CN112100291A

CN112100291A - Data binning method and device

Info

Publication number: CN112100291A
Application number: CN202010987254.2A
Authority: CN
Inventors: 周康
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-18

Abstract

The invention discloses a data binning method and device, and relates to the technical field of computers. One embodiment of the method comprises: classifying the distribution type of the data; transforming data which do not belong to uniform distribution or normal distribution; classifying the distribution type of the transformed data; and performing adaptive binning on the data according to the distribution type of the data. The embodiment automatically detects the distribution type of the data, and adapts different binning methods for each type; the box separation method comprising the scale weighting k-means algorithm is constructed, the percentage range of the number of samples of each category can be controlled, and the service requirement for automatically grading the one-dimensional index data is met.

Description

Data binning method and device

Technical Field

The invention relates to the technical field of computers, in particular to a data binning method and device.

Background

The nationwide outlets of banks have tens of thousands of operation-related indexes, and the outlets, employees, equipment, areas and the like need to be automatically graded according to the operation indexes. The index data are distributed differently, have no label and large quantity, have extremely high manual configuration cost, and need a set of method for automatically identifying the data distribution type and reasonably grading. By means of an unsupervised automatic binning technology, the data distribution type of the data is automatically analyzed for the single operation related index, and a self-adaptive binning method is adopted, so that automatic grading of network points, staff, equipment, areas and the like is achieved.

In the prior art, the same method is adopted for all the distributed data, and the diversity of data distribution cannot be self-adapted; the percentage of samples contained in each bin is not controllable and does not meet the traffic requirements of the hierarchy. In order to solve the above technical problem, it is necessary to optimize the existing binning method so that it can:

a. the data distribution is automatically detected, corresponding box separation methods are adopted to automatically separate tens of thousands of operation indexes according to different data distribution types, and the use of box separation results can be used as important standards for classification of network points, employees, equipment, regions and the like.

b. When the box is separated, the range of the percentage of the number of the samples contained in each box body can be controlled according to the graded service requirement, so that the grading scientificity is embodied, and the service requirement is met.

Some terms associated with the present invention are listed below:

unsupervised learning: the sample is not labeled;

and (3) clustering algorithm: for a large number of unknown labeled data sets, dividing the data sets into a plurality of different categories according to the similarity between data, so that the data in the categories are relatively similar, and the data similarity between the categories is relatively small;

k-means: an unsupervised clustering algorithm;

data binning: a data processing technique for reducing the effects of minor observation errors is a method of grouping a plurality of successive values into a smaller number of "bins".

Disclosure of Invention

In view of this, embodiments of the present invention provide a data binning method and apparatus, which can automatically bin tens of thousands of operation-related indexes through a small amount of configuration, and provide a basis for subsequent hierarchical assessment and evaluation. The invention automatically detects the distribution type of the data, and adopts different box separation methods for each type; and can control the percentage range of the number of samples per binning level.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data binning method including:

classifying the distribution type of the data;

transforming data which do not belong to uniform distribution or normal distribution;

classifying the distribution type of the transformed data; and is

And carrying out self-adaptive box separation on the data according to the distribution type of the data.

Optionally, classifying the distribution type of the data includes:

obtaining an index value by dividing the data by its mean value,

and performing a Colmogorov test on the index value to judge the closest distribution type of the data.

Optionally, the data that is not in a uniform distribution or a normal distribution includes data whose distribution type is a chi-square distribution, an exponential distribution, a poisson distribution, or another distribution type.

Optionally, transforming data that does not belong to a uniform distribution or a normal distribution includes: the Box-Cox transform or square root transform is performed on data that does not belong to a uniform distribution or a normal distribution.

Optionally, classifying the distribution type of the transformed data includes:

and carrying out the Kolmogorov test on the index value of the transformed data to judge whether the data is normally distributed.

Optionally, performing adaptive binning on the data according to the distribution type of the data, including:

if the distribution type of the data is normal distribution, performing box separation on the data in a mode of 3 times of standard deviation; or

And if the distribution type of the data is uniform distribution, performing binning on the data in a fixed percentage manner.

and if the distribution type of the data does not belong to uniform distribution or normal distribution, performing binning on the data by adopting a scale weighted k-means algorithm.

Optionally, the scale weighted k-means algorithm comprises:

setting the ratio of the number N of the classes to the expected sample number of each class, and calculating the expected sample number of each class;

randomly initializing a central point of each class, and simultaneously initializing the scale weight of each class to 1/expected sample number;

initializing all sample states as unallocated;

calculating the absolute value of the scale weighted distance between the sample and all class center points of the unassigned samples;

finding out a sample with the minimum weighting distance in all unassigned samples, and assigning the sample to a class with the minimum weighting distance;

recalculating the central point and the scale weight of each class;

if the unassigned samples exist, returning to the step of calculating the absolute values of the scale weighted distances between the unassigned samples and all the class center points, and otherwise, judging whether the center points are not changed any more;

if the central point changes, returning to the step of initializing all sample states as unassigned, otherwise ending.

Optionally, the method further comprises first pre-processing the data including outlier cleaning and/or missing value cleaning.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for binning data, including:

the distribution type classification module is used for classifying the distribution types of the data;

the data transformation module is used for transforming data which do not belong to uniform distribution or normal distribution; and

and the self-adaptive box separating module is used for carrying out self-adaptive box separation on the data according to the distribution type of the data.

According to a third aspect of the embodiments of the present invention, there is provided an electronic device for data binning, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.

One embodiment of the above invention has the following advantages or benefits: automatically detecting the distribution type of the data, and adapting different box separation methods for each type; the common k-means clustering algorithm is improved, a box separation method comprising the scale weighting k-means algorithm is constructed, the percentage range of the number of samples of each category can be controlled, and the service requirement for automatically grading one-dimensional index data is met.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with specific embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of a method of data binning in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of specific details of a method of data binning according to an embodiment of the present invention;

FIG. 3 is a flow diagram of a scale weighted k-means algorithm according to an embodiment of the invention;

FIG. 4 is a schematic diagram of the main modules of an apparatus for data binning in accordance with an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to one aspect of an embodiment of the present invention, a method of data binning is provided.

Fig. 1 is a schematic diagram of a main flow of a data binning method according to an embodiment of the present invention, as shown in fig. 1, the data binning method according to the embodiment of the present invention includes: step S101, step S102, step S103, and step S104.

In step S101, the distribution type of the data is classified. And classifying the distribution type of the data, wherein the classification comprises the steps of obtaining an index value by dividing the data by the mean value of the data, and carrying out a Colomogovif test on the index value to judge the closest distribution type of the data.

In step S102, transforming data that does not belong to a uniform distribution or a normal distribution includes: the Box-Cox transform or square root transform is performed on data that does not belong to a uniform distribution or a normal distribution. Wherein the data not belonging to the uniform distribution or the normal distribution comprises data of which the distribution type is a chi-square distribution, an exponential distribution, a poisson distribution or other distribution types.

In step S103, classifying the distribution type of the transformed data, including performing a kolmogorov test on an index value of the transformed data to determine whether the data is normally distributed.

In step S104, performing adaptive binning on the data according to the distribution type of the data, including: if the distribution type of the data is normal distribution, performing box separation on the data in a mode of 3 times of standard deviation; or if the distribution type of the data is uniform distribution, the data is subjected to binning in a fixed percentage mode. And if the distribution type of the data does not belong to uniform distribution or normal distribution, performing binning on the data by adopting a scale weighted k-means algorithm.

The scale weighted k-means algorithm comprises: setting the ratio of the number N of the classes to the expected sample number of each class, and calculating the expected sample number of each class; randomly initializing a central point of each class, and simultaneously initializing the scale weight of each class to 1/expected sample number; initializing all sample states as unallocated; calculating the absolute value of the scale weighted distance between the sample and all class center points of the unassigned samples; finding out a sample with the minimum weighting distance in all unassigned samples, and assigning the sample to a class with the minimum weighting distance; recalculating the central point and the scale weight of each class; if the unassigned samples exist, returning to the step of calculating the absolute values of the scale weighted distances between the unassigned samples and all the class center points, and otherwise, judging whether the center points are not changed any more; if the central point changes, returning to the step of initializing all sample states as unassigned, otherwise ending.

FIG. 2 is a schematic diagram of specific details of a method of data binning according to an embodiment of the present invention. The method of data binning of the present invention is described in more detail below in conjunction with FIG. 2.

In this embodiment, as shown in fig. 2, data is first preprocessed. Since there are outliers and missing values in the raw data, the data needs to be preprocessed before being binned. The preprocessing comprises outlier cleaning, namely sequencing single index data, and performing truncation operation on the numerical value of front and back 1% of samples by default. The preprocessing also includes missing value cleaning, i.e. according to the service index type, a 0 value, a fixed value, or an average value can be filled.

Next, the distribution type of the preprocessed data is classified. The original data is divided by its mean value to become an index value, which represents how many times the original data is the mean value. The method has the advantages that the business meaning is clear, the data change range can be reduced, and the business meaning of the data is maintained. And then carrying out a Colmogorov test on the index value, and judging the closest distribution type of the data. Currently, the distribution types of data of major interest include uniform distribution, normal distribution, chi-square distribution, exponential distribution, poisson distribution, skewed distribution, and the like. According to one embodiment of the present invention, the distribution types of the data are classified into a uniform distribution, a normal distribution, a chi-square distribution, an exponential distribution, a poisson distribution, and other distribution types such as a skewed distribution.

And transforming the data of which the distribution type does not belong to uniform distribution or normal distribution. Wherein the transform mainly comprises a Box-Cox transform, although the transform may comprise other transforms such as a square root transform. The transformation of the data which do not belong to the uniform distribution or the normal distribution can convert the data with the distribution type approximate to the normal distribution into the normal distribution, so that the same box separation method is used for box separation in the subsequent process. For the data with the distribution type belonging to uniform distribution and normal distribution, data transformation is not needed; and for data with distribution types of chi-square distribution, exponential distribution, poisson distribution and other distribution types such as skewed distribution, Box-Cox transformation is adopted to make the data close to normal distribution.

Then, the distribution type of the transformed data is classified, that is, the index value of the transformed data is subjected to the kolmogorov test to determine whether the data is normally distributed.

And finally, carrying out self-adaptive box separation on the data according to the distribution type of the data. If the distribution type of the data is normal distribution, performing box separation on the data in a mode of 3 times of standard deviation; or if the distribution type of the data is uniform distribution, the data is subjected to binning in a fixed percentage mode. And if the distribution type of the data does not belong to uniform distribution or normal distribution, performing binning on the data by adopting a scale weighted k-means algorithm.

FIG. 3 is a flow diagram of a scale weighted k-means algorithm according to an embodiment of the invention. For the actual business scenario we are confronted with, it is desirable that the samples in each level of the binning are as similar as possible, i.e. the intra-group variance is small. While controlling the approximate number of samples within each level so as not to be significantly outside of the desired range. Therefore, the invention improves the k-means algorithm, creates the scale weighted k-means algorithm by adding the scale weighting factor, and is used for controlling the size of each class so as to realize the binning effect on one-dimensional data.

In this embodiment, as shown in FIG. 3, the specific flow of the scale weighted k-means algorithm is as follows: setting the ratio of the number N of the classes to the expected sample number of each class, and calculating the expected sample number of each class; randomly initializing a central point of each class, and simultaneously initializing the scale weight of each class to 1/expected sample number; initializing all sample states as unallocated; calculating the absolute value of the scale weighted distance between the sample and all class center points of the unassigned samples; finding out a sample with the minimum weighting distance in all unassigned samples, and assigning the sample to a class with the minimum weighting distance; recalculating the central point and the scale weight of each class; if the unassigned samples exist, returning to the step of calculating the absolute values of the scale weighted distances between the unassigned samples and all the class center points, and otherwise, judging whether the center points are not changed any more; if the central point changes, returning to the step of initializing all sample states as unassigned, otherwise ending.

Wherein:

class size weight-class contains number of samples/expected number of samples of class

Scale weighted distance (scale weight of class) sample to class center absolute distance

Fig. 4 is a schematic diagram of main blocks of an apparatus 400 for data binning according to an embodiment of the present invention, and as shown in fig. 4, the apparatus 400 for data binning includes: a distribution type classification module 401 for classifying the distribution type of the data; a data transformation module 402, which transforms data that do not belong to uniform distribution or normal distribution; and an adaptive binning module 403, which performs adaptive binning on the data according to the distribution type of the data.

According to an embodiment of the present invention, there is provided an electronic device for data binning including:

one or more processors;

a storage device for storing one or more programs,

According to a further embodiment of the present invention, there is provided a computer readable medium, on which a computer program is stored, which program, when executed by a processor, performs the method provided by the first aspect of an embodiment of the present invention.

Fig. 5 illustrates an exemplary system architecture 500 of a method of data binning or an apparatus of data binning to which embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

501, 502, 503. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.

It should be noted that the method for account management provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the apparatus for account management is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprising: the task generation module is used for responding to a task application of a user for storing a tail box to generate a box storing task request and sending the box storing task request to a server; the task execution module is used for receiving a first dynamic password generated by the server end in response to the box storing task request and a box storing password input by the user; and verifying the first dynamic password and the storage password, and executing a storage task when the verification is passed so as to store the storage bag bound with the tail box into the data box separation device. Where the names of these modules do not in some cases constitute a limitation on the module itself, for example, a task generation module may also be described as a "module that performs a binning task".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: responding to a task application of a user for storing a tail box to generate a box storing task request, and sending the box storing task request to a server; receiving a first dynamic password generated by the server end in response to the box storing task request and a box storing password input by the user; and verifying the first dynamic password and the storage password, and executing a storage task when the verification is passed so as to store the storage bag bound with the tail box into the data box separation device.

According to the technical scheme of the embodiment of the invention, the trunk of the network point is bound with the storage bag in advance, and the storage bag stores the articles in the trunk to the data box separating device of the network point, so that the normal opening of the network point for business can be ensured under the conditions of special weather conditions, traffic jam and the like. Meanwhile, as escort is not required at the end of the day, the escort cost and the storage cost can be reduced, the operation cost is reduced, the working time of network business personnel can be reduced, and the working efficiency is improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data binning, comprising:

classifying the distribution type of the data;

classifying the distribution type of the transformed data; and is

2. The method of claim 1, wherein classifying the distribution type of the data comprises:

obtaining an index value by dividing the data by its mean value,

3. The method of claim 1, wherein the data that is not uniformly or normally distributed comprises data whose distribution type is a chi-squared distribution, an exponential distribution, a poisson distribution, or other distribution type.

4. The method of claim 1, wherein transforming data that does not belong to a uniform distribution or a normal distribution comprises: the Box-Cox transform or square root transform is performed on data that does not belong to a uniform distribution or a normal distribution.

5. The method of claim 1, wherein classifying the distribution type of the transformed data comprises:

6. The method of any one of claims 1 to 5, wherein adaptively binning the data according to a distribution type of the data comprises:

7. The method of any one of claims 1 to 5, wherein adaptively binning the data according to a distribution type of the data comprises:

8. The method of claim 7, wherein the scale-weighted k-means algorithm comprises:

initializing all sample states as unallocated;

recalculating the central point and the scale weight of each class;

9. The method of claim 1, further comprising first pre-processing the data including outlier cleaning and/or missing value cleaning.

10. An apparatus for binning data, comprising:

11. An electronic device for data binning, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1-9.