WO2020248356A1

WO2020248356A1 - Data binning processing method and apparatus, electronic device and computer-readable medium

Info

Publication number: WO2020248356A1
Application number: PCT/CN2019/100804
Authority: WO
Inventors: 陈星为
Original assignee: 同盾控股有限公司
Priority date: 2019-06-12
Filing date: 2019-08-15
Publication date: 2020-12-17
Also published as: CN110245140A; CN110245140B

Abstract

A data binning processing method, an apparatus, an electronic device and a computer-readable medium, related to the field of data processing. The method comprises: acquiring data to be processed, as well as a target binning means and a pre-determined number of bins thereof (S1): if a data volume of the data to be processed is greater than or equal to a pre-determined threshold, randomly allocating the data to be processed to a number N of nodes, N being a positive integer greater than 1 (S2); processing the data to be processed on the N nodes according to the pre-determined number of bins and using the target binning means, so as to determine a target quantile of the data to be processed (S3); according to the target quantile, performing a binning operation on the data to be processed, so as to obtain a binning result (S4). It is possible to perform binning processing of data having a relatively high data volume.

Description

Data binning processing method and device, electronic equipment and computer readable medium

This disclosure requires the priority of a Chinese invention patent application whose application date is June 12, 2019, the application number is 201910504964.2, and the invention name is "Data Binning Processing Method and Device, Electronic Equipment and Computer Readable Medium".

Technical field

The present disclosure relates to the technical field of data processing, and in particular to a data binning processing method and device, electronic equipment and computer-readable media.

Background technique

Data binning is a commonly used data processing method. Data binning is actually dividing the data into sub-intervals according to the attribute value of a certain attribute, such as dividing sub-intervals according to age, dividing sub-intervals according to height, and so on. If the attribute value of a data is within a certain subrange, put the data in the bin represented by the subrange.

With the development of big data, the scale of data is gradually increasing. A binning method that can adapt to large-scale data is extremely important for data processing.

It should be noted that the information disclosed in the above background section is only used to strengthen the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.

Summary of the invention

In view of this, the embodiments of the present disclosure provide a data binning processing method and device, electronic equipment, and computer readable medium, which can perform binning processing on data with a large data scale.

Other characteristics and advantages of the present disclosure will become apparent through the following detailed description, or partly learned through the practice of the present disclosure.

According to the first aspect of the embodiments of the present disclosure, a data binning processing method is proposed, the method includes: obtaining the data to be processed and its target binning method and preset binning number; if the amount of data to be processed is If it is greater than or equal to the preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1. According to the preset number of bins and the target binning method, the N nodes The to-be-processed data above is processed to determine a target quantile of the to-be-processed data; the to-be-processed data is binned according to the target quantile to obtain a binning result.

In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point includes: if the target binning mode is the first binning mode, determining a first candidate segmentation point of the data to be processed; and dividing the data to be processed according to the first candidate segmentation point Distributed to the N nodes in an orderly manner; respectively sort the to-be-processed data on each node after the orderly distribution to obtain the first sorted data in each node; according to the first sorted data in each node Obtain the global KS of the to-be-processed data; determine the target quantile according to the global KS of the to-be-processed data.

In some exemplary embodiments of the present disclosure, determining the first candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the second ranking data in each node; The number N of the nodes is divided into equal frequency for each second sorted data respectively to obtain the first pre-segment point on each node; the first candidate segmentation point is determined according to the first pre-segment point .

In some exemplary embodiments of the present disclosure, determining the target quantile according to the global KS of the to-be-processed data includes: first ranking on the N nodes according to the global KS of the to-be-processed data Determine the second candidate segmentation point in the data; determine the target quantile point in the second candidate segmentation point according to the preset number of bins.

In some exemplary embodiments of the present disclosure, determining a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed includes: determining in the global KS A maximum KS, and its corresponding to-be-processed data is used as the second candidate segmentation point; if the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, then The left side and the right side of the second candidate segmentation point respectively determine the to-be-processed data corresponding to a maximum KS as the second candidate segmentation point.

In some exemplary embodiments of the present disclosure, determining the target quantile in the second candidate segmentation point according to the preset number of bins includes: determining the number of the second candidate segmentation point Whether the number is less than the preset number of bins; if the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target binning point; If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.

In some exemplary embodiments of the present disclosure, the data binning processing method further includes: if the data volume of the data to be processed is less than a preset threshold, sorting the data to be processed to generate a third ranking Data; determine the KS of the third sorted data; determine a third candidate segmentation point according to the KS of the third sorted data; determine whether the number of the third candidate segmentation points is greater than or equal to the preset bin If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.

In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point further includes: if the target binning mode is the second binning mode, determining a fourth candidate segmentation point of the data to be processed; and dividing the to-be-processed data according to the fourth candidate segmentation point The data is distributed to the N nodes in an orderly manner; the data to be processed on each node after the orderly distribution is sorted to obtain the fourth sorted data in each node; The target quantile is determined in the fourth ranking data.

In some exemplary embodiments of the present disclosure, determining the fourth candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the fifth ranking data in each node; The number N of the nodes is divided into equal frequency for each fifth sorted data respectively to obtain the second pre-segment point on each node; the fourth candidate segmentation point is determined according to the second pre-segment point .

In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point further includes: if the target binning mode is the third binning mode, obtaining the maximum value and the minimum value on each node respectively; and determining the waiting point according to the maximum value and the minimum value on each node The maximum and minimum values of the processed data; the target quantile is determined according to the maximum and minimum values of the to-be-processed data and the preset number of bins.

According to a second aspect of the embodiments of the present disclosure, a data binning processing device is proposed. The device includes: a data acquisition module, a data distribution module, a target quantile point determination module, and a binning module. Wherein, the data acquisition module is configured to acquire the to-be-processed data and its target binning method and preset binning number; the data distribution module is configured to: if the data volume of the to-be-processed data is greater than or equal to a preset threshold, The data is randomly distributed to N nodes, where N is a positive integer greater than 1. The target binning point determination module is configured to perform processing on the N nodes according to the preset binning number and using the target binning method The data is processed to determine the target quantile of the data to be processed; the binning module is configured to perform a binning operation on the data to be processed according to the target quantile to obtain a binning result.

According to a third aspect of the embodiments of the present disclosure, an electronic device is provided. The electronic device includes: one or more processors; a storage device for storing one or more programs. When the one or more programs are The one or more processors execute, so that the one or more processors implement the data binning processing method described in any one of the foregoing.

According to a fourth aspect of the embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, characterized in that, when the program is executed by a processor, the data binning process as described in any of the above is implemented method.

The data binning processing method, device, electronic equipment, and computer readable medium provided by some embodiments of the present disclosure allocate the amount of data to be processed to multiple nodes, and then determine the target quantile by the data on the multiple nodes, Finally, the binning operation of the data to be processed is realized according to the target quantile. The data binning processing method distributes data with a large amount of data to multiple nodes, and uses multiple nodes at the same time to complete the binning operation of the data to be processed, which overcomes the defect that a single node has too small memory and cannot process large-scale data.

It should be understood that the above general description and the following detailed description are only exemplary and cannot limit the present disclosure.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the disclosure, and together with the specification are used to explain the principle of the disclosure. The drawings described below are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 shows a schematic diagram of an exemplary system architecture of a data box processing method or data box processing device applied to an embodiment of the present disclosure.

Fig. 2 is a flowchart showing a method for processing data binning according to an exemplary embodiment.

Fig. 3 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 4 is a flow chart showing yet another data binning processing method according to an exemplary embodiment.

Fig. 5 is a flow chart showing still another method for processing data binning according to an exemplary embodiment.

Fig. 6 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 7 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 8 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 9 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 10 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 11 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 12 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 13 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 14 is a flowchart showing another data binning processing method according to an exemplary embodiment.

Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment.

Fig. 16 is a schematic structural diagram showing another computer system applied to a data binning processing device according to an exemplary embodiment.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the embodiments set forth herein; on the contrary, these embodiments are provided so that this disclosure will be comprehensive and complete, and fully convey the concept of the example embodiments To those skilled in the art. In the figures, the same reference numerals denote the same or similar parts, and thus their repeated description will be omitted.

The features, structures, or characteristics described in the present disclosure may be combined in one or more embodiments in any suitable manner. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, well-known methods, devices, implementations or operations are not shown or described in detail to avoid obscuring aspects of the present disclosure.

The accompanying drawings are only schematic illustrations of the present disclosure, and the same reference numerals in the figures indicate the same or similar parts, and thus their repeated description will be omitted. Some block diagrams shown in the drawings do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

The flowchart shown in the drawings is only an exemplary description, and does not necessarily include all contents and steps, nor does it have to be executed in the described order. For example, some steps can be decomposed, and some steps can be combined or partially combined, so the actual execution order may be changed according to actual conditions.

In this specification, the terms "a", "an", "the", "said" and "at least one" are used to indicate that there are one or more elements/components/etc.; the terms "including", "including" and "Have" is used to mean open-ended inclusion and means that in addition to the listed elements/components/etc., there may be additional elements/components/etc.; the terms “first” and “second "And "third" are only used as markers, and are not a limitation on the number of objects.

The exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of an exemplary system architecture of a data binning processing method or a data binning processing device that can be applied to an embodiment of the present disclosure.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Among them, the

terminal devices

101, 102, 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and so on.

The server 105 may be a server that provides various services, for example, a background management server that provides support for devices operated by users using the

terminal devices

101, 102, and 103. The background management server can analyze and process the received request and other data, and feed back the processing result to the terminal device.

The server 105 may, for example, obtain the data to be processed and its target binning method and preset binning number; if the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, N is a positive integer greater than 1; the data to be processed on the N nodes are processed according to the preset number of bins and the target bin method to determine the target quantile of the data to be processed ; Perform binning operations on the to-be-processed data according to the target quantile to obtain binning results.

It should be understood that the number of terminal devices, networks, and servers in FIG. 1 is only illustrative. The server 105 may be a physical server, or may be composed of multiple servers. According to actual needs, it may have any number of terminal devices, Network and server.

In related technologies, data can be divided into sub-intervals according to the attribute value of a certain attribute, such as sub-intervals according to age, sub-intervals according to height, and so on. If the attribute value of a data is within a certain sub-range, the data can be put into the bin represented by the sub-range. Then use the attributes of the entire subrange to represent the attributes of the data in this subrange. This sort of binning can be understood as the discretization of data, and the discretization of data can have the following advantages:

1. It is easy to complete the increase or decrease operation of discrete data. This discrete data type is conducive to the rapid iteration of the model.

2. When the sparse vector formed by the discretized data is subjected to inner product multiplication, the calculation speed is faster, the calculation result is convenient to store, and it is easy to expand.

3. The discretized data has strong robustness to abnormal data. For example, in the age data, the abnormal data "age greater than 300" will cause great interference to the model, and after discretizing the age data (the age greater than 30 can be expressed as 1, otherwise it is 0), it is only 0 For the data of 1 and 1 feature, substituting the discretized abnormal data into the model will not interfere with the model.

4. For generalized linear models, continuous data expression ability is limited. Substituting the discretized data into the model is equivalent to introducing non-linearity to the model, improving the expression ability and enhancing the fitting effect.

5. Substituting the discretized continuous data into the model, the model will become more stable. For example, for an age data that will change over time, if you take 20-30 years old as an age range, suppose a user’s age is 25, and the user’s age will change to 26 after one year, but the corresponding discrete data value remains unchanged .

6. After discretizing continuous data, the role of logistic regression model can be simplified, and the risk of model overfitting can be reduced.

2, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.

Step S1: Obtain the data to be processed and its target binning method and preset binning number.

In some embodiments, the preset number of bins refers to the number of bins designated by the user to divide the data to be processed, and the target binning mode refers to the binning mode specified by the user. In some embodiments, the target binning manner may include at least one of a first binning manner, a second binning manner, and a third binning manner.

Step S2: If the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.

In some embodiments, the preset threshold may refer to the amount of data that can be processed by a single machine. For example, for a list of to-be-processed data including label column, sequence number column, and characteristic value, assuming that the label, sequence number, and characteristic value are all int (integer data, each int data occupies 4 bytes) data, then A server with 1G memory can only handle 10 ⁸ to 10 ⁹ data volumes. In some embodiments, when the amount of data to be processed is greater than or equal to a preset threshold, the data to be processed can be randomly allocated to N nodes for processing.

In some embodiments, N nodes may refer to N terminals that can perform data processing, such as N servers or N computer terminals. The present disclosure does not limit the physical form of the N nodes, and the actual operation shall prevail.

In some embodiments, the amount of data to be processed randomly allocated to each node is approximately the same.

Step S3, processing the data to be processed on the N nodes according to the preset number of bins and using the target binning method to determine the target quantile of the data to be processed.

Step S4: Perform a binning operation on the to-be-processed data according to the target quantile point to obtain a binning result.

In some embodiments, the data to be processed can be divided at the target quantile to form multiple bins of data.

The foregoing embodiment provides a data binning processing method. On the one hand, the relationship between the amount of data to be processed and the preset threshold is considered before data binning, so as to avoid the inability to complete the data to be processed due to excessive data volume. On the other hand, by distributing data with a large amount of data to multiple nodes, multiple nodes are used to complete the binning operation of the data to be processed at the same time, which overcomes the problem of a single node whose memory is too small to handle large-scale data. defect.

Referring to FIG. 3, step S3 provided in the embodiment shown in FIG. 2 may include the following steps.

Step S31: If the target binning mode is the first binning mode, determine the first candidate segmentation point of the data to be processed.

In some embodiments, the first binning method may be a distributed data binning processing method based on the data ks value.

In some embodiments, determining the first candidate segmentation point may include the steps shown in FIG. 4.

Step S311: Sort the to-be-processed data on each node respectively to obtain the second sorted data in each node.

In some embodiments, the data to be processed may be randomly distributed to N nodes, where N is a positive integer greater than 1.

For example, M data to be processed are randomly allocated to N nodes, and the data on each node are respectively denoted as M ₁ , M ₂ ... M _N-1 , M _N.

In some embodiments, the data to be processed on each node may be sorted separately to obtain the second sorted data in each node.

For example, data on each node _{_{_{M 1, M 2 ...... .M N}}} -1, the M _N sorted generate a second sorting data on each node _{_{M '1, M' 2 ......}} .M 'N-1 , M _'N.

In some embodiments, the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed. For example, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for data is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slow, and bucket sort is faster, but occupies more memory.

In some embodiments, the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 ⁸ ～ 10 ⁹ , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.

Step S312: Perform equal frequency division on each second sorted data according to the number N of the nodes, to obtain the first pre-segment point on each node.

In some embodiments, the equal frequency division of the second sorted data on each node can be realized according to the number N of nodes designated by the user and the data volume of the data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000 and the number of nodes is 5, then the second sorted data on the first node can be divided into equal frequency according to the amount of data per box of 1000/5.

In some embodiments, equal frequency division is performed on each node according to the amount of data to be processed on each node and the number N of said nodes to obtain the first pre-segment point on each node.

For example, assuming that the second sorting data on the respective nodes _{_{M '1, M' 2,}} ...... .M 'N-1, M' N, N number of nodes according to the data and the amount of data in each node and each node may respectively The second sorted data is divided into equal frequency. Suppose that the first pre-segmentation points determined on the first node are m ₁₁ , m ₁₂ , m _1N-1 (it is easy to understand that only N-1 segmentation points are needed to divide M data into N boxes In), the first pre-segmentation points determined on the second node are m ₂₁ , m ₂₂ ,...m _2N-1 , and the first pre-segmentation points determined on the i-th node are m _i1 , m _i2 ,...m _iN-1 , i is a positive integer less than or equal to N.

Step S313: Determine the first candidate segmentation point according to the first pre-segmentation point.

In some embodiments, the first pre-segmentation points on multiple nodes may be correspondingly averaged to determine the first candidate segmentation point. For example, assuming that the preset number of bins is N, the first pre-segmenting points determined on the first node are m ₁₁ , m ₁₂ , ... m _1N-1 , and the first pre-segmenting points determined on the second node M ₂₁ , m ₂₂ , m _2N-1 , the first pre-segment points determined on the i-th node are m _i1 , m _i2 ,...m _iN-1 , and i is a positive integer less than or equal to N.

In some embodiments, the first candidate segmentation point may be determined as

Where m _iN-1 represents the N-1 first pre-segmentation point on the i-th node.

In some other embodiments, the first pre-segmentation points on multiple nodes may be correspondingly calculated as the median, maximum, or minimum, etc., as the first candidate segmentation point.

The embodiment shown in Figure 4 not only determines the first candidate segmentation point for preliminary division of the data to be processed through multiple nodes, but also sorts the data to be processed on the node according to the memory size of the node and the amount of data to be processed , Ensuring the running speed while fully utilizing the node memory.

Step S32: Distributing the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.

In some embodiments, ordered allocation refers to a specific and known size relationship between the data to be processed on each node after allocation. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.

For example, assuming that the number of nodes N is 4, and the first candidate segmentation points are C ₁ , C ₂ , and C ₃ , respectively, the data to be processed are allocated to 4 nodes in order according to the first candidate segmentation point, which can be expressed as: the 0th to the C ₁ th data assigned to the first node, the second to the first C ₁ +1 C ₂ data assigned to the second node, the first C ₁ +1 through C ₂ data distribution To the second node, assign the C ₃ +1 to the last data to the fourth node.

Step S33: Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.

In some embodiments, the sorting method can be selected according to the memory size of each node and the data amount of the data to be processed on the node to realize the sorting of the amount of data to be processed on each node.

Step S34: Obtain the global KS of the to-be-processed data according to the first ranking data in each node.

In related technologies, the KS value can be used to evaluate the risk discrimination ability of the model. The indicator measures the gap between the cumulative part of the first sample and the second sample. The larger the KS value, the better the variable can distinguish the first sample from the second sample.

In some embodiments, each node may include the first sample data and the second sample data.

In some embodiments, the labeling rules of the first sample and the second sample may be defined by the user. For example, in bank data, the user can define the data corresponding to those customers with credit problems as the first sample, and define the data corresponding to those customers without credit problems as the second sample.

In some embodiments, the KS value of an interval (there may be only one data in the interval) can be obtained in the following manner.

1. Sort the data.

2. Sort the sorted data in order to generate multiple data intervals.

3. Obtain the number of first samples (for example, good data) and the number of second samples (for example, bad data) in each interval.

4. Get the cumulative first sample number of each interval (the cumulative first sample number can refer to the first sample number of the current interval plus the first sample number of all intervals before this interval. For example, the first interval has 3 The same book, the second interval has 2 first samples, and the third interval has 4 first samples, then the cumulative number of first samples in the second interval is 2+3) and the cumulative second sample number.

5. Obtain the ratio of the cumulative number of first samples in each interval to the total number of first samples (good%) and the ratio of the cumulative number of second samples in each interval to the total number of second samples (bad%).

6. The absolute value of the difference between the ratio of the cumulative number of first samples in the interval to the total number of first samples and the ratio of the cumulative number of second samples in the interval to the total number of second samples (|good%-bad%|) , As the KS value of the interval.

In some embodiments, the repeated data to be processed may be merged before the global KS of the data to be processed is determined.

In some embodiments, since the first sorted data between each node is also ordered, the global data to be processed can be determined according to the data volume of the first sample and the data volume of the second sample in the node. KS value.

In some embodiments, the global KS of the data refers to the KS value of the data obtained on the basis of all the data to be processed. For example, if the data to be processed is divided into three nodes, each node has N1, N2, N3 first samples, N4, N5, N6 first samples, then the last one on the second node The global KS value of the data can be expressed as (|(N1+N2)/(N1+N2+N3)%-(N4+N5)/(N4+N5+N6)%|).

Step S35: Determine the target quantile according to the global KS of the data to be processed.

In some embodiments, the target quantile can be determined according to the steps shown in FIG. 5.

Step S351: Determine a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed.

In some embodiments, the second candidate segmentation point can also be determined according to the steps shown in FIG. 6.

Step S3511: Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.

In some embodiments, data corresponding to a maximum KS value can be determined in the data to be processed according to the global KS of the data to be processed as the second candidate segmentation point.

Step S3512: If the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, determine a maximum value on the left and right sides of the second candidate segmentation point. The to-be-processed data corresponding to the KS is used as the second candidate segmentation point.

In some embodiments, the preset data amount may be set by the user in advance.

In some embodiments, it is determined whether the amount of data to be processed on the left and right of the second candidate segmentation point obtained in step S3511 is greater than the preset data amount (if more than one second candidate segmentation point is obtained in step S3511 , Respectively determine whether the data amount of the data to be processed on the left and right sides of the above-mentioned more than one second candidate segmentation points is greater than the preset data amount). If the amount of data to be processed on the left and right of the second candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS corresponding to the left and right of each second candidate segmentation point respectively The to-be-processed data of is used as the second candidate segmentation point; if it is determined that there is a second candidate segmentation point, the data amount of the to-be-processed data on the left or right side is less than the preset data amount, then the iteration is stopped.

Step S352: Determine the target quantile point in the second candidate segmentation point according to the preset number of bins.

In some embodiments, the determination of the target quantile at the second candidate cut-off point according to the preset number of bins can be achieved through the steps shown in FIG. 7.

Step S3521: Determine whether the number of the second candidate segmentation points is less than the preset number of bins.

Step S3522: If the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile point.

Step S3523: If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.

In some embodiments, assuming that the number of second candidate segmentation points is N and the number of target bins is M, where N is greater than or equal to M, then M-1 targets can be determined from the N second candidate segmentation points Divided into points.

In some embodiments, when M-1 target segmentation points are determined among the N second candidate segmentation points, there may be

For each solution, the IV value of the corresponding solution can be obtained by formula (1).

Among them, good_Pcnt _i % represents the proportion of the first sample in the i-th interval (the interval may only include one number) to the total number of first samples, bad_Pcnt _i % represents the second sample in the i-th interval The proportion of the number of samples.

In some embodiments, the IV value of each solution can be obtained in turn, and the solution corresponding to the maximum IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution. This method occupies less space and has simple logic. However, this method has been repeatedly calculated for many times, and the calculation efficiency is not high.

In some embodiments, a dynamic programming method can be selected to determine the target points. The dynamic programming method can cache the solution of the sub-problem that has been solved, and the solution of the sub-problem can be used directly next time, avoiding repeated operations.

The foregoing embodiment provides a data binning processing method, which has the following beneficial effects:

1. The data to be processed is binned based on the KS index, which can effectively bin bin processing of continuous variables, and has stronger interpretability, and this method can be attached to the specific needs of many users. For example, the IV of the binning result is required to be monotonous Wait.

2. Sort the data to be processed according to the memory of the node and the amount of data to be processed on the node, ensuring the running speed while fully utilizing the memory of the node.

3. Use the dynamic programming method to determine the target quantile, saving running time.

4. Compared with the equal frequency, equal distance and equal binning method, this method does not require business experience and can automatically complete the binning operation.

5. This method distributes the amount of data to be processed to multiple nodes on a large scale, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.

Referring to FIG. 8, the data binning processing method provided by the embodiment of the present disclosure may further include the following steps.

Step S1: Obtain data to be processed.

Step S5: If the amount of data to be processed is less than a preset threshold, sort the data to be processed to generate third sorted data.

In some embodiments, the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed. In some embodiments, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for the data to be processed on the node is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster, but occupies more memory.

Step S6: Determine the KS of the third sorted data.

In some embodiments, the repeated data to be processed may be merged before determining the KS of the data to be processed.

In some embodiments, the third ranking can be determined based on the total number of first samples and the total number of second samples in the third ranking data, and the cumulative first sample number and the second cumulative number of samples at each data in the third ranking data. The KS value of the data in the data.

Step S7: Determine a third candidate segmentation point according to the KS of the third ranking data.

In some embodiments, a maximum KS may be determined among the KSs of the third ranking data, and the corresponding to-be-processed data may be used as the third candidate segmentation point.

In some embodiments, if the amount of data to be processed on the left and right sides of the third candidate segmentation point is greater than the preset amount of data, then the data on the left and right sides of the third candidate segmentation point are respectively Determine the to-be-processed data corresponding to one largest KS as the third candidate segmentation point.

In some embodiments, the preset data amount may be set by the user in advance.

In some embodiments, it is determined that the amount of data to be processed on the left and right sides of the third candidate segmentation point is greater than the preset data amount (if more than one third candidate segmentation point is obtained in the above steps, then the foregoing The amount of data to be processed on the left and right sides of more than one third candidate segmentation point is greater than the preset data amount). If it is determined that the amount of data to be processed on the left and right sides of the third candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS on the left and right sides of each third candidate segmentation point. The corresponding data to be processed is used as the third candidate segmentation point. If it is determined that there is a third candidate segmentation point where the amount of data to be processed on the left or right side is less than the preset data amount, then the iteration is stopped.

Step S8: Determine whether the number of the third candidate segmentation points is greater than or equal to the preset number of bins.

In some embodiments, if the number of the third candidate segmentation points is less than the preset number of bins, it is determined that the third candidate segmentation point is the target quantile point.

Step S9: If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.

In some embodiments, assuming that the number of second candidate segmentation points is N and the number of target bins is M, where N is greater than or equal to M, then M-1 targets must be determined from the N second candidate segmentation points Divided into points.

For each solution, the IV value of the solution can be obtained by formula (1).

In some embodiments, a third candidate segmentation point corresponding to the solution with the largest IV value may be selected as the target quantile point.

In some embodiments, the IV value of each solution can be obtained in turn, and the solution corresponding to the largest IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution. This optimal solution is obtained The method occupies less space and is simple in logic. However, the method has been repeatedly calculated many times, and the calculation efficiency is not high.

In some embodiments, the technical solution provided in the embodiment shown in FIG. 8 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also distribute the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .

The technical solution provided by the embodiment shown in Fig. 8 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory. On the other hand, it is based on node memory and on-node The data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized. Furthermore, this method uses dynamic programming to find out the eligible target quantiles, which saves running time.

Referring to FIG. 9, step S3 provided by the embodiment shown in FIG. 2 may further include the following steps.

Step S36: If the target binning mode is the second binning mode, determine the fourth candidate segmentation point of the data to be processed.

10, step S36 provided in the embodiment shown in FIG. 9 may include the following steps.

S361: Sort the to-be-processed data on each node respectively to obtain fifth sorted data in each node.

In some embodiments, the data to be processed on each node may be sorted separately to obtain the fifth sorted data in each node.

In some embodiments, the sorting method can be selected according to the memory size of the node and the memory size required for processing the data to be processed to achieve the sorting of the amount of data to be processed. In some embodiments, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for the data to be processed on the node is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster but occupies more memory.

S362: Perform equal frequency division on each fifth sorted data according to the number N of the nodes, to obtain a second pre-segment point on each node.

In some embodiments, the equal frequency division of the sorted data on each node can be realized according to the number N of nodes designated by the user and the amount of data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000, and the number of bins preset by the user is 5, then the sorted data on the first node can be divided equally according to the amount of data per box of 1000/5.

In some embodiments, the second pre-segment point on each node can be obtained after equal frequency division of each node according to the amount of data to be processed on each node and the number N of the nodes.

S363. Determine the fourth candidate segmentation point according to the second pre-segmentation point.

In some embodiments, the fourth candidate segmentation point may be determined according to the second pre-segmentation point.

In some embodiments, the second pre-segment points on each node may be correspondingly averaged to determine the fourth candidate segmentation point. For example, suppose the number of nodes N is 4, the second pre-segment points determined on the first node are 2.2, 4.2, 5.8, 8.2, and the second pre-segment points determined on the second node are 1.8, 3.8, 6.2 , 7.8, then the second pre-segment point on the first node and the second pre-segment point on the second node respectively correspond to the fourth candidate segmentation points obtained after averaging 2, 4, 6, 8 .

In some other embodiments, the second pre-segmentation point on each node may be corresponding to the median, maximum, or minimum value, etc., as the fourth candidate segmentation point.

Step S37: Distributing the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.

In some embodiments, ordered allocation refers to a specific, known size relationship between the data to be processed on each node. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.

Step S38: Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.

Step S39: Determine the target quantile point in the fourth ranking data according to the preset number of bins.

In some embodiments, if the data to be processed has been sorted, the target points can be determined according to the amount of data to be processed and the preset number of bins.

For example, it is known that the amount of data to be processed is 1000, the fourth ranking data on the first node is 2520, the fourth ranking data on the second node is 2480, and the fourth ranking data on the third and fourth nodes is 2500. And the maximum value on the first node is smaller than the minimum value on the second node, and so on. If the number of nodes is 4, then the target points should be the 2500th, 500th, and 7500th data, because the data on the four nodes is sorted data, and the four nodes are also ordered, so it is easy Determine the 2500th, 5000th, and 7500th data after sorting.

The binning processing method provided in the foregoing embodiment completes binning processing of large-scale to-be-processed data on multiple nodes based on an equal frequency method. This method first randomly allocates the to-be-processed data to multiple nodes, and confirms the preliminary equal-frequency cut-off point-the fourth candidate cut-off point, and then allocates the data to be processed to each node in order according to the fourth candidate cut-off point, and Sort the data on each node, and finally confirm the target quantile based on the sorted data and the preset number of bins. The binning processing method can perform binning processing on evenly distributed large-scale data.

In some embodiments, step S3 provided in the embodiment shown in FIG. 2 may further include the following steps.

If the target binning method is the third binning method, the maximum value and the minimum value on each node are respectively obtained; the maximum value and the minimum value of the data to be processed are determined according to the maximum value and the minimum value on each node Value; the target quantile is determined according to the maximum and minimum values of the data to be processed and the preset number of bins.

In some embodiments, after randomly distributing the data to be processed to N nodes, the maximum value and minimum value on each node can be obtained respectively, and a maximum value and minimum value can be determined from the maximum value and minimum value on each node. As the maximum and minimum values of the data to be processed. If the maximum and minimum values of the data to be processed and the preset binning data, the quantile point of the data to be processed can be determined. For example, if it is known that the maximum value of the data to be processed is 10000, the minimum value is 1, and the number of bins is 4, then the target quantiles are 2500, 500, 7500, and the data can be binned according to the target quantile. operating.

In the above embodiment, the maximum value and minimum value are first confirmed in each node, and then the maximum value and minimum value in the large-scale data to be processed are determined according to the maximum value and minimum value in the node, and finally according to the value of the data to be processed The maximum, minimum and preset binning numbers are used to complete the binning operation of the data to be processed. This method is simple and easy to operate, and is suitable for some concentrated data to be processed.

Fig. 11 is a flowchart showing a data binning processing method according to an exemplary embodiment.

Referring to FIG. 11, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.

Step S111: Obtain the data to be processed and its target binning method and preset binning number.

Step S112, if the amount of data to be processed is greater than or equal to a preset threshold.

Step S113: Randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.

Step S114, if the target binning mode is the first binning mode, sort the to-be-processed data on each node to obtain the second sorted data in each node.

Step S115: Perform equal frequency division on each second sorted data according to the number of nodes to obtain the first pre-segment point on each node.

Step S116: Determine the first candidate segmentation point according to the first pre-segmentation point.

Step S117: Distribute the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.

Step S118: Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.

Step S119: Obtain the global KS of the to-be-processed data according to the first ranking data in each node.

Step S1110: Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.

Step S1111: Determine whether the data amount of the data to be processed on the left and right of the second candidate segmentation point is greater than a preset data amount.

If the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, step S1112 is executed; if the amount of data to be processed on the left and right of the second candidate segmentation point is If the data amount of is not greater than the preset data amount, step S1113 is executed;

Step S1112: Determine the to-be-processed data corresponding to a maximum KS on the left and right sides of the second candidate segmentation point, respectively, as the second candidate segmentation point. Then, continue to perform step S1111 until the amount of data to be processed on the left and right sides of the second candidate segmentation point is less than or equal to the preset data amount.

Step S1113: Determine whether the number of the second candidate segmentation points is less than the preset number of bins.

If it is determined that the number of the second candidate segmentation points is less than the preset number of bins, step S1114 is executed; if it is determined that the number of the second candidate segmentation points is not less than the preset number of bins, Step S1115 is executed.

Step S1114: Determine that the second candidate segmentation point is the target segmentation point.

Step S1115: Determine the target quantile point according to the preset number of bins and using a dynamic programming method.

Step S1116: Obtain a binning result of the to-be-processed data according to the target quantile.

1. Based on the KS index, the data to be processed can be binned, which can effectively bin-bind continuous variables, and has stronger explanatory properties.

Fourth, this method distributes the amount of data to be processed on a large scale to multiple nodes, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.

Fig. 12 is a flowchart showing a method for processing data binning according to an exemplary embodiment.

Step S121: Obtain the data to be processed and its target binning method and preset binning number.

Step S122, if the amount of data to be processed is greater than or equal to a preset threshold.

Step S123: If the target binning mode is the second binning mode, sort the to-be-processed data on each node to obtain the fifth sorted data in each node.

Step S124: Perform equal frequency division on each fifth sorted data according to the number of nodes to obtain second pre-segment points on each node.

Step S125: Determine the fourth candidate segmentation point according to the second pre-segmentation point.

Step S126: Distribute the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.

Step S127: Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.

Step S128: Determine the target quantile point in the fourth ranking data according to the preset number of bins.

Step S129: Obtain a binning result of the to-be-processed data according to the target quantile.

Fig. 13 is a flowchart showing a method for processing data binning according to an exemplary embodiment.

3, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.

Step S131: Obtain the to-be-processed data and its target binning method and preset binning number.

Step S132, if the amount of data to be processed is greater than or equal to a preset threshold.

Step S133: Randomly allocate the data to be processed to N nodes, where N is a positive integer greater than 1.

Step S134: If the target binning mode is the third binning mode, the maximum value and the minimum value on each node are obtained respectively.

Step S135: Determine the maximum value and the minimum value of the to-be-processed data according to the maximum value and the minimum value on each node.

Step S136: Determine the target quantile point according to the maximum value and minimum value of the data to be processed and the preset number of bins.

Step S137: Obtain a binning result of the to-be-processed data according to the target quantile.

Fig. 14 is a flow chart showing a method for processing data binning according to an exemplary embodiment.

4, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.

Step S141: Obtain the data to be processed and its target binning method and preset binning number.

Step S142, if the amount of data to be processed is less than a preset threshold.

Step S143: Sort the to-be-processed data to generate third sorted data.

Step S144: Determine the KS of the third sorted data.

Step S145: Determine a maximum KS among the KS of the third sorted data, and use the corresponding to-be-processed data as the fifth candidate segmentation point.

Step S146: Determine whether the amount of data to be processed on the left and right sides of the fifth candidate segmentation point is greater than a preset data amount.

If it is determined that the data amount of the data to be processed on the left and right sides of the fifth candidate segmentation point is greater than the preset data amount, then continue to perform step S146, otherwise, perform step S147.

Step S147: Determine whether the number of the fifth candidate segmentation points is less than the preset number of bins.

If it is determined that the number of the fifth candidate segmentation points is less than the preset number of bins, step S148 is executed; otherwise, step 149 is executed.

Step S148, determining that the second candidate segmentation point is the target segmentation point.

Step S149: Determine the target quantile point according to the preset number of bins and using a dynamic programming method.

Step S1410: Obtain a binning result of the to-be-processed data according to the target quantile.

In some embodiments, the technical solution provided by the embodiment shown in FIG. 14 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also assign the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .

The technical solution provided by the embodiment shown in Fig. 14 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory. On the other hand, it is based on the node memory and node The data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized. Furthermore, this method uses dynamic programming to find out the eligible target quantiles, which saves running time.

Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment. 15, the device 150 includes a data acquisition module 1501, a data distribution module 1502, a target quantile determination module 1503, and a binning module 1504.

Among them, the data acquisition module 1501 can be configured to acquire the data to be processed and its target binning method and the preset number of bins; the data distribution module 1502 can be configured to: if the data volume of the data to be processed is greater than or equal to the preset threshold, The data to be processed is randomly distributed to N nodes, where N is a positive integer greater than 1. The target quantile point determination module 1503 may be configured to perform the calculation of the N nodes according to the preset number of bins and the target binning method. The to-be-processed data on each node is processed to determine the target quantile of the to-be-processed data; the binning module 1504 may be configured to perform binning operation on the to-be-processed data according to the target quantile to obtain a binning operation. Box results.

In some embodiments, the target quantile determination module 03 shown in FIG. 15 may include a first candidate segmentation point determination submodule, a first allocation submodule, a first ranking submodule, a global KS determination submodule, and a first target Quantile determination sub-module.

Wherein, the first candidate segmentation point determination sub-module may be configured to determine the first candidate segmentation point of the data to be processed if the target binning mode is the first binning mode; the first allocation sub-module may be configured According to the first candidate segmentation point, the data to be processed is distributed to the N nodes in an orderly manner; the first sorting sub-module may be configured to sort the data to be processed on each node after the orderly distribution, respectively, To obtain the first ranking data in each node; the global KS determination sub-module may be configured to obtain the global KS of the to-be-processed data according to the first ranking data in each node; the first target quantile determination sub-module, according to The global KS of the data to be processed determines the target quantile.

In some embodiments, the first candidate segmentation point determination sub-module may include a second sorting unit, a first pre-segment point determination unit, and a first candidate segmentation point determination unit.

The second sorting unit may be configured to sort the to-be-processed data on each node to obtain the second sorted data in each node; the first pre-segment point determination unit may be configured to respectively sort the data to be processed according to the number N of nodes. Perform equal frequency division on each second sorted data to obtain the first pre-segment point on each node; the unit for determining the first candidate segmentation point may be configured to determine the first candidate according to the first pre-segment point Split point.

In some embodiments, the first target quantile determination sub-module 035 shown in FIG. 15 may include a second candidate segmentation point determination unit and a target quantile determination unit.

The second candidate segmentation point determination unit may be configured to determine the second candidate segmentation point according to the global KS of the to-be-processed data in the first ranking data on the N nodes; determine the target quantile point unit It may be configured to determine the target quantile point in the second candidate segmentation point according to the preset number of bins.

In some embodiments, the second candidate segmentation point determination unit may include a maximum KS determination subunit and a binary unit.

Wherein, the maximum KS determining subunit may be configured to determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point; a binary unit, if the second candidate segmentation If the data volume of the data to be processed on the left and right of the point is greater than the preset data volume, the data to be processed corresponding to the largest KS is determined on the left and right of the second candidate segmentation point, respectively, as the The second candidate segmentation point.

In some embodiments, the second target quantile determination unit may include a first judgment subunit, a second target quantile determination subunit, and a second target quantile determination subunit.

Wherein, the first judgment subunit judges whether the number of the second candidate segmentation points is less than the preset number of bins; the second target quantile determination subunit, if the second candidate segmentation point is If the number is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile; the second target quantile determination subunit, if the number of the second candidate segmentation point is The number is greater than or equal to the preset number of bins, and the target quantile is determined according to the preset number of bins and using a dynamic programming method.

In some embodiments, the device 150 shown in FIG. 15 may further include: a third ranking module, a KS determination module, a third candidate segmentation point determination module, a second judgment module, and a third target quantile determination module.

Wherein, the third sorting module may be configured to sort the to-be-processed data to generate third sorted data if the data amount of the to-be-processed data is less than a preset threshold; the KS determination module may be configured to determine the first KS of three sorted data; the third candidate segmentation point determination module may be configured to determine the third candidate segmentation point according to the KS of the third sorted data; the second judgment module may be configured to determine the third candidate segmentation point Whether the number of quantiles is greater than or equal to the preset number of bins; the third target quantile determination module may be configured to, if the number of the third candidate segmentation points is greater than or equal to the preset number of bins, according to the The number of bins is preset and the target binning point is determined by using a dynamic programming method.

In some embodiments, the target quantile determination module 03 shown in FIG. 15 may further include: a fourth candidate segmentation point determination submodule, a second allocation submodule, a fourth ranking data acquisition submodule, and a fourth target score Location determination sub-module.

Wherein, the fourth candidate segmentation point determination submodule may be configured to determine the fourth candidate segmentation point of the data to be processed if the target binning mode is the second binning mode; the second allocation submodule may be configured In order to allocate the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point; the fourth ranking data acquisition submodule may be configured to separately allocate the to-be-processed data on each node after the orderly allocation The data is sorted to obtain the fourth sort data in each node; the fourth target quantile determination sub-module may be configured to determine the target quantile in the fourth sort data according to the preset number of bins .

In some embodiments, the fourth candidate segmentation point determination submodule may include: a fifth ranking submodule, a second pre-segment point determination submodule, and a fourth candidate segmentation point submodule.

Wherein, the fifth sorting sub-module may be configured to sort the to-be-processed data on each node to obtain the fifth sorting data in each node; the second pre-segment point determination sub-module may be configured to sort the data according to the number of the nodes. The number N is to divide each fifth sorted data with equal frequency to obtain the second pre-segment point on each node; the third candidate segmentation point sub-module may be configured to determine the second pre-segment point according to the second pre-segment point. Four candidate segmentation points.

In some embodiments, the device 150 shown in FIG. 15 may further include: a node maximum value acquisition module, a global maximum value determination module, and a fifth target quantile determination submodule

The node maximum value obtaining module may be configured to obtain the maximum value and the minimum value on each node if the target binning mode is the third binning mode; the global maximum value determining module may be configured to obtain the maximum value and the minimum value on each node according to the The maximum and minimum values determine the maximum and minimum values of the data to be processed; the fourth target quantile determination sub-module determines the target points according to the maximum and minimum values of the data to be processed and the preset number of bins Site.

Since each functional module of the data binning processing device 150 of the exemplary embodiment of the present disclosure corresponds to the steps of the foregoing exemplary embodiment of the data binning processing method, it will not be repeated here.

Referring now to FIG. 16, it shows a schematic structural diagram of a computer system 1600 suitable for implementing a terminal device according to an embodiment of the present application. The terminal device shown in FIG. 16 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.

As shown in FIG. 16, the computer system 1600 includes a central processing unit (CPU) 1601, which can be based on a program stored in a read-only memory (ROM) 1602 or a program loaded from a storage portion 1608 into a random access memory (RAM) 1603 And perform various appropriate actions and processing. In the RAM 1603, various programs and data required for the operation of the system 1600 are also stored. The CPU 1601, ROM 1602, and RAM 1603 are connected to each other through a bus 1604. An input/output (I/O) interface 1605 is also connected to the bus 1604.

The following components are connected to the I/O interface 1605: an input part 1606 including a keyboard, a mouse, etc.; an output part 1607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage part 1608 including a hard disk ; And a communication section 1609 including a network interface card such as a LAN card, a modem, etc. The communication section 1609 performs communication processing via a network such as the Internet. The driver 1610 is also connected to the I/O interface 1605 as needed. A removable medium 1611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1610 as needed, so that the computer program read from it is installed into the storage portion 1608 as needed.

In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication part 1609, and/or installed from the removable medium 1611. When the computer program is executed by the central processing unit (CPU) 1601, it executes the above-mentioned functions defined in the system of the present application.

It should be noted that the computer-readable medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this application, the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device . The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation of the system architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present application can be implemented in software or hardware. The described unit may also be provided in the processor. For example, it may be described as: a processor includes a sending unit, an acquiring unit, a determining unit, and a first processing unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.

As another aspect, the present application also provides a computer-readable medium, which may be included in the device described in the above-mentioned embodiments; or it may exist alone without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by a device, the functions that the device can implement include: obtaining the data to be processed and its target binning method and preset binning methods If the data volume of the data to be processed is greater than or equal to a preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1; according to the preset number of bins and use all The target binning method processes the to-be-processed data on the N nodes to determine the target quantile of the to-be-processed data; the binning operation is performed on the to-be-processed data according to the target quantile to Obtain the binning result.

Through the description of the foregoing embodiments, those skilled in the art can easily understand that the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solutions of the embodiments of the present disclosure can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions. It is used to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a smart device, etc.) to execute the method according to the embodiment of the present disclosure, such as one or more steps shown in FIG. 2.

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiments of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.

Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the disclosure disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not applied for by the present disclosure. . The description and embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

It should be understood that the present disclosure is not limited to the detailed structure, drawings, or implementation methods that have been shown here. On the contrary, the present disclosure intends to cover various modifications and equivalent arrangements included in the spirit and scope of the appended claims. .

Claims

A data binning processing method, including:

Obtain the data to be processed and its target binning method and preset binning number;

If the data amount of the data to be processed is greater than or equal to the preset threshold, randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1;

Processing the data to be processed on the N nodes according to the preset number of bins and using the target binning method to determine the target quantile of the data to be processed;

Perform a binning operation on the to-be-processed data according to the target quantile point to obtain a binning result.
The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites, including:

If the target binning mode is the first binning mode, determine the first candidate segmentation point of the data to be processed;

Allocating the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point;

Respectively sort the to-be-processed data on each node after the ordered distribution to obtain the first sorted data in each node;

Obtaining the global KS of the to-be-processed data according to the first ranking data in the respective nodes;

The target quantile is determined according to the global KS of the data to be processed.
The method according to claim 2, wherein determining the first candidate segmentation point of the data to be processed comprises:

Sort the to-be-processed data on each node separately to obtain the second sorted data in each node;

Perform equal frequency division on each second sorted data respectively according to the number N of said nodes to obtain the first pre-segment point on each node;

The first candidate segmentation point is determined according to the first pre-segmentation point.
The method according to claim 2, wherein determining the target quantile based on the global KS of the data to be processed comprises:

Determining a second candidate segmentation point according to the global KS of the to-be-processed data in the first ranking data on the N nodes;

The target quantile point is determined in the second candidate segmentation point according to the preset number of bins.
The method according to claim 4, wherein determining the second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed comprises:

Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point;

If the amount of data to be processed on the left and right sides of the second candidate segmentation point is greater than the preset data amount, determine a maximum KS corresponding to the left and right sides of the second candidate segmentation point, respectively The data to be processed is used as the second candidate segmentation point.
The method according to claim 4, wherein determining the target quantile point in the second candidate segmentation point according to the preset number of bins comprises:

Determining whether the number of the second candidate segmentation points is less than the preset number of bins;

If the number of the second candidate segmentation points is less than the preset number of bins, determining that the second candidate segmentation point is the target quantile point;

If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
The method according to claim 1, further comprising:

If the data amount of the data to be processed is less than the preset threshold, sort the data to be processed to generate third sorted data;

Determining the KS of the third ranking data;

Determining a third candidate segmentation point according to the KS of the third ranking data;

Judging whether the number of the third candidate segmentation points is greater than or equal to the preset number of bins;

If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target location point is determined according to the preset number of bins and using a dynamic programming method.
The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites also include:

If the target binning mode is the second binning mode, determine the fourth candidate segmentation point of the to-be-processed data;

Allocating the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point;

Respectively sort the to-be-processed data on each node after the orderly distribution to obtain the fourth sort data in each node;

The target quantile point is determined in the fourth ranking data according to the preset number of bins.
The method according to claim 8, wherein determining the fourth candidate segmentation point of the to-be-processed data comprises:

Sort the data to be processed on each node respectively to obtain the fifth sorted data in each node;

Perform equal frequency division on each fifth sorted data respectively according to the number N of said nodes to obtain the second pre-segment point on each node;

The fourth candidate segmentation point is determined according to the second pre-segmentation point.
The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites also include:

If the target binning mode is the third binning mode, the maximum value and the minimum value on each node are obtained respectively;

Determining the maximum value and the minimum value of the data to be processed according to the maximum value and the minimum value on each node;

The target quantile point is determined according to the maximum and minimum values of the data to be processed and the preset number of bins.
A data binning processing device, which includes:

The data acquisition module is configured to acquire the data to be processed and its target binning method and preset binning number;

The data distribution module is configured to randomly distribute the data to be processed to N nodes if the data amount of the data to be processed is greater than or equal to a preset threshold, where N is a positive integer greater than 1;

A target binning point determination module, configured to process the data to be processed on the N nodes according to the preset binning number and using the target binning method to determine the target binning of the data to be processed point;

The binning module is configured to perform binning operation on the to-be-processed data according to the target binning point to obtain binning results.
An electronic device including:

One or more processors;

Storage device for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-10.
A computer readable medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1-10.