WO2020248356A1 - Data binning processing method and apparatus, electronic device and computer-readable medium - Google Patents

Data binning processing method and apparatus, electronic device and computer-readable medium Download PDF

Info

Publication number
WO2020248356A1
WO2020248356A1 PCT/CN2019/100804 CN2019100804W WO2020248356A1 WO 2020248356 A1 WO2020248356 A1 WO 2020248356A1 CN 2019100804 W CN2019100804 W CN 2019100804W WO 2020248356 A1 WO2020248356 A1 WO 2020248356A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processed
binning
target
node
Prior art date
Application number
PCT/CN2019/100804
Other languages
French (fr)
Chinese (zh)
Inventor
陈星为
Original Assignee
同盾控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 同盾控股有限公司 filed Critical 同盾控股有限公司
Publication of WO2020248356A1 publication Critical patent/WO2020248356A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present disclosure relates to the technical field of data processing, and in particular to a data binning processing method and device, electronic equipment and computer-readable media.
  • Data binning is a commonly used data processing method. Data binning is actually dividing the data into sub-intervals according to the attribute value of a certain attribute, such as dividing sub-intervals according to age, dividing sub-intervals according to height, and so on. If the attribute value of a data is within a certain subrange, put the data in the bin represented by the subrange.
  • the embodiments of the present disclosure provide a data binning processing method and device, electronic equipment, and computer readable medium, which can perform binning processing on data with a large data scale.
  • a data binning processing method includes: obtaining the data to be processed and its target binning method and preset binning number; if the amount of data to be processed is If it is greater than or equal to the preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1. According to the preset number of bins and the target binning method, the N nodes The to-be-processed data above is processed to determine a target quantile of the to-be-processed data; the to-be-processed data is binned according to the target quantile to obtain a binning result.
  • the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed
  • the quantile point includes: if the target binning mode is the first binning mode, determining a first candidate segmentation point of the data to be processed; and dividing the data to be processed according to the first candidate segmentation point Distributed to the N nodes in an orderly manner; respectively sort the to-be-processed data on each node after the orderly distribution to obtain the first sorted data in each node; according to the first sorted data in each node Obtain the global KS of the to-be-processed data; determine the target quantile according to the global KS of the to-be-processed data.
  • determining the first candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the second ranking data in each node; The number N of the nodes is divided into equal frequency for each second sorted data respectively to obtain the first pre-segment point on each node; the first candidate segmentation point is determined according to the first pre-segment point .
  • determining the target quantile according to the global KS of the to-be-processed data includes: first ranking on the N nodes according to the global KS of the to-be-processed data Determine the second candidate segmentation point in the data; determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
  • determining a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed includes: determining in the global KS A maximum KS, and its corresponding to-be-processed data is used as the second candidate segmentation point; if the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, then The left side and the right side of the second candidate segmentation point respectively determine the to-be-processed data corresponding to a maximum KS as the second candidate segmentation point.
  • determining the target quantile in the second candidate segmentation point according to the preset number of bins includes: determining the number of the second candidate segmentation point Whether the number is less than the preset number of bins; if the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target binning point; If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
  • the data binning processing method further includes: if the data volume of the data to be processed is less than a preset threshold, sorting the data to be processed to generate a third ranking Data; determine the KS of the third sorted data; determine a third candidate segmentation point according to the KS of the third sorted data; determine whether the number of the third candidate segmentation points is greater than or equal to the preset bin If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
  • the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed
  • the quantile point further includes: if the target binning mode is the second binning mode, determining a fourth candidate segmentation point of the data to be processed; and dividing the to-be-processed data according to the fourth candidate segmentation point
  • the data is distributed to the N nodes in an orderly manner; the data to be processed on each node after the orderly distribution is sorted to obtain the fourth sorted data in each node;
  • the target quantile is determined in the fourth ranking data.
  • determining the fourth candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the fifth ranking data in each node; The number N of the nodes is divided into equal frequency for each fifth sorted data respectively to obtain the second pre-segment point on each node; the fourth candidate segmentation point is determined according to the second pre-segment point .
  • the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed
  • the quantile point further includes: if the target binning mode is the third binning mode, obtaining the maximum value and the minimum value on each node respectively; and determining the waiting point according to the maximum value and the minimum value on each node
  • the maximum and minimum values of the processed data; the target quantile is determined according to the maximum and minimum values of the to-be-processed data and the preset number of bins.
  • a data binning processing device includes: a data acquisition module, a data distribution module, a target quantile point determination module, and a binning module.
  • the data acquisition module is configured to acquire the to-be-processed data and its target binning method and preset binning number;
  • the data distribution module is configured to: if the data volume of the to-be-processed data is greater than or equal to a preset threshold, The data is randomly distributed to N nodes, where N is a positive integer greater than 1.
  • the target binning point determination module is configured to perform processing on the N nodes according to the preset binning number and using the target binning method
  • the data is processed to determine the target quantile of the data to be processed;
  • the binning module is configured to perform a binning operation on the data to be processed according to the target quantile to obtain a binning result.
  • an electronic device includes: one or more processors; a storage device for storing one or more programs.
  • the one or more processors execute, so that the one or more processors implement the data binning processing method described in any one of the foregoing.
  • a computer-readable medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the data binning process as described in any of the above is implemented method.
  • the data binning processing method, device, electronic equipment, and computer readable medium provided by some embodiments of the present disclosure allocate the amount of data to be processed to multiple nodes, and then determine the target quantile by the data on the multiple nodes, Finally, the binning operation of the data to be processed is realized according to the target quantile.
  • the data binning processing method distributes data with a large amount of data to multiple nodes, and uses multiple nodes at the same time to complete the binning operation of the data to be processed, which overcomes the defect that a single node has too small memory and cannot process large-scale data.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture of a data box processing method or data box processing device applied to an embodiment of the present disclosure.
  • Fig. 2 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
  • Fig. 3 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 4 is a flow chart showing yet another data binning processing method according to an exemplary embodiment.
  • Fig. 5 is a flow chart showing still another method for processing data binning according to an exemplary embodiment.
  • Fig. 6 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 7 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 8 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 9 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 10 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 11 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 12 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 13 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 14 is a flowchart showing another data binning processing method according to an exemplary embodiment.
  • Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment.
  • Fig. 16 is a schematic structural diagram showing another computer system applied to a data binning processing device according to an exemplary embodiment.
  • the terms “a”, “an”, “the”, “said” and “at least one” are used to indicate that there are one or more elements/components/etc.; the terms “including”, “including” and “Have” is used to mean open-ended inclusion and means that in addition to the listed elements/components/etc., there may be additional elements/components/etc.; the terms “first” and “second “And “third” are only used as markers, and are not a limitation on the number of objects.
  • Fig. 1 shows a schematic diagram of an exemplary system architecture of a data binning processing method or a data binning processing device that can be applied to an embodiment of the present disclosure.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and so on.
  • the server 105 may be a server that provides various services, for example, a background management server that provides support for devices operated by users using the terminal devices 101, 102, and 103.
  • the background management server can analyze and process the received request and other data, and feed back the processing result to the terminal device.
  • the server 105 may, for example, obtain the data to be processed and its target binning method and preset binning number; if the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, N is a positive integer greater than 1; the data to be processed on the N nodes are processed according to the preset number of bins and the target bin method to determine the target quantile of the data to be processed ; Perform binning operations on the to-be-processed data according to the target quantile to obtain binning results.
  • the server 105 may be a physical server, or may be composed of multiple servers. According to actual needs, it may have any number of terminal devices, Network and server.
  • data can be divided into sub-intervals according to the attribute value of a certain attribute, such as sub-intervals according to age, sub-intervals according to height, and so on. If the attribute value of a data is within a certain sub-range, the data can be put into the bin represented by the sub-range. Then use the attributes of the entire subrange to represent the attributes of the data in this subrange.
  • This sort of binning can be understood as the discretization of data, and the discretization of data can have the following advantages:
  • the discretized data has strong robustness to abnormal data.
  • the abnormal data "age greater than 300" will cause great interference to the model, and after discretizing the age data (the age greater than 30 can be expressed as 1, otherwise it is 0), it is only 0
  • the age greater than 30 can be expressed as 1, otherwise it is 0
  • substituting the discretized abnormal data into the model will not interfere with the model.
  • the model will become more stable. For example, for an age data that will change over time, if you take 20-30 years old as an age range, suppose a user’s age is 25, and the user’s age will change to 26 after one year, but the corresponding discrete data value remains unchanged .
  • Fig. 2 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
  • the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
  • Step S1 Obtain the data to be processed and its target binning method and preset binning number.
  • the preset number of bins refers to the number of bins designated by the user to divide the data to be processed
  • the target binning mode refers to the binning mode specified by the user.
  • the target binning manner may include at least one of a first binning manner, a second binning manner, and a third binning manner.
  • Step S2 If the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.
  • the preset threshold may refer to the amount of data that can be processed by a single machine. For example, for a list of to-be-processed data including label column, sequence number column, and characteristic value, assuming that the label, sequence number, and characteristic value are all int (integer data, each int data occupies 4 bytes) data, then A server with 1G memory can only handle 10 8 to 10 9 data volumes. In some embodiments, when the amount of data to be processed is greater than or equal to a preset threshold, the data to be processed can be randomly allocated to N nodes for processing.
  • N nodes may refer to N terminals that can perform data processing, such as N servers or N computer terminals.
  • the present disclosure does not limit the physical form of the N nodes, and the actual operation shall prevail.
  • the amount of data to be processed randomly allocated to each node is approximately the same.
  • Step S3 processing the data to be processed on the N nodes according to the preset number of bins and using the target binning method to determine the target quantile of the data to be processed.
  • Step S4 Perform a binning operation on the to-be-processed data according to the target quantile point to obtain a binning result.
  • the data to be processed can be divided at the target quantile to form multiple bins of data.
  • the foregoing embodiment provides a data binning processing method.
  • the relationship between the amount of data to be processed and the preset threshold is considered before data binning, so as to avoid the inability to complete the data to be processed due to excessive data volume.
  • multiple nodes are used to complete the binning operation of the data to be processed at the same time, which overcomes the problem of a single node whose memory is too small to handle large-scale data. defect.
  • step S3 provided in the embodiment shown in FIG. 2 may include the following steps.
  • Step S31 If the target binning mode is the first binning mode, determine the first candidate segmentation point of the data to be processed.
  • the first binning method may be a distributed data binning processing method based on the data ks value.
  • determining the first candidate segmentation point may include the steps shown in FIG. 4.
  • Step S311 Sort the to-be-processed data on each node respectively to obtain the second sorted data in each node.
  • the data to be processed may be randomly distributed to N nodes, where N is a positive integer greater than 1.
  • M data to be processed are randomly allocated to N nodes, and the data on each node are respectively denoted as M 1 , M 2 ... M N-1 , M N.
  • the data to be processed on each node may be sorted separately to obtain the second sorted data in each node.
  • the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed. For example, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for data is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slow, and bucket sort is faster, but occupies more memory.
  • bucket sorting such as cardinal sorting
  • the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ⁇ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
  • Step S312 Perform equal frequency division on each second sorted data according to the number N of the nodes, to obtain the first pre-segment point on each node.
  • the equal frequency division of the second sorted data on each node can be realized according to the number N of nodes designated by the user and the data volume of the data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000 and the number of nodes is 5, then the second sorted data on the first node can be divided into equal frequency according to the amount of data per box of 1000/5.
  • equal frequency division is performed on each node according to the amount of data to be processed on each node and the number N of said nodes to obtain the first pre-segment point on each node.
  • the second sorting data on the respective nodes M '1, M' 2, whil .M 'N-1, M' N, N number of nodes according to the data and the amount of data in each node and each node may respectively
  • the second sorted data is divided into equal frequency.
  • the first pre-segmentation points determined on the first node are m 11 , m 12 , m 1N-1 (it is easy to understand that only N-1 segmentation points are needed to divide M data into N boxes In), the first pre-segmentation points determined on the second node are m 21 , m 22 ,...m 2N-1 , and the first pre-segmentation points determined on the i-th node are m i1 , m i2 ,...m iN-1 , i is a positive integer less than or equal to N.
  • Step S313 Determine the first candidate segmentation point according to the first pre-segmentation point.
  • the first pre-segmentation points on multiple nodes may be correspondingly averaged to determine the first candidate segmentation point. For example, assuming that the preset number of bins is N, the first pre-segmenting points determined on the first node are m 11 , m 12 , ... m 1N-1 , and the first pre-segmenting points determined on the second node M 21 , m 22 , m 2N-1 , the first pre-segment points determined on the i-th node are m i1 , m i2 ,...m iN-1 , and i is a positive integer less than or equal to N.
  • the first candidate segmentation point may be determined as
  • m iN-1 represents the N-1 first pre-segmentation point on the i-th node.
  • the first pre-segmentation points on multiple nodes may be correspondingly calculated as the median, maximum, or minimum, etc., as the first candidate segmentation point.
  • the embodiment shown in Figure 4 not only determines the first candidate segmentation point for preliminary division of the data to be processed through multiple nodes, but also sorts the data to be processed on the node according to the memory size of the node and the amount of data to be processed , Ensuring the running speed while fully utilizing the node memory.
  • Step S32 Distributing the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.
  • ordered allocation refers to a specific and known size relationship between the data to be processed on each node after allocation. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.
  • the data to be processed are allocated to 4 nodes in order according to the first candidate segmentation point, which can be expressed as: the 0th to the C 1 th data assigned to the first node, the second to the first C 1 +1 C 2 data assigned to the second node, the first C 1 +1 through C 2 data distribution To the second node, assign the C 3 +1 to the last data to the fourth node.
  • Step S33 Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.
  • the sorting method can be selected according to the memory size of each node and the data amount of the data to be processed on the node to realize the sorting of the amount of data to be processed on each node.
  • Step S34 Obtain the global KS of the to-be-processed data according to the first ranking data in each node.
  • the KS value can be used to evaluate the risk discrimination ability of the model.
  • the indicator measures the gap between the cumulative part of the first sample and the second sample. The larger the KS value, the better the variable can distinguish the first sample from the second sample.
  • each node may include the first sample data and the second sample data.
  • the labeling rules of the first sample and the second sample may be defined by the user.
  • the user can define the data corresponding to those customers with credit problems as the first sample, and define the data corresponding to those customers without credit problems as the second sample.
  • the KS value of an interval (there may be only one data in the interval) can be obtained in the following manner.
  • the cumulative first sample number of each interval can refer to the first sample number of the current interval plus the first sample number of all intervals before this interval. For example, the first interval has 3 The same book, the second interval has 2 first samples, and the third interval has 4 first samples, then the cumulative number of first samples in the second interval is 2+3) and the cumulative second sample number.
  • the repeated data to be processed may be merged before the global KS of the data to be processed is determined.
  • the global data to be processed can be determined according to the data volume of the first sample and the data volume of the second sample in the node. KS value.
  • the global KS of the data refers to the KS value of the data obtained on the basis of all the data to be processed. For example, if the data to be processed is divided into three nodes, each node has N1, N2, N3 first samples, N4, N5, N6 first samples, then the last one on the second node
  • the global KS value of the data can be expressed as (
  • Step S35 Determine the target quantile according to the global KS of the data to be processed.
  • the target quantile can be determined according to the steps shown in FIG. 5.
  • Step S351 Determine a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed.
  • the second candidate segmentation point can also be determined according to the steps shown in FIG. 6.
  • Step S3511 Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.
  • data corresponding to a maximum KS value can be determined in the data to be processed according to the global KS of the data to be processed as the second candidate segmentation point.
  • Step S3512 If the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, determine a maximum value on the left and right sides of the second candidate segmentation point.
  • the to-be-processed data corresponding to the KS is used as the second candidate segmentation point.
  • the preset data amount may be set by the user in advance.
  • step S3511 it is determined whether the amount of data to be processed on the left and right of the second candidate segmentation point obtained in step S3511 is greater than the preset data amount (if more than one second candidate segmentation point is obtained in step S3511 , Respectively determine whether the data amount of the data to be processed on the left and right sides of the above-mentioned more than one second candidate segmentation points is greater than the preset data amount).
  • the amount of data to be processed on the left and right of the second candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS corresponding to the left and right of each second candidate segmentation point respectively
  • the to-be-processed data of is used as the second candidate segmentation point; if it is determined that there is a second candidate segmentation point, the data amount of the to-be-processed data on the left or right side is less than the preset data amount, then the iteration is stopped.
  • Step S352 Determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
  • the determination of the target quantile at the second candidate cut-off point according to the preset number of bins can be achieved through the steps shown in FIG. 7.
  • Step S3521 Determine whether the number of the second candidate segmentation points is less than the preset number of bins.
  • Step S3522 If the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile point.
  • Step S3523 If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
  • M-1 targets can be determined from the N second candidate segmentation points Divided into points.
  • the IV value of the corresponding solution can be obtained by formula (1).
  • good_Pcnt i % represents the proportion of the first sample in the i-th interval (the interval may only include one number) to the total number of first samples
  • bad_Pcnt i % represents the second sample in the i-th interval The proportion of the number of samples.
  • the IV value of each solution can be obtained in turn, and the solution corresponding to the maximum IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution.
  • This method occupies less space and has simple logic. However, this method has been repeatedly calculated for many times, and the calculation efficiency is not high.
  • a dynamic programming method can be selected to determine the target points.
  • the dynamic programming method can cache the solution of the sub-problem that has been solved, and the solution of the sub-problem can be used directly next time, avoiding repeated operations.
  • the data to be processed is binned based on the KS index, which can effectively bin bin processing of continuous variables, and has stronger interpretability, and this method can be attached to the specific needs of many users.
  • the IV of the binning result is required to be monotonous Wait.
  • this method does not require business experience and can automatically complete the binning operation.
  • This method distributes the amount of data to be processed to multiple nodes on a large scale, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.
  • the data binning processing method provided by the embodiment of the present disclosure may further include the following steps.
  • Step S1 Obtain data to be processed.
  • Step S5 If the amount of data to be processed is less than a preset threshold, sort the data to be processed to generate third sorted data.
  • the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed.
  • bucket sorting such as cardinal sorting
  • quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster, but occupies more memory.
  • the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ⁇ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
  • Step S6 Determine the KS of the third sorted data.
  • the repeated data to be processed may be merged before determining the KS of the data to be processed.
  • the third ranking can be determined based on the total number of first samples and the total number of second samples in the third ranking data, and the cumulative first sample number and the second cumulative number of samples at each data in the third ranking data.
  • the KS value of the data in the data can be determined based on the total number of first samples and the total number of second samples in the third ranking data, and the cumulative first sample number and the second cumulative number of samples at each data in the third ranking data.
  • Step S7 Determine a third candidate segmentation point according to the KS of the third ranking data.
  • a maximum KS may be determined among the KSs of the third ranking data, and the corresponding to-be-processed data may be used as the third candidate segmentation point.
  • the data on the left and right sides of the third candidate segmentation point are respectively Determine the to-be-processed data corresponding to one largest KS as the third candidate segmentation point.
  • the preset data amount may be set by the user in advance.
  • the amount of data to be processed on the left and right sides of the third candidate segmentation point is greater than the preset data amount (if more than one third candidate segmentation point is obtained in the above steps, then the foregoing The amount of data to be processed on the left and right sides of more than one third candidate segmentation point is greater than the preset data amount). If it is determined that the amount of data to be processed on the left and right sides of the third candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS on the left and right sides of each third candidate segmentation point. The corresponding data to be processed is used as the third candidate segmentation point. If it is determined that there is a third candidate segmentation point where the amount of data to be processed on the left or right side is less than the preset data amount, then the iteration is stopped.
  • Step S8 Determine whether the number of the third candidate segmentation points is greater than or equal to the preset number of bins.
  • the third candidate segmentation point is the target quantile point.
  • Step S9 If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
  • N is greater than or equal to M
  • M-1 targets must be determined from the N second candidate segmentation points Divided into points.
  • the IV value of the solution can be obtained by formula (1).
  • a third candidate segmentation point corresponding to the solution with the largest IV value may be selected as the target quantile point.
  • the IV value of each solution can be obtained in turn, and the solution corresponding to the largest IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution.
  • This optimal solution is obtained
  • the method occupies less space and is simple in logic. However, the method has been repeatedly calculated many times, and the calculation efficiency is not high.
  • a dynamic programming method can be selected to determine the target points.
  • the dynamic programming method can cache the solution of the sub-problem that has been solved, and the solution of the sub-problem can be used directly next time, avoiding repeated operations.
  • the technical solution provided in the embodiment shown in FIG. 8 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also distribute the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .
  • the technical solution provided by the embodiment shown in Fig. 8 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory.
  • it is based on node memory and on-node
  • the data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized.
  • this method uses dynamic programming to find out the eligible target quantiles, which saves running time.
  • step S3 provided by the embodiment shown in FIG. 2 may further include the following steps.
  • Step S36 If the target binning mode is the second binning mode, determine the fourth candidate segmentation point of the data to be processed.
  • step S36 provided in the embodiment shown in FIG. 9 may include the following steps.
  • S361 Sort the to-be-processed data on each node respectively to obtain fifth sorted data in each node.
  • the data to be processed may be randomly distributed to N nodes, where N is a positive integer greater than 1.
  • the data to be processed on each node may be sorted separately to obtain the fifth sorted data in each node.
  • the sorting method can be selected according to the memory size of the node and the memory size required for processing the data to be processed to achieve the sorting of the amount of data to be processed.
  • bucket sorting such as cardinal sorting
  • quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster but occupies more memory.
  • the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ⁇ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
  • S362 Perform equal frequency division on each fifth sorted data according to the number N of the nodes, to obtain a second pre-segment point on each node.
  • the equal frequency division of the sorted data on each node can be realized according to the number N of nodes designated by the user and the amount of data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000, and the number of bins preset by the user is 5, then the sorted data on the first node can be divided equally according to the amount of data per box of 1000/5.
  • the second pre-segment point on each node can be obtained after equal frequency division of each node according to the amount of data to be processed on each node and the number N of the nodes.
  • the fourth candidate segmentation point may be determined according to the second pre-segmentation point.
  • the second pre-segment points on each node may be correspondingly averaged to determine the fourth candidate segmentation point. For example, suppose the number of nodes N is 4, the second pre-segment points determined on the first node are 2.2, 4.2, 5.8, 8.2, and the second pre-segment points determined on the second node are 1.8, 3.8, 6.2 , 7.8, then the second pre-segment point on the first node and the second pre-segment point on the second node respectively correspond to the fourth candidate segmentation points obtained after averaging 2, 4, 6, 8 .
  • the second pre-segmentation point on each node may be corresponding to the median, maximum, or minimum value, etc., as the fourth candidate segmentation point.
  • Step S37 Distributing the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.
  • ordered allocation refers to a specific, known size relationship between the data to be processed on each node. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.
  • Step S38 Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.
  • the sorting method can be selected according to the memory size of each node and the data amount of the data to be processed on the node to realize the sorting of the amount of data to be processed on each node.
  • Step S39 Determine the target quantile point in the fourth ranking data according to the preset number of bins.
  • the target points can be determined according to the amount of data to be processed and the preset number of bins.
  • the amount of data to be processed is 1000
  • the fourth ranking data on the first node is 2520
  • the fourth ranking data on the second node is 2480
  • the fourth ranking data on the third and fourth nodes is 2500.
  • the maximum value on the first node is smaller than the minimum value on the second node, and so on. If the number of nodes is 4, then the target points should be the 2500th, 500th, and 7500th data, because the data on the four nodes is sorted data, and the four nodes are also ordered, so it is easy Determine the 2500th, 5000th, and 7500th data after sorting.
  • the binning processing method provided in the foregoing embodiment completes binning processing of large-scale to-be-processed data on multiple nodes based on an equal frequency method.
  • This method first randomly allocates the to-be-processed data to multiple nodes, and confirms the preliminary equal-frequency cut-off point-the fourth candidate cut-off point, and then allocates the data to be processed to each node in order according to the fourth candidate cut-off point, and Sort the data on each node, and finally confirm the target quantile based on the sorted data and the preset number of bins.
  • the binning processing method can perform binning processing on evenly distributed large-scale data.
  • step S3 provided in the embodiment shown in FIG. 2 may further include the following steps.
  • the target binning method is the third binning method, the maximum value and the minimum value on each node are respectively obtained; the maximum value and the minimum value of the data to be processed are determined according to the maximum value and the minimum value on each node Value; the target quantile is determined according to the maximum and minimum values of the data to be processed and the preset number of bins.
  • the maximum value and minimum value on each node can be obtained respectively, and a maximum value and minimum value can be determined from the maximum value and minimum value on each node.
  • the quantile point of the data to be processed can be determined. For example, if it is known that the maximum value of the data to be processed is 10000, the minimum value is 1, and the number of bins is 4, then the target quantiles are 2500, 500, 7500, and the data can be binned according to the target quantile. operating.
  • the maximum value and minimum value are first confirmed in each node, and then the maximum value and minimum value in the large-scale data to be processed are determined according to the maximum value and minimum value in the node, and finally according to the value of the data to be processed
  • the maximum, minimum and preset binning numbers are used to complete the binning operation of the data to be processed. This method is simple and easy to operate, and is suitable for some concentrated data to be processed.
  • Fig. 11 is a flowchart showing a data binning processing method according to an exemplary embodiment.
  • the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
  • Step S111 Obtain the data to be processed and its target binning method and preset binning number.
  • Step S112 if the amount of data to be processed is greater than or equal to a preset threshold.
  • Step S113 Randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.
  • Step S114 if the target binning mode is the first binning mode, sort the to-be-processed data on each node to obtain the second sorted data in each node.
  • Step S115 Perform equal frequency division on each second sorted data according to the number of nodes to obtain the first pre-segment point on each node.
  • Step S116 Determine the first candidate segmentation point according to the first pre-segmentation point.
  • Step S117 Distribute the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.
  • Step S118 Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.
  • Step S119 Obtain the global KS of the to-be-processed data according to the first ranking data in each node.
  • Step S1110 Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.
  • Step S1111 Determine whether the data amount of the data to be processed on the left and right of the second candidate segmentation point is greater than a preset data amount.
  • step S1112 is executed; if the amount of data to be processed on the left and right of the second candidate segmentation point is If the data amount of is not greater than the preset data amount, step S1113 is executed;
  • Step S1112 Determine the to-be-processed data corresponding to a maximum KS on the left and right sides of the second candidate segmentation point, respectively, as the second candidate segmentation point. Then, continue to perform step S1111 until the amount of data to be processed on the left and right sides of the second candidate segmentation point is less than or equal to the preset data amount.
  • Step S1113 Determine whether the number of the second candidate segmentation points is less than the preset number of bins.
  • step S1114 is executed; if it is determined that the number of the second candidate segmentation points is not less than the preset number of bins, Step S1115 is executed.
  • Step S1114 Determine that the second candidate segmentation point is the target segmentation point.
  • Step S1115 Determine the target quantile point according to the preset number of bins and using a dynamic programming method.
  • Step S1116 Obtain a binning result of the to-be-processed data according to the target quantile.
  • the data to be processed can be binned, which can effectively bin-bind continuous variables, and has stronger explanatory properties.
  • this method distributes the amount of data to be processed on a large scale to multiple nodes, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.
  • Fig. 12 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
  • the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
  • Step S121 Obtain the data to be processed and its target binning method and preset binning number.
  • Step S122 if the amount of data to be processed is greater than or equal to a preset threshold.
  • Step S123 If the target binning mode is the second binning mode, sort the to-be-processed data on each node to obtain the fifth sorted data in each node.
  • Step S124 Perform equal frequency division on each fifth sorted data according to the number of nodes to obtain second pre-segment points on each node.
  • Step S125 Determine the fourth candidate segmentation point according to the second pre-segmentation point.
  • Step S126 Distribute the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.
  • Step S127 Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.
  • Step S128 Determine the target quantile point in the fourth ranking data according to the preset number of bins.
  • Step S129 Obtain a binning result of the to-be-processed data according to the target quantile.
  • the binning processing method provided in the foregoing embodiment completes binning processing of large-scale to-be-processed data on multiple nodes based on an equal frequency method.
  • This method first randomly allocates the to-be-processed data to multiple nodes, and confirms the preliminary equal-frequency cut-off point-the fourth candidate cut-off point, and then allocates the data to be processed to each node in order according to the fourth candidate cut-off point, and Sort the data on each node, and finally confirm the target quantile based on the sorted data and the preset number of bins.
  • the binning processing method can perform binning processing on evenly distributed large-scale data.
  • Fig. 13 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
  • the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
  • Step S131 Obtain the to-be-processed data and its target binning method and preset binning number.
  • Step S132 if the amount of data to be processed is greater than or equal to a preset threshold.
  • Step S133 Randomly allocate the data to be processed to N nodes, where N is a positive integer greater than 1.
  • Step S134 If the target binning mode is the third binning mode, the maximum value and the minimum value on each node are obtained respectively.
  • Step S135 Determine the maximum value and the minimum value of the to-be-processed data according to the maximum value and the minimum value on each node.
  • Step S136 Determine the target quantile point according to the maximum value and minimum value of the data to be processed and the preset number of bins.
  • Step S137 Obtain a binning result of the to-be-processed data according to the target quantile.
  • the maximum value and minimum value are first confirmed in each node, and then the maximum value and minimum value in the large-scale data to be processed are determined according to the maximum value and minimum value in the node, and finally according to the value of the data to be processed
  • the maximum, minimum and preset binning numbers are used to complete the binning operation of the data to be processed. This method is simple and easy to operate, and is suitable for some concentrated data to be processed.
  • Fig. 14 is a flow chart showing a method for processing data binning according to an exemplary embodiment.
  • the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
  • Step S141 Obtain the data to be processed and its target binning method and preset binning number.
  • Step S142 if the amount of data to be processed is less than a preset threshold.
  • Step S143 Sort the to-be-processed data to generate third sorted data.
  • Step S144 Determine the KS of the third sorted data.
  • Step S145 Determine a maximum KS among the KS of the third sorted data, and use the corresponding to-be-processed data as the fifth candidate segmentation point.
  • Step S146 Determine whether the amount of data to be processed on the left and right sides of the fifth candidate segmentation point is greater than a preset data amount.
  • step S146 If it is determined that the data amount of the data to be processed on the left and right sides of the fifth candidate segmentation point is greater than the preset data amount, then continue to perform step S146, otherwise, perform step S147.
  • Step S147 Determine whether the number of the fifth candidate segmentation points is less than the preset number of bins.
  • step S148 is executed; otherwise, step 149 is executed.
  • Step S148 determining that the second candidate segmentation point is the target segmentation point.
  • Step S149 Determine the target quantile point according to the preset number of bins and using a dynamic programming method.
  • Step S1410 Obtain a binning result of the to-be-processed data according to the target quantile.
  • the technical solution provided by the embodiment shown in FIG. 14 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also assign the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .
  • the technical solution provided by the embodiment shown in Fig. 14 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory.
  • it is based on the node memory and node
  • the data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized.
  • this method uses dynamic programming to find out the eligible target quantiles, which saves running time.
  • Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment. 15, the device 150 includes a data acquisition module 1501, a data distribution module 1502, a target quantile determination module 1503, and a binning module 1504.
  • the data acquisition module 1501 can be configured to acquire the data to be processed and its target binning method and the preset number of bins; the data distribution module 1502 can be configured to: if the data volume of the data to be processed is greater than or equal to the preset threshold, The data to be processed is randomly distributed to N nodes, where N is a positive integer greater than 1.
  • the target quantile point determination module 1503 may be configured to perform the calculation of the N nodes according to the preset number of bins and the target binning method.
  • the to-be-processed data on each node is processed to determine the target quantile of the to-be-processed data; the binning module 1504 may be configured to perform binning operation on the to-be-processed data according to the target quantile to obtain a binning operation. Box results.
  • the target quantile determination module 03 shown in FIG. 15 may include a first candidate segmentation point determination submodule, a first allocation submodule, a first ranking submodule, a global KS determination submodule, and a first target Quantile determination sub-module.
  • the first candidate segmentation point determination sub-module may be configured to determine the first candidate segmentation point of the data to be processed if the target binning mode is the first binning mode; the first allocation sub-module may be configured According to the first candidate segmentation point, the data to be processed is distributed to the N nodes in an orderly manner; the first sorting sub-module may be configured to sort the data to be processed on each node after the orderly distribution, respectively, To obtain the first ranking data in each node; the global KS determination sub-module may be configured to obtain the global KS of the to-be-processed data according to the first ranking data in each node; the first target quantile determination sub-module, according to The global KS of the data to be processed determines the target quantile.
  • the first candidate segmentation point determination sub-module may include a second sorting unit, a first pre-segment point determination unit, and a first candidate segmentation point determination unit.
  • the second sorting unit may be configured to sort the to-be-processed data on each node to obtain the second sorted data in each node; the first pre-segment point determination unit may be configured to respectively sort the data to be processed according to the number N of nodes. Perform equal frequency division on each second sorted data to obtain the first pre-segment point on each node; the unit for determining the first candidate segmentation point may be configured to determine the first candidate according to the first pre-segment point Split point.
  • the first target quantile determination sub-module 035 shown in FIG. 15 may include a second candidate segmentation point determination unit and a target quantile determination unit.
  • the second candidate segmentation point determination unit may be configured to determine the second candidate segmentation point according to the global KS of the to-be-processed data in the first ranking data on the N nodes; determine the target quantile point unit It may be configured to determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
  • the second candidate segmentation point determination unit may include a maximum KS determination subunit and a binary unit.
  • the maximum KS determining subunit may be configured to determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point; a binary unit, if the second candidate segmentation If the data volume of the data to be processed on the left and right of the point is greater than the preset data volume, the data to be processed corresponding to the largest KS is determined on the left and right of the second candidate segmentation point, respectively, as the The second candidate segmentation point.
  • the second target quantile determination unit may include a first judgment subunit, a second target quantile determination subunit, and a second target quantile determination subunit.
  • the first judgment subunit judges whether the number of the second candidate segmentation points is less than the preset number of bins; the second target quantile determination subunit, if the second candidate segmentation point is If the number is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile; the second target quantile determination subunit, if the number of the second candidate segmentation point is The number is greater than or equal to the preset number of bins, and the target quantile is determined according to the preset number of bins and using a dynamic programming method.
  • the device 150 shown in FIG. 15 may further include: a third ranking module, a KS determination module, a third candidate segmentation point determination module, a second judgment module, and a third target quantile determination module.
  • the third sorting module may be configured to sort the to-be-processed data to generate third sorted data if the data amount of the to-be-processed data is less than a preset threshold;
  • the KS determination module may be configured to determine the first KS of three sorted data;
  • the third candidate segmentation point determination module may be configured to determine the third candidate segmentation point according to the KS of the third sorted data;
  • the second judgment module may be configured to determine the third candidate segmentation point Whether the number of quantiles is greater than or equal to the preset number of bins;
  • the third target quantile determination module may be configured to, if the number of the third candidate segmentation points is greater than or equal to the preset number of bins, according to the The number of bins is preset and the target binning point is determined by using a dynamic programming method.
  • the target quantile determination module 03 shown in FIG. 15 may further include: a fourth candidate segmentation point determination submodule, a second allocation submodule, a fourth ranking data acquisition submodule, and a fourth target score Location determination sub-module.
  • the fourth candidate segmentation point determination submodule may be configured to determine the fourth candidate segmentation point of the data to be processed if the target binning mode is the second binning mode;
  • the second allocation submodule may be configured In order to allocate the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point;
  • the fourth ranking data acquisition submodule may be configured to separately allocate the to-be-processed data on each node after the orderly allocation The data is sorted to obtain the fourth sort data in each node;
  • the fourth target quantile determination sub-module may be configured to determine the target quantile in the fourth sort data according to the preset number of bins .
  • the fourth candidate segmentation point determination submodule may include: a fifth ranking submodule, a second pre-segment point determination submodule, and a fourth candidate segmentation point submodule.
  • the fifth sorting sub-module may be configured to sort the to-be-processed data on each node to obtain the fifth sorting data in each node;
  • the second pre-segment point determination sub-module may be configured to sort the data according to the number of the nodes.
  • the number N is to divide each fifth sorted data with equal frequency to obtain the second pre-segment point on each node;
  • the third candidate segmentation point sub-module may be configured to determine the second pre-segment point according to the second pre-segment point. Four candidate segmentation points.
  • the device 150 shown in FIG. 15 may further include: a node maximum value acquisition module, a global maximum value determination module, and a fifth target quantile determination submodule
  • the node maximum value obtaining module may be configured to obtain the maximum value and the minimum value on each node if the target binning mode is the third binning mode; the global maximum value determining module may be configured to obtain the maximum value and the minimum value on each node according to the The maximum and minimum values determine the maximum and minimum values of the data to be processed; the fourth target quantile determination sub-module determines the target points according to the maximum and minimum values of the data to be processed and the preset number of bins Site.
  • each functional module of the data binning processing device 150 of the exemplary embodiment of the present disclosure corresponds to the steps of the foregoing exemplary embodiment of the data binning processing method, it will not be repeated here.
  • FIG. 16 shows a schematic structural diagram of a computer system 1600 suitable for implementing a terminal device according to an embodiment of the present application.
  • the terminal device shown in FIG. 16 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the computer system 1600 includes a central processing unit (CPU) 1601, which can be based on a program stored in a read-only memory (ROM) 1602 or a program loaded from a storage portion 1608 into a random access memory (RAM) 1603 And perform various appropriate actions and processing.
  • CPU central processing unit
  • RAM random access memory
  • various programs and data required for the operation of the system 1600 are also stored.
  • the CPU 1601, ROM 1602, and RAM 1603 are connected to each other through a bus 1604.
  • An input/output (I/O) interface 1605 is also connected to the bus 1604.
  • the following components are connected to the I/O interface 1605: an input part 1606 including a keyboard, a mouse, etc.; an output part 1607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage part 1608 including a hard disk ; And a communication section 1609 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 1609 performs communication processing via a network such as the Internet.
  • the driver 1610 is also connected to the I/O interface 1605 as needed.
  • a removable medium 1611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1610 as needed, so that the computer program read from it is installed into the storage portion 1608 as needed.
  • the process described above with reference to the flowchart can be implemented as a computer software program.
  • the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 1609, and/or installed from the removable medium 1611.
  • the computer program is executed by the central processing unit (CPU) 1601, it executes the above-mentioned functions defined in the system of the present application.
  • the computer-readable medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present application can be implemented in software or hardware.
  • the described unit may also be provided in the processor.
  • a processor includes a sending unit, an acquiring unit, a determining unit, and a first processing unit.
  • the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the present application also provides a computer-readable medium, which may be included in the device described in the above-mentioned embodiments; or it may exist alone without being assembled into the device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the functions that the device can implement include: obtaining the data to be processed and its target binning method and preset binning methods If the data volume of the data to be processed is greater than or equal to a preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1; according to the preset number of bins and use all
  • the target binning method processes the to-be-processed data on the N nodes to determine the target quantile of the to-be-processed data; the binning operation is performed on the to-be-processed data according to the target quantile to Obtain the binning result.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solutions of the embodiments of the present disclosure can be embodied in the form of a software product.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions. It is used to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a smart device, etc.) to execute the method according to the embodiment of the present disclosure, such as one or more steps shown in FIG. 2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data binning processing method, an apparatus, an electronic device and a computer-readable medium, related to the field of data processing. The method comprises: acquiring data to be processed, as well as a target binning means and a pre-determined number of bins thereof (S1): if a data volume of the data to be processed is greater than or equal to a pre-determined threshold, randomly allocating the data to be processed to a number N of nodes, N being a positive integer greater than 1 (S2); processing the data to be processed on the N nodes according to the pre-determined number of bins and using the target binning means, so as to determine a target quantile of the data to be processed (S3); according to the target quantile, performing a binning operation on the data to be processed, so as to obtain a binning result (S4). It is possible to perform binning processing of data having a relatively high data volume.

Description

数据分箱处理方法及装置、电子设备和计算机可读介质Data binning processing method and device, electronic equipment and computer readable medium
本公开要求申请日为2019年6月12日、申请号为201910504964.2、发明创造名称为《数据分箱处理方法及装置、电子设备和计算机可读介质》的中国发明专利申请的优先权。This disclosure requires the priority of a Chinese invention patent application whose application date is June 12, 2019, the application number is 201910504964.2, and the invention name is "Data Binning Processing Method and Device, Electronic Equipment and Computer Readable Medium".
技术领域Technical field
本公开涉及数据处理技术领域,尤其涉及一种数据分箱处理方法及装置、电子设备和计算机可读介质。The present disclosure relates to the technical field of data processing, and in particular to a data binning processing method and device, electronic equipment and computer-readable media.
背景技术Background technique
数据分箱是一种常用的数据处理方法。数据分箱实际上就是把数据按照某一属性的属性值划分子区间,如按照年龄划分子区间、按照身高划分子区间等。如果一个数据的属性值处于某个子区间范围内,就把该数据放到该子区间代表的分箱内。Data binning is a commonly used data processing method. Data binning is actually dividing the data into sub-intervals according to the attribute value of a certain attribute, such as dividing sub-intervals according to age, dividing sub-intervals according to height, and so on. If the attribute value of a data is within a certain subrange, put the data in the bin represented by the subrange.
随着大数据的发展,数据规模逐渐增大,一种能适应大规模数据的分箱方法,对于数据处理是具有极其重要的意义。With the development of big data, the scale of data is gradually increasing. A binning method that can adapt to large-scale data is extremely important for data processing.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only used to strengthen the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.
发明内容Summary of the invention
有鉴于此,本公开实施例提供一种数据分箱处理方法及装置、电子设备和计算机可读介质,能够对数据规模较大的数据进行分箱处理。In view of this, the embodiments of the present disclosure provide a data binning processing method and device, electronic equipment, and computer readable medium, which can perform binning processing on data with a large data scale.
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。Other characteristics and advantages of the present disclosure will become apparent through the following detailed description, or partly learned through the practice of the present disclosure.
根据本公开实施例的第一个方面,提出一种数据分箱处理方法,该方法包括:获取待处理数据及其目标分箱方式和预设分箱数;若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。According to the first aspect of the embodiments of the present disclosure, a data binning processing method is proposed, the method includes: obtaining the data to be processed and its target binning method and preset binning number; if the amount of data to be processed is If it is greater than or equal to the preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1. According to the preset number of bins and the target binning method, the N nodes The to-be-processed data above is processed to determine a target quantile of the to-be-processed data; the to-be-processed data is binned according to the target quantile to obtain a binning result.
在本公开的一些示例性实施例中,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,包括:若所述目标分箱方式为第一分箱方式,则确定所述待处理数据的第一候选切分点;根据所述第一候选切分点将所述待处理数据有序分配到所述N个节点上;分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第一排序数据;根据所述各个节点中的第 一排序数据获得所述待处理数据的全局KS;根据所述待处理数据的全局KS确定所述目标分位点。In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point includes: if the target binning mode is the first binning mode, determining a first candidate segmentation point of the data to be processed; and dividing the data to be processed according to the first candidate segmentation point Distributed to the N nodes in an orderly manner; respectively sort the to-be-processed data on each node after the orderly distribution to obtain the first sorted data in each node; according to the first sorted data in each node Obtain the global KS of the to-be-processed data; determine the target quantile according to the global KS of the to-be-processed data.
在本公开的一些示例性实施例中,确定所述待处理数据的第一候选切分点,包括:分别对各节点上的待处理数据排序,以获得各个节点中的第二排序数据;根据所述节点的个数N分别对各第二排序数据进行等频划分,以获得各个节点上的第一预切分点;根据所述第一预切分点确定所述第一候选切分点。In some exemplary embodiments of the present disclosure, determining the first candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the second ranking data in each node; The number N of the nodes is divided into equal frequency for each second sorted data respectively to obtain the first pre-segment point on each node; the first candidate segmentation point is determined according to the first pre-segment point .
在本公开的一些示例性实施例中,根据所述待处理数据的全局KS确定所述目标分位点,包括:根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中的确定第二候选切分点;根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点。In some exemplary embodiments of the present disclosure, determining the target quantile according to the global KS of the to-be-processed data includes: first ranking on the N nodes according to the global KS of the to-be-processed data Determine the second candidate segmentation point in the data; determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
在本公开的一些示例性实施例中,根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中确定第二候选切分点,包括:在所述全局KS中确定一个最大KS,将其对应的待处理数据作为所述第二候选切分点;若所述第二候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则在所述第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点。In some exemplary embodiments of the present disclosure, determining a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed includes: determining in the global KS A maximum KS, and its corresponding to-be-processed data is used as the second candidate segmentation point; if the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, then The left side and the right side of the second candidate segmentation point respectively determine the to-be-processed data corresponding to a maximum KS as the second candidate segmentation point.
在本公开的一些示例性实施例中,根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点,包括:判断所述第二候选切分点的个数是否小于所述预设分箱数;如果所述第二候选切分点的个数小于所述预设分箱数,则确定所述第二候选切分点就是所述目标分位点;如果所述第二候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。In some exemplary embodiments of the present disclosure, determining the target quantile in the second candidate segmentation point according to the preset number of bins includes: determining the number of the second candidate segmentation point Whether the number is less than the preset number of bins; if the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target binning point; If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
在本公开的一些示例性实施例中,所述数据分箱处理方法还包括:若所述待处理数据的数据量小于预设阈值,则对所述待处理数据进行排序,以生成第三排序数据;确定所述第三排序数据的KS;根据所述第三排序数据的KS确定第三候选切分点;判断所述第三候选切分点的个数是否大于等于所述预设分箱数;如果所述第三候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。In some exemplary embodiments of the present disclosure, the data binning processing method further includes: if the data volume of the data to be processed is less than a preset threshold, sorting the data to be processed to generate a third ranking Data; determine the KS of the third sorted data; determine a third candidate segmentation point according to the KS of the third sorted data; determine whether the number of the third candidate segmentation points is greater than or equal to the preset bin If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
在本公开的一些示例性实施例中,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,还包括:若所述目标分箱方式为第二分箱方式,则确定所述待处理数据的第四候选切分点;根据所述第四候选切分点将所述待处理数据有序分配到所述N个节点上;分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第四排序数据;根据所述预设分箱数在所述第四排序数据中确定所述目标分位点。In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point further includes: if the target binning mode is the second binning mode, determining a fourth candidate segmentation point of the data to be processed; and dividing the to-be-processed data according to the fourth candidate segmentation point The data is distributed to the N nodes in an orderly manner; the data to be processed on each node after the orderly distribution is sorted to obtain the fourth sorted data in each node; The target quantile is determined in the fourth ranking data.
在本公开的一些示例性实施例中,确定所述待处理数据的第四候选切分点,包括:分别对各节点上的待处理数据排序,以获得各个节点中的第五排序数据;根据所述节点的个数N分别对各第五排序数据进行等频划分,以获得各个节点上的第二预切分点;根据所述第二预切分点确定所述第四候选切分点。In some exemplary embodiments of the present disclosure, determining the fourth candidate segmentation point of the data to be processed includes: respectively sorting the data to be processed on each node to obtain the fifth ranking data in each node; The number N of the nodes is divided into equal frequency for each fifth sorted data respectively to obtain the second pre-segment point on each node; the fourth candidate segmentation point is determined according to the second pre-segment point .
在本公开的一些示例性实施例中,根据所述预设分箱数并利用所述目标分箱方式对所 述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,还包括:若所述目标分箱方式为第三分箱方式,则分别获得各个节点上的最大值和最小值;根据所述各个节点上的最大值和最小值确定所述待处理数据的最大值和最小值;根据所述待处理数据的最大值和最小值以及预设分箱数确定所述目标分位点。In some exemplary embodiments of the present disclosure, the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target of the data to be processed The quantile point further includes: if the target binning mode is the third binning mode, obtaining the maximum value and the minimum value on each node respectively; and determining the waiting point according to the maximum value and the minimum value on each node The maximum and minimum values of the processed data; the target quantile is determined according to the maximum and minimum values of the to-be-processed data and the preset number of bins.
根据本公开实施例的第二方面,提出一种数据分箱处理装置,该装置包括:数据获取模块、数据分配模块、目标分位点确定模块以及分箱模块。其中,数据获取模块配置为获取待处理数据及其目标分箱方式和预设分箱数;数据分配模块配置为若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;目标分位点确定模块配置为根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;分箱模块配置为根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。According to a second aspect of the embodiments of the present disclosure, a data binning processing device is proposed. The device includes: a data acquisition module, a data distribution module, a target quantile point determination module, and a binning module. Wherein, the data acquisition module is configured to acquire the to-be-processed data and its target binning method and preset binning number; the data distribution module is configured to: if the data volume of the to-be-processed data is greater than or equal to a preset threshold, The data is randomly distributed to N nodes, where N is a positive integer greater than 1. The target binning point determination module is configured to perform processing on the N nodes according to the preset binning number and using the target binning method The data is processed to determine the target quantile of the data to be processed; the binning module is configured to perform a binning operation on the data to be processed according to the target quantile to obtain a binning result.
根据本公开实施例的第三方面,提出一种电子设备,该电子设备包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现上述任一项所述的数据分箱处理方法。According to a third aspect of the embodiments of the present disclosure, an electronic device is provided. The electronic device includes: one or more processors; a storage device for storing one or more programs. When the one or more programs are The one or more processors execute, so that the one or more processors implement the data binning processing method described in any one of the foregoing.
根据本公开实施例的第四方面,提出一种计算机可读介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如上述任一项所述的数据分箱处理方法。According to a fourth aspect of the embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, characterized in that, when the program is executed by a processor, the data binning process as described in any of the above is implemented method.
本公开某些实施例提供的数据分箱处理方法、装置及电子设备和计算机可读介质,通过将待处理数据量分配至多个节点上,然后在多个节点上的数据确定目标分位点,最后根据目标分位点实现对待处理数据的分箱操作。该数据分箱处理方法将数据量较大的数据分配至多个节点,同时使用多个节点以完成对待处理数据的分箱操作,克服了单个节点内存过小,无法处理大规模数据的缺陷。The data binning processing method, device, electronic equipment, and computer readable medium provided by some embodiments of the present disclosure allocate the amount of data to be processed to multiple nodes, and then determine the target quantile by the data on the multiple nodes, Finally, the binning operation of the data to be processed is realized according to the target quantile. The data binning processing method distributes data with a large amount of data to multiple nodes, and uses multiple nodes at the same time to complete the binning operation of the data to be processed, which overcomes the defect that a single node has too small memory and cannot process large-scale data.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。It should be understood that the above general description and the following detailed description are only exemplary and cannot limit the present disclosure.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。下面描述的附图仅仅是本公开的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the disclosure, and together with the specification are used to explain the principle of the disclosure. The drawings described below are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1示出了应用于本公开实施例的据分箱处理方法或据分箱处理装置的示例性系统架构的示意图。FIG. 1 shows a schematic diagram of an exemplary system architecture of a data box processing method or data box processing device applied to an embodiment of the present disclosure.
图2是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 2 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
图3是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 3 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图4是根据一示例性实施例示出的又一种数据分箱处理方法的流程图。Fig. 4 is a flow chart showing yet another data binning processing method according to an exemplary embodiment.
图5是根据一示例性实施例示出的再一种数据分箱处理方法的流程图。Fig. 5 is a flow chart showing still another method for processing data binning according to an exemplary embodiment.
图6是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 6 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图7是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 7 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图8是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 8 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图9是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 9 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图10是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 10 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图11是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 11 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图12是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 12 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图13是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 13 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图14是根据一示例性实施例示出的另一种数据分箱处理方法的流程图。Fig. 14 is a flowchart showing another data binning processing method according to an exemplary embodiment.
图15是根据一示例性实施例示出的一种数据分箱处理装置的框图。Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment.
图16是根据一示例性实施例示出的另一种应用于数据分箱处理装置的计算机系统的结构示意图。Fig. 16 is a schematic structural diagram showing another computer system applied to a data binning processing device according to an exemplary embodiment.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本公开将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the embodiments set forth herein; on the contrary, these embodiments are provided so that this disclosure will be comprehensive and complete, and fully convey the concept of the example embodiments To those skilled in the art. In the figures, the same reference numerals denote the same or similar parts, and thus their repeated description will be omitted.
本公开所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本公开的各方面。The features, structures, or characteristics described in the present disclosure may be combined in one or more embodiments in any suitable manner. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, well-known methods, devices, implementations or operations are not shown or described in detail to avoid obscuring aspects of the present disclosure.
附图仅为本公开的示意性图解,图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The accompanying drawings are only schematic illustrations of the present disclosure, and the same reference numerals in the figures indicate the same or similar parts, and thus their repeated description will be omitted. Some block diagrams shown in the drawings do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和步骤,也不是必须按所描述的顺序执行。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowchart shown in the drawings is only an exemplary description, and does not necessarily include all contents and steps, nor does it have to be executed in the described order. For example, some steps can be decomposed, and some steps can be combined or partially combined, so the actual execution order may be changed according to actual conditions.
本说明书中,用语“一个”、“一”、“该”、“所述”和“至少一个”用以表示存在一个或多个要素/组成部分/等;用语“包含”、“包括”和“具有”用以表示开放式的包括在内的意思并且是指除了列出的要素/组成部分/等之外还可存在另外的要素/组成部分/等;用语“第一”、“第二”和“第三”等仅作为标记使用,不是对其对象的数量限制。In this specification, the terms "a", "an", "the", "said" and "at least one" are used to indicate that there are one or more elements/components/etc.; the terms "including", "including" and "Have" is used to mean open-ended inclusion and means that in addition to the listed elements/components/etc., there may be additional elements/components/etc.; the terms “first” and “second "And "third" are only used as markers, and are not a limitation on the number of objects.
下面结合附图对本公开示例实施方式进行详细说明。The exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1示出了可以应用于本公开实施例的数据分箱处理方法或数据分箱处理装置的示例性系统架构的示意图。Fig. 1 shows a schematic diagram of an exemplary system architecture of a data binning processing method or a data binning processing device that can be applied to an embodiment of the present disclosure.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。其中,终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on. Among them, the terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and so on.
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所进行操作的装置提供支持的后台管理服务器。后台管理服务器可以对接收到的请求等数据进行分析等处理,并将处理结果反馈给终端设备。The server 105 may be a server that provides various services, for example, a background management server that provides support for devices operated by users using the terminal devices 101, 102, and 103. The background management server can analyze and process the received request and other data, and feed back the processing result to the terminal device.
服务器105可例如获取待处理数据及其目标分箱方式和预设分箱数;若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。The server 105 may, for example, obtain the data to be processed and its target binning method and preset binning number; if the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, N is a positive integer greater than 1; the data to be processed on the N nodes are processed according to the preset number of bins and the target bin method to determine the target quantile of the data to be processed ; Perform binning operations on the to-be-processed data according to the target quantile to obtain binning results.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的,服务器105可以是一个实体的服务器,还可以为多个服务器组成,根据实际需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks, and servers in FIG. 1 is only illustrative. The server 105 may be a physical server, or may be composed of multiple servers. According to actual needs, it may have any number of terminal devices, Network and server.
在相关技术中,可以将数据按照某一属性的属性值划分子区间,如按照年龄划分子区间、按照身高划分子区间等。如果一个数据的属性值处于某个子区间范围内,就可以把该数据放到该子区间代表的分箱内。然后用整个子区间的属性来表示这个子区间里的数据的属性。可以将这种分箱处理理解为数据的离散化,对数据离散化处理可以具备以下优势:In related technologies, data can be divided into sub-intervals according to the attribute value of a certain attribute, such as sub-intervals according to age, sub-intervals according to height, and so on. If the attribute value of a data is within a certain sub-range, the data can be put into the bin represented by the sub-range. Then use the attributes of the entire subrange to represent the attributes of the data in this subrange. This sort of binning can be understood as the discretization of data, and the discretization of data can have the following advantages:
1.可以很轻松的完成对离散数据的增减操作,该离散数据类型有利于模型的快速迭代。1. It is easy to complete the increase or decrease operation of discrete data. This discrete data type is conducive to the rapid iteration of the model.
2.离散化后的数据形成的稀疏向量在进行内积乘法时,运算速度较快,计算结果方便存储,容易扩展。2. When the sparse vector formed by the discretized data is subjected to inner product multiplication, the calculation speed is faster, the calculation result is convenient to store, and it is easy to expand.
3.离散化后的数据对于异常数据具有很强的鲁棒性。例如,在年龄数据中,异常数据“年龄大于300”会对模型造成很大的干扰,而将该年龄数据离散化后(可以将大于30的年龄表示为1,否则为0)形成了只有0和1特征的数据,将离散化后异常数据代入模型中将不会对模型产生干扰。3. The discretized data has strong robustness to abnormal data. For example, in the age data, the abnormal data "age greater than 300" will cause great interference to the model, and after discretizing the age data (the age greater than 30 can be expressed as 1, otherwise it is 0), it is only 0 For the data of 1 and 1 feature, substituting the discretized abnormal data into the model will not interfere with the model.
4.对于广义线性模型,连续型数据表达能力受限。而将离散化数据代入该模型中相当于为该模型引入了非线性,提高了表达能力,增强了拟合效果。4. For generalized linear models, continuous data expression ability is limited. Substituting the discretized data into the model is equivalent to introducing non-linearity to the model, improving the expression ability and enhancing the fitting effect.
5.将连续型数据离散化后代入模型,模型将会变得更稳定。例如,对于一个会随着时间变化的年龄数据,如果将20~30岁作为一个年龄区间,假设一个用户年龄是25,一年 后该用户年龄变为26,但是其对应的离散数据值不变。5. Substituting the discretized continuous data into the model, the model will become more stable. For example, for an age data that will change over time, if you take 20-30 years old as an age range, suppose a user’s age is 25, and the user’s age will change to 26 after one year, but the corresponding discrete data value remains unchanged .
6.将连续型数据离散化后,可简化逻辑回归模型的作用,降低了模型过拟合的风险。6. After discretizing continuous data, the role of logistic regression model can be simplified, and the risk of model overfitting can be reduced.
图2是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 2 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
参照图2,本公开实施例提供的数据分箱处理方法可以包括以下步骤。2, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
步骤S1,获取待处理数据及其目标分箱方式和预设分箱数。Step S1: Obtain the data to be processed and its target binning method and preset binning number.
在一些实施例中,所述预设分箱数指的是用户指定的将所述待处理数据需要划分的箱数,目标分箱方式指的是用户指定的分箱方式。在一些实施例中,目标分箱方式可以包括第一分箱方式、第二分箱方式以及第三分箱方式中的至少一种。In some embodiments, the preset number of bins refers to the number of bins designated by the user to divide the data to be processed, and the target binning mode refers to the binning mode specified by the user. In some embodiments, the target binning manner may include at least one of a first binning manner, a second binning manner, and a third binning manner.
步骤S2,若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数。Step S2: If the data volume of the data to be processed is greater than or equal to a preset threshold, randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.
在一些实施例中,预设阈值可以指的是单机可以处理的数据量。例如,对于一个包括标签列、序号列、特征值的待处理数据列表来说,假设标签、序号、特征值都是int(整型数据,每个int型数据占用4字节)型数据,那么内存为1G的服务器仅能处理10 8~10 9的数据量。在一些实施例中,当待处理数据的数据量大于等于预设阈值时,可以将待处理数据随机分配至N个节点中进行处理。 In some embodiments, the preset threshold may refer to the amount of data that can be processed by a single machine. For example, for a list of to-be-processed data including label column, sequence number column, and characteristic value, assuming that the label, sequence number, and characteristic value are all int (integer data, each int data occupies 4 bytes) data, then A server with 1G memory can only handle 10 8 to 10 9 data volumes. In some embodiments, when the amount of data to be processed is greater than or equal to a preset threshold, the data to be processed can be randomly allocated to N nodes for processing.
在一些实施例中,N个节点可以指的是N个可以进行数据处理的终端,例如N个服务器或者N个电脑终端等。本公开对N个节点的实体形式不做限制,以实际操作为准。In some embodiments, N nodes may refer to N terminals that can perform data processing, such as N servers or N computer terminals. The present disclosure does not limit the physical form of the N nodes, and the actual operation shall prevail.
在一些实施例中,随机分配到各个节点上的待处理数据的数据量大致相同。In some embodiments, the amount of data to be processed randomly allocated to each node is approximately the same.
步骤S3,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点。Step S3, processing the data to be processed on the N nodes according to the preset number of bins and using the target binning method to determine the target quantile of the data to be processed.
步骤S4,根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。Step S4: Perform a binning operation on the to-be-processed data according to the target quantile point to obtain a binning result.
在一些实施例中,可以在目标分位点处将待处理数据进行分割以形成多箱数据。In some embodiments, the data to be processed can be divided at the target quantile to form multiple bins of data.
上述实施例提供了一种数据分箱处理方法,一方面在数据分箱之前考虑了待处理数据量与预设阈值之间的关系,避免了因数据量过大,无法完成对待处理数据进行分箱操作的问题,另一方面,通过将数据量较大的数据分配至多个节点,使用多个节点同时完成对待处理数据的分箱操作,克服了单个节点内存过小,无法处理大规模数据的缺陷。The foregoing embodiment provides a data binning processing method. On the one hand, the relationship between the amount of data to be processed and the preset threshold is considered before data binning, so as to avoid the inability to complete the data to be processed due to excessive data volume. On the other hand, by distributing data with a large amount of data to multiple nodes, multiple nodes are used to complete the binning operation of the data to be processed at the same time, which overcomes the problem of a single node whose memory is too small to handle large-scale data. defect.
参照图3,图2所示实施例提供的步骤S3可以包括以下步骤。Referring to FIG. 3, step S3 provided in the embodiment shown in FIG. 2 may include the following steps.
步骤S31,若所述目标分箱方式为第一分箱方式,则确定所述待处理数据的第一候选切分点。Step S31: If the target binning mode is the first binning mode, determine the first candidate segmentation point of the data to be processed.
在一些实施例中,第一分箱方式可以是一种基于数据ks值的分布式数据分箱处理方法。In some embodiments, the first binning method may be a distributed data binning processing method based on the data ks value.
在一些实施例中,确定第一候选切分点可以包括如图4所示步骤。In some embodiments, determining the first candidate segmentation point may include the steps shown in FIG. 4.
步骤S311,分别对各节点上的待处理数据排序,以获得各个节点中的第二排序数据。Step S311: Sort the to-be-processed data on each node respectively to obtain the second sorted data in each node.
在一些实施例中,可以首先将所述待处理数据随机分配至N个节点上,N为大于1 的正整数。In some embodiments, the data to be processed may be randomly distributed to N nodes, where N is a positive integer greater than 1.
例如,将M个待处理数据随机分配到N个节点上,各节点上的数据分别表示为M 1、M 2…….M N-1、M NFor example, M data to be processed are randomly allocated to N nodes, and the data on each node are respectively denoted as M 1 , M 2 ... M N-1 , M N.
在一些实施例中,可以分别对各节点上的待处理数据进行排序处理,以获得各节点中的第二排序数据。In some embodiments, the data to be processed on each node may be sorted separately to obtain the second sorted data in each node.
例如,对各节点上的数据M 1、M 2…….M N-1、M N排序后分别生成各节点上的第二排序数据M’ 1、M’ 2…….M’ N-1、M’ NFor example, data on each node M 1, M 2 ...... .M N -1, the M N sorted generate a second sorting data on each node M '1, M' 2 ...... .M 'N-1 , M 'N.
在一些实施例中,可以根据节点内存大小和处理待处理数据所需的内存大小选择排序方法以实现对待处理数据量的排序。例如,当单个节点上的待处理数据所需的内存空间小于该节点的一半内存时,可以采用桶排序(例如基数排序)对该节点上的待处理数据进行排序,当单个节点上的待处理数据所需的空间大于等于该节点的一半内存时,可以采用快速排序对该节点上的待处理数据进行排序。其中,快速排序占用内存少,但是速度较慢,桶排序速度较快,但是占用内存较大。In some embodiments, the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed. For example, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for data is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slow, and bucket sort is faster, but occupies more memory.
在一些实施例中,节点中处理待处理数据所需的内存与该节点上的待处理数据的数据量、数据类型以及待处理数据包括的属性个数等相关。例如,对于一个包括标签列、序号列、特征值的待处理数据列表来说,假设它的数据量为10 8~10 9,再假设标签、序号、特征值都是int型数据(每个int型数据占用4字节),那么要处理上述数待处理数据至少需要1G的内存。 In some embodiments, the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ~ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
步骤S312,根据所述节点的个数N分别对各第二排序数据进行等频划分,以获得各个节点上的第一预切分点。Step S312: Perform equal frequency division on each second sorted data according to the number N of the nodes, to obtain the first pre-segment point on each node.
在一些实施例中,根据用户指定的节点个数N以及各个节点上待处理数据的数据量可以实现对各个节点上的第二排序数据的等频划分。假设第一节点上待处理数据量为1000,节点个数为5,那么可以根据每箱数据量为1000/5对第一节点上的第二排序数据进行等频划分。In some embodiments, the equal frequency division of the second sorted data on each node can be realized according to the number N of nodes designated by the user and the data volume of the data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000 and the number of nodes is 5, then the second sorted data on the first node can be divided into equal frequency according to the amount of data per box of 1000/5.
在一些实施例中,根据各个节点上的待处理数据量以及所述节点的个数N对各个节点进行等频划分以获得各个节点上的第一预切分点。In some embodiments, equal frequency division is performed on each node according to the amount of data to be processed on each node and the number N of said nodes to obtain the first pre-segment point on each node.
例如,假设各节点上的第二排序数据M’ 1、M’ 2,…….M’ N-1、M’ N,根据节点数N以及各节点中数据的数据量可以分别对各节点的第二排序数据进行等频划分。假设第一个节点上确定的第一预切分点分别是m 11、m 12、m 1N-1(容易理解的是只需要N-1个切分点就可以将M个数据分到N箱中)、第二个节点上确定的第一预切分点分别是m 21、m 22,…..m 2N-1、第i个节点上确定的第一预切分点分别是m i1、m i2,…….m iN-1,i为小于等于N的正整数。 For example, assuming that the second sorting data on the respective nodes M '1, M' 2, ...... .M 'N-1, M' N, N number of nodes according to the data and the amount of data in each node and each node may respectively The second sorted data is divided into equal frequency. Suppose that the first pre-segmentation points determined on the first node are m 11 , m 12 , m 1N-1 (it is easy to understand that only N-1 segmentation points are needed to divide M data into N boxes In), the first pre-segmentation points determined on the second node are m 21 , m 22 ,...m 2N-1 , and the first pre-segmentation points determined on the i-th node are m i1 , m i2 ,...m iN-1 , i is a positive integer less than or equal to N.
步骤S313,根据所述第一预切分点确定所述第一候选切分点。Step S313: Determine the first candidate segmentation point according to the first pre-segmentation point.
在一些实施例中,可以对多个节点上的第一预切分点对应求均值以确定第一候选切分点。例如,假设预设分箱数为N,第一节点上确定的第一预切分点为m 11、m 12,…….m 1N-1, 第二节点上确定的第一预切分点为m 21、m 22、m 2N-1,第i个节点上确定的第一预切分点分别是m i1、m i2,…….m iN-1,i为小于等于N的正整数。 In some embodiments, the first pre-segmentation points on multiple nodes may be correspondingly averaged to determine the first candidate segmentation point. For example, assuming that the preset number of bins is N, the first pre-segmenting points determined on the first node are m 11 , m 12 , ... m 1N-1 , and the first pre-segmenting points determined on the second node M 21 , m 22 , m 2N-1 , the first pre-segment points determined on the i-th node are m i1 , m i2 ,...m iN-1 , and i is a positive integer less than or equal to N.
在一些实施例中,所述第一候选切分点可以确定为
Figure PCTCN2019100804-appb-000001
Figure PCTCN2019100804-appb-000002
In some embodiments, the first candidate segmentation point may be determined as
Figure PCTCN2019100804-appb-000001
Figure PCTCN2019100804-appb-000002
其中m iN-1代表第i个节点上的第N-1个第一预切分点。 Where m iN-1 represents the N-1 first pre-segmentation point on the i-th node.
在另外一些实施例中,也可以对多个节点上的第一预切分点对应求中值、最大值或者最小值等以作为第一候选切分点。In some other embodiments, the first pre-segmentation points on multiple nodes may be correspondingly calculated as the median, maximum, or minimum, etc., as the first candidate segmentation point.
图4所示实施例,不仅通过多个节点共同确定了对待处理数据进行初步划分的第一候选切分点,而且还根据节点内存大小和待处理数据的数据量对节点上待处理数据进行排序,在充分利节点内存的情况下保证了运行速度。The embodiment shown in Figure 4 not only determines the first candidate segmentation point for preliminary division of the data to be processed through multiple nodes, but also sorts the data to be processed on the node according to the memory size of the node and the amount of data to be processed , Ensuring the running speed while fully utilizing the node memory.
步骤S32,根据所述第一候选切分点将所述待处理数据有序分配到所述N个节点上。Step S32: Distributing the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.
在一些实施例中,有序分配指的是分配后的各个节点上的待处理数据之间存在特定、已知的大小关系。例如,第一节点上的待处理数据最大值小于第二节点上待处理数据的最小值,以此类推等。In some embodiments, ordered allocation refers to a specific and known size relationship between the data to be processed on each node after allocation. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.
例如,假设节点个数N为4,第一候选切分点分别为C 1、C 2、C 3,将待处理数据根据第一候选切分点有序分配至4个节点上可以表示为:将第0至第C 1个数据分配至第一个节点上,将第C 1+1至第C 2个数据分配至第2个节点上、将第C 1+1至第C 2个数据分配至第2个节点上、将第C 3+1至最后一个数据分配至第4个节点上。 For example, assuming that the number of nodes N is 4, and the first candidate segmentation points are C 1 , C 2 , and C 3 , respectively, the data to be processed are allocated to 4 nodes in order according to the first candidate segmentation point, which can be expressed as: the 0th to the C 1 th data assigned to the first node, the second to the first C 1 +1 C 2 data assigned to the second node, the first C 1 +1 through C 2 data distribution To the second node, assign the C 3 +1 to the last data to the fourth node.
步骤S33,分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第一排序数据。Step S33: Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.
在一些实施例中,可以根据各节点的内存大小和节点上的待处理数据的数据量大小选择排序方法以实现对各节点上待处理数据量的排序。In some embodiments, the sorting method can be selected according to the memory size of each node and the data amount of the data to be processed on the node to realize the sorting of the amount of data to be processed on each node.
步骤S34,根据所述各个节点中的第一排序数据获得所述待处理数据的全局KS。Step S34: Obtain the global KS of the to-be-processed data according to the first ranking data in each node.
在相关技术中,KS值可以用来对模型风险区分能力进行评估,指标衡量的是第一样本和第二样本累计部分之间的差距。KS值越大,表示该变量越能将第一样本和第二样本区分开来。In related technologies, the KS value can be used to evaluate the risk discrimination ability of the model. The indicator measures the gap between the cumulative part of the first sample and the second sample. The larger the KS value, the better the variable can distinguish the first sample from the second sample.
在一些实施例中,各个节点中均可以包括第一样本数据和第二样本内数据。In some embodiments, each node may include the first sample data and the second sample data.
在一些实施例中,所述第一样本和所述第二样本的标注规则可以由用户自行定义。例如,在银行数据中,用户可以定义那些存在信用问题的客户所对应的数据为第一样本,定义那些不存在信用问题的客户所对应的数据为第二样本。In some embodiments, the labeling rules of the first sample and the second sample may be defined by the user. For example, in bank data, the user can define the data corresponding to those customers with credit problems as the first sample, and define the data corresponding to those customers without credit problems as the second sample.
在一些实施例中,可以通过以下方式获取区间(该区间内可以只有一个数据)的KS 值。In some embodiments, the KS value of an interval (there may be only one data in the interval) can be obtained in the following manner.
1、对数据进行排序处理。1. Sort the data.
2、对排序后数据按序进行分类处理,以生成多个数据区间。2. Sort the sorted data in order to generate multiple data intervals.
3、获取各区间的第一样本的数量(例如好数据)和第二样本的数量(例如坏数据)。3. Obtain the number of first samples (for example, good data) and the number of second samples (for example, bad data) in each interval.
4、获取各区间的累计第一样本数(累计第一样本数可以指的是当前区间的第一样本数加上本区间之前所有区间的第一样本数,如,第一区间有3个第一样本,第二个区间有2个第一样本,第三个区间有4个第一样本,那么第二个区间的累计第一样本数为2+3个)和累计第二样本数。4. Get the cumulative first sample number of each interval (the cumulative first sample number can refer to the first sample number of the current interval plus the first sample number of all intervals before this interval. For example, the first interval has 3 The same book, the second interval has 2 first samples, and the third interval has 4 first samples, then the cumulative number of first samples in the second interval is 2+3) and the cumulative second sample number.
5、获取各区间的累计第一样本数占总第一样本数的比例(good%)和各区间的累计第二样本数占总第二样本数的比例(bad%)。5. Obtain the ratio of the cumulative number of first samples in each interval to the total number of first samples (good%) and the ratio of the cumulative number of second samples in each interval to the total number of second samples (bad%).
6、确定区间的累计第一样本数占总第一样本数的比例和该区间的累计第二样本数占总第二样本数的比例的差值的绝对值(|good%-bad%|),以作为该区间的KS值。6. The absolute value of the difference between the ratio of the cumulative number of first samples in the interval to the total number of first samples and the ratio of the cumulative number of second samples in the interval to the total number of second samples (|good%-bad%|) , As the KS value of the interval.
在一些实施例中,在确定待处理数据的全局KS之前可以将重复的待处理数据合并。In some embodiments, the repeated data to be processed may be merged before the global KS of the data to be processed is determined.
在一些实施例中,由于各个节点之间的第一排序数据之间也是有序的,所以可以根据节点中的第一样本的数据量和第二样本的数据量分别确定待处理数据的全局KS值。In some embodiments, since the first sorted data between each node is also ordered, the global data to be processed can be determined according to the data volume of the first sample and the data volume of the second sample in the node. KS value.
在一些实施例中,数据的全局KS指的是在全部待处理数据的基础上获取的数据的KS值。例如,将待处理数据分到三个节点上,每个节点上分别有N1、N2、N3个第一样本、N4、N5、N6个第一样本,那么第二个节点上的最后一个数据的全局KS值可以表示为(|(N1+N2)/(N1+N2+N3)%-(N4+N5)/(N4+N5+N6)%|)。In some embodiments, the global KS of the data refers to the KS value of the data obtained on the basis of all the data to be processed. For example, if the data to be processed is divided into three nodes, each node has N1, N2, N3 first samples, N4, N5, N6 first samples, then the last one on the second node The global KS value of the data can be expressed as (|(N1+N2)/(N1+N2+N3)%-(N4+N5)/(N4+N5+N6)%|).
步骤S35,根据所述待处理数据的全局KS确定所述目标分位点。Step S35: Determine the target quantile according to the global KS of the data to be processed.
在一些实施例中,可以根据如图5所示步骤确定目标分位点。In some embodiments, the target quantile can be determined according to the steps shown in FIG. 5.
步骤S351,根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中确定第二候选切分点。Step S351: Determine a second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed.
在一些实施例中,还可以根据如图6所示步骤确定第二候选切分点。In some embodiments, the second candidate segmentation point can also be determined according to the steps shown in FIG. 6.
步骤S3511,在所述全局KS中确定一个最大KS,将其对应的待处理数据作为所述第二候选切分点。Step S3511: Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.
在一些实施例中,可以根据待处理数据的全局KS在待处理数据中确定一个最大KS值所对应的数据作为第二候选切分点。In some embodiments, data corresponding to a maximum KS value can be determined in the data to be processed according to the global KS of the data to be processed as the second candidate segmentation point.
步骤S3512,若所述第二候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则在所述第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点。Step S3512: If the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, determine a maximum value on the left and right sides of the second candidate segmentation point. The to-be-processed data corresponding to the KS is used as the second candidate segmentation point.
在一些实施例中,预设数据量可以由用户提前设定。In some embodiments, the preset data amount may be set by the user in advance.
在一些实施例中,判断根据步骤S3511获取的第二候选切分点左侧和右侧的待处理数据的数据量是否大于预设数据量(如果步骤S3511获取了不止一个第二候选切分点,则分别判断上述不止一个第二候选切分点的左侧和右侧的待处理数据的数据量是否大于预设 数据量)。若所述第二候选切分点左侧和右侧的待处理数据的数据量全部大于预设数据量,则继续在各第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点;若判断存在一个第二候选切分点的左侧或者右侧的待处理数据的数据量小于预设数据量,则停止迭代。In some embodiments, it is determined whether the amount of data to be processed on the left and right of the second candidate segmentation point obtained in step S3511 is greater than the preset data amount (if more than one second candidate segmentation point is obtained in step S3511 , Respectively determine whether the data amount of the data to be processed on the left and right sides of the above-mentioned more than one second candidate segmentation points is greater than the preset data amount). If the amount of data to be processed on the left and right of the second candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS corresponding to the left and right of each second candidate segmentation point respectively The to-be-processed data of is used as the second candidate segmentation point; if it is determined that there is a second candidate segmentation point, the data amount of the to-be-processed data on the left or right side is less than the preset data amount, then the iteration is stopped.
步骤S352,根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点。Step S352: Determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
在一些实施例中,根据预设分箱数在所述第二候选切分点钟确定所述目标分位点可以通过如图7所示步骤实现。In some embodiments, the determination of the target quantile at the second candidate cut-off point according to the preset number of bins can be achieved through the steps shown in FIG. 7.
步骤S3521,判断所述第二候选切分点的个数是否小于所述预设分箱数。Step S3521: Determine whether the number of the second candidate segmentation points is less than the preset number of bins.
步骤S3522,如果所述第二候选切分点的个数小于所述预设分箱数,则确定所述第二候选切分点就是所述目标分位点。Step S3522: If the number of the second candidate segmentation points is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile point.
步骤S3523,如果所述第二候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Step S3523: If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
在一些实施例中,假设第二候选切分点的个数为N,目标分箱数为M,其中N大于等于M,那么可以在N个第二候选切分点中确定M-1个目标分为点。In some embodiments, assuming that the number of second candidate segmentation points is N and the number of target bins is M, where N is greater than or equal to M, then M-1 targets can be determined from the N second candidate segmentation points Divided into points.
在一些实施例中,在N个第二候选切分点中确定M-1个目标切分点时可能会有
Figure PCTCN2019100804-appb-000003
种解,每种解都会可以通过公式(1)求得对应解的IV值。
In some embodiments, when M-1 target segmentation points are determined among the N second candidate segmentation points, there may be
Figure PCTCN2019100804-appb-000003
For each solution, the IV value of the corresponding solution can be obtained by formula (1).
Figure PCTCN2019100804-appb-000004
Figure PCTCN2019100804-appb-000004
其中,good_Pcnt i%代表第i个区间(该区间可以只包括一个数)的第一样本数占总第一样本数的比例,bad_Pcnt i%代表第i个区间的第二样本数占总第二样本数的比例。 Among them, good_Pcnt i % represents the proportion of the first sample in the i-th interval (the interval may only include one number) to the total number of first samples, bad_Pcnt i % represents the second sample in the i-th interval The proportion of the number of samples.
在一些实施例中,可以依次求出每个解的IV值,并找出最大IV值对应的解以作为最优解,并根据最优解确定目标分位点。该方法占用空间少、逻辑简单,但是该方法进行了多次重复计算,计算效率不高,。In some embodiments, the IV value of each solution can be obtained in turn, and the solution corresponding to the maximum IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution. This method occupies less space and has simple logic. However, this method has been repeatedly calculated for many times, and the calculation efficiency is not high.
在一些实施例中,可以选择动态规划的方法确定目标分为点。动态规划方法可以将已经解决过的子问题的解缓存下来,下次可以直接使用该子问题的解,避免了重复运算。In some embodiments, a dynamic programming method can be selected to determine the target points. The dynamic programming method can cache the solution of the sub-problem that has been solved, and the solution of the sub-problem can be used directly next time, avoiding repeated operations.
上述实施例提供了一种数据分箱处理方法,该方法具有以下有益效果:The foregoing embodiment provides a data binning processing method, which has the following beneficial effects:
一、基于KS指标对待处理数据进行分箱,能够有效地对连续变量进行分箱处理,具有更强的解释性,而且该方法可以附加许多用户的特定需求,例如要求分箱结果的IV要单调等。1. The data to be processed is binned based on the KS index, which can effectively bin bin processing of continuous variables, and has stronger interpretability, and this method can be attached to the specific needs of many users. For example, the IV of the binning result is required to be monotonous Wait.
二、根据节点内存和节点上待处理数据的数据量对待处理数据进行排序,在充分利节点内存的情况下保证了运行速度。2. Sort the data to be processed according to the memory of the node and the amount of data to be processed on the node, ensuring the running speed while fully utilizing the memory of the node.
三、使用动态规划方法确定目标分位点,节约了运行时间。3. Use the dynamic programming method to determine the target quantile, saving running time.
四、相比于等频、等距等分箱方法,本方法不需要业务经验,可自动完成分箱操作。4. Compared with the equal frequency, equal distance and equal binning method, this method does not require business experience and can automatically complete the binning operation.
五、该方法通过将大规模的将待处理数据量分配至多个节点上,然后在多个节点上的 数据中确定了目标分位点,最后根据目标分位点实现对待处理数据的分箱操作,克服了单机内存过小,无法处理大规模数据的缺陷。5. This method distributes the amount of data to be processed to multiple nodes on a large scale, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.
参照图8,本公开实施例提供的数据分箱处理方法还可以包括以下步骤。Referring to FIG. 8, the data binning processing method provided by the embodiment of the present disclosure may further include the following steps.
步骤S1,获取待处理数据。Step S1: Obtain data to be processed.
步骤S5,若所述待处理数据的数据量小于预设阈值,则对所述待处理数据进行排序,以生成第三排序数据。Step S5: If the amount of data to be processed is less than a preset threshold, sort the data to be processed to generate third sorted data.
在一些实施例中,可以根据节点内存大小和处理待处理数据所需的内存大小选择排序方法以实现对待处理数据量的排序。在一些实施例中,当单个节点上的待处理数据所需的内存空间小于该节点的一半内存时,可以采用桶排序(例如基数排序)对该节点上的待处理数据进行排序,当单个节点上的待处理数据所需的空间大于等于该节点的一半内存时可以,可采用快速排序对该节点上的待处理数据进行排序。其中快速排序占用内存少,但是速度较慢,而桶排序速度较快,但是占用内存较大。In some embodiments, the sorting method may be selected according to the memory size of the node and the memory size required for processing the data to be processed to realize the sorting of the amount of data to be processed. In some embodiments, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for the data to be processed on the node is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster, but occupies more memory.
在一些实施例中,节点中处理待处理数据所需的内存与该节点上的待处理数据的数据量、数据类型以及待处理数据包括的属性个数等相关。例如,对于一个包括标签列、序号列、特征值的待处理数据列表来说,假设它的数据量为10 8~10 9,再假设标签、序号、特征值都是int型数据(每个int型数据占用4字节),那么要处理上述数待处理数据至少需要1G的内存。 In some embodiments, the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ~ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
步骤S6,确定所述第三排序数据的KS。Step S6: Determine the KS of the third sorted data.
在一些实施例中,在确定待处理数据的KS之前可以将重复的待处理数据合并。In some embodiments, the repeated data to be processed may be merged before determining the KS of the data to be processed.
在一些实施例中,可以根据第三排序数据中的总第一样本数和总第二样本数以及第三排序数据中各个数据处的累计第一样本数和第二累计样本数确定第三排序数据中的数据的KS值。In some embodiments, the third ranking can be determined based on the total number of first samples and the total number of second samples in the third ranking data, and the cumulative first sample number and the second cumulative number of samples at each data in the third ranking data. The KS value of the data in the data.
步骤S7,根据所述第三排序数据的KS确定第三候选切分点。Step S7: Determine a third candidate segmentation point according to the KS of the third ranking data.
在一些实施例中,可以在所述第三排序数据的KS中确定一个最大KS,并将其对应的待处理数据作为所述第三候选切分点。In some embodiments, a maximum KS may be determined among the KSs of the third ranking data, and the corresponding to-be-processed data may be used as the third candidate segmentation point.
在一些实施例中,若所述第三候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则在所述第三候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第三候选切分点。In some embodiments, if the amount of data to be processed on the left and right sides of the third candidate segmentation point is greater than the preset amount of data, then the data on the left and right sides of the third candidate segmentation point are respectively Determine the to-be-processed data corresponding to one largest KS as the third candidate segmentation point.
在一些实施例中,预设数据量可以由用户提前设定。In some embodiments, the preset data amount may be set by the user in advance.
在一些实施例中,判断上述第三候选切分点左侧和右侧的待处理数据的数据量大于预设数据量(如果上述步骤获取了不止一个第三候选切分点,则分别判断上述不止一个第三候选切分点的左侧和右侧的待处理数据的数据量大于预设数据量)。若判断所述第三候选切分点左侧和右侧的待处理数据的数据量全部大于预设数据量,则继续在各第三候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第三候选切分点。若判断存在一个第三候选切分点的左侧或者右侧的待处理数据的数据量小于预设数据量, 则停止迭代。In some embodiments, it is determined that the amount of data to be processed on the left and right sides of the third candidate segmentation point is greater than the preset data amount (if more than one third candidate segmentation point is obtained in the above steps, then the foregoing The amount of data to be processed on the left and right sides of more than one third candidate segmentation point is greater than the preset data amount). If it is determined that the amount of data to be processed on the left and right sides of the third candidate segmentation point is all greater than the preset data amount, continue to determine a maximum KS on the left and right sides of each third candidate segmentation point. The corresponding data to be processed is used as the third candidate segmentation point. If it is determined that there is a third candidate segmentation point where the amount of data to be processed on the left or right side is less than the preset data amount, then the iteration is stopped.
步骤S8,判断所述第三候选切分点的个数是否大于等于所述预设分箱数。Step S8: Determine whether the number of the third candidate segmentation points is greater than or equal to the preset number of bins.
在一些实施例中,如果所述第三候选切分点的个数小于所述预设分箱数,则确定所述第三候选切分点就是所述目标分位点。In some embodiments, if the number of the third candidate segmentation points is less than the preset number of bins, it is determined that the third candidate segmentation point is the target quantile point.
步骤S9,如果所述第三候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Step S9: If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
在一些实施例中,假设第二候选切分点的个数为N,目标分箱数为M,其中N大于等于M,那么必须在N个第二候选切分点中确定M-1个目标分为点。In some embodiments, assuming that the number of second candidate segmentation points is N and the number of target bins is M, where N is greater than or equal to M, then M-1 targets must be determined from the N second candidate segmentation points Divided into points.
在一些实施例中,在N个第二候选切分点中确定M-1个目标切分点时可能会有
Figure PCTCN2019100804-appb-000005
种解,每种解都会可以通过公式(1)求得该解的IV值。
In some embodiments, when M-1 target segmentation points are determined among the N second candidate segmentation points, there may be
Figure PCTCN2019100804-appb-000005
For each solution, the IV value of the solution can be obtained by formula (1).
在一些实施例中,可以选择一个IV值最大的解对应的第三候选切分点作为目标分位点。In some embodiments, a third candidate segmentation point corresponding to the solution with the largest IV value may be selected as the target quantile point.
在一些实施例中,可以依次求出每个解的IV值,并找出最大IV值对应的解以作为最优解,并根据最优解确定目标分位点,这种最优解求取方法占用空间少、逻辑简单,但是该方法进行了多次重复计算,计算效率不高。In some embodiments, the IV value of each solution can be obtained in turn, and the solution corresponding to the largest IV value can be found as the optimal solution, and the target quantile can be determined according to the optimal solution. This optimal solution is obtained The method occupies less space and is simple in logic. However, the method has been repeatedly calculated many times, and the calculation efficiency is not high.
在一些实施例中,可以选择动态规划的方法确定目标分为点。动态规划方法可以将已经解决过的子问题的解缓存以来,下次可以直接使用该子问题的解,避免了重复运算。In some embodiments, a dynamic programming method can be selected to determine the target points. The dynamic programming method can cache the solution of the sub-problem that has been solved, and the solution of the sub-problem can be used directly next time, avoiding repeated operations.
在一些实施例中,可以在单个节点中使用图8所示实施例提供的技术方案以完成单个属性数据的分箱处理。如果一个数据列表中包括多个属性的数据,例如一个数据列表中既包括年龄也包括分数,也可以将上述数据列表中的数据按照属性分配至多个节点中并同时使用上述方法以完成分箱处理。In some embodiments, the technical solution provided in the embodiment shown in FIG. 8 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also distribute the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .
图8所述实施例提供的技术方案一方面基于KS指标对待处理数据进行分箱,能够有效地对连续变量进行分箱处理,而且具有更强的解释性,另一方面根据节点内存和节点上待处理数据的数据量对待处理数据进行排序,在充分利节点内存的情况下保证了运行速度,进一步的,该方法使用动态规划找出符合条件的目标分位点,节约了运行时间。The technical solution provided by the embodiment shown in Fig. 8 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory. On the other hand, it is based on node memory and on-node The data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized. Furthermore, this method uses dynamic programming to find out the eligible target quantiles, which saves running time.
参照图9,图2所示实施例提供的步骤S3还可以包括以下步骤。Referring to FIG. 9, step S3 provided by the embodiment shown in FIG. 2 may further include the following steps.
步骤S36,若所述目标分箱方式为第二分箱方式,则确定所述待处理数据的第四候选切分点。Step S36: If the target binning mode is the second binning mode, determine the fourth candidate segmentation point of the data to be processed.
参照图10,图9所示实施例提供的步骤S36可以包括以下步骤。10, step S36 provided in the embodiment shown in FIG. 9 may include the following steps.
S361,分别对各节点上的待处理数据排序,以获得各个节点中的第五排序数据。S361: Sort the to-be-processed data on each node respectively to obtain fifth sorted data in each node.
在一些实施例中,可以首先将所述待处理数据随机分配至N个节点上,N为大于1的正整数。In some embodiments, the data to be processed may be randomly distributed to N nodes, where N is a positive integer greater than 1.
在一些实施例中,可以分别对各节点上的待处理数据进行排序处理,以获得各节点中的第五排序数据。In some embodiments, the data to be processed on each node may be sorted separately to obtain the fifth sorted data in each node.
在一些实施例中,可以根据节点内存大小和处理待处理数据所需的内存大小选择排序 方法以实现对待处理数据量的排序。在一些实施例中,当单个节点上的待处理数据所需的内存空间小于该节点的一半内存时,可以采用桶排序(例如基数排序)对该节点上的待处理数据进行排序,当单个节点上的待处理数据所需的空间大于等于该节点的一半内存时可以,可采用快速排序对该节点上的待处理数据进行排序。其中快速排序占用内存少,但是速度较慢,而桶排序速度较快,但是占用内存较大。In some embodiments, the sorting method can be selected according to the memory size of the node and the memory size required for processing the data to be processed to achieve the sorting of the amount of data to be processed. In some embodiments, when the memory space required for the data to be processed on a single node is less than half of the memory of the node, bucket sorting (such as cardinal sorting) can be used to sort the data to be processed on the node. When the space required for the data to be processed on the node is greater than or equal to half of the memory of the node, quick sort can be used to sort the data to be processed on the node. Among them, quick sort occupies less memory, but the speed is slower, while bucket sorting is faster but occupies more memory.
在一些实施例中,节点中处理待处理数据所需的内存与该节点上的待处理数据的数据量、数据类型以及待处理数据包括的属性个数等相关。例如,对于一个包括标签列、序号列、特征值的待处理数据列表来说,假设它的数据量为10 8~10 9,再假设标签、序号、特征值都是int型数据(每个int型数据占用4字节),那么要处理上述数待处理数据至少需要1G的内存。 In some embodiments, the memory required for processing the data to be processed in a node is related to the data amount, data type, and the number of attributes included in the data to be processed on the node. For example, for a list of to-be-processed data including label column, serial number column, and characteristic value, suppose its data volume is 10 8 ~ 10 9 , and then suppose that the label, serial number, and characteristic value are all int data (each int Type data occupies 4 bytes), then at least 1G of memory is required to process the above data to be processed.
S362,根据所述节点的个数N分别对各第五排序数据进行等频划分,以获得各个节点上的第二预切分点。S362: Perform equal frequency division on each fifth sorted data according to the number N of the nodes, to obtain a second pre-segment point on each node.
在一些实施例中,根据用户指定的节点的个数N以及各个节点上待处理数据的数据量可以实现对各个节点上的排序后数据的等频划分。假设第一节点上待处理数据量为1000,用户预设的分箱数为5,那么可以根据每箱数据量为1000/5对第一节点上的排序后数据进行等频划分。In some embodiments, the equal frequency division of the sorted data on each node can be realized according to the number N of nodes designated by the user and the amount of data to be processed on each node. Assuming that the amount of data to be processed on the first node is 1000, and the number of bins preset by the user is 5, then the sorted data on the first node can be divided equally according to the amount of data per box of 1000/5.
在一些实施例中,根据各个节点上的待处理数据量以及所述节点的个数N对各个节点进行等频划分后可以获得各个节点上的第二预切分点。In some embodiments, the second pre-segment point on each node can be obtained after equal frequency division of each node according to the amount of data to be processed on each node and the number N of the nodes.
S363,根据所述第二预切分点确定所述第四候选切分点。S363. Determine the fourth candidate segmentation point according to the second pre-segmentation point.
在一些实施例中,可以根据所述第二预切分点确定所述第四候选切分点。In some embodiments, the fourth candidate segmentation point may be determined according to the second pre-segmentation point.
在一些实施例中,可以对各个节点上的第二预切分点对应求均值以确定所述第四候选切分点。例如,假设节点的个数N为4,第一节点上确定的第二预切分点为2.2、4.2、5.8、8.2,第二节点上确定的第二预切分点为1.8、3.8、6.2、7.8,那么对第一节点上的第二预切分点与第二节点上的第二预切分点分别对应求均值之后求得的第四候选切分点为2、4、6、8。In some embodiments, the second pre-segment points on each node may be correspondingly averaged to determine the fourth candidate segmentation point. For example, suppose the number of nodes N is 4, the second pre-segment points determined on the first node are 2.2, 4.2, 5.8, 8.2, and the second pre-segment points determined on the second node are 1.8, 3.8, 6.2 , 7.8, then the second pre-segment point on the first node and the second pre-segment point on the second node respectively correspond to the fourth candidate segmentation points obtained after averaging 2, 4, 6, 8 .
在另外一些实施例中,也可以对各个节点上的第二预切分点对应求中值、最大值或者最小值等以作为第四候选切分点。In some other embodiments, the second pre-segmentation point on each node may be corresponding to the median, maximum, or minimum value, etc., as the fourth candidate segmentation point.
步骤S37,根据所述第四候选切分点将所述待处理数据有序分配到所述N个节点上。Step S37: Distributing the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.
在一些实施例中,有序分配指的是各个节点上的待处理数据之间存在特定、已知的大小关系。例如,第一节点上的待处理数据最大值小于第二节点上待处理数据的最小值,以此类推等。In some embodiments, ordered allocation refers to a specific, known size relationship between the data to be processed on each node. For example, the maximum value of the data to be processed on the first node is smaller than the minimum value of the data to be processed on the second node, and so on.
步骤S38,分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第四排序数据。Step S38: Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.
在一些实施例中,可以根据各节点的内存大小和节点上的待处理数据的数据量大小选择排序方法以实现对各节点上待处理数据量的排序。In some embodiments, the sorting method can be selected according to the memory size of each node and the data amount of the data to be processed on the node to realize the sorting of the amount of data to be processed on each node.
步骤S39,根据所述预设分箱数在所述第四排序数据中确定所述目标分位点。Step S39: Determine the target quantile point in the fourth ranking data according to the preset number of bins.
在一些实施例中,如果已对待处理数据进行排序,根据待处理数据的数据量、预设分箱数就可以确定目标分为点。In some embodiments, if the data to be processed has been sorted, the target points can be determined according to the amount of data to be processed and the preset number of bins.
例如,已知待处理数据量为1000,第一节点上的第四排序数据为2520、第二节点上的第四排序数据为2480、第三节点和第四节点上第四排序数据为2500,并且第一节点上的最大值小于第二节点上的最小值,以此类推。如果节点个数为4,那么目标分为点应该为第2500、500、7500个数据,因为四个节点上的数据是排序后数据,而四个节点之间的也是有序的,所以很容易确定排序后第2500、5000、7500的数据。For example, it is known that the amount of data to be processed is 1000, the fourth ranking data on the first node is 2520, the fourth ranking data on the second node is 2480, and the fourth ranking data on the third and fourth nodes is 2500. And the maximum value on the first node is smaller than the minimum value on the second node, and so on. If the number of nodes is 4, then the target points should be the 2500th, 500th, and 7500th data, because the data on the four nodes is sorted data, and the four nodes are also ordered, so it is easy Determine the 2500th, 5000th, and 7500th data after sorting.
上述实施例提供的分箱处理方法,基于等频方法完成了在多个节点上对大规模的待处理数据的分箱处理。该方法首先将待处理随机分配至多个节点上,并确认了初步等频切分点-第四候选切分点,然后根据第四候选切分点将待处理数据按序分配至各个节点,并对各个节点上的数据进行排序,最后根据排序后数据、预设分箱数就确认了目标分位点。该分箱处理方法可以对分布均匀的大规模数据进行分箱处理。The binning processing method provided in the foregoing embodiment completes binning processing of large-scale to-be-processed data on multiple nodes based on an equal frequency method. This method first randomly allocates the to-be-processed data to multiple nodes, and confirms the preliminary equal-frequency cut-off point-the fourth candidate cut-off point, and then allocates the data to be processed to each node in order according to the fourth candidate cut-off point, and Sort the data on each node, and finally confirm the target quantile based on the sorted data and the preset number of bins. The binning processing method can perform binning processing on evenly distributed large-scale data.
在一些实施例中,图2所示实施例提供的步骤S3还可以包括以下步骤。In some embodiments, step S3 provided in the embodiment shown in FIG. 2 may further include the following steps.
若所述目标分箱方式为第三分箱方式,则分别获得各个节点上的最大值和最小值;根据所述各个节点上的最大值和最小值确定所述待处理数据的最大值和最小值;根据所述待处理数据的最大值和最小值以及预设分箱数确定所述目标分位点。If the target binning method is the third binning method, the maximum value and the minimum value on each node are respectively obtained; the maximum value and the minimum value of the data to be processed are determined according to the maximum value and the minimum value on each node Value; the target quantile is determined according to the maximum and minimum values of the data to be processed and the preset number of bins.
在一些实施例中,将待处理数据随机分配至N个节点后可以分别获取各个节点上的最大值和最小值,并在上述各节点上的最大值和最小值中确定一个最大值和最小值以作为上述待处理数据的最大值和最小值。如果一直待处理数据的最大值和最小值以及预设分箱数据,就可以确定待处理数据的分位点。例如,如果已知待处理数据的最大值为10000,最小值为1,分箱数为4,那么目标分位点就是2500、500、7500,根据目标分位点就可以实现对数据的分箱操作。In some embodiments, after randomly distributing the data to be processed to N nodes, the maximum value and minimum value on each node can be obtained respectively, and a maximum value and minimum value can be determined from the maximum value and minimum value on each node. As the maximum and minimum values of the data to be processed. If the maximum and minimum values of the data to be processed and the preset binning data, the quantile point of the data to be processed can be determined. For example, if it is known that the maximum value of the data to be processed is 10000, the minimum value is 1, and the number of bins is 4, then the target quantiles are 2500, 500, 7500, and the data can be binned according to the target quantile. operating.
上述实施例,首先通过在各个节点中确认最大值和最小值来,然后再根据节点中的最大值和最小值确定了大规模待处理数据中的最大值和最小值,最后根据待处理数据的最大值、最小值以及预设分箱数来完成对待处理数据的分箱操作。该方法简单易操作,适用于一些分布比较集中的待处理数据。In the above embodiment, the maximum value and minimum value are first confirmed in each node, and then the maximum value and minimum value in the large-scale data to be processed are determined according to the maximum value and minimum value in the node, and finally according to the value of the data to be processed The maximum, minimum and preset binning numbers are used to complete the binning operation of the data to be processed. This method is simple and easy to operate, and is suitable for some concentrated data to be processed.
图11是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 11 is a flowchart showing a data binning processing method according to an exemplary embodiment.
参照图11,本公开实施例提供的数据分箱处理方法可以包括以下步骤。Referring to FIG. 11, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
步骤S111,获取待处理数据及其目标分箱方式和预设分箱数。Step S111: Obtain the data to be processed and its target binning method and preset binning number.
步骤S112,若所述待处理数据的数据量大于等于预设阈值。Step S112, if the amount of data to be processed is greater than or equal to a preset threshold.
步骤S113,将所述待处理数据随机分配至N个节点,N为大于1的正整数。Step S113: Randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1.
步骤S114,若所述目标分箱方式为第一分箱方式,分别对各节点上的待处理数据排序,以获得各个节点中的第二排序数据。Step S114, if the target binning mode is the first binning mode, sort the to-be-processed data on each node to obtain the second sorted data in each node.
步骤S115,根据所述节点个数分别对各第二排序数据进行等频划分,以获得各个节 点上的第一预切分点。Step S115: Perform equal frequency division on each second sorted data according to the number of nodes to obtain the first pre-segment point on each node.
步骤S116,根据所述第一预切分点确定所述第一候选切分点。Step S116: Determine the first candidate segmentation point according to the first pre-segmentation point.
步骤S117,根据所述第一候选切分点将所述待处理数据有序分配到所述N个节点上。Step S117: Distribute the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point.
步骤S118,分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第一排序数据。Step S118: Sort the to-be-processed data on each node after the ordered distribution, respectively, to obtain the first sorted data in each node.
步骤S119,根据所述各个节点中的第一排序数据获得所述待处理数据的全局KS。Step S119: Obtain the global KS of the to-be-processed data according to the first ranking data in each node.
步骤S1110,在所述全局KS中确定一个最大KS,将其对应的待处理数据作为所述第二候选切分点。Step S1110: Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point.
步骤S1111,判断所述第二候选切分点左侧和右侧的待处理数据的数据量是否大于预设数据量。Step S1111: Determine whether the data amount of the data to be processed on the left and right of the second candidate segmentation point is greater than a preset data amount.
若所述第二候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则执行步骤S1112;若所述第二候选切分点左侧和右侧的待处理数据的数据量不大于预设数据量,则执行步骤S1113;If the amount of data to be processed on the left and right of the second candidate segmentation point is greater than the preset data amount, step S1112 is executed; if the amount of data to be processed on the left and right of the second candidate segmentation point is If the data amount of is not greater than the preset data amount, step S1113 is executed;
步骤S1112,在所述第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点。然后,继续执行步骤S1111,直至所述第二候选切分点左侧和右侧的待处理数据的数据量小于等于预设数据量。Step S1112: Determine the to-be-processed data corresponding to a maximum KS on the left and right sides of the second candidate segmentation point, respectively, as the second candidate segmentation point. Then, continue to perform step S1111 until the amount of data to be processed on the left and right sides of the second candidate segmentation point is less than or equal to the preset data amount.
步骤S1113,判断所述第二候选切分点的个数是否小于所述预设分箱数。Step S1113: Determine whether the number of the second candidate segmentation points is less than the preset number of bins.
若判断所述第二候选切分点的个数小于所述预设分箱数,则执行步骤S1114;若判断所述第二候选切分点的个数不小于所述预设分箱数,则执行步骤S1115。If it is determined that the number of the second candidate segmentation points is less than the preset number of bins, step S1114 is executed; if it is determined that the number of the second candidate segmentation points is not less than the preset number of bins, Step S1115 is executed.
步骤S1114,确定所述第二候选切分点就是所述目标分位点。Step S1114: Determine that the second candidate segmentation point is the target segmentation point.
步骤S1115,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Step S1115: Determine the target quantile point according to the preset number of bins and using a dynamic programming method.
步骤S1116,根据所述目标分位点获得所述待处理数据的分箱结果。Step S1116: Obtain a binning result of the to-be-processed data according to the target quantile.
上述实施例提供了一种数据分箱处理方法,该方法具有以下有益效果:The foregoing embodiment provides a data binning processing method, which has the following beneficial effects:
一、基于KS指标对待处理数据进行分箱,能够有效地对连续变量进行分箱处理,而且具有更强的解释性。1. Based on the KS index, the data to be processed can be binned, which can effectively bin-bind continuous variables, and has stronger explanatory properties.
二、根据节点内存和节点上待处理数据的数据量对待处理数据进行排序,在充分利节点内存的情况下保证了运行速度。2. Sort the data to be processed according to the memory of the node and the amount of data to be processed on the node, ensuring the running speed while fully utilizing the memory of the node.
三、使用动态规划方法确定目标分位点,节约了运行时间。3. Use the dynamic programming method to determine the target quantile, saving running time.
四、该方法通过将大规模的将待处理数据量分配至多个节点上,然后在多个节点上的数据中确定了目标分位点,最后根据目标分位点实现对待处理数据的分箱操作,克服了单机内存过小,无法处理大规模数据的缺陷。Fourth, this method distributes the amount of data to be processed on a large scale to multiple nodes, and then determines the target quantile in the data on multiple nodes, and finally realizes the binning operation of the data to be processed according to the target quantile. , To overcome the shortcomings that the single machine's memory is too small to handle large-scale data.
图12是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 12 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
参照图2,本公开实施例提供的数据分箱处理方法可以包括以下步骤。2, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
步骤S121,获取待处理数据及其目标分箱方式和预设分箱数。Step S121: Obtain the data to be processed and its target binning method and preset binning number.
步骤S122,若所述待处理数据的数据量大于等于预设阈值。Step S122, if the amount of data to be processed is greater than or equal to a preset threshold.
步骤S123,若所述目标分箱方式为第二分箱方式,则分别对各节点上的待处理数据排序,以获得各个节点中的第五排序数据。Step S123: If the target binning mode is the second binning mode, sort the to-be-processed data on each node to obtain the fifth sorted data in each node.
步骤S124,根据所述节点个数分别对各第五排序数据进行等频划分,以获得各个节点上的第二预切分点。Step S124: Perform equal frequency division on each fifth sorted data according to the number of nodes to obtain second pre-segment points on each node.
步骤S125,根据所述第二预切分点确定所述第四候选切分点。Step S125: Determine the fourth candidate segmentation point according to the second pre-segmentation point.
步骤S126,根据所述第四候选切分点将所述待处理数据有序分配到所述N个节点上。Step S126: Distribute the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point.
步骤S127,分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第四排序数据。Step S127: Sort the to-be-processed data on each node after the orderly distribution respectively to obtain the fourth sorted data in each node.
步骤S128,根据所述预设分箱数在所述第四排序数据中确定所述目标分位点。Step S128: Determine the target quantile point in the fourth ranking data according to the preset number of bins.
步骤S129,根据所述目标分位点获得所述待处理数据的分箱结果。Step S129: Obtain a binning result of the to-be-processed data according to the target quantile.
上述实施例提供的分箱处理方法,基于等频方法完成了在多个节点上对大规模的待处理数据的分箱处理。该方法首先将待处理随机分配至多个节点上,并确认了初步等频切分点-第四候选切分点,然后根据第四候选切分点将待处理数据按序分配至各个节点,并对各个节点上的数据进行排序,最后根据排序后数据、预设分箱数就确认了目标分位点。该分箱处理方法可以对分布均匀的大规模数据进行分箱处理。The binning processing method provided in the foregoing embodiment completes binning processing of large-scale to-be-processed data on multiple nodes based on an equal frequency method. This method first randomly allocates the to-be-processed data to multiple nodes, and confirms the preliminary equal-frequency cut-off point-the fourth candidate cut-off point, and then allocates the data to be processed to each node in order according to the fourth candidate cut-off point, and Sort the data on each node, and finally confirm the target quantile based on the sorted data and the preset number of bins. The binning processing method can perform binning processing on evenly distributed large-scale data.
图13是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 13 is a flowchart showing a method for processing data binning according to an exemplary embodiment.
参照图3,本公开实施例提供的数据分箱处理方法可以包括以下步骤。3, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
步骤S131,获取待处理数据及其目标分箱方式和预设分箱数。Step S131: Obtain the to-be-processed data and its target binning method and preset binning number.
步骤S132,若所述待处理数据的数据量大于等于预设阈值。Step S132, if the amount of data to be processed is greater than or equal to a preset threshold.
步骤S133,将所述待处理数据随机分配至N个节点,N为大于1的正整数。Step S133: Randomly allocate the data to be processed to N nodes, where N is a positive integer greater than 1.
步骤S134,若所述目标分箱方式为第三分箱方式,则分别获得各个节点上的最大值和最小值。Step S134: If the target binning mode is the third binning mode, the maximum value and the minimum value on each node are obtained respectively.
步骤S135,根据所述各个节点上的最大值和最小值确定所述待处理数据的最大值和最小值。Step S135: Determine the maximum value and the minimum value of the to-be-processed data according to the maximum value and the minimum value on each node.
步骤S136,根据所述待处理数据的最大值和最小值以及预设分箱数确定所述目标分位点。Step S136: Determine the target quantile point according to the maximum value and minimum value of the data to be processed and the preset number of bins.
步骤S137,根据所述目标分位点获得所述待处理数据的分箱结果。Step S137: Obtain a binning result of the to-be-processed data according to the target quantile.
上述实施例,首先通过在各个节点中确认最大值和最小值来,然后再根据节点中的最大值和最小值确定了大规模待处理数据中的最大值和最小值,最后根据待处理数据的最大值、最小值以及预设分箱数来完成对待处理数据的分箱操作。该方法简单易操作,适用于一些分布比较集中的待处理数据。In the above embodiment, the maximum value and minimum value are first confirmed in each node, and then the maximum value and minimum value in the large-scale data to be processed are determined according to the maximum value and minimum value in the node, and finally according to the value of the data to be processed The maximum, minimum and preset binning numbers are used to complete the binning operation of the data to be processed. This method is simple and easy to operate, and is suitable for some concentrated data to be processed.
图14是根据一示例性实施例示出的一种数据分箱处理方法的流程图。Fig. 14 is a flow chart showing a method for processing data binning according to an exemplary embodiment.
参照图4,本公开实施例提供的数据分箱处理方法可以包括以下步骤。4, the data binning processing method provided by the embodiment of the present disclosure may include the following steps.
步骤S141,获取待处理数据及其目标分箱方式和预设分箱数。Step S141: Obtain the data to be processed and its target binning method and preset binning number.
步骤S142,若所述待处理数据的数据量小于预设阈值。Step S142, if the amount of data to be processed is less than a preset threshold.
步骤S143,对所述待处理数据进行排序,以生成第三排序数据。Step S143: Sort the to-be-processed data to generate third sorted data.
步骤S144,确定所述第三排序数据的KS。Step S144: Determine the KS of the third sorted data.
步骤S145,在所述第三排序数据的KS中确定一个最大KS,将其对应的待处理数据作为所述第五候选切分点。Step S145: Determine a maximum KS among the KS of the third sorted data, and use the corresponding to-be-processed data as the fifth candidate segmentation point.
步骤S146,判断所述第五候选切分点左侧和右侧的待处理数据的数据量是否大于预设数据量。Step S146: Determine whether the amount of data to be processed on the left and right sides of the fifth candidate segmentation point is greater than a preset data amount.
若判断所述第五候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则继续执行步骤S146,否则执行步骤S147。If it is determined that the data amount of the data to be processed on the left and right sides of the fifth candidate segmentation point is greater than the preset data amount, then continue to perform step S146, otherwise, perform step S147.
步骤S147,判断所述第五候选切分点的个数是否小于所述预设分箱数。Step S147: Determine whether the number of the fifth candidate segmentation points is less than the preset number of bins.
若判断所述第五候选切分点的个数小于所述预设分箱数,则执行步骤S148,否则执行步骤149。If it is determined that the number of the fifth candidate segmentation points is less than the preset number of bins, step S148 is executed; otherwise, step 149 is executed.
步骤S148,确定所述第二候选切分点就是所述目标分位点。Step S148, determining that the second candidate segmentation point is the target segmentation point.
步骤S149,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Step S149: Determine the target quantile point according to the preset number of bins and using a dynamic programming method.
步骤S1410,根据所述目标分位点获得所述待处理数据的分箱结果。Step S1410: Obtain a binning result of the to-be-processed data according to the target quantile.
在一些实施例中,可以在单个节点中使用图14所示实施例提供的技术方案以完成单个属性数据的分箱处理。如果一个数据列表中包括多个属性的数据,例如一个数据列表中既包括年龄也包括分数,也可以将上述数据列表中的数据按照属性分配至多个节点中并同时使用上述方法以完成分箱处理。In some embodiments, the technical solution provided by the embodiment shown in FIG. 14 can be used in a single node to complete the binning processing of a single attribute data. If a data list includes multiple attributes of data, for example, a data list includes both age and score, you can also assign the data in the above data list to multiple nodes according to attributes and use the above methods at the same time to complete the binning process .
图14所述实施例提供的技术方案一方面基于KS指标对待处理数据进行分箱,能够有效地对连续变量进行分箱处理,而且具有更强的解释性,另一方面根据节点内存和节点上待处理数据的数据量对待处理数据进行排序,在充分利节点内存的情况下保证了运行速度,进一步的,该方法使用动态规划找出符合条件的目标分位点,节约了运行时间。The technical solution provided by the embodiment shown in Fig. 14 on the one hand performs binning of the data to be processed based on the KS index, which can effectively bin-process continuous variables, and is more explanatory. On the other hand, it is based on the node memory and node The data volume of the to-be-processed data is sorted, and the running speed is ensured when the node memory is fully utilized. Furthermore, this method uses dynamic programming to find out the eligible target quantiles, which saves running time.
图15是根据一示例性实施例示出的一种数据分箱处理装置的框图。参照图15,该装置150包括数据获取模块1501、数据分配模块1502、目标分位点确定模块1503以及分箱模块1504。Fig. 15 is a block diagram showing a data binning processing device according to an exemplary embodiment. 15, the device 150 includes a data acquisition module 1501, a data distribution module 1502, a target quantile determination module 1503, and a binning module 1504.
其中,数据获取模块1501可以配置为获取待处理数据及其目标分箱方式和预设分箱数;数据分配模块1502可以配置为若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;目标分位点确定模块1503可以配置为根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;分箱模块1504可以配置为根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。Among them, the data acquisition module 1501 can be configured to acquire the data to be processed and its target binning method and the preset number of bins; the data distribution module 1502 can be configured to: if the data volume of the data to be processed is greater than or equal to the preset threshold, The data to be processed is randomly distributed to N nodes, where N is a positive integer greater than 1. The target quantile point determination module 1503 may be configured to perform the calculation of the N nodes according to the preset number of bins and the target binning method. The to-be-processed data on each node is processed to determine the target quantile of the to-be-processed data; the binning module 1504 may be configured to perform binning operation on the to-be-processed data according to the target quantile to obtain a binning operation. Box results.
在一些实施例中,图15所示目标分位点确定模块03可以包括第一候选切分点确定子模块、第一分配子模块、第一排序子模块、全局KS确定子模块以及第一目标分位点确定子模块。In some embodiments, the target quantile determination module 03 shown in FIG. 15 may include a first candidate segmentation point determination submodule, a first allocation submodule, a first ranking submodule, a global KS determination submodule, and a first target Quantile determination sub-module.
其中,第一候选切分点确定子模块可以配置为若所述目标分箱方式为第一分箱方式,则确定所述待处理数据的第一候选切分点;第一分配子模块可以配置根据所述第一候选切分点将所述待处理数据有序分配到所述N个节点上;第一排序子模块可以配置分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第一排序数据;全局KS确定子模块可以配置根据所述各个节点中的第一排序数据获得所述待处理数据的全局KS;第一目标分位点确定子模块,根据所述待处理数据的全局KS确定所述目标分位点。Wherein, the first candidate segmentation point determination sub-module may be configured to determine the first candidate segmentation point of the data to be processed if the target binning mode is the first binning mode; the first allocation sub-module may be configured According to the first candidate segmentation point, the data to be processed is distributed to the N nodes in an orderly manner; the first sorting sub-module may be configured to sort the data to be processed on each node after the orderly distribution, respectively, To obtain the first ranking data in each node; the global KS determination sub-module may be configured to obtain the global KS of the to-be-processed data according to the first ranking data in each node; the first target quantile determination sub-module, according to The global KS of the data to be processed determines the target quantile.
在一些实施例中,第一候选切分点确定子模块可以包括第二排序单元、第一预切分点确定单元以及确定第一候选切分点单元。In some embodiments, the first candidate segmentation point determination sub-module may include a second sorting unit, a first pre-segment point determination unit, and a first candidate segmentation point determination unit.
其中,第二排序单元可以配置为分别对各节点上的待处理数据排序,以获得各个节点中的第二排序数据;第一预切分点确定单元可以配置为根据所述节点个数N分别对各第二排序数据进行等频划分,以获得各个节点上的第一预切分点;确定第一候选切分点单元可以配置为根据所述第一预切分点确定所述第一候选切分点。The second sorting unit may be configured to sort the to-be-processed data on each node to obtain the second sorted data in each node; the first pre-segment point determination unit may be configured to respectively sort the data to be processed according to the number N of nodes. Perform equal frequency division on each second sorted data to obtain the first pre-segment point on each node; the unit for determining the first candidate segmentation point may be configured to determine the first candidate according to the first pre-segment point Split point.
在一些实施例中,图15所示第一目标分位点确定子模块035可以包括第二候选切分点确定单元和确定目标分位点单元。In some embodiments, the first target quantile determination sub-module 035 shown in FIG. 15 may include a second candidate segmentation point determination unit and a target quantile determination unit.
其中,第二候选切分点确定单元可以配置为根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中的确定第二候选切分点;确定目标分位点单元可以配置为根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点。The second candidate segmentation point determination unit may be configured to determine the second candidate segmentation point according to the global KS of the to-be-processed data in the first ranking data on the N nodes; determine the target quantile point unit It may be configured to determine the target quantile point in the second candidate segmentation point according to the preset number of bins.
在一些实施例中,第二候选切分点确定单元可以包括最大KS确定子单元和二分子单元。In some embodiments, the second candidate segmentation point determination unit may include a maximum KS determination subunit and a binary unit.
其中,最大KS确定子单元可以配置为在所述全局KS中确定一个最大KS,将其对应的待处理数据作为所述第二候选切分点;二分子单元,若所述第二候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则在所述第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点。Wherein, the maximum KS determining subunit may be configured to determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point; a binary unit, if the second candidate segmentation If the data volume of the data to be processed on the left and right of the point is greater than the preset data volume, the data to be processed corresponding to the largest KS is determined on the left and right of the second candidate segmentation point, respectively, as the The second candidate segmentation point.
在一些实施例中,第二目标分位点确定单元可以包括第一判断子单元、第二目标分位点确定子单元以及第二目标分位点确定子单元。In some embodiments, the second target quantile determination unit may include a first judgment subunit, a second target quantile determination subunit, and a second target quantile determination subunit.
其中,第一判断子单元,判断所述第二候选切分点的个数是否小于所述预设分箱数;第二目标分位点确定子单元,如果所述第二候选切分点的个数小于所述预设分箱数,则确定所述第二候选切分点就是所述目标分位点;第二目标分位点确定子单元,如果所述第二候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Wherein, the first judgment subunit judges whether the number of the second candidate segmentation points is less than the preset number of bins; the second target quantile determination subunit, if the second candidate segmentation point is If the number is less than the preset number of bins, it is determined that the second candidate segmentation point is the target quantile; the second target quantile determination subunit, if the number of the second candidate segmentation point is The number is greater than or equal to the preset number of bins, and the target quantile is determined according to the preset number of bins and using a dynamic programming method.
在一些实施例中,图15所示的装置150还可以包括:第三排序模块、KS确定模块、第三候选切分点确定模块、第二判断模块以及第三目标分位点确定模块。In some embodiments, the device 150 shown in FIG. 15 may further include: a third ranking module, a KS determination module, a third candidate segmentation point determination module, a second judgment module, and a third target quantile determination module.
其中,第三排序模块可以配置为若所述待处理数据的数据量小于预设阈值,则对所述待处理数据进行排序,以生成第三排序数据;KS确定模块可以配置为确定所述第三排序数据的KS;第三候选切分点确定模块可以配置为根据所述第三排序数据的KS确定第三 候选切分点;第二判断模块可以配置为判断所述第三候选切分点的个数是否大于等于所述预设分箱数;第三目标分位点确定模块可以配置为如果所述第三候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。Wherein, the third sorting module may be configured to sort the to-be-processed data to generate third sorted data if the data amount of the to-be-processed data is less than a preset threshold; the KS determination module may be configured to determine the first KS of three sorted data; the third candidate segmentation point determination module may be configured to determine the third candidate segmentation point according to the KS of the third sorted data; the second judgment module may be configured to determine the third candidate segmentation point Whether the number of quantiles is greater than or equal to the preset number of bins; the third target quantile determination module may be configured to, if the number of the third candidate segmentation points is greater than or equal to the preset number of bins, according to the The number of bins is preset and the target binning point is determined by using a dynamic programming method.
在一些实施例中,图15所示的目标分位点确定模块03还可以包括:第四候选切分点确定子模块、第二分配子模块、第四排序数据获取子模块以及第四目标分位点确定子模块。In some embodiments, the target quantile determination module 03 shown in FIG. 15 may further include: a fourth candidate segmentation point determination submodule, a second allocation submodule, a fourth ranking data acquisition submodule, and a fourth target score Location determination sub-module.
其中,第四候选切分点确定子模块可以配置为若所述目标分箱方式为第二分箱方式,则确定所述待处理数据的第四候选切分点;第二分配子模块可以配置为根据所述第四候选切分点将所述待处理数据有序分配到所述N个节点上;第四排序数据获取子模块可以配置为分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第四排序数据;第四目标分位点确定子模块可以配置为根据所述预设分箱数在所述第四排序数据中确定所述目标分位点。Wherein, the fourth candidate segmentation point determination submodule may be configured to determine the fourth candidate segmentation point of the data to be processed if the target binning mode is the second binning mode; the second allocation submodule may be configured In order to allocate the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point; the fourth ranking data acquisition submodule may be configured to separately allocate the to-be-processed data on each node after the orderly allocation The data is sorted to obtain the fourth sort data in each node; the fourth target quantile determination sub-module may be configured to determine the target quantile in the fourth sort data according to the preset number of bins .
在一些实施例中,第四候选切分点确定子模块可以包括:第五排序子模块、第二预切分点确定子模块以及第四候选切分点子模块。In some embodiments, the fourth candidate segmentation point determination submodule may include: a fifth ranking submodule, a second pre-segment point determination submodule, and a fourth candidate segmentation point submodule.
其中,第五排序子模块可以配置为分别对各节点上的待处理数据排序,以获得各个节点中的第五排序数据;第二预切分点确定子模块可以配置为根据所述节点的个数N分别对各第五排序数据进行等频划分,以获得各个节点上的第二预切分点;第三候选切分点子模块可以配置为根据所述第二预切分点确定所述第四候选切分点。Wherein, the fifth sorting sub-module may be configured to sort the to-be-processed data on each node to obtain the fifth sorting data in each node; the second pre-segment point determination sub-module may be configured to sort the data according to the number of the nodes. The number N is to divide each fifth sorted data with equal frequency to obtain the second pre-segment point on each node; the third candidate segmentation point sub-module may be configured to determine the second pre-segment point according to the second pre-segment point. Four candidate segmentation points.
在一些实施例中,图15所示的装置150还可以包括:节点最值获取模块、全局最值确定模块以及第五目标分位点确定子模块In some embodiments, the device 150 shown in FIG. 15 may further include: a node maximum value acquisition module, a global maximum value determination module, and a fifth target quantile determination submodule
节点最值获取模块可以配置为若所述目标分箱方式为第三分箱方式,则分别获得各个节点上的最大值和最小值;全局最值确定模块可以配置为根据所述各个节点上的最大值和最小值确定所述待处理数据的最大值和最小值;第四目标分位点确定子模块根据所述待处理数据的最大值和最小值以及预设分箱数确定所述目标分位点。The node maximum value obtaining module may be configured to obtain the maximum value and the minimum value on each node if the target binning mode is the third binning mode; the global maximum value determining module may be configured to obtain the maximum value and the minimum value on each node according to the The maximum and minimum values determine the maximum and minimum values of the data to be processed; the fourth target quantile determination sub-module determines the target points according to the maximum and minimum values of the data to be processed and the preset number of bins Site.
由于本公开的示例实施例的数据分箱处理装置150的各个功能模块与上述数据分箱处理方法的示例实施例的步骤对应,因此在此不再赘述。Since each functional module of the data binning processing device 150 of the exemplary embodiment of the present disclosure corresponds to the steps of the foregoing exemplary embodiment of the data binning processing method, it will not be repeated here.
下面参考图16,其示出了适于用来实现本申请实施例的终端设备的计算机系统1600的结构示意图。图16示出的终端设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。Referring now to FIG. 16, it shows a schematic structural diagram of a computer system 1600 suitable for implementing a terminal device according to an embodiment of the present application. The terminal device shown in FIG. 16 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
如图16所示,计算机系统1600包括中央处理单元(CPU)1601,其可以根据存储在只读存储器(ROM)1602中的程序或者从存储部分1608加载到随机访问存储器(RAM)1603中的程序而执行各种适当的动作和处理。在RAM 1603中,还存储有系统1600操作所需的各种程序和数据。CPU 1601、ROM 1602以及RAM 1603通过总线1604彼此相连。输入/输出(I/O)接口1605也连接至总线1604。As shown in FIG. 16, the computer system 1600 includes a central processing unit (CPU) 1601, which can be based on a program stored in a read-only memory (ROM) 1602 or a program loaded from a storage portion 1608 into a random access memory (RAM) 1603 And perform various appropriate actions and processing. In the RAM 1603, various programs and data required for the operation of the system 1600 are also stored. The CPU 1601, ROM 1602, and RAM 1603 are connected to each other through a bus 1604. An input/output (I/O) interface 1605 is also connected to the bus 1604.
以下部件连接至I/O接口1605:包括键盘、鼠标等的输入部分1606;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1607;包括硬盘等的存 储部分1608;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1609。通信部分1609经由诸如因特网的网络执行通信处理。驱动器1610也根据需要连接至I/O接口1605。可拆卸介质1611,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1610上,以便于从其上读出的计算机程序根据需要被安装入存储部分1608。The following components are connected to the I/O interface 1605: an input part 1606 including a keyboard, a mouse, etc.; an output part 1607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage part 1608 including a hard disk ; And a communication section 1609 including a network interface card such as a LAN card, a modem, etc. The communication section 1609 performs communication processing via a network such as the Internet. The driver 1610 is also connected to the I/O interface 1605 as needed. A removable medium 1611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1610 as needed, so that the computer program read from it is installed into the storage portion 1608 as needed.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1609从网络上被下载和安装,和/或从可拆卸介质1611被安装。在该计算机程序被中央处理单元(CPU)1601执行时,执行本申请的系统中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication part 1609, and/or installed from the removable medium 1611. When the computer program is executed by the central processing unit (CPU) 1601, it executes the above-mentioned functions defined in the system of the present application.
需要说明的是,本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this application, the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device . The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation of the system architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的 方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括发送单元、获取单元、确定单元和第一处理单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present application can be implemented in software or hardware. The described unit may also be provided in the processor. For example, it may be described as: a processor includes a sending unit, an acquiring unit, a determining unit, and a first processing unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备可实现功能包括:获取待处理数据及其目标分箱方式和预设分箱数;若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。As another aspect, the present application also provides a computer-readable medium, which may be included in the device described in the above-mentioned embodiments; or it may exist alone without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by a device, the functions that the device can implement include: obtaining the data to be processed and its target binning method and preset binning methods If the data volume of the data to be processed is greater than or equal to a preset threshold, the data to be processed is randomly allocated to N nodes, where N is a positive integer greater than 1; according to the preset number of bins and use all The target binning method processes the to-be-processed data on the N nodes to determine the target quantile of the to-be-processed data; the binning operation is performed on the to-be-processed data according to the target quantile to Obtain the binning result.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,本公开实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者智能设备等)执行根据本公开实施例的方法,例如图2的一个或多个所示的步骤。Through the description of the foregoing embodiments, those skilled in the art can easily understand that the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solutions of the embodiments of the present disclosure can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions. It is used to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a smart device, etc.) to execute the method according to the embodiment of the present disclosure, such as one or more steps shown in FIG. 2.
此外,上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiments of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其他实施例。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the disclosure disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not applied for by the present disclosure. . The description and embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.
应当理解的是,本公开并不限于这里已经示出的详细结构、附图方式或实现方法,相反,本公开意图涵盖包含在所附权利要求的精神和范围内的各种修改和等效设置。It should be understood that the present disclosure is not limited to the detailed structure, drawings, or implementation methods that have been shown here. On the contrary, the present disclosure intends to cover various modifications and equivalent arrangements included in the spirit and scope of the appended claims. .

Claims (13)

  1. 一种数据分箱处理方法,包括:A data binning processing method, including:
    获取待处理数据及其目标分箱方式和预设分箱数;Obtain the data to be processed and its target binning method and preset binning number;
    若所述待处理数据的数据量大于等于预设阈值,则将所述待处理数据随机分配至N个节点,N为大于1的正整数;If the data amount of the data to be processed is greater than or equal to the preset threshold, randomly distribute the data to be processed to N nodes, where N is a positive integer greater than 1;
    根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;Processing the data to be processed on the N nodes according to the preset number of bins and using the target binning method to determine the target quantile of the data to be processed;
    根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。Perform a binning operation on the to-be-processed data according to the target quantile point to obtain a binning result.
  2. 根据权利要求1所述方法,其中,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,包括:The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites, including:
    若所述目标分箱方式为第一分箱方式,则确定所述待处理数据的第一候选切分点;If the target binning mode is the first binning mode, determine the first candidate segmentation point of the data to be processed;
    根据所述第一候选切分点将所述待处理数据有序分配到所述N个节点上;Allocating the to-be-processed data to the N nodes in an orderly manner according to the first candidate segmentation point;
    分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第一排序数据;Respectively sort the to-be-processed data on each node after the ordered distribution to obtain the first sorted data in each node;
    根据所述各个节点中的第一排序数据获得所述待处理数据的全局KS;Obtaining the global KS of the to-be-processed data according to the first ranking data in the respective nodes;
    根据所述待处理数据的全局KS确定所述目标分位点。The target quantile is determined according to the global KS of the data to be processed.
  3. 根据权利要求2所述方法,其中,确定所述待处理数据的第一候选切分点,包括:The method according to claim 2, wherein determining the first candidate segmentation point of the data to be processed comprises:
    分别对各节点上的待处理数据排序,以获得各个节点中的第二排序数据;Sort the to-be-processed data on each node separately to obtain the second sorted data in each node;
    根据所述节点的个数N分别对各第二排序数据进行等频划分,以获得各个节点上的第一预切分点;Perform equal frequency division on each second sorted data respectively according to the number N of said nodes to obtain the first pre-segment point on each node;
    根据所述第一预切分点确定所述第一候选切分点。The first candidate segmentation point is determined according to the first pre-segmentation point.
  4. 根据权利要求2所述方法,其中,根据所述待处理数据的全局KS确定所述目标分位点,包括:The method according to claim 2, wherein determining the target quantile based on the global KS of the data to be processed comprises:
    根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中的确定第二候选切分点;Determining a second candidate segmentation point according to the global KS of the to-be-processed data in the first ranking data on the N nodes;
    根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点。The target quantile point is determined in the second candidate segmentation point according to the preset number of bins.
  5. 根据权利要求4所述方法,其中,根据所述待处理数据的全局KS在所述N个节点上的第一排序数据中确定第二候选切分点,包括:The method according to claim 4, wherein determining the second candidate segmentation point in the first ranking data on the N nodes according to the global KS of the data to be processed comprises:
    在所述全局KS中确定一个最大KS,将其对应的待处理数据作为所述第二候选切分点;Determine a maximum KS in the global KS, and use its corresponding to-be-processed data as the second candidate segmentation point;
    若所述第二候选切分点左侧和右侧的待处理数据的数据量大于预设数据量,则在所述第二候选切分点的左侧和右侧分别确定一个最大KS对应的待处理数据,以作为所述第二候选切分点。If the amount of data to be processed on the left and right sides of the second candidate segmentation point is greater than the preset data amount, determine a maximum KS corresponding to the left and right sides of the second candidate segmentation point, respectively The data to be processed is used as the second candidate segmentation point.
  6. 根据权利要求4所述方法,其中,根据所述预设分箱数在所述第二候选切分点中确定所述目标分位点,包括:The method according to claim 4, wherein determining the target quantile point in the second candidate segmentation point according to the preset number of bins comprises:
    判断所述第二候选切分点的个数是否小于所述预设分箱数;Determining whether the number of the second candidate segmentation points is less than the preset number of bins;
    如果所述第二候选切分点的个数小于所述预设分箱数,则确定所述第二候选切分点就是所述目标分位点;If the number of the second candidate segmentation points is less than the preset number of bins, determining that the second candidate segmentation point is the target quantile point;
    如果所述第二候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。If the number of the second candidate segmentation points is greater than or equal to the preset number of bins, the target binning point is determined according to the preset number of bins and using a dynamic programming method.
  7. 根据权利要求1所述方法,其中,还包括:The method according to claim 1, further comprising:
    若所述待处理数据的数据量小于预设阈值,则对所述待处理数据进行排序,以生成第三排序数据;If the data amount of the data to be processed is less than the preset threshold, sort the data to be processed to generate third sorted data;
    确定所述第三排序数据的KS;Determining the KS of the third ranking data;
    根据所述第三排序数据的KS确定第三候选切分点;Determining a third candidate segmentation point according to the KS of the third ranking data;
    判断所述第三候选切分点的个数是否大于等于所述预设分箱数;Judging whether the number of the third candidate segmentation points is greater than or equal to the preset number of bins;
    如果所述第三候选切分点的个数大于等于所述预设分箱数,根据所述预设分箱数并利用动态规划方法确定所述目标分位点。If the number of the third candidate segmentation points is greater than or equal to the preset number of bins, the target location point is determined according to the preset number of bins and using a dynamic programming method.
  8. 根据权利要求1所述方法,其中,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,还包括:The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites also include:
    若所述目标分箱方式为第二分箱方式,则确定所述待处理数据的第四候选切分点;If the target binning mode is the second binning mode, determine the fourth candidate segmentation point of the to-be-processed data;
    根据所述第四候选切分点将所述待处理数据有序分配到所述N个节点上;Allocating the to-be-processed data to the N nodes in an orderly manner according to the fourth candidate segmentation point;
    分别对有序分配后的各节点上的待处理数据进行排序,以获得各个节点中的第四排序数据;Respectively sort the to-be-processed data on each node after the orderly distribution to obtain the fourth sort data in each node;
    根据所述预设分箱数在所述第四排序数据中确定所述目标分位点。The target quantile point is determined in the fourth ranking data according to the preset number of bins.
  9. 根据权利要求8所述方法,其中,确定所述待处理数据的第四候选切分点,包括:The method according to claim 8, wherein determining the fourth candidate segmentation point of the to-be-processed data comprises:
    分别对各节点上的待处理数据排序,以获得各个节点中的第五排序数据;Sort the data to be processed on each node respectively to obtain the fifth sorted data in each node;
    根据所述节点的个数N分别对各第五排序数据进行等频划分,以获得各个节点上的第二预切分点;Perform equal frequency division on each fifth sorted data respectively according to the number N of said nodes to obtain the second pre-segment point on each node;
    根据所述第二预切分点确定所述第四候选切分点。The fourth candidate segmentation point is determined according to the second pre-segmentation point.
  10. 根据权利要求1所述方法,其中,根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点,还包括:The method according to claim 1, wherein the data to be processed on the N nodes are processed according to the preset number of bins and the target binning method to determine the target bin of the data to be processed Sites also include:
    若所述目标分箱方式为第三分箱方式,则分别获得各个节点上的最大值和最小值;If the target binning mode is the third binning mode, the maximum value and the minimum value on each node are obtained respectively;
    根据所述各个节点上的最大值和最小值确定所述待处理数据的最大值和最小值;Determining the maximum value and the minimum value of the data to be processed according to the maximum value and the minimum value on each node;
    根据所述待处理数据的最大值和最小值以及预设分箱数确定所述目标分位点。The target quantile point is determined according to the maximum and minimum values of the data to be processed and the preset number of bins.
  11. 一种数据分箱处理装置,其中,包括:A data binning processing device, which includes:
    数据获取模块,配置为获取待处理数据及其目标分箱方式和预设分箱数;The data acquisition module is configured to acquire the data to be processed and its target binning method and preset binning number;
    数据分配模块,配置为若所述待处理数据的数据量大于等于预设阈值,则将所述待处 理数据随机分配至N个节点,N为大于1的正整数;The data distribution module is configured to randomly distribute the data to be processed to N nodes if the data amount of the data to be processed is greater than or equal to a preset threshold, where N is a positive integer greater than 1;
    目标分位点确定模块,配置为根据所述预设分箱数并利用所述目标分箱方式对所述N个节点上的待处理数据进行处理,以确定所述待处理数据的目标分位点;A target binning point determination module, configured to process the data to be processed on the N nodes according to the preset binning number and using the target binning method to determine the target binning of the data to be processed point;
    分箱模块,配置为根据所述目标分位点对所述待处理数据进行分箱操作以获得分箱结果。The binning module is configured to perform binning operation on the to-be-processed data according to the target binning point to obtain binning results.
  12. 一种电子设备,包括:An electronic device including:
    一个或多个处理器;One or more processors;
    存储装置,用于存储一个或多个程序,Storage device for storing one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一项所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-10.
  13. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-10中任一项所述的方法。A computer readable medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1-10.
PCT/CN2019/100804 2019-06-12 2019-08-15 Data binning processing method and apparatus, electronic device and computer-readable medium WO2020248356A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910504964.2 2019-06-12
CN201910504964.2A CN110245140B (en) 2019-06-12 2019-06-12 Data binning processing method and device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
WO2020248356A1 true WO2020248356A1 (en) 2020-12-17

Family

ID=67886711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/100804 WO2020248356A1 (en) 2019-06-12 2019-08-15 Data binning processing method and apparatus, electronic device and computer-readable medium

Country Status (2)

Country Link
CN (1) CN110245140B (en)
WO (1) WO2020248356A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114491416A (en) * 2022-02-23 2022-05-13 北京百度网讯科技有限公司 Characteristic information processing method and device, electronic equipment and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111311599B (en) * 2020-01-17 2024-03-26 北京达佳互联信息技术有限公司 Image processing method, device, electronic equipment and storage medium
CN112667608B (en) * 2020-04-03 2022-01-25 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN112667741B (en) * 2020-04-13 2022-07-08 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN111506485B (en) * 2020-04-15 2021-07-27 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111507479B (en) * 2020-04-15 2021-08-10 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111242244B (en) * 2020-04-24 2020-09-18 支付宝(杭州)信息技术有限公司 Characteristic value sorting method, system and device
CN111611243B (en) * 2020-05-13 2023-06-13 第四范式(北京)技术有限公司 Data processing method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070185896A1 (en) * 2006-02-01 2007-08-09 Oracle International Corporation Binning predictors using per-predictor trees and MDL pruning
CN108764273A (en) * 2018-04-09 2018-11-06 中国平安人寿保险股份有限公司 A kind of method, apparatus of data processing, terminal device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070185896A1 (en) * 2006-02-01 2007-08-09 Oracle International Corporation Binning predictors using per-predictor trees and MDL pruning
CN108764273A (en) * 2018-04-09 2018-11-06 中国平安人寿保险股份有限公司 A kind of method, apparatus of data processing, terminal device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WU, XIAOJIE: "Online loan overdue prediction based on Parallel Random Forest", MASTER THESIS, no. 2, 1 March 2016 (2016-03-01), pages 1 - 59, XP009524869, ISSN: 1674-0246 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114491416A (en) * 2022-02-23 2022-05-13 北京百度网讯科技有限公司 Characteristic information processing method and device, electronic equipment and storage medium
EP4134834A1 (en) * 2022-02-23 2023-02-15 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus of processing feature information, electronic device, and storage medium

Also Published As

Publication number Publication date
CN110245140A (en) 2019-09-17
CN110245140B (en) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2020248356A1 (en) Data binning processing method and apparatus, electronic device and computer-readable medium
US11176487B2 (en) Gradient-based auto-tuning for machine learning and deep learning models
CN108549583B (en) Big data processing method and device, server and readable storage medium
CN103853618B (en) Resource allocation method with minimized cloud system cost based on expiration date drive
JP6199812B2 (en) System and method for performing parallel search on explicitly represented graphs
Chon et al. GMiner: A fast GPU-based frequent itemset mining method for large-scale data
CN108804383B (en) Support point parallel enumeration method and device based on measurement space
CN106909942B (en) Subspace clustering method and device for high-dimensionality big data
CN108334951A (en) For the pre- statistics of the data of the node of decision tree
US20220229701A1 (en) Dynamic allocation of computing resources
Kijsipongse et al. Dynamic load balancing on GPU clusters for large-scale K-Means clustering
CN105229608A (en) Based on the database processing towards array of coprocessor
CN112597126A (en) Data migration method and device
CN107958266A (en) It is a kind of based on MPI and be about to connection attribute carry out discretization method
WO2021208174A1 (en) Distributed-type graph computation method, terminal, system, and storage medium
CN113760638A (en) Log service method and device based on kubernets cluster
CN111597256A (en) Transaction asynchronous processing method
CN112667770A (en) Method and device for classifying articles
CN113780333A (en) User group classification method and device
CN114662777A (en) Photovoltaic module serial line arrangement determining method and device, electronic equipment and storage medium
Zaslavsky et al. Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation
Lincoln et al. Cache-adaptive exploration: Experimental results and scan-hiding for adaptivity
CN107463541A (en) File difference comparative approach, storage medium, electronic equipment and system
CN108011735A (en) Community discovery method and device
CN106227600A (en) A kind of multidimensional virtual resource allocation method based on Energy-aware

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933004

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933004

Country of ref document: EP

Kind code of ref document: A1