CN112836765A

CN112836765A - Data processing method and device for distributed learning and electronic equipment

Info

Publication number: CN112836765A
Application number: CN202110233219.6A
Authority: CN
Inventors: 谭明超; 马国强; 范涛; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-05-25
Anticipated expiration: 2041-03-01
Also published as: CN112836765B

Abstract

The application provides a data processing method, a data processing device, electronic equipment, a computer readable storage medium and a computer program product for distributed learning; the method comprises the following steps: determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extreme values and sample numbers of sample characteristic data stored in a plurality of second devices respectively; determining a total number of samples in each interval based on the number of sub-samples in each second device corresponding to each interval; constructing simulation data in each interval based on the total sample number in each interval and the simulation quantile point corresponding to each interval; total simulation data is formed based on the simulation data within each interval, and a target quantile is determined based on the total simulation data. By the method and the device, the safety of the sample characteristic data can be protected, and the target quantile can be rapidly obtained.

Description

Data processing method and device for distributed learning and electronic equipment

Technical Field

The present application relates to data processing technologies, and in particular, to a data processing method and apparatus for distributed learning, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the continuous development of technologies such as big data and distributed technologies, characteristic binning processing is required to be performed on characteristic data in many fields. Feature binning is a technique for grouping multiple data, and each grouping may be referred to as a bin. In the field of machine learning, features can be discretized by binning continuous features, and the degree of correlation between the features and labels can be examined based on the binning results of the feature binning. For example, information feature values, evidence weights, etc. are derived based on the binning results for feature data preprocessing and feature selection.

In the related art, the feature data is usually stored in multi-party distributed data, and feature binning processing needs to be performed by combining the feature data of multiple parties, however, when multi-party cooperation is performed in the related art, each party can expose the feature data stored in the related art, and a risk of data leakage is caused.

Disclosure of Invention

The embodiment of the application provides a data processing method and device for distributed learning, electronic equipment, a computer readable storage medium and a computer program product, which can protect the safety of sample characteristic data and quickly obtain target quantiles.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data processing method for distributed learning, which comprises the following steps:

determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extreme values and sample numbers of sample characteristic data stored in a plurality of second devices respectively;

determining a total number of samples in each interval based on the number of subsamples in each second device corresponding to said each interval;

constructing simulation data in each interval based on the total number of samples in each interval and the simulation quantile point corresponding to each interval;

forming total simulation data based on the simulation data in each interval, and determining a target quantile point based on the total simulation data;

transmitting the target part-location to each of the second devices so that

And each second device constructs a sample set based on the target quantile and trains a machine learning model for carrying out a classification task based on the sample set.

An embodiment of the present application provides a data processing apparatus for distributed learning, including: .

The simulation quantile determining module is used for determining a plurality of simulation quantiles and a plurality of corresponding intervals based on the sample characteristic extreme values and the sample quantity of the sample characteristic data stored in the second devices respectively;

an interval sample number determining module, configured to determine a total number of samples in each interval based on a number of sub-samples corresponding to each interval in each second device;

the simulation data construction module is used for constructing simulation data in each interval based on the total number of samples in each interval and the simulation quantile point corresponding to each interval;

the target quantile determining module is used for forming total simulation data based on the simulation data in each interval and determining a target quantile based on the total simulation data;

and the characteristic data processing module is used for sending the target quantile points to the second equipment so as to enable the second equipment to construct a sample set based on the target quantile points and train a machine learning model for performing a classification task based on the sample set.

In the foregoing solution, the simulation quantile determining module is further configured to determine a global sample feature extreme value and a global sample number of the global sample feature data based on a sample feature extreme value and a sample number of sample feature data stored in each of the plurality of second devices; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices, and the global sample feature extremum comprises a maximum value and a minimum value of the global sample feature data; determining an overall feature interval of the global sample feature data based on the global sample feature extremum; determining a distance interval based on a preset bin number and the global sample characteristic extreme value; carrying out equidistant partition processing on the overall characteristic interval based on the distance interval so as to determine a plurality of simulation quantiles and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of simulation quantiles.

In the above scheme, the simulation data construction module is further configured to determine a feature data range of a corresponding interval based on the simulation quantile corresponding to each interval; determining a distribution proportion of the simulation data based on the total sample number in each interval and the characteristic data range of the corresponding interval; wherein, the simulation data distribution proportion is the ratio of the difference of the simulation quantile points corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of the adjacent simulation data is the simulation data distribution proportion.

In the above scheme, the target quantile determining module is further configured to perform splicing fitting on the simulation data in the multiple intervals based on the simulation quantile to form total simulation data; wherein the total analog data is data having a specific order; determining a binning proportion, and dividing the total simulation data based on the binning proportion to obtain a plurality of different bins; the sub-boxes comprise at least one sub-simulation data, and the number of the sub-simulation data in the different sub-boxes is consistent; and determining the corresponding quantiles of the plurality of different bins as target quantiles.

In the foregoing solution, the data processing apparatus for distributed learning further includes: the parallel processing module is used for creating a plurality of tasks for obtaining the target quantile; the plurality of target quantile obtaining tasks are used for obtaining target quantiles of global sample characteristic data with different dimensions; wherein global sample feature data for each dimension characterizes data of the same feature, the global sample feature data comprising the sample feature data stored by each of the plurality of second devices; and executing a plurality of tasks for obtaining the target quantile points in parallel to obtain the target quantile points of the global sample characteristic data with different dimensions.

In the foregoing solution, the feature data processing module is further configured to send the target quantile point to each second device, so that each second device determines each sub-box of the sample feature data based on the target quantile point, and determines sub-positive and negative sample distributions respectively corresponding to each sub-box based on tag data of each stored sample feature data; determining total positive and negative sample distribution respectively corresponding to each sub-box based on the sub-positive and negative sample distribution sent by each second device; determining a feature index value corresponding to global sample feature data based on the total positive and negative sample distribution of each bin, wherein the feature index value corresponding to the global sample feature data is used for enabling each second device to execute the following operations: when the characteristic index value exceeds an index threshold value, constructing a sample set, and training a machine learning model for carrying out a classification task based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.

In the foregoing solution, the sample feature data in the sample set carries a pre-labeled classification result, and the data processing apparatus for distributed learning further includes: the model training module is used for carrying out classification prediction on each sample characteristic data in the sample set through the machine learning model to obtain a prediction classification result of each sample characteristic data; calculating a loss value based on the difference between the pre-labeled classification result on each sample characteristic data and the predicted classification result; updating model parameters of the machine learning model based on the loss values.

An embodiment of the present application provides a data processing system for distributed learning, including: a first device and a plurality of second devices; wherein the content of the first and second substances,

the first device is used for determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extreme values and sample quantity of sample characteristic data stored by a plurality of second devices respectively; determining a total number of samples in each interval based on the number of subsamples in each second device corresponding to said each interval; constructing simulation data in each interval based on the total number of samples in each interval and the simulation quantile point corresponding to each interval; forming total simulation data based on the simulation data in each interval, and determining a target quantile point based on the total simulation data; sending the target sub-location points to each second device;

the second device is used for determining a sample characteristic extreme value and a sample number of the stored sample characteristic data and sending the sample characteristic extreme value and the sample number to the first device; determining the number of sub-samples in each interval based on the simulation quantile point of the first equipment and the corresponding intervals, and sending the number of the sub-samples to the first equipment; and constructing a sample set based on the target quantile determined by the first equipment, and training a machine learning model for carrying out a classification task based on the sample set.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the data processing method for distributed learning provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the computer-readable storage medium, so as to implement the data processing method for distributed learning provided by the embodiment of the present application.

The embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the data processing method for distributed learning provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps that a plurality of simulation quantiles and corresponding intervals are determined, the number of samples in a plurality of intervals obtained by a second device based on the plurality of simulation quantiles is obtained at one time, so that simulation data in each interval are constructed, and a final target quantile is obtained;

the second equipment only transmits the extreme value and the quantity of the sample characteristic data to the first equipment and does not transmit the characteristic data, so that the problem of data leakage caused by the characteristic data provided by each second equipment when the quantile point is obtained based on distributed data is solved, and the data safety is protected to a certain extent;

when the target quantile point is obtained, the method of obtaining the quantile point by one-time transmission (sending and receiving) of the first equipment and the second equipment and constructing the simulation data to replace a method of continuously transmitting intermediate data to continuously recursively and iteratively calculate the target quantile point is replaced, so that the complexity of data processing is reduced, the data processing efficiency is improved, and the target quantile point can be rapidly and accurately obtained.

Drawings

FIG. 1 is a schematic diagram of an alternative architecture of a distributed learning data processing system provided by an embodiment of the present application;

fig. 2 is an alternative structural schematic diagram of an electronic device provided in an embodiment of the present application;

fig. 3A is an alternative flowchart of a processing method for distributed learning provided in an embodiment of the present application;

fig. 3B is an alternative flowchart of a processing method for distributed learning provided by the embodiment of the present application;

fig. 4 is an alternative flow chart of a processing method of distributed learning provided in the embodiment of the present application;

fig. 5A is an alternative schematic diagram of a processing method of distributed learning provided in an embodiment of the present application;

fig. 5B is an alternative schematic diagram of a processing method of distributed learning provided in the embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely for distinguishing between similar items and not for indicating a particular ordering of items, it is to be understood that "first \ second \ third" may be interchanged both in particular order or sequence as appropriate, so that embodiments of the application described herein may be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Box separation: the original data is sorted, the bin points are divided by using a certain rule, and the numerical value between two bin points is classified into the bin. In the field of machine learning, continuous features are subjected to binning processing, so that the features can be discretized, Evidence Weight (WOE), Information Value (IV) and the like are obtained, the features are preprocessed and selected, iteration of a model can be accelerated when the model is trained based on the discretized features, and robustness and interpretability of the model are effectively enhanced.

Here, a common binning method includes: equidistant binning, equal-frequency binning, optimal binning and the like. The equidistant box separation is as follows: after the data are sequenced, finding out the maximum value and the minimum value, and equally dividing box points between the maximum value and the minimum value; the equal frequency binning is: after the data are subjected to box separation, the number of the data in each box is approximately equal; the optimal box separation is as follows: and (4) utilizing an IV value or chi-square test and other evaluation modes to enable the evaluation index to be optimal after the box separation.

2) And (3) quantile division: the method refers to dividing the probability distribution range of a random variable into several equal numerical points, and in the application, the quantile points are used for representing the bin dividing points of the characteristic bin dividing.

3) Evidence Weight (WOE, Weight of Evidence): an index for evaluating the characteristic data is used to measure the difference between the normal sample distribution and the default sample distribution.

4) Information Value (IV), Information Value: an index for evaluating feature data is used to measure the predictive power of a feature.

Distributed learning has made breakthrough achievements in a plurality of application fields, however, in the process of implementing the present application, the applicant finds that, because each participant participating in distributed learning holds feature data, when the participant is used for feature data binning, that is, when the split points of the feature data are obtained, it is impossible to balance between obtaining all the feature data to obtain accurate feature split points and protecting the privacy of each party's data.

Based on this, embodiments of the present application provide a data processing method and apparatus for distributed learning, an electronic device, a computer-readable storage medium, and a computer program product, which can avoid a data leakage problem caused by feature data provided by each second device when a quantile is obtained based on distributed learning, protect data security to a certain extent, and ensure that a final target quantile is obtained quickly and accurately.

The data processing method for distributed learning provided by the embodiment of the present application may be implemented by various types of electronic devices, such as a terminal, a server, or a combination of the two.

First, a data processing system for distributed learning provided in the embodiment of the present application is described, and an exemplary data processing system for distributed learning is described below by taking as an example a data processing method for distributed learning provided in the embodiment of the present application, in which a server and a server cooperate to implement the distributed learning. Referring to fig. 1, fig. 1 is a schematic diagram of an alternative structure of a data processing system 100 for distributed learning according to an embodiment of the present application.

As shown in fig. 1, the first device 200 is connected to a second device 400 (exemplary second devices 400-1 and 400-2 are shown) via a network 300, and the network 300 may be a wide area network or a local area network, or a combination thereof, and uses a wireless link to realize data transmission.

As an example, the second devices 400-1 and 400-2 transmit the sample feature extremum and the number of samples of the respective stored sample feature data to the first device 200; the first device 200 receives the sample characteristic extreme value and the sample number, determines a plurality of simulation quantiles and a plurality of corresponding intervals, and sends the plurality of simulation quantiles to the second devices 400-1 and 400-2; after receiving the plurality of simulation quantiles, the second devices 400-1 and 400-2 determine the number of samples in the corresponding intervals respectively based on the simulation quantiles, and send the number of sub-samples in each interval to the first device 200; after receiving the number of the sub-samples in each interval, the first device 200 determines the total number of the samples in each interval, constructs simulation data in each interval based on the total number of the samples in each interval and the simulation quantile corresponding to each interval, forms total simulation data based on the simulation data in each interval, determines a target quantile by using the total simulation data, and sends the target quantile to the second devices 400-1 and 400-2; after the second devices 400-1 and 400-2 receive the target quantiles, a sample set is constructed based on the target quantiles, and a machine learning model for performing a classification task is trained based on the sample set.

It should be noted that 400-1 and 400-2 are two examples of the second device, and in practical implementation, the first device 200 may perform data transmission with multiple second devices 400 to implement the data processing method for distributed learning provided in the embodiment of the present application.

In some embodiments, the first device 200 and the second device 400 may be independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and big data and artificial intelligence platforms. The network 300 may be a wide area network or a local area network, or a combination of both. The first device 200 and the second device 400 may be directly or indirectly connected through wired or wireless communication, and the embodiments of the present application are not limited thereto.

Next, an electronic device for implementing the data processing method for distributed learning provided in the embodiment of the present application is described, and in practical applications, the electronic device may be implemented as the first device 200 and the second device 400 (shown in fig. 1 by 400-1 and 400-2) shown in fig. 1.

Taking the electronic device as the first device 200 shown in fig. 1 as an example, the first device (as an active party) and the second device (as a participant) can be applied in a scenario of distributed learning to perform joint feature binning processing and data modeling processing. The method is exemplarily applied to a transverse federated learning scene, wherein a first device serves as an active party, a plurality of second devices serve as participants, the participants provide extreme values and quantity of sample feature data, the active party serves as a leading party, feature binning processing is performed to obtain target quantiles, and the plurality of participants are combined to analyze the feature data and train a machine learning model. Referring to fig. 2, fig. 2 is an alternative structural schematic diagram of an electronic device (implemented as a first device 200) provided in an embodiment of the present application, where the first device 200 shown in fig. 2 includes: at least one processor 210, at least one network interface 220, and memory 230. The various components in the first device 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 230 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 230 optionally includes one or more storage devices physically located remotely from processor 210.

Memory 230 includes volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 230 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 230 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.

An operating system 231 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 232 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

in some embodiments, the apparatus for data processing of distributed learning provided in the embodiments of the present application may be implemented in the first device 200 by using software, and fig. 2 shows a data processing apparatus 233 for distributed learning stored in the memory 230, which may be software in the form of a computer program, a plug-in, and the like. The data processing apparatus 233 for distributed learning includes the following software modules: a simulated quantile determination module 2331, an interval sample number determination module 2332, a simulated data construction module 2333, a target quantile determination module 2334 and a feature data processing module 2335. These modules may be logical functional modules and thus may be arbitrarily combined or further divided according to the functions implemented. The functions of the respective modules will be explained below.

In other embodiments, the data processing apparatus for distributed learning provided in this embodiment may be implemented in hardware, and for example, the data processing apparatus for distributed learning provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the data processing method for distributed learning provided in this embodiment, for example, the processor in the form of a hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The data processing method for distributed learning provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the first device provided by the embodiment of the present application. Referring to fig. 3A, fig. 3A is an alternative flowchart of a data processing method for distributed learning according to an embodiment of the present application, which will be described with reference to the steps shown in fig. 3A.

In step 101, a plurality of simulation quantiles and a plurality of corresponding intervals are determined based on sample characteristic extreme values and the number of samples of sample characteristic data stored in each of the plurality of second devices.

In some embodiments, the sample characteristic data stored by each of the plurality of second devices may be in the same dimension. Here, the same dimension represents the same feature, that is, the sample feature data of the same dimension is data representing the same feature, and each of the second devices respectively possesses some sample feature data under the feature.

For example, the plurality of second devices may be servers of banking systems provided by a plurality of banks, and the data stored in the servers of the banking systems each include sample characteristic data of the characteristic "age".

In practical implementation, the characteristics of the data owned by each second device are the same, and may be different in the user dimension. For example, the second device 1 has data on the characteristics of "age" of the users 1 and 2, and the second device 2 has data on the characteristics of "age" of the users 3 and 4. And the second devices perform combined feature binning by the first device through the feature data owned by the second devices. For example, each second device provides the first device with a sample feature extreme value and a sample number of sample feature data in a certain dimension (feature), so that the first device obtains data information of the feature data in the current dimension (feature), and implements a method for obtaining a target quantile point, which is described below, to achieve the purpose of feature binning.

In some embodiments, referring to fig. 3B, fig. 3B is an optional flowchart of the data processing method for distributed learning provided in the embodiment of the present application, and step 101 shown in fig. 3B may be implemented by step 1011 to step 1014, which will be described in conjunction with each step.

In step 1011, a global sample feature extremum and a global sample number of the global sample feature data are determined based on the sample feature extremum and the sample number of the sample feature data stored by each of the plurality of second devices.

Here, the global sample feature data includes sample feature data stored in each of the plurality of second devices, the sample feature extremum includes a maximum value and a minimum value of the sample feature data, and the global sample feature extremum includes a maximum value and a minimum value of the global sample feature data.

In practical implementation, the plurality of second devices are sorted according to the respective stored sample characteristic data to obtain the maximum value and the minimum value of the corresponding sample characteristic data and the number of samples. Here, the sample feature data sorting may be ascending sorting or descending sorting according to the size of the sample feature data, or priority sorting according to the level of the sample feature data.

In some embodiments, the first device compares the maximum value and the minimum value based on the maximum value and the minimum value of the sample feature data stored in each of the plurality of second devices to obtain the maximum value and the minimum value of the corresponding global sample feature data; and the first equipment carries out accumulation processing on the plurality of sample numbers based on the sample numbers of the sample characteristic data stored by the plurality of second equipment respectively to obtain the global sample number of the corresponding global sample characteristic data.

In step 1012, an overall feature interval of the global sample feature data is determined based on the global sample feature extremum.

In some embodiments, the range of the full sample feature data is determined based on the maximum value and the minimum value of the global sample feature data, resulting in an overall feature interval, and the overall feature interval takes the maximum value and the minimum value as the endpoints of the interval.

In step 1013, a distance interval is determined based on a preset bin number and the global sample feature extremum.

It should be noted that, the number of bins may be preset, and may be dynamically adjusted adaptively according to the number of global samples, generally, in order to ensure that data in each interval after equidistant binning processing has a certain regularity as much as possible, the number of bins may be determined to be larger (for example, a threshold value of the number of bins may be preset so that actually set data of the bins exceeds the threshold value of the number of bins as much as possible), so that the feature data is divided more finely, and the feature data in the corresponding interval is considered to be uniformly distributed within a set error acceptable range.

In practical implementation, the difference value between the maximum value and the minimum value of the global sample feature data is determined, and the ratio of the current difference value to the number of the bins is used as the distance interval of equidistant division.

In step 1014, an equidistant segmentation process is performed on the global feature interval based on the distance intervals to determine a plurality of simulated quantiles and a corresponding plurality of intervals.

In some embodiments, the overall characteristic interval is divided equidistantly according to the distance interval, so as to obtain a plurality of simulation quantiles and a plurality of corresponding intervals. Here, the difference between adjacent analog quantiles in the plurality of analog quantile points is the same, that is, the above distance interval.

In actual implementation, starting from the minimum value of the overall characteristic interval, sequentially taking the corresponding value after the distance interval accumulation as the next binning point, sequentially determining a plurality of binning points, and determining a plurality of corresponding intervals based on the binning points; and determining a plurality of binning points as a plurality of simulation binning points.

It should be noted that the simulation quantile point may be a binning point within the overall feature interval, and may also include a global sample feature maximum value and a global sample feature minimum value. For example, if the simulated quantile points include the bin points and the maximum and minimum values of the global sample features in the global feature interval, the equidistant division processing on the global feature interval may be implemented as: determining 4 simulation quantiles "0", "15", "30", "40" to obtain corresponding 3 intervals (which can also be called a plurality of bins): 0 to 15, 15 to 30, 30 to 40.

It should be noted that which interval the analog quantile belongs to can be preset. For example, the simulation quantile point may be set to belong to a section having the simulation quantile point as the minimum value of the corresponding section, and taking advantage of the above example, the simulation quantile point 15 may belong to a section of 15 to 30, and the simulation quantile point 30 may belong to a section of 30 to 40. In other examples, the simulation quantile point may also be set to belong to an interval with the simulation quantile point as the maximum value of the corresponding interval, which is not described herein again.

In step 102, a total number of samples in each interval is determined based on the number of subsamples in each second device corresponding to each interval.

In actual implementation, for each section, the first device adds up the number of sub-samples corresponding to each section in each second device as the total number of samples in each section.

In step 103, simulation data in each interval is constructed based on the total number of samples in each interval and the simulation quantile point corresponding to each interval.

In some embodiments, the method of step 103 may be implemented by: determining a characteristic data range of a corresponding interval based on the simulation quantile corresponding to each interval; determining a distribution proportion of the simulation data based on the total sample number in each interval and the characteristic data range of the corresponding interval; the simulation data distribution proportion is the ratio of the difference value of the simulation quantile points corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of adjacent simulation data is the simulation data distribution proportion.

In actual implementation, according to the distribution ratio of the simulation data, in a corresponding interval, numerical values corresponding to the distribution ratio of the simulation data are sequentially superimposed from an endpoint (minimum value of the interval) of the left interval, and a plurality of simulation data which are uniformly distributed in the corresponding interval are obtained.

For example, in the banking business, if the first device is a server of a bank lending system, the corresponding feature data is a loan amount in ten thousand yuan, when the first device determines (2, 3) that there are 10 feature data in the interval (the corresponding interval does not include the feature data 2, and includes the feature data 3), the simulated data distribution ratio is determined to be 0.1, and the simulated uniformly distributed data are constructed to be 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, and 3.

And establishing uniformly distributed simulation data in each interval to replace the fact that real sample characteristic data in the corresponding interval are obtained from each second device, reducing the data exposure risk of each second device, maintaining the privacy of each second device, ensuring the data safety to a certain extent, reducing the transmission complexity of the sample characteristic data and improving the efficiency of obtaining the target branch point.

In step 104, total simulation data is formed based on the simulation data within each interval, and a target quantile is determined based on the total simulation data.

In some embodiments, the method of step 104 may be implemented by: splicing and fitting the simulation data in the multiple intervals based on the simulation sub-sites to form total simulation data; a target quantile is determined based on the total simulation data.

The total simulation data is data having a specific order. Here, since the simulation data in each interval are uniformly distributed, the simulation data in the corresponding interval have a specific sequence, and the splicing and fitting process performs end-to-end processing on the corresponding interval according to the simulation sub-sites, so as to implement splicing and fitting on the simulation data in a plurality of intervals, thereby forming total simulation data.

Here, the end-to-end processing of the corresponding section according to the analog quantile may be: and if the simulation quantile points are the maximum value of the first interval and the minimum value of the second interval, connecting the first interval with the second interval so as to directly splice and fit the data of the first interval with the data of the second interval. For example, if the plurality of intervals obtained by the first device are [1, 2], (2, 3], 20 uniformly distributed simulation data are constructed for the interval of [1, 2], 10 uniformly distributed simulation data are constructed for the interval of (2, 3], two intervals are spliced according to the simulation quantile point "2", and after the first 20 simulation data, the last 10 simulation data are spliced to obtain 30 total simulation data.

In some embodiments, determining quantiles based on the simulation data may be accomplished by: determining a binning proportion, and dividing the total analog data based on the binning proportion to obtain a plurality of different bins; and determining the box dividing points corresponding to the plurality of different boxes as target box dividing points. Here, each bin includes at least one sub-analog data, and the number of sub-analog data in different bins is consistent.

In practical implementation, the binning ratio is preset and is used for performing equal-frequency binning processing on the total analog data. For example, assuming that the data range of the total simulation data is 0 to 50, if the binning proportion is 50%, using data ranked at 25 th and before as first binned data, using data ranked after 25 th as second binned data, using 25 th data as a binning point, and determining the target binning point; if the binning ratio is 25%, the data ranked at 10 th and before is used as the first binned data, the data ranked after 10 th, 20 th and before is used as the second binned data, the data ranked after 20 th, 30 th and before is used as the third binned data, the data ranked after 30 th, 40 th and before is used as the fourth binned data, the data ranked after 40 th is used as the fifth binned data, and the data ranked at 10 th, 20 th, 30 th and 40 th is used as the five binning points of the bins, so that the target binning point is determined.

In other embodiments, determining quantiles based on the simulation data may also be accomplished by: determining a distance interval, carrying out equidistant division on the total simulation data based on the distance interval, sequentially taking a corresponding value after accumulating the distance interval as a next target quantile from a minimum value, sequentially determining a plurality of target quantiles, and obtaining a plurality of boxes based on the target quantile. Here, the ratio of the difference between the maximum value and the minimum value of the global sample feature data to the bin number is used as the distance interval, and the bin number refers to the description above in the embodiments of the present application and is not described herein again.

The first device performs characteristic binning processing based on the total simulation data to quickly obtain the target binning point, real characteristic data of the second device is not needed to be used for combined binning processing, data loss and leakage risks in the transmission process are reduced, complex processing of continuously transmitting data in a multi-party combined mode and continuously calculating/comparing/updating the binning point is avoided, and processing efficiency of distributed data for obtaining the characteristic binning point is improved.

In some embodiments, after step 104, the following scheme may also be performed: creating a plurality of tasks for obtaining target quantile points; the multiple target quantile obtaining tasks are used for obtaining target quantiles of global sample characteristic data with different dimensions; the global sample feature data of each dimension represents data of the same feature, and the global sample feature data comprise sample feature data stored by a plurality of second devices respectively; and executing a plurality of tasks for obtaining the target quantile points in parallel to obtain the target quantile points of the global sample characteristic data with different dimensions.

In practical implementations, the feature data stored by the plurality of second devices is typically multi-dimensional. For example, the second device may be a server of a bank lending system, a server of an online investment platform, or the like, taking the bank lending system as an example, the server of the bank lending system stores lease information of each lending user, where the lease information typically includes characteristics of multiple dimensions, such as: user name, mobile phone number, bank card number, age, address, lease amount, overdue return and the like. For the loan systems of a plurality of banks, a plurality of features with the same dimensionality are stored in the loan systems respectively, and the first device (usually expressed as a third party, which can be a risk control platform, a loan income evaluation platform, and the like) can perform feature binning processing on the features of each dimensionality respectively to obtain a target quantile point of the corresponding dimensionality.

In the method, the quantiles of the global sample feature data with different dimensions are simultaneously obtained in a parallel processing mode, so that the solving time of the target quantile is saved, and the processing efficiency of the feature data is improved.

In step 105, the target quantile point is sent to each second device, so that each second device constructs a sample set based on the target quantile point, and trains a machine learning model for performing a classification task based on the sample set.

In some embodiments, step 105 may be implemented by sending the target quantile point to each second device, so that each second device determines each bin of the sample feature data based on the target quantile point, and determines sub positive and negative sample distributions respectively corresponding to each bin based on the tag data of the respective stored sample feature data; determining total positive and negative sample distribution respectively corresponding to each sub-box based on the positive and negative sample distribution sent by each second device; determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-box; wherein the feature index value corresponding to the global sample feature data is used for causing each of the second devices to perform the following operations: when the characteristic index value exceeds an index threshold value, constructing a sample set, and training a machine learning model for carrying out a classification task based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.

In actual implementation, the first device sends the final target sub-site of the global sample feature data to each second device. And each second device determines each sub-box of the sample characteristic data based on the target sub-position, distributes the respective stored sample characteristic data to the corresponding sub-box, determines whether the sample characteristic data in each sub-box is a positive sample or a negative sample according to the label data, and counts the number of the positive samples and the number of the negative samples in each sub-box to be used as the distribution of the sub-positive and negative samples of the current sub-box.

Here, the label data is reference data corresponding to current sample feature data or global sample feature data, and specifically, for features of the same dimension, the label data is generally used to distinguish whether the feature data belonging to the current feature is qualified or not, so as to distinguish whether the feature data belongs to a positive sample or a negative sample (the positive sample may be understood as a normal sample, and the negative sample may be understood as a default sample). The tag data is usually an empirical value, and is an evaluation index obtained through a large number of experiments and testing processes, for example, by taking a bank lending system as an example, for the feature of age, the corresponding tag data is usually 20 years old, which indicates that 20 years old is a qualified age index capable of participating in lending, if the age is less than 20 years old, the current sample feature data or the global sample feature data is determined as a negative sample, and if the age exceeds 20 years old, the current sample feature data or the global sample feature data is determined as a positive sample.

After the distribution of the positive and negative sub-samples in each sub-box of the second equipment is determined, for each sub-box, the positive and negative sub-samples sent by each second equipment are distributed and aggregated to obtain the total positive and negative sample distribution of each sub-box; specifically, the number of positive samples in each sub-box is accumulated to be used as the total number of positive samples, and the number of negative samples in each sub-box is accumulated to be used as the total number of negative samples, so that the total positive and negative sample distribution of each sub-box is obtained; and determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each box, wherein the global sample characteristic data comprises sample characteristic data stored in each of the plurality of second devices.

Here, the feature index value is used to evaluate the feasibility of a corresponding feature (under which global sample feature data or sample feature data is provided), that is, whether the current global sample feature data can be used to train a machine learning model as a training sample or can be used to perform feature selection and feature processing subsequently. The characteristic index value may be an IV value, a WOE value, or the like.

In actual implementation, the IV value or the WOE value of the current bin is determined based on the total positive and negative sample distribution of each bin, and the IV values or the WOE values of all the bins are subjected to aggregation processing to obtain the IV value or the WOE value corresponding to the global sample characteristic data. Here, the aggregation process may be an accumulation process or a weighted sum process, which is not limited in the present application.

In some embodiments, when the feature metric value exceeds the metric threshold, each second device may use the respective stored sample feature data as a training sample, construct a sample set, and train a machine learning model for performing a classification task based on the sample set.

Here, the index threshold is a threshold for evaluating feasibility of a feature (global sample feature data or sample feature data is provided under the feature), and when the feature index value exceeds the index threshold, it is determined that the current feature data can be used for training of the machine learning model.

Here, the machine learning model may be a two-class model or a multi-class model. In practical implementation, a machine learning model is taken as a binary classification model for illustration, when the machine learning model is used for classification prediction of feature data, sample feature data in a sample set carries a pre-labeled classification result, and training the machine learning model can be implemented in the following manner: each second device carries out classification prediction on each sample characteristic data in the sample set through a machine learning model to obtain a prediction classification result of each sample characteristic data; and calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data, and updating the model parameters of the machine learning model based on the loss value.

It should be noted that the training of the machine learning model may be implemented independently by the second device, and the machine learning model after training is synchronized to the first device and the remaining plurality of second devices; the machine learning model training can also be implemented by the first device and the second devices together, each second device trains the machine learning model based on sample characteristic data in a sample set constructed by each second device, the machine learning model obtained by training is sent to the first device, the machine learning models obtained by training of each second device are aggregated by the first device to obtain a global machine learning model, and the machine learning models trained by the second devices are updated by the plurality of second devices synchronously.

In practical implementation, the above two-classification model can be applied to a plurality of practical application scenarios to perform two-classification prediction on feature data. For example, in the field of consumer finance, credit evaluation is performed by performing two-class prediction on loan data of a loan user, and the user is determined to be a default or normal user; the method carries out two-classification prediction on the characteristic data of the human body in the field of medical image recognition so as to carry out lesion result evaluation, judge whether a patient is sick or healthy and the like.

For example, the machine learning model may also be a wind control model, where sample feature data in the sample set carries a pre-labeled target wind control evaluation result, and the training of the wind control model may be implemented by: risk prediction is carried out on each sample characteristic data in the sample set through a wind control model, and a prediction wind control evaluation result of each sample characteristic data is obtained; and updating the model parameters of the wind control model based on the difference between the target wind control evaluation result and the predicted wind control evaluation result marked on each sample characteristic data.

Here, the sample feature data in the sample set may include user data, and the above-described wind control model may predict user credit based on the user data, so as to perform intelligent risk control. The wind control model may be used for a number of application scenarios, such as: anti-fraud, white list preliminary screening, credit prequalification, post-credit early warning scoring and the like in the consumption financial industry.

Illustratively, if training samples of the model come from loan systems of a plurality of banks, the training samples include feature data of features such as user names, genders, mobile phone numbers, income information, loan amounts, overdue situations and the like of users, the wind control model can be used for performing classification judgment on the feature data, so as to perform risk assessment on whether the identities of the users are qualified, perform risk prediction on whether the loan amounts exceed a threshold value, evaluate whether loan credit values of the users reach standards, and the like.

According to the embodiment of the application, each second device with the characteristic data does not need to provide complete characteristic data, and the mode of constructing simulation data to replace the characteristic data is adopted, so that the respective characteristic data cannot be mutually exposed among the second devices, the data are distributed in multiple parties, the box separation processing of obtaining the branch points is carried out in a multi-party combined mode, the data privacy of each second device is maintained, the data safety is better protected to a certain extent, and a better application value is embodied in an application scene with strict requirements on the data privacy; and when the target quantile point is obtained, the method for obtaining the quantile point by establishing the simulation data through one-time transmission (sending/receiving) of the required data between the first equipment and the second equipment reduces the continuous transmission of intermediate data, calculates the calculation and transmission difficulty of the target quantile point through continuous recursion and iteration, improves the data processing efficiency, and ensures that the target quantile point is obtained quickly and accurately.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Distributed data processing is generally applied in a federal learning scene, is dominated by an active party and is combined with a plurality of participants to jointly perform feature data processing and model training. In a specific horizontal federal learning scenario, a plurality of participants have feature data with the same user features but not completely the same users, and a master party combines the plurality of participants to perform feature analysis and model training based on the data. Here, when analyzing the feature data, in order to ensure the security of each participant data, an effective, fast and secure method for obtaining the feature data quantiles needs to be provided, so as to facilitate the analysis and processing of the feature data.

The present application provides for an application based on a first device (master) and a plurality of second devices (participants) in a landscape federated learning scenario. For example, the method provided by the embodiment of the application can be applied to a loan platform. The control center of the loan platform serves as an active party, a plurality of banks serve as participants, each bank provides extreme value and quantity information of sample characteristic data based on respective sample characteristic data (the sample characteristic data can be expressed as personal information and loan data of a user, such as user name, sex, mobile phone number, income information, loan amount and overdue condition), the active party performs characteristic binning processing based on the data to obtain a target binning point and a plurality of binning results, and characteristic index values are calculated according to the binning results to judge the value of the characteristic data according to the characteristic index values. And then, the active party controls a plurality of participants to carry out federal modeling based on the characteristic data to obtain a wind control model, and the wind control model is commonly applied to a loan platform to evaluate user credit or carry out risk prediction.

In some practical application scenarios, for unprotected distributed data, a distributed GK-summary algorithm may be adopted, a data structure named summary is maintained by each node, and then the summary is uniformly combined to achieve the function of compressing and sequencing the original data, thereby calculating the quantile points and completing the binning operation. However, in a horizontal federal learning scenario, a common practical problem is that when calculating the quantiles of the federal scenario, each participant respectively grasps feature data, and it is difficult to compare the size of each data, and if each participant provides all feature data, a large amount of data is leaked, and data privacy is seriously affected.

The embodiment of the application provides the realization of feature binning processing in horizontal federal learning.

In horizontal federal learning, a row of a data matrix (which may be in a tabular form) represents feature data of one user, and a column in the vertical direction represents one data feature (or label), while for a plurality of participants who participate in horizontal federal learning and grasp feature data, the plurality of participants hold the same data feature but have different user dimensions.

TABLE 1

For example, referring to table 1, table 1 is an optional schematic table for describing distributed feature data; table 1 shows feature data of the second device 1 (hereinafter referred to as party a), where different columns represent feature information in different dimensions, and different columns contain sample feature data of party a in the corresponding dimension. For example, the features of party a include: mobile phone number/device number, age, income, transaction times, whether there is overdue. Different rows represent characteristic data of different users. Specifically, in table 1, the first line of data represents the feature data of the user 1 (the mobile phone number/device number is U1, the age is 28, the income is 20000, the transaction number is 10, and whether there is an overdue of 1). Here, with respect to the feature of whether or not there is overdue, 1 means yes, and 0 means no.

TABLE 2

Referring to table 2, table 2 is an alternative schematic table that describes distributed feature data; table 2 shows feature data of the second device 2 (hereinafter, referred to as B-party), for example, features of the B-party include: mobile phone number/device number, age, income, transaction times, whether there is overdue. For example, in table 2, the first line of data represents the characteristic data of user 4 (mobile phone number/device number U4, age 12, income 0, transaction number 3, whether there is an expectation of 0).

In some embodiments, for the feature of each dimension (for example, each column shown in table 1 and table 2), each participant has feature data under the corresponding dimension, and the present application provides a first device for implementing the processing method of distributed learning provided in the embodiments of the present application, that is, performing feature binning processing on the feature data of each dimension to obtain a target quantile point.

In some embodiments, referring to fig. 4, fig. 4 is an optional flowchart of data processing of distributed learning provided in the embodiment of the present application, and the data processing method of distributed learning provided in the embodiment of the present application may be cooperatively implemented by a first device (taking a trusted third party as an example below) and a second device (taking a participant as an example below), which may specifically be implemented through step 401 and step 411, which are specifically described below.

Step 401: and the multiple participants perform data sequencing on the sample characteristic data to obtain a sample characteristic extreme value and a sample number of the sample characteristic data.

Here, the sample characteristic extremum includes a maximum value and a minimum value of the sample characteristic data.

The sample feature data stored by each second party is feature data in the same dimension, and the description thereof will not be repeated below.

In practical implementation, referring to fig. 5A, fig. 5A shows an alternative schematic diagram of a processing method of distributed learning. In fig. 5A, party 1, party 2, and party 3 respectively count the maximum value and the minimum value of the respective sample feature data, and the number of samples of the corresponding features.

For example, for the feature of "revenue", the participant 1 has 25 sample feature data, and after the 25 sample feature data are sorted, the minimum revenue data is obtained as follows: 6. the maximum revenue data is: 11; the participant 2 has 10 sample feature data, and after the 10 sample feature data are sorted, the minimum income data is obtained as follows: 8. the maximum revenue data is: 14; the participant 3 has 45 sample feature data, and after the 45 sample feature data are sorted, the minimum income data is obtained as follows: 0. the maximum revenue data is: 10. here, the unit of sample feature data of this feature "revenue" is ten thousand dollars.

Step 402: the multiple participants send respective sample feature extrema and sample numbers to a trusted third party (arbiter).

In actual implementation, referring to fig. 5A, the multiple participants send the maximum value, the minimum value, and the number of samples of the respective sample feature data to a trusted third party (hereinafter referred to as a third party).

For example, taking over the above example, the data sent by party 1 is 6, 11, 25; the data sent by the participant 2 are 8, 14 and 10; the data sent by party 3 are 0, 10, 45.

Step 403: and the third party receives the plurality of sample characteristic extreme values and the number of samples, and the global sample characteristic extreme values and the number of global samples of the global sample characteristic data are obtained after aggregation.

In the above example, after the third party receives the data sent by each participant, the data is compared and aggregated, and the global sample feature extremum (expressed as the minimum value of the global sample feature data and the maximum value of the global sample feature data) of the global sample feature data of the corresponding feature is 0 or 14, and the number of global samples is 80.

It should be noted that the global sample feature data is represented by a sum of sample feature data stored in each participant device.

Step 404: and the third party determines a total characteristic interval of the global sample characteristic data based on the global sample characteristic extreme value, and performs equidistant binning processing on the total characteristic interval to obtain a plurality of simulation quantiles and a plurality of corresponding intervals.

In some embodiments, the bin number is preset, the overall characteristic interval of the global sample characteristic data is determined based on the maximum value and the minimum value of the global sample characteristic data, and the overall characteristic interval is divided equidistantly based on the preset bin number to obtain a plurality of simulation quantiles and a plurality of corresponding intervals.

Here, the number of bins is set in advance, and in order to ensure that the data of each section after equidistant binning processing is uniformly distributed within a set error acceptable range as much as possible, the number of bins may be determined to be larger, so that the feature data is divided more finely.

In practical implementation, the difference value of the maximum value and the minimum value is determined, the ratio of the current difference value to the box number is used as the distance interval of equidistant division, the value corresponding to the accumulated distance interval is used as a next simulation quantile point from the minimum value, a plurality of simulation quantiles are determined in sequence, and a plurality of corresponding intervals are determined based on the simulation quantile points.

For example, taking the above example as a support, the number of bins is set to 10 in advance, and based on the ratio 1.4 of the difference 14 between the minimum value of the global sample feature data and the maximum value of the global sample feature data and the number of bins 10, 11 simulation quantiles are determined to be: 0.0, 1.4, 2.8, 4.2, 5.6, 7.0, 8.4, 9.8, 11.2, 12.6, 14, and corresponding 10 intervals of [0.0, 1.4], (1.4, 2.8], (2.8, 4.2], (4.2, 5.6], (5.6, 7.0], (7.0, 8.4], (8.4, 9.8], (9.8, 11.2], (11.2, 12.6], (12.6, 14] are obtained.

It should be noted that the determined simulation quantiles may be the quantiles in the overall feature interval, or may include the global sample feature extremum, for example, in the above example, the obtained simulation quantile may be 11, or may be the remaining 9 quantiles that do not include 0 and 14.

Step 405: the third party sends the plurality of simulated quantiles to a plurality of participants.

In practical implementation, referring to fig. 5B, fig. 5B shows an alternative schematic diagram of the processing method of distributed learning. In fig. 5B, a third party sends a plurality of simulated quantiles to a plurality of participants.

Step 406: after receiving the plurality of simulation quantiles, the plurality of participants collect and summarize the sample characteristic data based on the intervals corresponding to the simulation quantile points, and determine the number of the sub-samples of the sample characteristic data in each interval.

In actual implementation, referring to fig. 5B, in fig. 5B, a plurality of participants perform collection and summary processing on sample feature data based on the intervals corresponding to the analog quantile points, and determine the number of sub-samples of the sample feature data in each interval.

For example, taking the above example into account, a plurality of participants may collect and summarize sample feature data in the interval of (7.0, 8.4) based on the simulation quantiles 7.0 and 8.4, and respectively determine the number of sub-samples in the current interval.

Step 407: the plurality of participants send the number of subsamples within the plurality of intervals to a third party.

In actual implementation, referring to fig. 5B, in fig. 5B, a plurality of participants transmits the number of subsamples within a plurality of intervals to a third party.

For example, taking the example above, for the interval (7.0, 8.4], participant 1 sends the number of subsamples 2 to the third party, participant 2 sends the number of subsamples 2 to the third party, and participant 3 sends the number of subsamples 3 to the third party.

Step 408: the third party receives the number of sub-samples in a plurality of intervals and determines the total number of samples in each interval.

For example, taking the above example as an example, for the interval (7.0, 8.4], the third party receives the number of sub-samples transmitted by a plurality of participants, and it is determined that there are 7 data in the interval (7.0, 8.4).

Step 409: the third party constructs uniformly distributed simulation data based on the total sample number and the corresponding simulation quantiles in each interval.

In actual implementation, referring to fig. 5B, in fig. 5B, a third party constructs uniformly distributed simulation data in each interval.

For example, the ratio of the difference between the corresponding simulation quantile points and the corresponding total sample number is used as the simulation data distribution ratio, and the simulation data is uniformly constructed at equal intervals, wherein the difference between adjacent simulation data is the simulation data distribution ratio.

It should be noted that, when the third party performs equidistant binning processing on the overall characteristic intervals, the number of binning can be flexibly adjusted, and it is ensured that the data of each interval has certain regularity as much as possible, where the characteristic range of each interval is small, and uniformly distributed data is constructed within a range acceptable for final errors, so as to replace data actually distributed in the corresponding interval.

For example, taking the above example as an example, for the interval (7.0, 8.4], the ratio 0.2 of the difference 1.4 to the total number of samples 7 is taken as a distribution ratio of the simulation data, and the simulation data in the corresponding interval are sequentially constructed as 7.2, 7.4, 7.6, 7.8, 8.0, 8.2 and 8.4.

Step 410: and the third party splices the simulation data in each interval to form total simulation data.

Here, the simulation data in each section are uniformly distributed and have an order arranged according to the data size, and the simulation data in a plurality of sections are subjected to stitching fitting based on the simulation branch points to form total simulation data.

Step 411: the third party determines the target quantile based on the total simulation data.

In some embodiments, the target quantile is obtained by performing equal frequency binning based on the total analog data.

In actual implementation, the binning proportion is determined, the total simulation data is divided based on the binning proportion to obtain a plurality of different bins, and the bin dividing points corresponding to the different bins are determined as target bin dividing points.

Note that, in the equal-frequency binning processing, after the feature data is binned, the number of data in each bin is substantially equal.

For example, the binning proportion is determined to be 20%, the total simulation data is divided, the first 20% of feature data is determined as feature data in a first bin, the binning point of the first bin is determined as a target binning point, 20% -40% of feature data is determined as feature data in a second bin, the binning point of the second bin is determined as a target binning point, and 5 bins corresponding to the global sample feature data and 6 target binning points including a global sample feature extreme value are obtained sequentially according to the frequency of 20%.

In other embodiments, the total simulation data may be processed in other ways to obtain the target quantile. Specifically, after a summary is constructed on a single side (the summary is a data structure used for storing and maintaining characteristic data), the total analog data is subjected to equal-frequency binning processing to obtain a target quantile point. In other embodiments, the target quantile point may be obtained by performing equidistant binning on the total simulation data, or may be obtained by an optimal binning method.

In some embodiments, for the multiple features provided in tables 1 and 2 (represented as global sample feature data in multiple dimensions, where the global sample feature data is composed of sample feature data stored by each participant), a split-point determining task may be constructed, and a parallel processing manner is adopted to determine target split points of global sample feature data in different dimensions.

In the method, the quantiles of the global sample feature data with different dimensions are simultaneously solved in parallel, so that the solving time of the target quantile is saved, and the processing efficiency of the feature data is improved.

In some embodiments, after a target quantile is obtained for global sample feature data of each dimension, feature index values (which may be expressed as evidence weights or information values) of the feature data are determined based on feature binning results corresponding to the target quantile, availability of the feature data or values of the feature data are determined according to the feature index values for subsequent feature preprocessing and feature selection, and a machine learning model is trained jointly by multiple participants based on available feature data. Here, the machine learning model may include a wind control model for risk assessment under a plurality of actual scenarios.

In the embodiment of the application, each participant does not provide complete characteristic data, the problem of data leakage caused by a common distributed data acquisition box dividing point method is solved, complex processes such as recursion iteration are not needed, meanwhile, when data transmission is carried out by a third party and each participant, the needed data is transmitted at one time, continuous transmission of intermediate data is overcome, the calculation and transmission difficulty of a target dividing point is obtained by continuous recursion and iteration of the intermediate data, the data processing efficiency is improved, repeated transmission of the data is avoided, and the data safety is better protected to a certain extent.

Continuing with the exemplary structure of the distributed learning data processing apparatus 233 provided in the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the distributed learning data processing apparatus 233 of the memory 230 may include:

a simulation quantile determining module 2331 configured to determine a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extrema and the number of samples of sample characteristic data stored by each of the plurality of second devices;

an interval sample number determining module 2332, configured to determine the total number of samples in each interval based on the number of sub-samples corresponding to each interval in each second device;

a simulation data constructing module 2333, configured to construct simulation data in each interval based on the total number of samples in each interval and the simulation quantiles corresponding to each interval;

a target quantile determining module 2334 for forming total simulation data based on the simulation data in each interval and determining target quantiles based on the total simulation data;

a feature data processing module 2335, configured to send the target quantile to each second device, so that each second device constructs a sample set based on the target quantile, and trains a machine learning model for performing a classification task based on the sample set.

In some embodiments, the analog quantile determining module 2331 is further configured to determine a global sample feature extremum and a global sample number for the global sample feature data based on the sample feature extremum and the sample number for the sample feature data stored by each of the plurality of second devices; the global sample feature data comprise sample feature data stored by the second devices respectively, and the global sample feature extreme value comprises the maximum value and the minimum value of the global sample feature data; determining a global characteristic interval of the global sample characteristic data based on the global sample characteristic extreme value; determining a distance interval based on the preset box number and the global sample characteristic extreme value; carrying out equidistant division processing on the overall characteristic interval based on the distance interval so as to determine a plurality of simulation quantile points and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of analog quantiles.

In some embodiments, the simulation data construction module 2333 is further configured to determine a feature data range for each interval based on the simulation quantile corresponding to the respective interval; determining a distribution proportion of the simulation data based on the total sample number in each interval and the characteristic data range of the corresponding interval; the simulation data distribution proportion is the ratio of the difference value of the simulation quantile points corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of adjacent simulation data is the simulation data distribution proportion.

In some embodiments, the target quantile determination module 2334 is further configured to stitch fit the simulated data in the plurality of intervals based on the simulated quantiles to form total simulated data; wherein, the total analog data is data with a specific sequence; determining a binning proportion, and dividing the total analog data based on the binning proportion to obtain a plurality of different bins; the sub-boxes comprise at least one sub-simulation data, and the number of the sub-simulation data in different sub-boxes is consistent; and determining the corresponding quantiles of the plurality of different bins as target quantiles.

In some embodiments, the data processing apparatus for distributed learning further comprises: parallel processing module 2336 (not shown in fig. 2) for creating a plurality of targeted quantile tasks; the multiple target quantile obtaining tasks are used for obtaining target quantiles of global sample characteristic data with different dimensions; the global sample feature data of each dimension represents data of the same feature, and the global sample feature data comprise sample feature data stored by a plurality of second devices respectively; and executing a plurality of tasks for obtaining the target quantile points in parallel to obtain the target quantile points of the global sample characteristic data with different dimensions.

In some embodiments, the feature data processing module 2335 is further configured to send the target quantile point to each second device, so that each second device determines each bin of the sample feature data based on the target quantile point, and determines sub positive and negative sample distributions corresponding to each bin based on the tag data of the respective stored sample feature data; determining total positive and negative sample distribution respectively corresponding to each sub-box based on the positive and negative sample distribution sent by each second device; determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-box; wherein the feature index value corresponding to the global sample feature data is used for causing each of the second devices to perform the following operations: when the characteristic index value exceeds an index threshold value, constructing a sample set, and training a machine learning model for carrying out a classification task based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.

In some embodiments, the sample feature data in the sample set carries a pre-labeled classification result, and the data processing apparatus for distributed learning further includes: a model training module 2337 (not shown in fig. 2) configured to perform classification prediction on each sample feature data in the sample set through a machine learning model to obtain a prediction classification result of each sample feature data; calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data; based on the loss value, model parameters of the machine learning model are updated.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the data processing method of distributed learning described above in the embodiments of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform the method provided by embodiments of the present application, for example, the data processing method of distributed learning as shown in fig. 3A and 3B.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, each participant does not provide complete feature data, so that the problem of data leakage caused by feature data provided by each second device when a quantile is obtained based on distributed data is avoided, and data security is protected to a certain extent; and when the target quantile point is obtained, the required data is transmitted (sent and received) at one time between the first equipment and the second equipment, and the method for obtaining the quantile point by constructing the simulation data replaces the method for continuously transmitting intermediate data, so that the method for continuously calculating the target quantile point recursively and iteratively reduces the calculation and transmission difficulty, reduces the complexity of data processing, improves the data processing efficiency, and ensures that the target quantile point is quickly and accurately obtained.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A data processing method for distributed learning, applied to a first device, the method comprising:

transmitting the target part-location to each of the second devices so that

2. The method of claim 1, wherein determining a plurality of simulation quantiles and a corresponding plurality of intervals based on the sample characteristic extremum and the number of samples of the sample characteristic data stored by each of the plurality of second devices comprises:

determining a global sample characteristic extreme value and a global sample number of the global sample characteristic data based on the sample characteristic extreme value and the sample number of the sample characteristic data respectively stored by the plurality of second devices; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices, and the global sample feature extremum comprises a maximum value and a minimum value of the global sample feature data;

determining an overall feature interval of the global sample feature data based on the global sample feature extremum;

determining a distance interval based on a preset bin number and the global sample characteristic extreme value;

carrying out equidistant partition processing on the overall characteristic interval based on the distance interval so as to determine a plurality of simulation quantiles and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of simulation quantiles.

3. The method of claim 1 or 2, wherein said constructing the simulation data in each of said intervals based on the total number of samples in each of said intervals and the corresponding simulation quantile point in each of said intervals comprises:

determining a characteristic data range of a corresponding interval based on the simulation quantile point corresponding to each interval;

determining a distribution proportion of the simulation data based on the total sample number in each interval and the characteristic data range of the corresponding interval; wherein, the simulation data distribution proportion is the ratio of the difference of the simulation quantile points corresponding to the characteristic data range to the total sample number;

and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of the adjacent simulation data is the simulation data distribution proportion.

4. The method of claim 1, wherein said forming total simulation data based on simulation data within said each interval and determining a target quantile based on said total simulation data comprises:

performing splicing fitting on the simulation data in the plurality of intervals based on the simulation quantile points to form total simulation data; wherein the total analog data is data having a specific order;

determining a binning proportion, and dividing the total simulation data based on the binning proportion to obtain a plurality of different bins; the sub-boxes comprise at least one sub-simulation data, and the number of the sub-simulation data in the different sub-boxes is consistent;

and determining the corresponding quantiles of the plurality of different bins as target quantiles.

5. The method of claim 1, further comprising:

creating a plurality of tasks for obtaining target quantile points;

the plurality of target quantile obtaining tasks are used for obtaining target quantiles of global sample characteristic data with different dimensions; wherein global sample feature data for each dimension characterizes data of the same feature, the global sample feature data comprising the sample feature data stored by each of the plurality of second devices;

and executing a plurality of tasks for obtaining the target quantile points in parallel to obtain the target quantile points of the global sample characteristic data with different dimensions.

6. The method of claim 1, wherein the sending the target quantile to each of the second devices to cause each of the second devices to construct a sample set based on the target quantile and train a machine learning model for performing a classification task based on the sample set comprises:

sending the target quantile point to each second device, so that each second device determines each sub-box of the sample characteristic data based on the target quantile point, and determines sub-positive and negative sample distribution respectively corresponding to each sub-box based on label data of the sample characteristic data stored in each second device;

determining total positive and negative sample distribution respectively corresponding to each sub-box based on the sub-positive and negative sample distribution sent by each second device;

determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-box;

wherein the feature index value corresponding to the global sample feature data is used for causing each of the second devices to perform the following operations:

when the characteristic index value exceeds an index threshold value, constructing a sample set, and training a machine learning model for carrying out a classification task based on the sample set;

wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.

7. The method of claim 6, wherein the sample feature data in the sample set carries pre-labeled classification results, the method further comprising:

classifying and predicting each sample characteristic data in the sample set through the machine learning model to obtain a prediction classification result of each sample characteristic data;

calculating a loss value based on the difference between the pre-labeled classification result on each sample characteristic data and the predicted classification result;

updating model parameters of the machine learning model based on the loss values.

8. A data processing apparatus for distributed learning, comprising:

9. A data processing system for distributed learning, comprising: a first device and a plurality of second devices; wherein the content of the first and second substances,

the first device to:

sending the target sub-location points to each second device;

the second device to:

determining a sample characteristic extreme value and a sample number of stored sample characteristic data, and sending the sample characteristic extreme value and the sample number to first equipment;

determining the number of sub-samples in each interval based on the simulation quantile point of the first equipment and the corresponding intervals, and sending the number of the sub-samples to the first equipment;

and constructing a sample set based on the target quantile determined by the first equipment, and training a machine learning model for carrying out a classification task based on the sample set.

10. A data processing apparatus for distributed learning, comprising:

a memory for storing executable instructions;

a processor for implementing the distributed learning data processing method of any one of claims 1 to 7 when executing executable instructions stored in the memory.

11. A computer-readable storage medium storing executable instructions for implementing the distributed learning data processing method of any one of claims 1 to 7 when executed by a processor.

12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the data processing method of distributed learning of any one of claims 1 to 7.