CN112836765B - Data processing method and device for distributed learning and electronic equipment - Google Patents

Data processing method and device for distributed learning and electronic equipment Download PDF

Info

Publication number
CN112836765B
CN112836765B CN202110233219.6A CN202110233219A CN112836765B CN 112836765 B CN112836765 B CN 112836765B CN 202110233219 A CN202110233219 A CN 202110233219A CN 112836765 B CN112836765 B CN 112836765B
Authority
CN
China
Prior art keywords
data
sample
simulation
interval
quantiles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110233219.6A
Other languages
Chinese (zh)
Other versions
CN112836765A (en
Inventor
谭明超
马国强
范涛
陈天健
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110233219.6A priority Critical patent/CN112836765B/en
Publication of CN112836765A publication Critical patent/CN112836765A/en
Application granted granted Critical
Publication of CN112836765B publication Critical patent/CN112836765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a data processing method, a data processing device, an electronic device, a computer readable storage medium and a computer program product for distributed learning; the method comprises the following steps: determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample feature extremum and sample number of the sample feature data stored by the plurality of second devices respectively; determining the total sample number in each interval based on the sub-sample number corresponding to each interval in each second device; constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval; total simulation data is formed based on the simulation data within each interval, and a target quantile is determined based on the total simulation data. According to the method and the device, the safety of the sample characteristic data can be protected, and the target quantile can be obtained rapidly.

Description

Data processing method and device for distributed learning and electronic equipment
Technical Field
The present application relates to data processing technology, and in particular, to a data processing method, apparatus, electronic device, computer readable storage medium and computer program product for distributed learning.
Background
With the continuous development of big data, distributed technology and the like, feature data needs to be subjected to feature binning in many fields. Feature binning is a technique that groups a plurality of data, each group may be referred to as a bin. In the machine learning field, features can be discretized by performing a binning process on continuous features, and the degree of correlation between the features and the labels can be examined based on the binning result of feature binning. For example, information feature values, evidence weights, etc. are found based on the binned result for feature data preprocessing and feature selection.
In the related art, the characteristic data are usually stored in multi-party distributed data, characteristic box division processing is needed to be carried out by combining the characteristic data of multiple parties, however, when the related art carries out multi-party matching to carry out combined characteristic box division, each party can expose the characteristic data stored by itself, and the risk of data leakage is caused.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, electronic equipment, a computer readable storage medium and a computer program product for distributed learning, which can protect the safety of sample characteristic data and quickly obtain target quantiles.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data processing method for distributed learning, which comprises the following steps:
determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample feature extremum and sample number of the sample feature data stored by the plurality of second devices respectively;
determining the total sample number in each interval based on the sub-sample number corresponding to each interval in each second device;
constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval;
forming total simulation data based on the simulation data in each interval, and determining a target quantile based on the total simulation data;
transmitting the target quantile to each of the second devices to enable
Each of the second devices constructs a sample set based on the target quantiles and trains a machine learning model for performing classification tasks based on the sample set.
The embodiment of the application provides a data processing device for distributed learning, which comprises: .
The simulation quantile determining module is used for determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extremum and sample quantity of sample characteristic data stored by a plurality of second devices respectively;
A section sample number determining module, configured to determine a total sample number in each section based on a sub-sample number corresponding to each section in each second device;
the simulation data construction module is used for constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval;
the target quantile determining module is used for forming total simulation data based on the simulation data in each interval and determining a target quantile based on the total simulation data;
and the characteristic data processing module is used for sending the target quantiles to each second device so that each second device builds a sample set based on the target quantiles and trains a machine learning model for performing classification tasks based on the sample set.
In the above scheme, the analog quantile determining module is further configured to determine a global sample feature extremum and a global sample number of the global sample feature data based on sample feature extremum and sample numbers of the sample feature data stored in each of the plurality of second devices; the global sample characteristic data comprise sample characteristic data stored by each of the plurality of second devices, and the global sample characteristic extremum comprises a maximum value and a minimum value of the global sample characteristic data; determining a global feature interval of the global sample feature data based on the global sample feature extremum; determining a distance interval based on a preset number of bins and the global sample feature extremum; performing equidistant partition processing on the overall characteristic interval based on the distance interval to determine a plurality of simulation quantiles and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of analog quantiles.
In the above scheme, the analog data construction module is further configured to determine a characteristic data range of the corresponding section based on the analog quantiles corresponding to each section; determining a simulation data distribution proportion based on the total sample number in each interval and the characteristic data range of the corresponding interval; the simulation data distribution proportion is the ratio of the difference value of the simulation quantiles corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of adjacent simulation data is the simulation data distribution proportion.
In the above scheme, the target quantile determining module is further configured to perform a stitching fit on the simulation data in the multiple intervals based on the simulation quantile, so as to form total simulation data; wherein the total analog data is data having a specific order; determining a box division ratio, and dividing the total simulation data based on the box division ratio to obtain a plurality of different boxes; wherein the sub-boxes comprise at least one piece of sub-simulation data, and the sub-simulation data in different boxes are consistent in quantity; and determining the corresponding quantiles of the plurality of different bins as target quantiles.
In the above-mentioned scheme, the data processing device for distributed learning further includes: the parallel processing module is used for creating a plurality of tasks for solving target quantiles; the tasks for solving the target quantile are used for solving the target quantiles of the global sample characteristic data with different dimensions; wherein global sample feature data for each dimension characterizes data of the same feature, the global sample feature data comprising the sample feature data stored by each of the plurality of second devices; and executing a plurality of tasks for solving the target quantile in parallel to obtain the target quantile of the global sample characteristic data with different dimensions.
In the above scheme, the feature data processing module is further configured to send the target quantile to each second device, so that each second device determines each sub-bin of the sample feature data based on the target quantile, and determines sub-positive and negative sample distributions corresponding to each sub-bin based on tag data of each stored sample feature data; determining total positive and negative sample distribution corresponding to each sub-box respectively based on the sub-positive and negative sample distribution sent by each second device; determining a feature index value corresponding to global sample feature data based on the total positive and negative sample distribution of each bin, wherein the feature index value corresponding to the global sample feature data is used for enabling each second device to execute the following operations: when the characteristic index value exceeds an index threshold value, a sample set is constructed, and a machine learning model for performing classification tasks is trained based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.
In the above solution, the sample feature data in the sample set carries a pre-labeled classification result, and the data processing apparatus for distributed learning further includes: the model training module is used for carrying out classification prediction on each sample characteristic data in the sample set through the machine learning model to obtain a prediction classification result of each sample characteristic data; calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data; model parameters of the machine learning model are updated based on the loss values.
An embodiment of the present application provides a data processing system for distributed learning, including: a first device and a plurality of second devices; wherein,
the first device is used for determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extremum and sample quantity of sample characteristic data stored by the second devices respectively; determining the total sample number in each interval based on the sub-sample number corresponding to each interval in each second device; constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval; forming total simulation data based on the simulation data in each interval, and determining a target quantile based on the total simulation data; transmitting the target quantile to each of the second devices;
The second device is used for determining sample characteristic extremum and sample number of stored sample characteristic data and sending the sample characteristic extremum and sample number to the first device; determining the number of subsamples in each interval based on the simulation quantiles of the first equipment and the corresponding multiple intervals, and sending the subsamples to the first equipment; a sample set is constructed based on the target quantiles determined by the first device, and a machine learning model for performing classification tasks is trained based on the sample set.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the data processing method for distributed learning provided by the embodiment of the application when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the data processing method for distributed learning.
Embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the data processing method for distributed learning provided by the embodiments of the present application.
The embodiment of the application has the following beneficial effects:
determining a plurality of simulation quantiles and corresponding intervals, and acquiring the number of samples in the intervals obtained by the second equipment based on the simulation quantiles at one time, so as to construct simulation data in each interval and obtain a final target quantile;
the second equipment only transmits the extreme value and the quantity of the sample characteristic data to the first equipment and does not transmit the characteristic data, so that the problem of data leakage caused by providing the characteristic data by each second equipment when the division point is solved based on the distributed data is avoided, and the data safety is protected to a certain extent;
when the target quantile is obtained, the required data is transmitted (sent and received) at one time between the first equipment and the second equipment, and the method for obtaining the quantile by constructing analog data replaces the method for continuously transmitting intermediate data to continuously recursively and iteratively calculate the target quantile, so that the complexity of data processing is reduced, the data processing efficiency is improved, and the target quantile is ensured to be obtained rapidly and accurately.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of a data processing system for distributed learning provided by embodiments of the present application;
FIG. 2 is a schematic diagram of an alternative architecture of an electronic device provided in an embodiment of the present application;
FIG. 3A is a schematic flow chart of an alternative method for processing distributed learning according to an embodiment of the present application;
FIG. 3B is a schematic flow chart of an alternative method of processing distributed learning provided in an embodiment of the present application;
FIG. 4 is a schematic flow chart of an alternative method for processing distributed learning according to an embodiment of the present application;
FIG. 5A is a schematic diagram of an alternative method of processing distributed learning provided by embodiments of the present application;
fig. 5B is a schematic diagram of an alternative method for processing distributed learning according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
If a similar description of "first/second" appears in the application document, the following description is added, in which the terms "first/second/third" are merely distinguishing between similar objects and not representing a particular ordering of the objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence, if allowed, so that the embodiments of the application described herein may be implemented in an order other than that illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.
1) And (3) separating boxes: after the original data are ordered, the box division points are divided by a certain rule, and the numerical value between two box division points is classified into the box division points. In the machine learning field, by carrying out the box division processing on the continuous type features, the features can be discretized, evidence weights (Weight of Evidence, WOE), information values (Information Value, IV) and the like can be obtained, so that the features are preprocessed and selected, and the iteration of the model can be accelerated when the model is trained based on the discretized features, and the robustness and the interpretability of the model are effectively enhanced.
Here, common box division methods include: equidistant binning, equal frequency binning, optimal binning, etc. Equidistant box division is: after sorting the data, finding out a maximum value and a minimum value, and equally dividing the box division points between the maximum value and the minimum value; the equal frequency division box is as follows: after the data are divided into boxes, the number of the data in each box is approximately equal; the optimal box division is as follows: and (5) utilizing an IV value or chi-square test and other evaluation modes to optimize the evaluation index after scoring.
2) Dividing points: refers to a numerical point that divides the probability distribution range of a random variable into several equal parts, in this application, a binning point is used to characterize a feature binning.
3) Evidence weight (WOE, weight of Evidence): an index for evaluating characteristic data is used for measuring the difference between normal sample distribution and default sample distribution.
4) Information value (IV, information Value): an index for evaluating feature data is used for measuring the predictive power of features.
While distributed learning has achieved breakthrough results in a plurality of application fields, the applicant finds that, in the process of implementing the application, since each participant participating in distributed learning holds characteristic data, when characteristic binning is performed on the characteristic data, that is, the quantile of the characteristic data is obtained, a balance cannot be achieved between obtaining accurate characteristic quantiles of all the characteristic data and protecting the privacy of data of each participant.
Based on the above, the embodiments of the present application provide a data processing method, apparatus, electronic device, computer readable storage medium and computer program product for distributed learning, which can avoid the problem of data leakage caused by providing feature data by each second device when obtaining the quantile based on distributed learning, protect the data security to a certain extent, and ensure that the final target quantile is obtained quickly and accurately.
The data processing method for distributed learning provided in the embodiments of the present application may be implemented by various types of electronic devices, such as a terminal, a server, or a combination of both.
First, a data processing system for distributed learning provided in an embodiment of the present application will be described, and an exemplary data processing system for distributed learning will be described below by taking a server and a server cooperatively implementing a data processing method for distributed learning provided in an embodiment of the present application as an example. With reference to FIG. 1, FIG. 1 is a schematic diagram of an alternative architecture of a distributed learning data processing system 100 provided in an embodiment of the present application.
As shown in fig. 1, the first device 200 is connected to a second device 400 (second devices 400-1 and 400-2 are shown as examples) through a network 300. The network 300 may be a wide area network or a local area network, or a combination of both, and data transmission is implemented using a wireless link.
As an example, the second devices 400-1 and 400-2 transmit the sample feature extremum and the number of samples of the respective stored sample feature data to the first device 200; the first device 200 receives the sample characteristic extremum and the sample number, determines a plurality of analog quantiles and a corresponding plurality of intervals, and transmits the plurality of analog quantiles to the second devices 400-1 and 400-2; after the second devices 400-1 and 400-2 receive the plurality of analog quantiles, determining the number of samples in the corresponding section based on the analog quantiles, respectively, and transmitting the number of sub-samples of each section to the first device 200; after receiving the number of sub-samples of each section, the first device 200 determines the number of total samples in each section, constructs analog data in each section based on the number of total samples in each section and the analog quantiles corresponding to each section, forms total analog data based on the analog data in each section, determines a target quantile using the total analog data, and transmits the target quantile to the second devices 400-1 and 400-2; after receiving the target quantiles, the second devices 400-1 and 400-2 construct a sample set based on the target quantiles and train a machine learning model for performing classification tasks based on the sample set.
It should be noted that 400-1 and 400-2 are two examples of the second device, and in actual implementation, the first device 200 may perform data transmission with a plurality of second devices 400 to implement the data processing method for distributed learning provided in the embodiments of the present application.
In some embodiments, the first device 200 and the second device 400 may be independent physical servers, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The network 300 may be a wide area network or a local area network, or a combination of both. The first device 200 and the second device 400 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
Next, an electronic device for implementing the data processing method for distributed learning provided in the embodiment of the present application will be described, where in practical application, the electronic device may be implemented as the first device 200 and the second device 400 shown in fig. 1 (shown in 400-1 and 400-2 in fig. 1).
Taking the electronic device as the first device 200 shown in fig. 1 as an example, the first device (as an active party) and the second device (as a participating party) can be applied to a scene of distributed learning to perform joint feature binning processing and data modeling processing. The method is applied to a horizontal federal learning scene, a first device is used as an active party, a plurality of second devices are used as participants, the plurality of participants provide extreme values and quantity of sample characteristic data, the active party is used as a leading party, characteristic binning is conducted, target quantiles are obtained, and the plurality of participants are combined to analyze the characteristic data and conduct training of a machine learning model. Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device (implemented as a first device 200) according to an embodiment of the present application, where the first device 200 shown in fig. 2 includes: at least one processor 210, at least one network interface 220, and a memory 230. The various components in the first device 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable connected communications between these components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 240 in fig. 2.
The processor 210 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
Memory 230 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 230 optionally includes one or more storage devices that are physically remote from processor 210.
Memory 230 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 230 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 230 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 231 including system programs, e.g., a framework layer, a core library layer, a driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
network communication module 232 for reaching other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
in some embodiments, the apparatus for data processing for distributed learning provided in the embodiments of the present application may be implemented in a software manner in the first device 200, and fig. 2 shows a data processing apparatus 233 for distributed learning stored in the memory 230, which may be software in the form of a computer program, a plug-in, or the like. The data processing apparatus 233 of the distributed learning includes the following software modules: a simulation quantile determination module 2331, a section sample number determination module 2332, a simulation data construction module 2333, a target quantile determination module 2334, and a feature data processing module 2335. These modules may be logical functional modules, and thus may be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.
In other embodiments, the data processing apparatus for distributed learning provided in the embodiments of the present application may be implemented in hardware, and by way of example, the data processing apparatus for distributed learning provided in the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the data processing method for distributed learning provided in the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic component.
The data processing method for distributed learning provided in the embodiment of the present application will be described with reference to an exemplary application and implementation of the first device provided in the embodiment of the present application. Referring to fig. 3A, fig. 3A is a schematic flowchart of an alternative data processing method for distributed learning according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.
In step 101, a plurality of simulation quantiles and a corresponding plurality of intervals are determined based on the sample feature extremum and the number of samples of the sample feature data stored by each of the plurality of second devices.
In some embodiments, the sample feature data stored by each of the plurality of second devices may be in the same dimension. Here, the sample feature data of the same dimension representing the same feature, that is, the same dimension, is data representing the same feature, and each second device respectively has some sample feature data under the feature.
For example, the plurality of second devices may be servers of banking systems provided by a plurality of banks, and the data stored in the servers of the plurality of banking systems each include sample feature data of the feature "age".
In actual implementation, the data possessed by each of the second devices is identical in characteristics and may differ in the user dimension. For example, the second device 1 has data of the feature of "age" of the users 1 and 2, and the second device 2 has data of the feature of "age" of the users 3 and 4. And the second devices are subjected to joint feature binning by the first device through the feature data owned by the second devices. For example, each second device provides the sample feature extremum and the sample number of the sample feature data under a certain dimension (feature) to the first device, so that the first device can obtain the data information of the feature data under the current dimension (feature), and the method for obtaining the target quantile is implemented, which is described below, so as to achieve the purpose of feature binning.
In some embodiments, referring to fig. 3B, fig. 3B is a schematic flow chart of an optional data processing method for distributed learning provided in the embodiments of the present application, and step 101 shown in fig. 3B may be implemented through steps 1011 to 1014, which will be described in connection with the steps.
In step 1011, a global sample feature extremum and a global sample number of the global sample feature data are determined based on the sample feature extremum and the sample number of the sample feature data stored by each of the plurality of second devices.
Here, the global sample feature data includes sample feature data stored by each of the plurality of second devices, the sample feature extremum includes a maximum value and a minimum value of the sample feature data, and the global sample feature extremum includes a maximum value and a minimum value of the global sample feature data.
In practical implementation, the plurality of second devices are ranked according to the respective stored sample characteristic data to obtain the maximum value and the minimum value of the corresponding sample characteristic data and the number of samples. Here, the sample feature data sorting may be ascending sorting or descending sorting according to the size of the sample feature data, or may be priority sorting according to the level of the sample feature data, which is only given as an example of sorting in the embodiment of the present application and is not limited thereto.
In some embodiments, the first device compares the maximum value and the minimum value based on the maximum value and the minimum value of the sample characteristic data stored by each of the plurality of second devices, to obtain the maximum value and the minimum value of the corresponding global sample characteristic data; the first device performs accumulation processing on the plurality of sample numbers based on the sample numbers of the sample feature data stored by the plurality of second devices, so as to obtain the corresponding global sample numbers of the global sample feature data.
In step 1012, a global feature interval for the global sample feature data is determined based on the global sample feature extremum.
In some embodiments, the range of the global sample feature data is determined based on the maximum and minimum values of the global sample feature data, resulting in an overall feature interval with the maximum and minimum values as the endpoints of the interval.
In step 1013, a distance interval is determined based on the preset number of bins and the global sample feature extremum.
It should be noted that, the number of bins may be preset, and may be dynamically adjusted adaptively according to the number of global samples, in general, in order to ensure that the data of each interval after equidistant bin splitting has a certain regularity as much as possible, the number of bins may be determined to be larger (for example, a threshold value of the number of bins may be preset, so that the actually set bin data exceeds the threshold value of the number of bins as much as possible), so that the feature data is divided more finely, and within a range where the set error is acceptable, the feature data in the corresponding interval is considered to be uniformly distributed.
In actual implementation, the difference value between the maximum value and the minimum value of the global sample characteristic data is determined, and the ratio of the current difference value and the number of bins is taken as the equidistant divided distance interval.
In step 1014, the overall characteristic interval is equidistantly partitioned based on the distance interval to determine a plurality of analog quantiles and a corresponding plurality of intervals.
In some embodiments, the overall feature interval is equidistantly divided according to the distance interval, so as to obtain a plurality of simulation quantiles and a plurality of corresponding intervals. Here, the differences between adjacent ones of the plurality of analog quantiles are the same, i.e. the distance intervals above.
In actual implementation, starting from the minimum value of the overall characteristic interval, sequentially taking the corresponding value after the accumulated distance interval as the next box dividing point, sequentially determining a plurality of box dividing points, and determining a plurality of corresponding intervals based on the box dividing points; the plurality of binning points is determined as a plurality of analog binning points.
It should be noted that, the analog quantile may be a binning point in the overall feature interval, and may also include a global sample feature maximum and minimum. For example, if the analog quantile includes a binning point in the overall feature interval and a global sample feature maximum and minimum, the equidistant partitioning of the overall feature interval may be implemented as: the 4 simulated quantiles "0", "15", "30", "40" are determined to yield corresponding 3 intervals (which may also be referred to as bins): 0 to 15, 15 to 30 and 30 to 40.
It should be noted that, to which section the simulation quantile belongs, may be preset. For example, the simulation quantile may be set to be subordinate to a section having the simulation quantile as the minimum value of the corresponding section, and in the above example, the simulation quantile 15 may be subordinate to a section of 15 to 30, and the simulation quantile 30 may be subordinate to a section of 30 to 40. In other examples, the analog quantile may also be set to be subordinate to a zone in which the analog quantile is the maximum value of the corresponding zone, which is not described herein.
In step 102, the number of total samples within each interval is determined based on the number of sub-samples in each second device corresponding to each interval.
In actual implementation, for each interval, the first device performs accumulation processing on the number of sub-samples corresponding to each interval in each second device, as the total number of samples in each interval.
In step 103, the simulation data in each section is constructed based on the total number of samples in each section and the simulation quantiles corresponding to each section.
In some embodiments, the method of step 103 may be implemented by: determining a characteristic data range of each section based on the corresponding simulation quantile of each section; determining a simulated data distribution ratio based on the number of the total samples in each interval and the characteristic data range of the corresponding interval; the analog data distribution proportion is the ratio of the difference value of the analog dividing points corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed analog data in each interval based on the analog data distribution proportion, wherein the difference value of adjacent analog data is the analog data distribution proportion.
In actual implementation, according to the analog data distribution ratio, values corresponding to the analog data distribution ratio are sequentially superimposed from the left-side section end point (the minimum value of the section) in the corresponding section, and a plurality of analog data which are uniformly distributed in the corresponding section are obtained.
For example, in banking, if the first device is a server of a bank lending system, the corresponding characteristic data is a loan amount, and the unit is ten thousand yuan, and when the first device determines (2, 3) that 10 characteristic data are included in the interval (the corresponding interval does not include the characteristic data 2 and includes the characteristic data 3), the distribution ratio of the analog data is determined to be 0.1, and the analog uniform distribution data are constructed to be 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9 and 3.
In this way, uniformly distributed analog data are built in each interval to replace the acquisition of real sample characteristic data in the corresponding interval from each second device, so that the data exposure risk of each second device is reduced, the privacy of each second device is maintained, the data safety is ensured to a certain extent, meanwhile, the transmission complexity of the sample characteristic data is reduced, and the efficiency of obtaining target quantiles is improved.
In step 104, total simulation data is formed based on the simulation data within each interval, and a target quantile is determined based on the total simulation data.
In some embodiments, the method of step 104 may be implemented by: performing splicing fitting on the simulation data in the intervals based on the simulation quantiles to form total simulation data; the target quantiles are determined based on the total simulation data.
The total analog data is data having a specific order. Because the simulation data in each section are uniformly distributed, the simulation data in the corresponding section has a specific sequence, and the splicing fitting processing performs end-to-end connection processing on the corresponding section according to the simulation dividing points, so that the splicing fitting on the simulation data in a plurality of sections is realized, and the total simulation data is formed.
Here, the end-to-end processing of the corresponding intervals according to the simulated quantiles may be: if the simulation dividing point is the maximum value of the first section and the minimum value of the second section, the first section and the second section are connected, so that the data of the first section and the data of the second section are directly spliced and fit. For example, if the plurality of sections obtained by the first device are [1,2], (2, 3], and 20 uniformly distributed analog data are constructed for the section of [1,2], 10 uniformly distributed analog data are constructed for the section of (2, 3), two sections are spliced according to the analog splitting point "2", after the first 20 analog data, 10 analog data are spliced, so as to obtain 30 total analog data.
In some embodiments, determining quantiles based on the analog data may be accomplished by: determining a box division ratio, and dividing the total simulation data based on the box division ratio to obtain a plurality of different boxes; and determining the corresponding sub-box points of the plurality of different sub-boxes as target sub-box points. Here, each sub-bin includes at least one piece of sub-analog data, and the sub-analog data in different sub-bins is identical in number.
In practical implementation, the binning ratio is preset for performing equal frequency binning for the total analog data. For example, assuming that the data range of the total simulation data is 0 to 50, if the binning ratio is 50%, the data ranked at 25 th and before is taken as the data of the first bin, the data ranked after 25 th is taken as the data of the second bin, the 25 th data is taken as the bin point, and the target bin point is determined; if the bin ratio is 25%, the data ranked at the 10 th and before is taken as the data of the first bin, the data ranked at the 10 th and after the 20 th and before is taken as the data of the second bin, the data ranked at the 20 th and after the 30 th and before is taken as the data of the third bin, the data ranked at the 30 th and after the 40 th and before is taken as the data of the fourth bin, the data ranked at the 40 th and after the 40 th is taken as the data of the fifth bin, the data of the 10 th, 20 th, 30 th and 40 th bits is taken as the bin points of the five bins, and the target bin points are determined.
In other embodiments, determining quantiles based on analog data may also be accomplished by: determining distance intervals, equally dividing the total simulation data based on the distance intervals, sequentially taking the value corresponding to the accumulated distance intervals as the next target quantile from the minimum value, sequentially determining a plurality of target quantiles, and obtaining a plurality of bins based on the target quantiles. Here, the ratio of the difference between the maximum value and the minimum value of the global sample feature data to the number of bins is taken as the distance interval, and the number of bins refers to the description of the embodiment of the present application, and is not described herein.
Here, the first device performs feature binning processing based on the total simulation data, so that the target binning point is obtained quickly, real feature data of the second device is not needed to be utilized to combine the binning processing, the data loss and leakage risks in the transmission process are reduced, complex processing of continuously transmitting data and continuously calculating/comparing/updating the binning point through combining multiple parties is avoided, and the processing efficiency of obtaining the feature binning point by the distributed data is improved.
In some embodiments, following step 104, the following scheme may also be performed: creating a plurality of tasks for solving target quantiles; the tasks for solving the target quantile are used for solving the target quantile of the global sample characteristic data with different dimensions; the global sample characteristic data of each dimension represents the data of the same characteristic, and the global sample characteristic data comprises sample characteristic data stored by a plurality of second devices respectively; and executing a plurality of tasks for solving the target quantile in parallel to obtain the target quantiles of the global sample characteristic data with different dimensions.
In actual practice, the feature data stored by the plurality of second devices is typically multi-dimensional. For example, the second device may be a server of a bank loan system, a server of an online investment platform, or the like, taking the bank loan system as an example, where the server of the bank loan system stores the lease information of each loan user, where the lease information typically includes a plurality of dimension features, such as: user name, mobile phone number, bank card number, age, address, lease amount, whether overdue return, etc. For a lending system of a plurality of banks, a plurality of features with the same dimension are respectively stored, and a first device (usually expressed as a third party, which may be a risk control platform, a lending income evaluation platform, etc.) may respectively perform feature binning processing on the features with the dimensions so as to obtain target quantiles with the corresponding dimensions.
Here, through the parallel processing mode, the quantiles of the global sample characteristic data with different dimensions are simultaneously obtained, so that the solving time of the target quantiles is saved, and the processing efficiency of the characteristic data is improved.
In step 105, the target quantiles are sent to each of the second devices, such that each of the second devices builds a sample set based on the target quantiles and trains a machine learning model for performing classification tasks based on the sample set.
In some embodiments, step 105 may be implemented by sending the target quantile to each second device, so that each second device determines each bin of the sample feature data based on the target quantile, and determines sub-positive and negative sample distributions respectively corresponding to each bin based on the tag data of the sample feature data stored respectively; determining total positive and negative sample distribution corresponding to each sub-box respectively based on the sub-positive and negative sample distribution sent by each second device; determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-bin; the feature index value corresponding to the global sample feature data is used for enabling each second device to execute the following operations: when the characteristic index value exceeds an index threshold value, a sample set is constructed, and a machine learning model for performing classification tasks is trained based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.
In actual implementation, the first device sends the final target quantiles of the global sample feature data to the respective second devices. Each second device determines each sub-box of the sample characteristic data based on the target sub-position point, distributes the sample characteristic data stored in each sub-box to the corresponding sub-box, sequentially determines whether the sample characteristic data is a positive sample or a negative sample according to the label data for the sample characteristic data in each sub-box, and counts the number of the positive samples and the number of the negative samples in each sub-box to be used as sub-positive and negative sample distribution of the current sub-box.
Here, the tag data is reference data corresponding to the current sample feature data or the global sample feature data, and in particular, for features of the same dimension, the tag data is generally used to distinguish whether feature data belonging to the current feature is up to standard or not, and is used to distinguish whether feature data belongs to a positive sample or a negative sample (a positive sample may be understood as a normal sample, and a negative sample may be understood as a default sample). The tag data is usually an empirical value, and is an evaluation index obtained through a large number of experiments and test processes, for example, taking a loan system of a bank as an example, for the age, the corresponding tag data is usually 20 years old, the 20 years old is an acceptable age index capable of participating in loan, if the age is less than 20 years old, the current sample feature data or the global sample feature data is determined as a negative sample, and if the age is more than 20 years old, the current sample feature data or the global sample feature data is determined as a positive sample.
After determining the distribution of the sub positive and negative samples in each sub box of the second equipment, for each sub box, carrying out aggregation treatment on the distribution of the sub positive and negative samples sent by each second equipment to obtain the total distribution of the sub positive and negative samples of each sub box; specifically, the positive sample number in each sub-box is accumulated to be the total positive sample number, and the negative sample number in each sub-box is accumulated to be the total negative sample number, so that the total positive and negative sample distribution of each sub-box is obtained; and determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each bin, wherein the global sample characteristic data comprises sample characteristic data stored by a plurality of second devices.
Here, the feature index value is used to evaluate the feasibility of the corresponding feature (the feature under which the global sample feature data or the sample feature data is provided), i.e., whether the current global sample feature data can be used to train the machine learning model as a training sample or can be used for subsequent feature selection and feature processing. The characteristic index value may be an IV value, a WOE value, or the like.
In actual implementation, based on total positive and negative sample distribution of each bin, determining the IV value or WOE value of the current bin, and performing aggregation treatment on the IV values or WOE values of all bins to obtain the IV value or WOE value corresponding to the global sample characteristic data. Here, the aggregation process may be an accumulation process or a weighted summation process, which is not limited in this application.
In some embodiments, each second device may construct a sample set using the respective stored sample feature data as a training sample when the feature index value exceeds the index threshold, and train a machine learning model for performing the classification task based on the sample set.
Here, the index threshold is a threshold for evaluating the feasibility of a feature (the feature having global sample feature data or sample feature data under it), and when the feature index value exceeds the index threshold, it is determined that the current feature data can be used for training of the machine learning model.
Here, the machine learning model may be a two-class model, or may be a multi-class model. In practical implementation, taking a machine learning model as a classification model as an example for explanation, when the machine learning model is used for classifying and predicting feature data, sample feature data in a sample set carries a pre-labeled classification result, and training the machine learning model can be realized by the following ways: each second device performs classification prediction on each sample characteristic data in the sample set through a machine learning model to obtain a prediction classification result of each sample characteristic data; and calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data, and updating model parameters of the machine learning model based on the loss value.
It should be noted that, the training machine learning model may be independently implemented by the second device, and the trained machine learning model is synchronized to the first device and the remaining plurality of second devices; the training machine learning model may be implemented by the first device and the second devices together, where each second device trains the machine learning model based on sample feature data in the sample set constructed by each second device, and sends the trained machine learning model to the first device, and the first device performs aggregation processing on the trained machine learning model of each second device to obtain a global machine learning model, and synchronizes to the plurality of second devices to update the machine learning model trained by each second device.
In practical implementation, the classification model can be applied to a plurality of practical application scenes to perform classification prediction on the feature data. For example, in the consumer finance field, two-class prediction is performed on loan data of a loan user to perform credit evaluation, and it is determined that the user is a default or normal user; the characteristic data of the human body is subjected to classification prediction in the field of medical image recognition so as to evaluate focus results and judge whether the patient is ill or healthy.
The machine learning model may also be an air control model, where the sample feature data in the sample set carries a pre-labeled target air control evaluation result, and training the air control model may be implemented by: carrying out risk prediction on each sample characteristic data in the sample set through a wind control model to obtain a predicted wind control evaluation result of each sample characteristic data; and updating model parameters of the wind control model based on the difference between the target wind control evaluation result and the predicted wind control evaluation result marked on each sample characteristic data.
Here, the sample characteristic data in the sample set may include user data, and the wind control model may predict user credits based on the user data for intelligent risk control. The wind control model may be used for a number of application scenarios, such as: anti-fraud, white list preliminary screening, credit pre-review, post-credit pre-warning scoring, etc. in the consumer finance industry.
For example, if the training samples of the model are from a lending system of a plurality of banks, including the user name, gender, mobile phone number, income information, lending amount, overdue feature data of features, the wind control model may be used for classifying the feature data to perform risk assessment on the identity of the user, risk prediction on whether the lending amount exceeds a threshold, assessment on whether the lending credit value of the user meets standards, and so on.
According to the embodiment of the application, each second device with the characteristic data does not need to provide complete characteristic data to construct a mode of replacing the characteristic data with analog data, so that the characteristic data among the second devices are not mutually exposed, the data are distributed among multiple parties, the multiple parties are required to combine to perform the box division processing of solving the division points, the data privacy of each second device is maintained, the data safety is better protected to a certain extent, and the better application value is embodied in application scenes with strict requirements on the data privacy; in addition, when the target quantile is obtained, the required data is transmitted (sent/received) at one time between the first equipment and the second equipment, so that the method for obtaining the quantile by using the analog data is used for reducing the calculation and transmission difficulty of continuously transmitting intermediate data and continuously recursively and iteratively calculating the target quantile, improving the data processing efficiency and ensuring that the target quantile is obtained quickly and accurately.
In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.
Distributed data processing is generally applied in a federal learning scenario, and is combined with multiple participants by taking the initiative as a dominant part to jointly perform feature data processing and model training. In a specific horizontal federal learning scenario, multiple participants possess feature data that are identical in user features, but not identical in user, and feature analysis and model training are performed by an active party in combination with multiple participants based on the data. Here, in order to ensure the safety of the data of each participant during the feature data analysis, an effective, rapid and safe method for obtaining the feature data dividing points needs to be provided, so as to facilitate the analysis and processing of the feature data.
The application provides an application based on a first device (active party) and a plurality of second devices (participants) in a horizontal federal learning scenario. The method provided by the embodiment of the application can be applied to a lending platform. Here, the control center of the lending platform serves as an active party, a plurality of banks serve as participants, each bank provides extremum and quantity information of the sample characteristic data based on respective sample characteristic data (here, the sample characteristic data can be represented as personal information and lending data of a user, such as a user name, a sex, a mobile phone number, income information, a lending amount and overdue condition), the active party performs characteristic binning processing based on the data to obtain a target binning point and a plurality of binning results, and a characteristic index value is calculated according to the binning results to judge the value of the characteristic data according to the characteristic index value. And then, controlling a plurality of participants by the initiative to perform federal modeling based on the characteristic data to obtain a wind control model, and jointly applying the wind control model to a lending platform to evaluate the credit of the user or perform risk prediction.
In some practical application scenarios, for unprotected distributed data, a distributed GK-summary algorithm can be adopted, a data structure named summary is maintained through each node, then each summary is unified and combined, the function of compressing and sequencing the original data is achieved, so that the split point is calculated, and the split box operation is completed. However, in the horizontal federation learning scenario, the actual problem generally encountered is that when calculating the quantiles of the federation scenario, each participant respectively grasps the feature data, and it is difficult to compare the sizes of the data of each participant, and if each participant provides all the feature data, a large amount of data leakage is caused, so that the data privacy is seriously affected.
The embodiment of the application provides the realization of the characteristic binning processing in transverse federal learning.
In horizontal federal learning, a row of data matrices (which may be in tabular form) represents feature data of one user, a column of data features (or labels) is represented in the vertical, and for a plurality of participants who participate in the horizontal federal learning to grasp feature data, the plurality of participants hold the same data features, but the user dimensions are different.
TABLE 1
Illustratively, referring to Table 1, table 1 is an alternative schematic illustration of the distributed feature data; table 1 shows the characteristic data of the second device 1 (hereinafter referred to as a-party), wherein different columns represent characteristic information in different dimensions, and sample characteristic data of a-party in corresponding dimensions are contained in different columns. For example, features of party a include: cell phone number/device number, age, income, number of transactions, whether there is overdue. Different rows represent characteristic data of different users. Specifically, in table 1, the first line of data represents the characteristic data of user 1 (mobile phone number/device number U1, age 28, income 20000, number of transactions 10, and whether there is overdue 1). Here, 1 indicates yes and 0 indicates no for the feature of whether or not there is overdue.
TABLE 2
Referring to Table 2, table 2 is an alternative schematic diagram of the distributed profile; table 2 shows feature data of the second device 2 (hereinafter referred to as B-party), for example, features of the B-party include: cell phone number/device number, age, income, number of transactions, whether there is overdue. For example, in table 2, the first line of data represents the feature data of the user 4 (cell phone number/device number U4, age 12, income 0, number of transactions 3, whether or not there is an expectation of 0).
In some embodiments, for each dimension of the feature (e.g., each column shown in table 1 and table 2), each participant has feature data in the corresponding dimension, and the present application provides a first device for implementing the processing method for distributed learning provided in the embodiments of the present application, that is, performing feature binning processing on the feature data in each dimension, and obtaining a target quantile.
In some embodiments, referring to fig. 4, fig. 4 is an optional flowchart of data processing for distributed learning provided in the embodiments of the present application, where the data processing method for distributed learning provided in the embodiments of the present application may be cooperatively implemented by a first device (hereinafter, a trusted third party is taken as an example) and a second device (hereinafter, a participant is taken as an example), and may specifically be implemented through steps 401-411, which will be described in detail below.
Step 401: and the plurality of participants conduct data sorting on the sample characteristic data to obtain sample characteristic extremum and sample quantity of the sample characteristic data.
Here, the sample feature extremum includes a maximum value and a minimum value of the sample feature data.
It should be noted that, the sample feature data stored in each second party is feature data in the same dimension, and the description thereof will not be repeated.
In actual implementation, referring to fig. 5A, fig. 5A shows an alternative schematic diagram of a processing method of distributed learning. In fig. 5A, participants 1, 2, 3 respectively count the maximum and minimum values of the respective sample feature data, and the number of samples of the corresponding feature.
For example, for the "revenue" feature, participant 1 has 25 sample feature data, and after sorting the 25 sample feature data, the minimum revenue data is obtained as: 6. the maximum revenue data are: 11; the participant 2 has 10 sample feature data, and after the 10 sample feature data are ordered, the minimum income data is obtained as follows: 8. the maximum revenue data are: 14; the participant 3 has 45 sample feature data, and after sorting the 45 sample feature data, the minimum income data is obtained as follows: 0. the maximum revenue data are: 10. here, the unit of sample feature data of this feature of "income" is ten thousand yuan.
Step 402: the plurality of participants send respective sample feature extremum and sample numbers to a trusted third party (arbiter).
In practical implementation, referring to fig. 5A, multiple participants send the maximum value, the minimum value, and the number of samples of the respective sample feature data to a trusted third party (hereinafter referred to as a third party).
For example, with the above example, the data sent by party 1 is 6, 11, 25; the data sent by the party 2 are 8, 14 and 10; the data sent by the party 3 are 0, 10, 45.
Step 403: and the third party receives the plurality of sample characteristic extremum and the sample number, and obtains the global sample characteristic extremum and the global sample number of the global sample characteristic data after aggregation.
Here, taking the above example, after the third party receives the data sent by each party, and performs comparison and aggregation processing on the data, the global sample feature extremum (expressed as the minimum value of the global sample feature data and the maximum value of the global sample feature data) of the global sample feature data of the corresponding feature is 0 and 14, and the global sample number is 80.
It should be noted that the global sample feature data represents the sum of the sample feature data stored by each of the participant devices.
Step 404: and the third party determines the overall characteristic interval of the overall sample characteristic data based on the overall sample characteristic extremum, and performs equidistant binning processing on the overall characteristic interval to obtain a plurality of simulation quantiles and a plurality of corresponding intervals.
In some embodiments, the number of bins is preset, the overall feature interval of the global sample feature data is determined based on the maximum value and the minimum value of the global sample feature data, and the overall feature interval is equidistantly divided based on the preset number of bins, so as to obtain a plurality of simulation quantiles and a plurality of corresponding intervals.
Here, the number of bins is set in advance, and in order to ensure that the data of each section after equidistant bin splitting is uniformly distributed within a set error acceptable range as much as possible, the number of bins can be determined to be larger, so that the feature data is divided more finely.
In practical implementation, determining the difference value of the maximum value and the minimum value, taking the ratio of the current difference value to the bin number as the equidistant divided distance interval, taking the value corresponding to the accumulated distance interval as a next simulation dividing point sequentially from the minimum value, sequentially determining a plurality of simulation dividing points, and determining a plurality of corresponding intervals based on the simulation dividing points.
For example, with the above example in mind, the number of bins is preset to 10, and based on the ratio of the difference 14 between the minimum value of the global sample feature data and the maximum value of the global sample feature data to the number of bins 10 of 1.4, the 11 simulated bins are determined to be: 0.0,1.4, 2.8,4.2, 5.6,7.0, 8.4,9.8, 11.2, 12.6, 14, giving corresponding 10 intervals [0.0,1.4], (1.4,2.8 ], (2.8,4.2 ], (4.2,5.6 ], (5.6,7.0 ], (7.0,8.4 ], (8.4,9.8 ], (9.8, 11.2], (11.2, 12.6], (12.6, 14].
It should be noted that the determined analog quantiles may be binning points in the overall feature interval, or may include global sample feature extremum, for example, in the above example, the resulting analog quantiles may be 11, or may be the remaining 9 binning points that do not include 0 and 14.
Step 405: the third party sends the plurality of analog quantiles to the plurality of participants.
In actual implementation, referring to fig. 5B, fig. 5B shows an alternative schematic diagram of a processing method of distributed learning. In fig. 5B, a third party sends multiple analog quantiles to multiple participants.
Step 406: after receiving the plurality of simulation quantiles, the plurality of participants collect and summarize the sample characteristic data based on the intervals corresponding to the simulation quantiles, and determine the number of sub-samples of the sample characteristic data in each interval.
In practical implementation, referring to fig. 5B, in fig. 5B, a plurality of participants collect and aggregate sample feature data based on intervals corresponding to the analog quantiles, and determine the number of sub-samples of the sample feature data in each interval.
For example, with the above example in mind, multiple participants may collect and aggregate sample feature data in the interval of (7.0,8.4) based on the simulation quantiles 7.0 and 8.4, respectively determining the number of sub-samples in the current interval.
Step 407: the plurality of parties send the number of subsamples within the plurality of intervals to the third party.
In actual implementation, referring to fig. 5B, in fig. 5B, multiple parties send the number of subsamples over multiple intervals to a third party.
For example, with the above example taken over, for the interval of (7.0,8.4), party 1 sends sub-sample number 2 to the third party, party 2 sends sub-sample number 2 to the third party, and party 3 sends sub-sample number 3 to the third party.
Step 408: the third party receives the number of subsamples within the plurality of intervals and determines a total number of samples within each interval.
For example, with the above example, for the interval of (7.0,8.4), the third party receives the number of subsamples sent by the plurality of participants, and determines (7.0,8.4) that there are 7 data in the interval.
Step 409: the third party builds evenly distributed simulation data based on the number of total samples and corresponding simulation quantiles within each interval.
In actual implementation, referring to fig. 5B, in fig. 5B, a third party builds uniformly distributed simulation data at each interval.
For example, the ratio of the difference value of the corresponding analog division point to the corresponding total sample number is taken as the analog data distribution ratio, and the analog data is equidistantly and uniformly constructed for the corresponding section, wherein the difference value of the adjacent analog data is the analog data distribution ratio.
It should be noted that, because the third party carries out equidistant box division processing to the overall characteristic interval, the number of boxes can be flexibly adjusted, and the data of each interval is ensured to have certain regularity as much as possible, where the characteristic range of each interval is very small, and uniformly distributed data is constructed within the range of acceptable final error to replace the data actually distributed in the corresponding interval.
For example, taking the above example as an example, for the section of (7.0,8.4), the ratio of the difference value 1.4 to the total number of samples 7 of 0.2 is used as the analog data distribution ratio, and the analog data in the corresponding section are sequentially constructed to be 7.2, 7.4, 7.6, 7.8, 8.0, 8.2, 8.4.
Step 410: and the third party splices the simulation data in each interval to form total simulation data.
Here, the simulation data in each section is uniformly distributed and has an order arranged according to the data size, and the simulation data in a plurality of sections are subjected to a splice fitting based on the simulation quantiles to form the total simulation data.
Step 411: the third party determines the target quantile based on the total simulation data.
In some embodiments, the target quantiles are obtained by performing an equal frequency binning process based on the total analog data.
In actual implementation, determining the box division ratio, dividing the total simulation data based on the box division ratio to obtain a plurality of different boxes, and determining corresponding box division points of the different boxes as target box division points.
The equal frequency bin processing is to bin the feature data so that the number of data in each bin is approximately equal.
For example, the bin ratio is determined to be 20%, the total simulation data is divided, the first 20% of the feature data is determined to be the feature data in the first bin, the bin point of the first bin is determined to be the target bin point, 20% -40% of the feature data is determined to be the feature data in the second bin, the bin point of the second bin is determined to be the target bin point, and 5 bins corresponding to the feature data of the global sample and 6 target bin points including the feature extremum of the global sample are sequentially obtained according to the frequency of 20%.
In other embodiments, the total simulation data may be processed in other ways to obtain the target quantiles. Specifically, after a summary is constructed on a single side (the summary is a data structure used for storing and maintaining characteristic data), the total simulation data is subjected to equal-frequency binning to obtain a target quantile. In other embodiments, the target quantiles may also be obtained by equidistant binning of the total simulation data, or may also be obtained by an optimal binning method.
In some embodiments, for the multiple features provided in table 1 and table 2 (which are represented as global sample feature data in multiple dimensions, where the global sample feature data is composed of sample feature data stored by each participant), a task of computing quantile may be constructed, and a parallel processing manner is adopted to compute target quantiles of global sample feature data in different dimensions.
Here, by means of simultaneously solving the quantiles of the global sample feature data with different dimensions in parallel, the solving time of the target quantiles is saved, and the processing efficiency of the feature data is improved.
In some embodiments, after the target score point is obtained for the global sample feature data of each dimension, a feature index value (which may be expressed as evidence weight or information value) of the feature data is determined based on a feature score box result corresponding to the target score point, availability of the feature data or value of the feature data is determined according to the feature index value, the feature is used for carrying out subsequent preprocessing and feature selection, and a machine learning model is jointly trained by combining a plurality of participants based on the available feature data. Here, the machine learning model may include a wind control model for risk assessment in a plurality of actual scenarios.
In the embodiment of the application, each participant does not provide complete characteristic data, so that the problem of data leakage caused by a common distributed data solving and box dividing method is avoided, and complex processes such as recursion iteration are not needed.
Continuing with the description below of an exemplary architecture of the data processing apparatus 233 implemented as software modules for distributed learning provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the data processing apparatus 233 for distributed learning of the memory 230 may include:
an analog quantile determination module 2331 configured to determine a plurality of analog quantiles and a corresponding plurality of intervals based on sample feature extrema and a number of samples of the sample feature data stored by each of the plurality of second devices;
a section sample number determining module 2332, configured to determine a total sample number in each section based on a sub-sample number corresponding to each section in each second device;
The simulation data construction module 2333 is configured to construct simulation data in each section based on the total sample number in each section and the simulation quantiles corresponding to each section;
a target quantile determination module 2334 for forming total simulation data based on the simulation data within each interval and determining a target quantile based on the total simulation data;
the feature data processing module 2335 is configured to send the target site to each second device, so that each second device builds a sample set based on the target site, and trains a machine learning model for performing classification tasks based on the sample set.
In some embodiments, the analog quantile determination module 2331 is further configured to determine a global sample feature extremum and a global sample number for the global sample feature data based on the sample feature extremum and the sample number for the sample feature data stored by each of the plurality of second devices; the global sample characteristic data comprises sample characteristic data stored by a plurality of second devices, and the global sample characteristic extremum comprises a maximum value and a minimum value of the global sample characteristic data; determining a global feature interval of global sample feature data based on the global sample feature extremum; determining a distance interval based on a preset number of bins and a global sample feature extremum; performing equidistant partition processing on the overall characteristic interval based on the distance interval to determine a plurality of simulation quantiles and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of analog quantiles.
In some embodiments, the analog data construction module 2333 is further configured to determine a characteristic data range of each interval based on the analog quantile corresponding to the respective interval; determining a simulated data distribution ratio based on the number of the total samples in each interval and the characteristic data range of the corresponding interval; the analog data distribution proportion is the ratio of the difference value of the analog dividing points corresponding to the characteristic data range to the total sample number; and constructing uniformly distributed analog data in each interval based on the analog data distribution proportion, wherein the difference value of adjacent analog data is the analog data distribution proportion.
In some embodiments, the target quantile determination module 2334 is further configured to splice fit the simulated data in the plurality of intervals based on the simulated quantiles to form total simulated data; wherein the total analog data is data having a specific order; determining a box division ratio, and dividing the total simulation data based on the box division ratio to obtain a plurality of different boxes; the sub-analog data in different sub-boxes are consistent in quantity; the respective quantiles of the plurality of different bins are determined as target quantiles.
In some embodiments, the data processing apparatus of distributed learning further comprises: parallel processing module 2336 (not shown in FIG. 2) for creating a plurality of target quantile tasks; the tasks for solving the target quantile are used for solving the target quantile of the global sample characteristic data with different dimensions; the global sample characteristic data of each dimension represents the data of the same characteristic, and the global sample characteristic data comprises sample characteristic data stored by a plurality of second devices respectively; and executing a plurality of tasks for solving the target quantile in parallel to obtain the target quantiles of the global sample characteristic data with different dimensions.
In some embodiments, the feature data processing module 2335 is further configured to send the target quantile to each second device, so that each second device determines each bin of the sample feature data based on the target quantile, and determines sub-positive and negative sample distributions respectively corresponding to each bin based on the tag data of the sample feature data stored in each second device; determining total positive and negative sample distribution corresponding to each sub-box respectively based on the sub-positive and negative sample distribution sent by each second device; determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-bin; the feature index value corresponding to the global sample feature data is used for enabling each second device to execute the following operations: when the characteristic index value exceeds an index threshold value, a sample set is constructed, and a machine learning model for performing classification tasks is trained based on the sample set; wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.
In some embodiments, the sample feature data in the sample set carries a pre-labeled classification result, and the data processing apparatus for distributed learning further includes: model training module 2337 (not shown in fig. 2) for performing classification prediction on each sample feature data in the sample set through a machine learning model to obtain a prediction classification result of each sample feature data; calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data; model parameters of the machine learning model are updated based on the loss values.
It should be noted that, the description of the apparatus in the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the data processing method of distributed learning described in the embodiment of the present application.
The present embodiments provide a computer readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a method provided by the embodiments of the present application, for example, a data processing method of distributed learning as shown in fig. 3A, 3B.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
In summary, through the embodiment of the present application, each participant does not provide complete feature data, which avoids the problem of data leakage caused by providing feature data by each second device when the quantile is obtained based on distributed data, and better protects the data security to a certain extent; in addition, when the target quantile is obtained, the required data is transmitted (sent and received) at one time between the first equipment and the second equipment, and the method for obtaining the quantile by constructing analog data replaces the method for continuously transmitting intermediate data, so that the method for calculating the target quantile by continuous recursion and iteration reduces the calculation and transmission difficulties, reduces the complexity of data processing, improves the data processing efficiency and ensures that the target quantile is obtained quickly and accurately.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims (11)

1. A data processing method for distributed learning, applied to a first device, the method comprising:
determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample feature extremum and sample number of the sample feature data stored by the plurality of second devices respectively;
determining the total sample number in each interval based on the sub-sample number corresponding to each interval in each second device;
constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval;
forming total simulation data based on the simulation data in each interval, and determining a target quantile based on the total simulation data;
and sending the target quantiles to each second device so that each second device builds a sample set based on the target quantiles and trains a machine learning model for performing classification tasks based on the sample sets.
2. The method of claim 1, wherein the determining a plurality of analog quantiles and a corresponding plurality of bins based on the sample feature extremum and the number of samples of the sample feature data stored by each of the plurality of second devices comprises:
Determining a global sample feature extremum and a global sample number of the global sample feature data based on the sample feature extremum and the sample number of the sample feature data stored by each of the plurality of second devices; the global sample characteristic data comprise sample characteristic data stored by each of the plurality of second devices, and the global sample characteristic extremum comprises a maximum value and a minimum value of the global sample characteristic data;
determining a global feature interval of the global sample feature data based on the global sample feature extremum;
determining a distance interval based on a preset number of bins and the global sample feature extremum;
performing equidistant partition processing on the overall characteristic interval based on the distance interval to determine a plurality of simulation quantiles and a plurality of corresponding intervals; wherein the distance interval is a difference between adjacent ones of the plurality of analog quantiles.
3. The method according to claim 1 or 2, wherein the constructing the analog data in each section based on the total number of samples in each section and the analog quantiles corresponding to each section includes:
determining the characteristic data range of the corresponding interval based on the analog quantiles corresponding to each interval;
Determining a simulation data distribution proportion based on the total sample number in each interval and the characteristic data range of the corresponding interval; the simulation data distribution proportion is the ratio of the difference value of the simulation quantiles corresponding to the characteristic data range to the total sample number;
and constructing uniformly distributed simulation data in each interval based on the simulation data distribution proportion, wherein the difference value of adjacent simulation data is the simulation data distribution proportion.
4. The method of claim 1, wherein the forming total simulation data based on the simulation data within each interval and determining the target quantile based on the total simulation data comprises:
performing splice fitting on the simulation data in a plurality of intervals based on the simulation quantiles to form total simulation data; wherein the total analog data is data having a specific order;
determining a box division ratio, and dividing the total simulation data based on the box division ratio to obtain a plurality of different boxes; wherein the sub-boxes comprise at least one piece of sub-simulation data, and the sub-simulation data in different boxes are consistent in quantity;
And determining the corresponding quantiles of the plurality of different bins as target quantiles.
5. The method according to claim 1, wherein the method further comprises:
creating a plurality of tasks for solving target quantiles;
the tasks for solving the target quantile are used for solving the target quantiles of the global sample characteristic data with different dimensions; wherein global sample feature data for each dimension characterizes data of the same feature, the global sample feature data comprising the sample feature data stored by each of the plurality of second devices;
and executing a plurality of tasks for solving the target quantile in parallel to obtain the target quantile of the global sample characteristic data with different dimensions.
6. The method of claim 1, wherein the sending the target quantiles to each of the second devices to cause each of the second devices to construct a sample set based on the target quantiles and to train a machine learning model for performing classification tasks based on the sample set comprises:
transmitting the target quantile to each second device, so that each second device determines each sub-bin of the sample characteristic data based on the target quantile, and determines sub-positive and negative sample distribution corresponding to each sub-bin based on label data of the sample characteristic data stored in each second device;
Determining total positive and negative sample distribution corresponding to each sub-box respectively based on the sub-positive and negative sample distribution sent by each second device;
determining a characteristic index value corresponding to the global sample characteristic data based on the total positive and negative sample distribution of each sub-bin;
the feature index value corresponding to the global sample feature data is used for enabling each second device to execute the following operations:
when the characteristic index value exceeds an index threshold value, a sample set is constructed, and a machine learning model for performing classification tasks is trained based on the sample set;
wherein the global sample feature data comprises sample feature data stored by each of the plurality of second devices.
7. The method of claim 6, wherein the sample feature data in the sample set carries pre-labeled classification results, the method further comprising:
classifying and predicting each sample characteristic data in the sample set through the machine learning model to obtain a prediction classification result of each sample characteristic data;
calculating a loss value based on the difference between the pre-labeled classification result and the predicted classification result on each sample characteristic data;
Model parameters of the machine learning model are updated based on the loss values.
8. A data processing apparatus for distributed learning, comprising:
the simulation quantile determining module is used for determining a plurality of simulation quantiles and a plurality of corresponding intervals based on sample characteristic extremum and sample quantity of sample characteristic data stored by a plurality of second devices respectively;
a section sample number determining module, configured to determine a total sample number in each section based on a sub-sample number corresponding to each section in each second device;
the simulation data construction module is used for constructing simulation data in each interval based on the total sample number in each interval and the simulation quantiles corresponding to each interval;
the target quantile determining module is used for forming total simulation data based on the simulation data in each interval and determining a target quantile based on the total simulation data;
and the characteristic data processing module is used for sending the target quantiles to each second device so that each second device builds a sample set based on the target quantiles and trains a machine learning model for performing classification tasks based on the sample set.
9. A data processing system for distributed learning based on the method of claim 1, comprising: a first device and a plurality of second devices; wherein,
the first device is configured to:
transmitting the target quantile to each of the second devices;
the second device is configured to:
determining a sample feature extremum and a sample number for storing sample feature data, and transmitting the sample feature extremum and the sample number to a first device;
determining the number of subsamples in each interval based on the simulation quantiles of the first equipment and the corresponding multiple intervals, and sending the subsamples to the first equipment;
a sample set is constructed based on the target quantiles determined by the first device, and a machine learning model for performing classification tasks is trained based on the sample set.
10. A data processing apparatus for distributed learning, comprising:
a memory for storing executable instructions;
a processor for implementing the data processing method of distributed learning of any one of claims 1 to 7 when executing executable instructions stored in said memory.
11. A computer readable storage medium storing executable instructions for implementing the data processing method of distributed learning of any one of claims 1 to 7 when executed by a processor.
CN202110233219.6A 2021-03-01 2021-03-01 Data processing method and device for distributed learning and electronic equipment Active CN112836765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110233219.6A CN112836765B (en) 2021-03-01 2021-03-01 Data processing method and device for distributed learning and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110233219.6A CN112836765B (en) 2021-03-01 2021-03-01 Data processing method and device for distributed learning and electronic equipment

Publications (2)

Publication Number Publication Date
CN112836765A CN112836765A (en) 2021-05-25
CN112836765B true CN112836765B (en) 2023-12-22

Family

ID=75934411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110233219.6A Active CN112836765B (en) 2021-03-01 2021-03-01 Data processing method and device for distributed learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN112836765B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN111950706A (en) * 2020-08-10 2020-11-17 中国平安人寿保险股份有限公司 Data processing method and device based on artificial intelligence, computer equipment and medium
CN112257873A (en) * 2020-11-11 2021-01-22 深圳前海微众银行股份有限公司 Training method, device, system, equipment and storage medium of machine learning model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN111950706A (en) * 2020-08-10 2020-11-17 中国平安人寿保险股份有限公司 Data processing method and device based on artificial intelligence, computer equipment and medium
CN112257873A (en) * 2020-11-11 2021-01-22 深圳前海微众银行股份有限公司 Training method, device, system, equipment and storage medium of machine learning model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于联邦学习的企业数据共享探讨;何雯;白翰茹;李超;;信息与电脑(理论版)(第08期);全文 *

Also Published As

Publication number Publication date
CN112836765A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
US20210150372A1 (en) Training method and system for decision tree model, storage medium, and prediction method
CN111815169A (en) Business approval parameter configuration method and device
CN112036483B (en) AutoML-based object prediction classification method, device, computer equipment and storage medium
CN107368526A (en) A kind of data processing method and device
CN113674087A (en) Enterprise credit rating method, apparatus, electronic device and medium
CN113011895A (en) Associated account sample screening method, device and equipment and computer storage medium
CN114638442A (en) Flight training scheme generation system, method and equipment for individual difference
CN113919432A (en) Classification model construction method, data classification method and device
CN109255389A (en) A kind of equipment evaluation method, device, equipment and readable storage medium storing program for executing
CN112836765B (en) Data processing method and device for distributed learning and electronic equipment
CN112131587A (en) Intelligent contract pseudo-random number security inspection method, system, medium and device
CN113011893B (en) Data processing method, device, computer equipment and storage medium
CN113269179B (en) Data processing method, device, equipment and storage medium
CN115619541A (en) Risk prediction system and method
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
CN106301880A (en) One determines that cyberrelationship degree of stability, Internet service recommend method and apparatus
CN117390455B (en) Data processing method and device, storage medium and electronic equipment
Agiza et al. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs
CN115130623B (en) Data fusion method and device, electronic equipment and storage medium
CN117273622A (en) Resource allocation method, device, equipment and computer readable storage medium
CN117974320A (en) Data processing method and device for wind control strategy generation
CN117876090A (en) Risk identification method, electronic device, storage medium, and program product
CN114239985A (en) Exchange rate prediction method and device, electronic equipment and storage medium
CN117609061A (en) Account test analysis method and device based on support vector machine
CN117391844A (en) Method and device for determining overdue prediction result, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant