CN112182982B

CN112182982B - Multiparty joint modeling method, device, equipment and storage medium

Info

Publication number: CN112182982B
Application number: CN202011165475.8A
Authority: CN
Inventors: 宋传园; 冯智; 吕亮亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2024-03-01
Anticipated expiration: 2040-10-27
Also published as: CN112182982A

Abstract

The disclosure provides a multiparty joint modeling method based on a distributed system, and relates to the fields of machine learning, safe computing and the like. The multiparty joint modeling method comprises the following steps: intersection is carried out on sample identifications included in each of the plurality of clusters to obtain intersection sample identifications and cluster sample data corresponding to the intersection sample identifications and included in each of the plurality of clusters, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster; respectively carrying out barrel classification on cluster sample data of each cluster in the plurality of clusters to obtain cluster barrel classification data of each cluster in the plurality of clusters; constructing a global information gain histogram based on the sample tag and cluster barrel data of each of the plurality of clusters; and constructing a decision tree model based on the global information gain histogram.

Description

Multiparty joint modeling method, device, equipment and storage medium

Technical Field

The present disclosure relates to the fields of machine learning, secure computing, etc., and more particularly, to a multiparty joint modeling method, apparatus, device, and storage medium.

Background

With the development of algorithms and big data, algorithms and algorithms are no longer the bottleneck impeding AI development. The true and effective data sources in various fields are the most precious resources. Meanwhile, barriers which are difficult to break exist between data sources, in most industries, data exist in the form of islands, and due to the problems of industry competition, privacy safety, complex administrative procedures and the like, even if data integration is realized among different departments of the same company, important resistance is faced, and in reality, the integration of data dispersed in various places and institutions is almost impossible or the required cost is huge.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a multiparty joint modeling method, including: intersection is carried out on sample identifiers included in each of a plurality of clusters to obtain intersection sample identifiers and cluster sample data corresponding to the intersection sample identifiers included in each of the plurality of clusters, wherein the sample identifiers and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster; respectively carrying out barrel separation on cluster sample data of each of a plurality of clusters to obtain cluster barrel separation data of each of the plurality of clusters; constructing a global information gain histogram based on the sample label and the cluster barrel data of each of the plurality of clusters, wherein the sample label is a true value of each sample and the sample label is stored in a specific cluster of the plurality of clusters; and constructing a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided a multiparty joint prediction method based on a distributed system, including: inputting the prediction sample into a decision tree model; aiming at each sub-decision tree of the decision tree model, acquiring the cluster of the root node; communicating with a cluster to obtain the characteristics of a root node; the characteristic data of the characteristics of the root node of the prediction sample are sent to the cluster to which the node belongs, and the cluster to which the child node belongs is obtained; iterating the process to obtain a predicted value of each sub-decision tree on a predicted sample; and summing the predicted values of the predicted samples by each sub-decision tree to obtain the predicted values of the predicted samples.

According to an aspect of the present disclosure, there is provided a multi-party joint modeling apparatus based on a distributed system, including: the intersection module is configured to perform intersection on data included in the clusters, so that each cluster in the clusters obtains corresponding cluster sample data; the barrel dividing module is configured to divide the cluster sample data of each of the plurality of clusters into barrels respectively to obtain cluster barrel dividing data of each of the plurality of clusters; a first construction module configured to construct a global information gain histogram based on the sample tag and the plurality of cluster sub-bucket data; and a second construction module configured to construct a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including: a processor; and a memory storing a program, the program comprising instructions, and the instructions when executed by the processor cause the processor to perform a multi-party joint modeling method according to the above and/or a multi-party joint prediction method according to the above.

According to another aspect of the present disclosure, there is also provided a computer readable storage medium storing a program comprising instructions that, when executed by a processor of an electronic device, cause the electronic device to perform a multi-party joint modeling method according to the above and/or a multi-party joint prediction method according to the above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a multi-party joint modeling method as described above and/or a multi-party joint prediction method as described above.

According to the technical scheme, the multi-party joint modeling method based on the distributed system is realized by carrying out barrel division on the distributed data and constructing the information gain histogram on the distributed data, so that the speed of multi-party joint modeling is improved, and modeling can be completed under a scene with large data volume.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIGS. 1-2 are flowcharts illustrating a multi-party joint modeling method according to an example embodiment;

FIG. 3 is a block diagram illustrating components of a distributed system according to an example embodiment;

FIG. 4 is a flowchart illustrating the separate barreling of cluster sample data for each of a plurality of clusters in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a barreling process according to an example embodiment;

FIG. 6 is a flowchart illustrating generating at least one data bucket based on feature data of a current feature in accordance with an example embodiment;

FIG. 7 is a flowchart illustrating the construction of a global information gain histogram in accordance with an exemplary embodiment;

FIG. 8 is a flowchart illustrating constructing a first information gain histogram in accordance with an exemplary embodiment;

FIG. 9 is a flowchart illustrating obtaining a first information gain sub-histogram or first candidate split gain of a feature of a node to be split in accordance with an example embodiment;

Fig. 10 is a flowchart illustrating constructing a first information gain sub-histogram according to an exemplary embodiment;

FIG. 11 is a flowchart illustrating a multi-party joint prediction method according to an example embodiment;

FIG. 12 is a block diagram illustrating a multi-party joint modeling apparatus according to an example embodiment;

FIG. 13 is a block diagram illustrating an exemplary computing device that may be used in connection with the exemplary embodiments.

Detailed Description

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, the existing multiparty joint modeling method is low in speed, is limited by factors such as equipment performance and storage capacity especially in a scene of large data volume, and cannot perform multiparty joint modeling, so that the method has great limitation in practical application.

In order to solve the technical problems, the present disclosure provides a multiparty joint modeling method based on a distributed system: sample identifiers among a plurality of clusters are crossed to obtain cluster sample data of each cluster; the cluster sample data of each cluster is subjected to barrel division to obtain cluster barrel division data; constructing a global information gain histogram based on the sample tags and cluster barrel data of each cluster; and constructing a decision tree model based on the global information gain histogram. Therefore, the multi-party joint modeling method based on the distributed system is realized by carrying out barrel division on the distributed data and constructing an information gain histogram on the distributed data, so that the speed of multi-party joint modeling is improved, and modeling can be completed under the scene of large data volume.

The multi-party joint modeling method of the present disclosure will be further described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a multi-party joint modeling method based on a distributed system according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the multiparty joint modeling method may include: step S101, intersecting sample identifiers included in each of a plurality of clusters to obtain an intersecting sample identifier and cluster sample data corresponding to the intersecting sample identifier, wherein each of the plurality of clusters includes; step S102, respectively carrying out barrel classification on cluster sample data of each of a plurality of clusters to obtain cluster barrel classification data of each of the plurality of clusters; step S103, constructing a global information gain histogram based on the sample label and cluster barrel data of each cluster in the plurality of clusters; and step S104, constructing a decision tree model based on the global information gain histogram. Therefore, the distributed system is built, the clustered data are distributed and stored in a plurality of clients of the cluster, the clients are used for carrying out preliminary barreling on the distributed data and constructing an information gain sub-histogram, the speed of multi-party joint modeling can be greatly improved, and the model can support rapid multi-party joint modeling in richer scenes.

According to some embodiments, a distributed system includes a plurality of clusters, each cluster including a server and a plurality of clients. In one exemplary embodiment, as shown in fig. 3, distributed system 3000 includes clusters 3100, 3200, and 3300, each including one server and three clients, e.g., cluster 3100 includes server 3110 and client 3101, cluster 3200 includes server 3210 and client 3201, and cluster 3300 includes server 3310 and client 3301. The main functions of the server may include coordinating multiple clients within the present cluster, instructing the clients to complete tasks such as grouping buckets and building histograms, integrating information uploaded by the clients, sending information down to the clients, completing some computing functions, communicating with servers of other clusters, etc. In one exemplary embodiment, the communication between clusters may be encrypted using the Paillier ciphertext. The main functions of the client may include storing data, receiving instructions from the server to complete the tasks of sub-bucket and building a histogram, uploading information to the server, etc. Communications within the cluster may have no privacy requirements. In one exemplary embodiment, the client communicates only with the servers of the present cluster.

Each cluster may include a number of raw sample data distributed across multiple clients of the cluster, the sample data including a sample identification. And intersecting all the sample identifiers included in each cluster to obtain a common sample identifier. Each cluster selects a sample which coincides with the common sample identification in the original sample data in the cluster as cluster sample data of the cluster.

In one exemplary embodiment, step S101 may include: the server of each cluster gathers all sample identifications in the cluster; based on OT safety protocol, realizing sample identification intersection among clusters, wherein each cluster obtains the same shared sample identification; the server of each cluster sends the common sample identification to a plurality of clients of the cluster, and instructs the clients to cross the common sample identification with the sample identification of the original sample data included by the clients to obtain the client sample data of each client. The cluster sample data may include, for example, sample data of a plurality of clients stored at a corresponding plurality of clients.

According to some embodiments, the cluster sample data and the client sample data may each include a sample identifier and at least one feature, and as shown in fig. 4, the step S102 of respectively barreling the cluster sample data of each of the plurality of clusters to obtain cluster barrel data of each of the plurality of clusters may include: step S10201, traversing at least one feature of cluster sample data of each cluster in the plurality of clusters; step S10202, generating at least one data bucket based on the characteristic data of the current characteristic; and step S10203, integrating the data buckets corresponding to all the features to obtain cluster sub-bucket data of the cluster. Therefore, the cluster sample data is divided into barrels, so that the number of splitting points to be calculated and corresponding information gains can be reduced, and the modeling speed is greatly improved; meanwhile, as the characteristic data of all samples in the barrel can be wiped off by the barrel, the multi-party joint modeling method can be used as a basis for multi-party joint modeling under the privacy requirement.

The sub-bucket is a process of performing feature discretization processing on feature data based on feature information. According to some embodiments, the data bucket may include at least one of a sample identification, a value of the bucket, a belonging client, a belonging feature. In one exemplary embodiment, as shown in FIG. 5, data 501 includes sample identifications 1-15 and feature data for a selected feature. The barreling process may include, for example: sorting the feature data of the selected features to generate sorted feature data 502; based on a preset barrel dividing rule, the feature data is divided into a plurality of data barrels 5001, and barrel dividing data 503 is obtained. The client to which each data bucket belongs may include the client to which each feature data that is divided into the same data bucket is located, the feature of each data bucket may be a feature based on a barrel dividing process, the sample identifier of each data bucket may include a sample identifier corresponding to each feature data that is divided into the same data bucket, and the value of each data bucket may be, for example, an average value, a median value, a minimum value, a maximum value, or a value obtained by other calculation method of all feature data that is divided into the same data bucket, which is not limited herein. In one exemplary embodiment, as shown in FIG. 5, the value of each data bucket 5001 may be the median of all feature data that is split into the same data bucket.

According to some embodiments, the data bucket to be merged may include at least one of a sample identification, a value of the bucket, a belonging client, a belonging feature, as shown in fig. 6, and generating at least one data bucket based on feature data of a current feature in step S10202 may include: step S602, judging whether feature data of the current feature are distributed on the same client; step 603, responding to the characteristic data of the current characteristic distributed in a plurality of clients, indicating each client in the plurality of clients to divide the characteristic data of the current characteristic included in the sample data of the respective client into buckets, generating at least one data bucket to be combined of the current characteristic, and uploading the at least one data bucket to be combined to a server corresponding to the plurality of clients; and step S604, merging the received at least one data bucket to be merged uploaded by the plurality of clients to generate at least one data bucket. Therefore, under the condition that the characteristic data of the current characteristic are distributed on a plurality of clients, the server merges the barrels with the same or similar values of part of barrels into the same barrel by indicating each client to pre-stage the sample data included by the client, so that the distributed data are divided into barrels under the condition that the characteristic data of the current characteristic are distributed on a plurality of clients. Compared with the method that all the characteristic data of the current characteristics of a plurality of clients are transmitted to the server, the server sorts all the characteristic data and divides the characteristic data into barrels, the method can remarkably accelerate barrel division efficiency, and therefore modeling speed is greatly improved.

According to some embodiments, step S604, merging the received at least one data bucket to be merged uploaded by the plurality of clients, and generating the at least one data bucket may include, for example: sorting all the data barrels to be combined of the current characteristics according to the barrel values; and merging one or more data barrels to be merged, which have the same or similar values of a plurality of continuous barrels, into one data barrel, wherein the merged data barrel comprises sample labels which can be all labels included by the merged one or more data barrels to be merged, the client to which the merged data barrel belongs can comprise the client to which the merged one or more data barrels to be merged belong, the characteristic of the merged data barrel can be the current characteristic, and the value of the merged data barrel can be, for example, the average value, the median, the minimum value, the maximum value or the value obtained by other calculation methods of the value of the merged one or more data barrels to be merged, without limitation.

According to some embodiments, as shown in fig. 6, step S10202, generating at least one data bucket based on the feature data of the current feature may include: step S605, each data bucket of the at least one data bucket formed by merging is sent to the client of the data bucket. The client barrel data of the belonging client includes at least one data barrel that is merged.

According to some embodiments, as shown in fig. 6, step S10202, generating at least one data bucket based on the feature data of the current feature may include: step S606, responding to the characteristic data of the current characteristic to be distributed on the same client, indicating the same client to divide the characteristic data of the current characteristic into barrels, generating at least one data barrel, and uploading the at least one data barrel to a server corresponding to the same client. Step S601 and step S607 in fig. 6 are similar to steps S10201 and S10203 in fig. 4, respectively. After step S605 and step S606 are performed, step S607 may be performed. Therefore, under the condition that all feature data of the current features are distributed on the same client, the barrel dividing result of the client is directly used as a final result, so that workload caused by barrel dividing by a server can be reduced, and modeling speed is improved.

According to the technical scheme, the client sample data are divided into the buckets by the instruction client, the data buckets to be combined or the data buckets are generated, the data buckets to be combined are combined by the server, cluster barrel division data and client barrel division data of all the clients are obtained by combining other data buckets, the distributed data are rapidly divided, the modeling speed can be greatly improved, and meanwhile, the model can support richer scenes.

According to some embodiments, the plurality of clusters includes a first cluster and at least one second cluster, as shown in fig. 2, the multi-party joint modeling method may further include: step S202, generating a public key and private key pair aiming at a server of a first cluster; and step S203, the public key and the modeling parameters are sent to the server of each second cluster in the at least one second cluster. Step S201 and step S204 in fig. 2 are similar to step S101 and step S102 in fig. 1. Therefore, by using homomorphic encryption, the result obtained by the operation of the second cluster by using the encrypted data after the encryption information is decrypted by the first cluster is the same as the unencrypted operation result, so that the encryption information can be used as the basis of multiparty joint modeling under the privacy requirement.

Modeling parameters may include, for example, maximum number of iterations, learning rate, stop-split conditions, model convergence conditions, and so forth. The modeling parameters are public for each cluster involved in modeling and do not require encryption.

According to some embodiments, the cluster sample data of the first cluster further includes a sample tag, as shown in fig. 7, and the constructing the global information gain histogram based on the sample tag and the cluster barrel data of each of the plurality of clusters includes: step S10301, obtaining a predicted value of a current model on each sample corresponding to each sample identifier of cluster barrel data of a first cluster; step S10302, calculating first-order gradient data and second-order gradient data based on the predicted value and the sample label; and step S10303, constructing a global information gain histogram based on the first-order gradient data, the second-order gradient data and the cluster barrel data of each of the plurality of clusters. Thus, by calculating first-order gradient data and second-order gradient data, a global information gain histogram can be constructed in combination with cluster feature data of each cluster, so that an optimal splitting point can be determined based on the global information gain histogram later to construct a decision tree. Meanwhile, the information gain histogram can reduce the number of splitting points and splitting thresholds thereof to be calculated, and can accelerate the histogram difference, so that the modeling speed is improved.

According to some embodiments, the current model includes one or more sub-decision trees. The decision tree model constructed by the present disclosure may be, for example, a gradient lifting decision tree model, an XGBoost model, a LightGBM model, or other models, which are not limited herein. The current model includes at least one sample identifier per leaf node, indicating the sample identifier that was split to that node. The structure of one or more sub-decision trees included in the current model, the clusters to which all nodes belong, and the sample identifications included in all leaf nodes can be disclosed for all clusters. The method for calculating the sample predicted value may be to sum the predicted values of the samples by each sub-decision tree to obtain the predicted values of the samples by the model.

The first-order gradient data and the second-order gradient data can be obtained by solving a first-order gradient and a second-order gradient of an objective function set by a model, and bringing a predicted value of a sample and a sample label into the first-order gradient and the second-order gradient, so that the first-order gradient data and the second-order gradient data of the sample are obtained.

The information gain histogram may include a plurality of histogram bins in one-to-one correspondence with the data bins, each histogram bin may be used to represent the information gain of the corresponding data bin. Each histogram bin includes a first order gradient sum, a second order gradient sum of all samples of the corresponding data bin, and the number of samples the data bin includes.

According to some embodiments, as shown in fig. 2, step S10303, constructing a global information gain histogram based on the first order gradient data, the second order gradient data, and the cluster barrel data of each of the plurality of clusters includes: step S207, the first-order gradient data and the second-order gradient data are encrypted and then sent to a server of each second cluster in the at least one second cluster; step S208, at least one node to be split of the current model is obtained, wherein the node to be split comprises at least one sample identifier; step S209, constructing a first information gain histogram based on the first-order gradient data, the second-order gradient data, the cluster barrel data of the first cluster and at least one node to be split; step S211, receiving at least one ciphertext information gain histogram from a server of each of the at least one second cluster; step S212, decrypting at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to the at least one ciphertext information gain histogram one by one; and step S213, combining the first information gain histogram and at least one second information gain histogram to obtain a global information gain histogram. Step S205 to step S206 in fig. 2 are similar to step S10301 to step S10302 in fig. 7, respectively. Therefore, by sending the encrypted gradient to at least one second cluster, the ciphertext information gain histogram sent by the at least one second cluster is received and decrypted, so that the two parties can obtain at least one corresponding second information gain histogram without acquiring gradient data and sample data of the other party. And combining the first information gain histogram and at least one second information gain histogram of the first cluster to construct a global information gain histogram under the privacy requirement.

The node to be split may be, for example, a part of leaf nodes of the last sub-decision tree meeting the allowable splitting condition. The allowable splitting condition may be, for example, the number of sample identifiers included in the leaf node being smaller than a preset value, the depth of the leaf node being smaller than a preset depth, etc., which is not limited herein. As a leaf node, the node to be split may include a plurality of sample identifiers representing samples split by the model to this node.

According to some embodiments, as shown in fig. 8, step S209, constructing a first information gain histogram based on the first order gradient data, the second order gradient data, the cluster bucket data of the first cluster, and the at least one node to be split includes: step S20901, traversing at least one feature of the cluster cask data of the first cluster for each of the at least one node to be split; step 20902, obtaining a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split based on the feature data of the node to be split and the current feature; and step S20803, merging the first information gain sub-histogram or the first candidate splitting gain of each feature in the cluster sub-bucket data of the first cluster of each node to be split, to obtain a first information gain histogram. Thus, a first information gain sub-histogram or a first candidate splitting gain is constructed for each node to be split and the characteristics of each first cluster, and a first information gain histogram can be obtained, so that a global information gain histogram can be obtained subsequently to construct a decision tree.

According to some embodiments, as shown in fig. 9, step S20202 obtains a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split based on the feature data of the node to be split and the current feature: step S902, judging whether feature data of the current feature are distributed on the same client; step 903, responding to the feature data of the current feature to be distributed on a plurality of clients, indicating each client in the plurality of clients to construct a first information gain sub-histogram to be combined based on the first-order gradient data, the second-order gradient data, the client barrel data of the client and the node to be split, and uploading the first information gain sub-histogram to be combined to a server corresponding to the plurality of clients; and step S904, combining the received multiple to-be-combined first information gain sub-histograms uploaded by the multiple clients to construct a first information gain sub-histogram. Therefore, under the condition that the characteristic data of the current characteristic are distributed in a plurality of clients, the first information gain sub-histogram to be combined is generated by indicating each client, and then the server combines the sub-histograms to be combined, so that the first information gain sub-histogram is constructed for the current node to be split and the current characteristic under the condition that the characteristic data of the current characteristic of the first cluster based on the current node to be split is distributed in a plurality of clients. Compared with the method that the first information gain sub-histogram is directly built by the servers of the first cluster, the method can enable a plurality of clients to build the sub-histogram in parallel, so that the modeling speed is improved.

According to some embodiments, the first information gain sub-histogram includes at least one histogram bin, the at least one histogram bin corresponds to all the data bins with the features of the feature, the first information gain sub-histogram to be combined includes at least one histogram bin to be combined, the at least one histogram bin to be combined corresponds to all the data bins with the features of the feature, the histogram bin and the histogram bin to be combined each include at least one of a first order gradient sum, a second order gradient sum, as shown in fig. 10, and step S904, combining the plurality of first information gain sub-histograms to be combined uploaded by the plurality of clients, and constructing the first information gain sub-histogram includes: step S90401, combining at least one histogram barrel to be combined in a plurality of first information gain sub-histograms to be combined, which are uploaded by a plurality of received clients, to generate at least one histogram barrel; and step S90402, constructing a first information gain sub-histogram based on the at least one histogram bin. Therefore, the first information gain sub-histogram can be constructed by combining the histogram barrels to be combined corresponding to the same data barrel into one histogram barrel, and the first information gain sub-histogram is constructed in a distributed mode, so that the first information gain sub-histogram can be combined with the first information gain sub-histogram of other features later to obtain the first information gain histogram.

The histogram bucket and the histogram bucket to be combined can comprise a first-order gradient sum, a second-order gradient sum and the number of sample identifications of intersection parts of sample identifications included in the histogram bucket or a data bucket corresponding to the histogram bucket to be combined and sample identifications included in the current node to be split.

According to some embodiments, step S90401, merging at least one histogram bucket to be merged in the plurality of first information gain sub-histograms to be merged, which are uploaded by the plurality of clients, generating at least one histogram bucket may include: and merging one or more histogram barrels to be merged corresponding to the data barrels of each current feature to obtain a histogram barrel corresponding to the data barrel. The first order gradient sum of the histogram bins may be a sum of first order gradient sums of one or more histogram bins to be merged, the second order gradient sum of the histogram bins may be a sum of second order gradient sums of one or more histogram bins to be merged, and the sample identification number of the histogram bins may be a sum of sample identification numbers of the one or more histogram bins to be merged.

According to some embodiments, as shown in fig. 9, step S20202, based on the feature data of the node to be split and the current feature, obtaining a first information gain sub-histogram or a first candidate split gain of the current feature of the node to be split may include: step S905, in response to all feature data of the current feature being distributed on the same client, instructs the same client to construct a first information gain sub-histogram based on the first-order gradient data, the second-order gradient data, the client barrel data of the same client, and the node to be split, calculates a first candidate split gain based on the first information gain sub-histogram, and uploads the first candidate split gain to the server of the first cluster. Step S901 and step S906 in fig. 9 are similar to step S20901 and step S20003 in fig. 8, respectively. After step S904 and step S905 are performed, step S906 may be performed. Therefore, under the condition that all feature data of the current feature are distributed on the same client, a first information gain sub-histogram of the current feature of the current node to be split is directly constructed, and a first candidate splitting gain is calculated based on the first information gain sub-histogram, so that workload caused by constructing the histogram by a server can be reduced, and the modeling speed is further improved.

The first candidate splitting gain may be a maximum gain of a current feature of the current node to be split, and may be obtained by calculating a splitting gain corresponding to each histogram bin of the first information gain sub-histogram, and selecting a maximum value thereof. The splitting gain corresponding to the histogram bin can be calculated by the following method: under the current node to be split and the current characteristic, obtaining a step data sum, a second-order gradient data sum and a sample identification number sum of all histogram bins, wherein the three sums are obtained in the previous splitting gain calculation process and are called parent node information gain original data; calculating the three sums of the histogram buckets corresponding to all the data buckets with smaller values than the data buckets corresponding to the histogram buckets, wherein the three sums are called left child node information gain original data; based on the parent node information gain original data and the left child node information gain original data, right child node information gain original data can be obtained through subtraction; calculating information gain based on the information gain raw data for the three nodes respectively; and adding the gain of the information of the right child node by using the gain of the information of the left child node and subtracting the gain of the information of the father node to obtain the splitting gain. The information gain may be calculated by dividing the square of a step sum by the sum of a second-order gradient sum and the harmonic parameter, or may be calculated by dividing the square of a step sum by the sum of the number of samples identified, or may be calculated by other calculation methods, which are not limited herein.

According to the technical scheme, the client is instructed to construct the first information gain sub-histogram to be combined or directly calculate the first candidate split gain, the server is used for combining the first information gain sub-histograms to be combined, the first information gain histogram is obtained by combining the first candidate split gain, and the first information gain histogram is constructed quickly, so that the modeling speed can be improved, and meanwhile, the model can support richer scenes.

According to some embodiments, as shown in fig. 2, step S10303, constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and the cluster barrel data for each of the plurality of clusters may include: and S210, constructing a ciphertext information gain histogram based on the ciphertext first-order gradient data, the ciphertext second-order gradient data, the cluster barrel data of the second cluster and at least one node to be split. In an exemplary embodiment, a method similar to the above-described steps S901 to S906 may be used, for example, ciphertext first-order gradient data and ciphertext second-order gradient data may be used to replace the first-order gradient data and the second-order gradient data, respectively, to obtain corresponding ciphertext information gain histograms.

According to some embodiments, as shown in fig. 2, step S104, constructing a decision tree model based on the global information gain histogram includes: step S214, determining an optimal splitting point based on the global new information gain histogram; step S215, indicating the client side at which the optimal splitting point is located to split the optimal splitting point; step S216, iterating the splitting process until the splitting termination condition is reached, and generating a sub-decision tree; and step S217, iterating the sub-decision tree generation process until reaching the iteration termination condition, and obtaining a decision tree model. Therefore, a new leaf node is obtained by determining and splitting the optimal splitting point, and the decision tree model is obtained by repeatedly iterating the steps.

According to some embodiments, step S215, indicating the client at which the optimal split point is located, splits the optimal split point includes: indicating the client to calculate a splitting threshold based on the optimal splitting point and the client barrel splitting data of the client, obtaining a sample identifier included in the split leaf node, and uploading the leaf node to a server; and synchronizing the cluster to which the node where the optimal splitting point is located and the leaf nodes to a plurality of clusters except the cluster. Therefore, a new leaf node is obtained by splitting the optimal splitting point, and the new leaf node of the cluster of the node where the optimal splitting point is located is synchronized to each cluster, so that the sharing of models among clusters is realized.

The optimal splitting point may be the histogram bin where the splitting gain is the largest among all nodes to be split and all features. The calculation of the split threshold may be calculated based on the following method: sorting characteristic data corresponding to an intersection part of the sample identification of the data bucket corresponding to the optimal splitting point and the sample identification of the node to be split corresponding to the optimal splitting point; taking the average of the feature data of every two adjacent samples after sequencing as a splitting threshold value, and calculating splitting gain; and selecting the splitting threshold with the maximum splitting gain as the splitting threshold of the optimal splitting point. The cluster to which each node belongs is public, and the splitting threshold is stored in the cluster to which each node belongs, so that in the prediction stage, the corresponding cluster can be found through the cluster to which the node included in the multiparty shared model belongs, and then data is sent to the cluster to acquire the next node until the final predicted value is obtained.

According to another aspect of the present disclosure, there is also provided a multiparty joint prediction method based on a distributed system, as shown in fig. 11, the multiparty joint prediction method may include: step 1101, inputting a prediction sample into a decision tree model; step 1102, aiming at each sub-decision tree of the decision tree model, obtaining the cluster to which the root node belongs; step S1103, communicating with the affiliated cluster to obtain the feature of the root node; step S1104, feature data of features of root nodes of the prediction samples are sent to the affiliated clusters, and the affiliated clusters of the child nodes are obtained; step S1105, iterating the process to obtain a predicted value of each sub decision tree on a predicted sample; and step S1106, summing the predicted values of the predicted samples by each sub-decision tree to obtain the predicted values of the predicted samples. Therefore, the predicted value of each sub-decision tree on the sample can be calculated by continuously communicating with the cluster to which the current node of the current sub-decision tree belongs and returning to the next node, so that the predicted value of the model on the sample is obtained. Because the decision tree model is shared by a plurality of clusters and the splitting threshold value is only stored in the cluster to which the node belongs, the complete content of the model is not disclosed to any cluster, so that the prediction of a sample is realized by combining the information stored in the cluster of each cluster, and the multiparty joint modeling supporting the privacy scene is realized.

According to another aspect of the present disclosure, there is also provided a multi-party joint modeling apparatus. As shown in fig. 12, the multiparty joint modeling apparatus 1200 may include: an intersection module 1201 configured to perform intersection on a sample identifier included in each of the plurality of clusters to obtain an intersection sample identifier and cluster sample data corresponding to the intersection sample identifier included in each of the plurality of clusters; a binning module 1202 configured to respectively bin the cluster sample data of each of the plurality of clusters to obtain cluster binned data of each of the plurality of clusters; a first construction module 1203 configured to construct a global information gain histogram based on the sample tag and the plurality of cluster sub-bucket data; and a second construction module 1204 configured for constructing a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided an electronic device, which may include: a processor; and a memory storing a program comprising instructions that when executed by the processor cause the processor to perform a multi-party joint modeling method according to the above.

According to another aspect of the present disclosure, there is also provided a computer readable storage medium storing a program comprising instructions that, when executed by a processor of an electronic device, cause the electronic device to perform a multi-party joint modeling method according to the above.

With reference to fig. 13, a computing device 13000, which is an example of a hardware device (electronic device) that can be applied to aspects of the present disclosure, will now be described. The computing device 13000 can be any machine configured to perform processes and/or calculations and can be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a robot, a smart phone, an on-board computer, or any combination thereof. The multi-party joint modeling method described above may be implemented, in whole or at least in part, by the computing device 13000 or a similar device or system.

The computing device 13000 can include elements that are connected to the bus 13002 (possibly via one or more interfaces) or that communicate with the bus 13002. For example, the computing device 13000 can include a bus 13002, one or more processors 13004, one or more input devices 13006, and one or more output devices 13008. The one or more processors 13004 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., special processing chips). The input device 13006 can be any type of device capable of inputting information to the computing device 13000 and can include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. The output device 13008 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. The computing device 13000 can also include a non-transitory storage device 13010 or any storage device connected to the non-transitory storage device 13010 that can be non-transitory and that can enable data storage, and can include, but is not limited to, a magnetic disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, an optical disk or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer can read data, instructions, and/or code. The non-transitory storage device 13010 may be detachable from the interface. The non-transitory storage 13010 may have data/program (including instructions)/code for implementing the methods and steps described above. The computing device 13000 can also include a communication device 13012. The communication device 13012 may be any type of device or system that enables communication with external devices and/or with a network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a bluetooth (TM) device, 1302.11 device, wiFi device, wiMax device, cellular communication device, and/or the like.

The computing device 13000 can also include a working memory 13014, which can be any type of working memory that can store programs (including instructions) and/or data useful for the operation of the processor 13004, and can include, but is not limited to, random access memory and/or read only memory devices.

The software elements (programs) may reside in the working memory 13014 and include, but are not limited to, an operating system 13016, one or more application programs 13018, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more applications 13018 and the above-described multi-party joint modeling method may be implemented by the instructions of one or more applications 13018 being read and executed by the processor 13004. More specifically, in the multiparty joint modeling method described above, step S101 to step S104 may be implemented, for example, by the processor 13004 executing the application 13018 having the instructions of step S101 to step S104. Further, other steps in the multi-party joint modeling method described above may be implemented, for example, by the processor 13004 executing an application 13018 having instructions for performing the corresponding steps. Executable code or source code of instructions of the software elements (programs) may be stored in a non-transitory computer readable storage medium (such as the storage device 13010 described above) and may be stored in the working memory 13014 (possibly compiled and/or installed) when executed. Executable code or source code for instructions of software elements (programs) may also be downloaded from a remote location.

It should also be understood that various modifications may be made according to specific requirements. For example, custom hardware may also be used, and/or particular elements may be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented by programming hardware (e.g., programmable logic circuits including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) in an assembly language or hardware programming language such as VERILOG, VHDL, c++ using logic and algorithms according to the present disclosure.

It should also be appreciated that the foregoing method may be implemented by a server-client mode. For example, a client may receive data entered by a user and send the data to a server. The client may also receive data input by the user, perform a part of the foregoing processes, and send the processed data to the server. The server may receive data from the client and perform the aforementioned method or another part of the aforementioned method and return the execution result to the client. The client may receive the result of the execution of the method from the server and may present it to the user, for example, via an output device.

It should also be appreciated that the components of computing device 13000 can be distributed across a network. For example, some processes may be performed using one processor while other processes may be performed by another processor remote from the one processor. Other components of computing system 13000 can also be similarly distributed. As such, computing device 13000 can be interpreted as a distributed computing system that performs processing at multiple locations.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A multi-party joint modeling method based on a distributed system, wherein the distributed system comprises a plurality of clusters, each cluster of the plurality of clusters comprising a server and a plurality of clients, the method comprising:

intersection is carried out on sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster, the cluster sample data comprises client sample data stored in the corresponding plurality of clients, and the cluster sample data and the client sample data comprise sample identifications and at least one feature;

respectively barrelling cluster sample data of each cluster in the plurality of clusters to obtain cluster barrelled data of each cluster in the plurality of clusters, wherein the method comprises the following steps:

traversing, for each of the plurality of clusters, the at least one feature of the cluster sample data of that cluster;

Generating at least one data bucket based on the feature data of the current feature; and

integrating the data buckets corresponding to all the features to obtain cluster sub-bucket data of the cluster, wherein the cluster sub-bucket data of the cluster comprises at least one feature of cluster sample data of the cluster and one or more data buckets corresponding to each feature of the at least one feature of the cluster sample data of the cluster;

constructing a global information gain histogram based on a sample tag and cluster barrel data of each of the plurality of clusters, wherein the sample tag is a true value of each sample and the sample tag is stored in a particular cluster of the plurality of clusters; and

and constructing a decision tree model based on the global information gain histogram.

2. The multi-party joint modeling method of claim 1, wherein the generating at least one data bucket based on the feature data of the current feature comprises:

judging whether the characteristic data of the current characteristic are distributed on the same client;

responding to the characteristic data of the current characteristic to be distributed on a plurality of clients, indicating each client in the plurality of clients to divide the characteristic data of the current characteristic included in the sample data of the respective client into barrels, generating at least one data barrel to be combined of the current characteristic, and uploading the at least one data barrel to be combined to a server corresponding to the plurality of clients; and

And merging the at least one to-be-merged data bucket uploaded by the plurality of clients to generate the at least one data bucket, wherein each data bucket in the at least one data bucket is formed by merging one or more to-be-merged data buckets with the same or similar values, the sample identification of each data bucket in the at least one data bucket comprises all sample identifications of the one or more to-be-merged data buckets, and the client of each data bucket in the at least one data bucket comprises the client of the one or more to-be-merged data buckets.

3. The multi-party joint modeling method of claim 2, wherein generating at least one data bucket based on the feature data of the current feature further comprises:

and sending each data bucket in the at least one data bucket to the client of the data bucket, wherein the client sub-bucket data of the client comprises the at least one data bucket.

4. The multi-party joint modeling method of claim 3, wherein generating at least one data bucket based on the feature data of the current feature further comprises:

And responding to the characteristic data of the current characteristic to be distributed on the same client, indicating the same client to sub-tank the characteristic data of the current characteristic, generating at least one data tank, and uploading the at least one data tank to a server corresponding to the same client, wherein the client sub-tank data of the same client comprises the at least one data tank.

5. The multi-party joint modeling method of claim 4, wherein the plurality of clusters includes a first cluster and at least one second cluster, the cluster sample data of the first cluster further including the sample tag,

wherein constructing a global information gain histogram based on the sample tags and the cluster barrel data for each of the plurality of clusters comprises:

obtaining a predicted value of a current model on each sample corresponding to each sample identifier of cluster sub-bucket data of the first cluster;

calculating first-order gradient data and second-order gradient data based on the predicted value and the sample tag; and

the global information gain histogram is constructed based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters.

6. The multi-party joint modeling method of claim 5, wherein the constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters comprises:

encrypting the first-order gradient data and the second-order gradient data and then sending the encrypted first-order gradient data and the encrypted second-order gradient data to a server of each second cluster in the at least one second cluster;

acquiring at least one node to be split of the current model, wherein the node to be split comprises at least one sample identifier;

constructing a first information gain histogram based on the first-order gradient data, the second-order gradient data, cluster bucket data of the first cluster and the at least one node to be split;

receiving at least one ciphertext information gain histogram from a server of each of the at least one second cluster;

decrypting the at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to the at least one ciphertext information gain histogram one by one; and

and combining the first information gain histogram and the at least one second information gain histogram to obtain the global information gain histogram.

7. The multi-party joint modeling method of claim 6, wherein the constructing a first information gain histogram based on the first order gradient data, the second order gradient data, cluster bucket data of the first cluster, and the at least one node to be split comprises:

traversing at least one feature of cluster sub-bucket data of the first cluster for each of the at least one node to be split;

based on the feature data of the node to be split and the current feature, obtaining a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split; and

and merging the first information gain sub-histogram or the first candidate splitting gain of each feature in at least one feature of the cluster barrel data of the first cluster of each node to be split to obtain the first information gain histogram.

8. The multiparty joint modeling method according to claim 7, wherein said deriving a first information gain sub-histogram or a first candidate split gain for the current feature of the node to be split based on the feature data for the node to be split and the current feature comprises:

Judging whether feature data of the current feature are distributed on the same client;

responding to the characteristic data of the current characteristic to be distributed on a plurality of clients, indicating each client in the plurality of clients to construct a first information gain sub-histogram to be combined based on the first-order gradient data, the second-order gradient data, the client barrel data of the client and the node to be split, and uploading the first information gain sub-histogram to be combined to a server corresponding to the plurality of clients; and

and merging the received first information gain sub-histograms to be merged, which are uploaded by the clients, and constructing the first information gain sub-histograms.

9. The multi-party joint modeling method of claim 8, wherein the first information gain sub-histogram includes at least one histogram bin that corresponds one-to-one to all data bins belonging to the feature, the first information gain sub-histogram to be combined includes at least one histogram bin to be combined that corresponds one-to-one to all data bins belonging to the feature, the histogram bin and the histogram bin to be combined each include at least one of a first order gradient and a second order gradient and,

The merging the received multiple to-be-merged first information gain sub-histograms uploaded by the multiple clients, and constructing the first information gain sub-histogram includes:

combining at least one to-be-combined histogram bucket in the plurality of to-be-combined first information gain sub-histograms uploaded by the plurality of clients to generate at least one histogram bucket, wherein each histogram bucket in the at least one histogram bucket is formed by combining one or a plurality of to-be-combined histogram buckets corresponding to the same data bucket from different to-be-combined first information gain sub-histograms, the first-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the first-order gradient sums of the one or more to-be-combined histogram buckets, and the second-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the second-order gradient sums of the one or more to-be-combined histogram buckets; and

the first information gain sub-histogram is constructed based on the at least one histogram bin.

10. The multi-party joint modeling method of claim 9, wherein the obtaining a first information gain sub-histogram or a first candidate split gain for the current feature of the node to be split based on the feature data for the node to be split and the current feature further comprises:

And responding to the fact that all feature data of the current feature are distributed on the same client, indicating the same client to construct the first information gain sub-histogram based on the first-order gradient data, the second-order gradient data, the client barrel data of the same client and the node to be split, calculating the first candidate split gain based on the first information gain sub-histogram, and uploading the first candidate split gain to a server of the first cluster.

11. The multiparty joint modeling method according to claim 6, wherein said ciphertext information gain histogram is constructed based on ciphertext first order gradient data, ciphertext second order gradient data, cluster bucket data of said second cluster, and said at least one node to be split.

12. The multi-party joint modeling method of claim 5, wherein the current model includes one or more sub-decision trees, each of the at least one node to be split being a part of a leaf node of a last sub-decision tree of the current model.

13. The multi-party joint modeling method of claim 12, wherein the constructing a decision tree model based on the global information gain histogram includes:

Determining an optimal splitting point based on the global new information gain histogram;

indicating a client where the optimal splitting point is located to split the optimal splitting point;

iterating the splitting process until the splitting termination condition is reached, and generating the sub-decision tree; and

and iterating the sub-decision tree generation process until reaching the iteration termination condition to obtain the decision tree model.

14. The multi-party joint modeling method of claim 13, wherein the current model includes a structure of one or more sub-decision trees, the clusters to which all nodes belong are disclosed for the plurality of clusters,

wherein the indicating the client where the optimal splitting point is located to split the optimal splitting point includes:

the client is instructed to calculate a splitting threshold based on the optimal splitting point and the client barrel splitting data of the client, a sample identifier included in the split leaf node is obtained, and the leaf node is uploaded to a server; and

synchronizing the cluster to which the node of the optimal splitting point belongs and the leaf nodes to the clusters except the cluster.

15. A multi-party joint prediction method based on a decision tree model built according to the method of any one of claims 1-14, comprising:

Inputting a prediction sample into the decision tree model;

aiming at each sub-decision tree of the decision tree model, acquiring a cluster to which a root node belongs;

communicating with the affiliated cluster to acquire the characteristics of the root node;

the feature data of the features of the root node of the prediction sample are sent to the affiliated cluster, and the affiliated cluster of the child node is obtained;

iterating the process to obtain a predicted value of each sub-decision tree on the predicted sample; and

and summing the predicted values of the predicted samples by each sub-decision tree to obtain the predicted values of the predicted samples.

16. A multi-party joint modeling apparatus based on a distributed system, wherein the distributed system comprises a plurality of clusters, each cluster of the plurality of clusters comprising a server and a plurality of clients, the apparatus comprising:

an intersection module configured to intersect a sample identifier included in each of the plurality of clusters to obtain an intersection sample identifier and cluster sample data corresponding to the intersection sample identifier included in each of the plurality of clusters, where the sample identifier and the cluster sample data included in each of the plurality of clusters are distributed and stored in a plurality of clients of the corresponding cluster, and the cluster sample data includes client sample data stored in the corresponding plurality of clients, and the cluster sample data and the client sample data each include a sample identifier and at least one feature;

The barreling module is configured to respectively barrel the cluster sample data of each of the plurality of clusters to obtain cluster barreled data of each of the plurality of clusters, and comprises the following steps:

a first construction module configured to construct a global information gain histogram based on a sample tag and the plurality of cluster barrel data, wherein the sample tag is a true value for each sample and the sample tag is stored in a particular cluster of the plurality of clusters; and

and a second construction module configured to construct a decision tree model based on the global information gain histogram.

17. An electronic device, comprising:

a processor; and

a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the method of any one of claims 1-15.

18. A computer readable storage medium storing a program, the program comprising instructions which, when executed by a processor of an electronic device, cause the electronic device to perform the method of any one of claims 1-15.

19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-15.