WO2020029590A1

WO2020029590A1 - Sample prediction method and device based on federated training, and storage medium

Info

Publication number: WO2020029590A1
Application number: PCT/CN2019/080297
Authority: WO
Inventors: 范涛; 成柯葳; 马国强; 刘洋; 陈天健; 杨强
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2018-08-10
Filing date: 2019-03-29
Publication date: 2020-02-13
Also published as: CN109165683A; CN109165683B

Abstract

A sample prediction method based on federated training, comprising the following steps: performing federated training on two aligned training samples by using an XGboost algorithm to construct a gradient boosting tree model (S10), wherein the gradient boosting tree model comprises a plurality of regression trees, and a split node of each regression tree corresponds to a feature of each training sample; and performing joint prediction on a sample to be predicted on the basis of the gradient boosting tree model, to determine a sample category of the sample to be predicted or obtain a prediction score of the sample to be predicted (S20). According to the method, federated training-based modeling is implemented by using training samples of different data parties, and thus sample prediction is implemented on the basis of the established model.

Description

Sample prediction method, device and storage medium based on federal training

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 10, 2018, with application number 201810913869.3, and the invention name is "Sample Prediction Method, Device, and Storage Medium Based on Federal Training", the entire contents of which are incorporated by reference Incorporated in the application.

Technical field

The invention relates to the technical field of machine learning, and in particular, to a sample prediction method, device, and computer-readable storage medium based on federal training.

Background technique

In the current information age, certain behaviors of people can be expressed through data, such as consumer behaviors, which leads to big data analysis. Machine learning can be used to build corresponding behavior analysis models, which can then classify people's behaviors or based on user behaviors Feature prediction.

In the existing machine learning technology, one party usually independently trains the sample data, that is, unilateral modeling. At the same time, based on the established mathematical model, it is possible to determine the relatively important features in the sample feature set. However, in many cross-domain big data analysis scenarios, for example, users have both consumer behavior and lending behavior, and user consumption behavior data is generated on the consumer service provider, and user loan behavior data is generated on the financial service provider. The provider needs to predict the user's lending behavior based on the characteristics of the consumer's consumption behavior, and then needs to use the consumption behavior data of the consumer service provider and perform machine learning together with the borrowing behavior data of the consumer to build a prediction model.

Therefore, for the above application scenarios, a new modeling method is needed to realize the joint training of the sample data of different data providers, so as to realize the joint participation of both parties in modeling.

Summary of the invention

The main purpose of the present invention is to provide a sample prediction method, device and computer-readable storage medium based on federal training, which aims to solve the problem that the prior art cannot implement joint training of sample data of different data providers, and thus cannot achieve mutual participation of both parties. Technical issues in modeling and sample prediction.

To achieve the above objective, the present invention provides a sample prediction method based on federal training. The sample prediction method based on federal training includes the following steps:

XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;

Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.

Optionally, the federal training-based sample prediction method includes:

Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;

The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.

Optionally, the two aligned training samples are a first training sample and a second training sample, respectively;

The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;

The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.

Optionally, using the XGboost algorithm to perform federal training on two aligned training samples to construct a gradient boosting tree model includes:

On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;

If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;

If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;

Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;

On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;

Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;

Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.

Optionally, before the step of obtaining the first step and the second step of each training sample in the sample set corresponding to the current node splitting on the second data side, the method further includes:

When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;

If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;

If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;

If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;

If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.

Optionally, the federal training-based sample prediction method further includes:

When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the current regression tree reaches a preset depth threshold;

If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;

When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;

If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.

On the second data side, record related information of the global best split node determined by each round of node splitting;

The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.

Optionally, the statistics of the average return value of the split nodes corresponding to the same feature in the gradient boosted tree model include:

On the second data side, each global best split node is used as the split node of each regression tree in the gradient boosting tree model, and the average return value of the split nodes corresponding to the same feature code is counted.

Optionally, based on the gradient boosting tree model, performing joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtaining a prediction score of the samples to be predicted includes:

Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;

If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;

If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;

When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.

Further, in order to achieve the above object, the present invention also provides a sample prediction device based on federal training. The sample prediction device based on federal training includes a memory, a processor, and a memory stored in the memory and accessible to the processor. A sample prediction program running on the computer, the sample prediction program, when executed by the processor, implements the steps of the federal training-based sample prediction method according to any one of the preceding items.

Further, in order to achieve the above object, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a sample prediction program, and the sample prediction program is implemented as any one of the above when executed by a processor. Describe the steps of the federal training-based sample prediction method.

The present invention uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model. The gradient boosted tree model is a regression tree set, which includes multiple regression trees, and one split node of each regression tree. Corresponds to a feature of the training sample. Finally, based on the gradient boosting tree model, joint prediction is performed to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted. The invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention;

2 is a schematic flowchart of an embodiment of a sample prediction method based on federal training according to the present invention;

FIG. 3 is a schematic flowchart of performing sample alignment in an embodiment of a sample training method based on federal training according to the present invention; FIG.

FIG. 4 is a detailed flowchart of an embodiment of step S10 in FIG. 2; FIG.

FIG. 5 is a schematic diagram of training results of an embodiment of a sample prediction method based on federal training according to the present invention.

The realization of the purpose, functional characteristics and advantages of the present invention will be further explained with reference to the embodiments and the drawings.

detailed description

It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

The invention provides a sample prediction device based on federal training.

As shown in FIG. 1, FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention.

The sample prediction device based on the federal training of the present invention may be a personal computer, or a device with a computing processing capability such as a server.

As shown in FIG. 1, the federal training-based sample prediction device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display, an input unit such as a keyboard, and the optional user interface 1003 may further include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (for example, a magnetic disk memory). The memory 1005 may optionally be a storage device independent of the foregoing processor 1001.

Those skilled in the art can understand that the structure of the sample prediction device based on the federal training shown in FIG. 1 does not constitute a limitation on the device, and may include more or less components than shown in the figure, or some components may be combined, or different Of the components.

As shown in FIG. 1, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a file copy program.

In the sample prediction device based on federal training shown in FIG. 1, the network interface 1004 is mainly used to connect to the background server and perform data communication with the background server; the user interface 1003 is mainly used to connect to the client (user) and perform communication with the client. Data communication; and the processor 1001 may be used to call a sample prediction program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:

Further, the two aligned training samples are a first training sample and a second training sample, respectively; the attributes of the first training sample include a sample ID and some sample features, and the attributes of the second training sample include a sample ID, another Part of the sample features and data labels; the first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party; processing The processor 1001 calls the sample prediction program stored in the memory 1005 and further performs the following operations:

On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a training sample corresponding to the sample ID locally in each split mode;

On the first data side, based on the encrypted one-step and two-step degrees, calculating a revenue value of a split node of a training sample corresponding to the sample ID locally in each split mode;

Or, on the first data side, one step and two steps used in the first round of node splitting are used, and the revenue value of the splitting node for each training sample corresponding to the sample ID locally is calculated;

Encrypt the revenue values of all split nodes and send them to the second data party.

Further, the processor 1001 calls the sample prediction program stored in the memory 1005 and performs the following operations:

When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;

Based on the hardware operating environment involved in the foregoing solution of the federal training-based sample prediction device embodiment, the following embodiments of the federal training-based sample prediction method of the present invention are proposed.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of a federal training-based sample prediction method according to the present invention. In this embodiment, the federal training-based sample prediction method includes the following steps:

In step S10, the XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosted tree model. The gradient boosted tree model includes multiple regression trees, and a split node of the regression tree corresponds to the training sample. A feature

The XGboost (eXtreme Gradient and Boosting) algorithm is an improvement on the Boosting algorithm based on the GBDT (Gradient Boosting Decision Tree) algorithm. The internal decision tree uses a regression tree. The output of the algorithm is a collection of regression trees. For multiple regression trees, the basic idea of training and learning is to traverse all the segmentation methods of all the features of the training sample (that is, the method of node splitting), select the segmentation method with the least loss, and obtain two leaves (that is, split the node to generate a new node) , And then continue traversing until:

(1) If the stopping splitting condition is satisfied, a regression tree is output;

(2) If the stopping iteration condition is satisfied, a regression tree set is output.

In this embodiment, the training samples used by the XGboost algorithm are two independent training samples, that is, each training sample belongs to a different data party. If two training samples are regarded as a whole training sample, since the two training samples belong to different data sides, it can be regarded as segmenting the whole training sample, and the training samples are different features of the same sample (sample Slitting).

In addition, because the two training samples belong to different data parties, in order to achieve federal training modeling, it is necessary to perform sample alignment on the original sample data provided by both parties.

In this embodiment, the federal training means that the sample training process is completed by the cooperation of two data parties. The regression training contained in the finally trained gradient boosted tree model has split nodes corresponding to the characteristics of both training samples.

In the XGboost algorithm, when traversing all the segmentation methods of all the features of the training sample, the value of the segmentation method is evaluated by the revenue value, and each segmentation node selects the segmentation method with the smallest loss. Therefore, the revenue value of a split node can be used as the basis for evaluating the importance of features. The larger the revenue value of a split node, the smaller the node segmentation loss, and the greater the importance of the feature corresponding to the split node.

In this embodiment, since the trained gradient boosted tree model includes multiple regression trees, and different regression trees may use the same features for node segmentation, it is necessary to statistically identify all regression trees included in the gradient boosted tree model. The average return value of the split node corresponding to the feature, and the average return value is used as the score of the corresponding feature.

Step S20: Based on the gradient boosting tree model, perform joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtain a prediction score of the samples to be predicted.

In this embodiment, the gradient boosted tree model trained by using the XGboost algorithm can realize joint prediction of prediction samples, thereby achieving classification or scoring of the prediction samples.

This embodiment uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model. The gradient boosted tree model is a regression tree set, which includes multiple regression trees, one split of each regression tree. The node corresponds to a feature of the training sample. Finally, based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted. The invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.

Further, in order to ensure that the sample gradients used by different data parties are consistent during the federation modeling process, the two data parties perform sample alignment processing on both sides before performing federated modeling. The specific processing flow is shown in FIG. 3.

The sample alignment between the two sides uses a blind signature and RSA encryption algorithm to perform an interactive encryption scheme on the sample ID. The encrypted ID string is compared to identify the intersection and non-intersection parts (privacy part, invisible to each other) in the samples of both parties. For the privacy protection of the non-intersecting part of the sample data, the present invention needs to encrypt the sample data during the sample alignment process.

Assume that the sample id of the data A is identified as X _A : {u1, u2, u3, u4}, the sample id of the data B is identified as X _B : {u1, u2, u3, u5}, and the blind signature of the data x is E (x), the RSA key generated by party B is (n, e, d), and the RSA key obtained by party A is (n, e). The following example process is performed:

(1) Party A encrypts id: Y _A = {(re ^e % n) * E (u) | u∈X _A }, where r is a different random generated corresponding to each different sample id in X _A Count, and then party _A sends Y _A to party B;

(2) Party B re-encrypts the id encrypted string: Z _A = {y ^d | y ∈ Y _A }, and Party B sends the double-encrypted string Z _A to Party A;

(3) Party _A performs the following operations on Z _A :

(4) Party B encrypts id: Z _B = {E (E (u)) ^d | u ∈ X _B }, and then sends Z _B to Party A;

(5) Party A compares D _A and Z _B. If the two encrypted strings are equal, it means that X _A and X _B are equal. The equal ids are the intersection of the samples ({u1, u2, u3}), which are reserved; the unequal parts ({u4, u5}), because they are in encrypted form, are not visible to both parties and can be discarded.

Further, in order to facilitate the description of the specific implementation of the joint training of the present invention, this embodiment specifically uses two independent training samples for illustration.

In this embodiment, the first data party provides a first training sample, and the attributes of the first training sample include a sample ID and some sample features; the second data party provides a second training sample, and the second training sample attribute includes a sample ID and another part of the sample. Features and data labels.

The sample characteristics refer to the characteristics exhibited or possessed by the sample. For example, if the sample is a person, the corresponding sample characteristics may be age, gender, income, education, etc. Data labels are used to classify multiple different samples. The classification results are determined based on the characteristics of the samples.

The main significance of the federal training modeling of the present invention is to achieve two-way privacy protection of the sample data of both parties. Therefore, during the federal training process, the first training sample is stored locally on the first data side and the second training sample is stored locally on the second data side. For example, the data in Table 1 below is provided by the first data side and stored in the first data. Locally, the data in Table 2 is provided by the second data party and stored locally.

Table 1

样本IDSample ID	AgeAge	GenderGender	Amount of given creditAmount of credit

X1X1	2020	11	50005000
X2X2	3030	11	300000300000
X3 X3	3535	00	250000250000
X4X4	4848	00	300000300000
X5 X5	1010	11	200200

As shown in Table 1 above, the first training sample attribute includes a sample ID (X1 to X5), Age features, Gender features, and Amount of credit features.

Table 2

样本IDSample ID	Bill PaymentBill Payment	Education Education	LableLable

X1X1	31023102	22	24twenty four
X2X2	1725017250	33	1414
X3X3	1402714027	22	1616
X4 X4	67876787	11	1010
X5X5	280280	11	2626

As shown in Table 2 above, the second training sample attribute includes a sample ID (X1 to X5), a Bill feature, an Education feature, and a data label Lable.

Further, referring to FIG. 4, FIG. 4 is a schematic diagram of a detailed process of an embodiment of step S10 in FIG. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S10 specifically includes:

In step S101, on the second data side, one step and two steps of each training sample in the sample set corresponding to the current node split are obtained;

XGboost algorithm is a machine learning modeling method. It needs to use a classifier (that is, a classification function) to map sample data to a certain category, so that it can be applied to data prediction. In the process of using the classifier to learn classification rules, it is necessary to use a loss function to determine the size of the fitting error of machine learning.

In this embodiment, each time a node split is performed, on the second data side, one step and two steps of each training sample in the sample set corresponding to the current node split are obtained.

Among them, the gradient boosting tree model requires multiple rounds of federal training. Each round of federal training corresponds to the generation of a regression tree, and the generation of a regression tree requires multiple node splits.

Therefore, in each round of federal training, the first node splitting uses the training samples saved at the beginning, and the next node splitting will use the training samples corresponding to the new node generated by the previous node splitting, and the same During the round of federal training, each round of node splitting follows the same one-step and two-step degrees used in the first round of node splitting. The next round of federal training will use the results of the previous round of federal training to update the first and second steps used in the last round of federal training.

The XGboost algorithm supports a custom loss function. The custom loss function is used to find the first-order and second-order partial derivatives of the objective function, corresponding to the first and second steps of the local sample data to be trained.

Based on the description of the XGboost algorithm and the gradient boosting tree model in the above embodiment, to construct a regression tree, a split node needs to be determined, and the split node can be determined by the revenue value. The formula for calculating the gain is as follows:

Among them, I _L represents the contained sample set of the left child node after the current node split, I _R represents the contained sample set of the right child node after the current node split, g _i represents a step of the sample i, and h _i represents the sample i Two steps, λ and γ are constant.

Since the sample data to be trained exists in the first data side and the second data side, it is necessary to calculate the revenue value of the split node in each splitting method on the first data side and the second data side, respectively. .

In this embodiment, because the first data side and the second data side are pre-aligned with the sample, both sides have the same gradient characteristics, and because the data label exists in the sample data of the second data side, based on the second data The one-step and two-step of the square sample data are used to calculate the return value of the split node of each sample data in each split mode.

In step S102, if the current round of node splitting is the first round of node splitting for constructing a regression tree, the first step and the second step are encrypted and sent to the first data together with the sample ID of the sample set. For the calculation of the revenue value of the split node for each training method corresponding to the local training sample corresponding to the sample ID based on the first and second steps of encryption on the first data side ;

In this embodiment, in order to achieve two-way privacy protection of the sample data of both parties during the federal training process, if the current round of node splitting is the first round of node splitting to construct a regression tree, one of the sample data is calculated on the second data side. After the gradation and the second gradation, they are encrypted before being sent to the first data party.

On the first data side, based on the one-step and two-step degrees of the received sample data, and the above formula for calculating the gain value, the local sample data of the first data side is calculated to split the nodes in each splitting method. Since the return value is encrypted with one and two steps, the calculated return value is also an encrypted value, so there is no need to encrypt the return value.

After calculating the revenue value of the split node under various segmentation methods of the sample data, the new node can be split to generate a regression tree. In this embodiment, it is preferred that a second data party having sample data with a data label dominate the construction of the regression tree of the gradient boosted tree model. Therefore, the local sample data of the first data side calculated on the first data side needs to be sent to the second data side for the revenue value of the split node in each split mode.

In step S103, if the current node splits into a non-first-round node split that constructs a regression tree, the sample ID of the sample set is sent to the first data side, so that the first round is continued on the first data side. One-step and two-step used for node splitting, to calculate the revenue value of the local training sample corresponding to the sample ID to split the node in each splitting method;

In this embodiment, if the current round of node splitting is a non-first round of node splitting to construct the regression tree, only the sample ID of the sample set corresponding to this round of node splitting is sent to the first data party, and the first data party continues to use The one-step and two-step degrees used in the first round of node splitting are used to calculate the revenue value of the splitting node for the training samples corresponding to the received sample ID locally in each splitting mode.

Step S104: The second data party receives the encrypted revenue values of all split nodes returned by the first data party and decrypts them;

Step S105: On the second data side, based on the one-step and two-step, calculate the revenue value of the split node of the training sample corresponding to the sample ID locally in each split mode;

On the second data side, based on the first and second steps of the calculated sample data and the calculation formula of the above-mentioned gain value, calculate the local sample data to be trained on the second data side to be split in each splitting method. The return value of the node.

Step S106: Determine the global best split node for the current round of node splitting based on the return values of all split nodes calculated by both parties;

Because the initial sample data of the two parties are sample aligned, the return value of all split nodes calculated by the two parties can be regarded as the return value of the split node of the two parties' overall data samples in each split mode. Therefore, by comparing The size of each return value, the split node with the largest return value as the global best split node for the current node split.

It should be noted that the sample features corresponding to the global best split node may belong to both the training samples on the first data side and the training samples on the second data side.

Optionally, since the regression tree construction of the gradient boosting tree model is dominated by the second data side, on the second data side, it is necessary to record the relevant information of the global best split node determined by each round of node splitting; the relevant information includes : The provider corresponding to the sample data, the feature code corresponding to the sample data, and the return value.

For example, if the data side A holds the feature f _i corresponding to the global best segmentation point, then this record is (Site A, E _A (f _i ), gain). Conversely, if the data side B holds the feature f _i corresponding to the global best segmentation point, this record is (Site B, E _B (f _i ), gain). Wherein, E _A (f _i) A side feature data representing encoded f _i, E _B (f _i) characteristic data indicating the direction B f _i is encoded, the encoding may be marked by features f _i without giving away their original feature data .

Optionally, when performing feature selection in the above embodiment, it is preferable to use each global best split node as the split node of each regression tree in the gradient boosting tree model to count the average return value of the split nodes corresponding to the same feature code.

Step S107: Based on the global best split node of the current node split, split the sample set corresponding to the current node to generate a new node to build a regression tree of the gradient boosted tree model.

If the sample features corresponding to the global best split node of the current node split belong to the training samples of the first data side, the sample data corresponding to the current node split of the current round belongs to the first data side. Correspondingly, if the sample features corresponding to the global best split node of the current node split belong to the training samples of the second data side, the sample data corresponding to the current node split of the current round belongs to the second data side.

Through node splitting, new nodes (left and right child nodes) can be generated to build a regression tree. Through multiple rounds of node splitting, new nodes can be continuously generated, and a deeper regression tree can be obtained. If node splitting is stopped, a regression tree of the gradient boosted tree model can be obtained.

In this embodiment, since the data calculated and communicated by both parties are encrypted data of the intermediate results of the model, the training process will not leak the original feature data. At the same time, an encryption algorithm is used throughout the training process to ensure the privacy of the data. Partial homomorphic encryption algorithm is preferably used, which supports addition homomorphism.

Further, in one embodiment, based on the difference in node splitting conditions, the one-step and two-step degrees of the training samples for node splitting are specifically obtained in the following manner:

1.The first round of node split corresponds to the construction of the first regression tree

1.1. If the current node split is the first node split to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to the current node split;

1.2. If the current round of node splitting is a non-first round of node splitting to construct the first regression tree, the first and second steps used in the first round of node splitting will be used.

2.Corresponding to the current round of node splitting to construct a non-first regression tree

2.1. If the current node split corresponds to the first node split of the non-first regression tree, the first and second steps are updated according to the previous federal training;

2.2. If the current round of node splitting is a non-first round of node splitting to construct a non-first regression tree, the first and second steps used in the first round of node splitting will be used.

Further, in an embodiment, in order to reduce the complexity of the regression tree, a depth threshold of the regression tree is preset to limit node splitting.

In this embodiment, when a new node is generated in each round to build a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree in this round reaches a preset depth threshold;

If the depth of the regression tree in this round reaches a preset depth threshold, the node splitting is stopped, and a regression tree of the gradient boosted tree model is obtained, otherwise the next round of node splitting is continued.

It should be noted that the condition that restricts node splitting may also be stopping node splitting when the node cannot continue to split, for example, a sample corresponding to the current node cannot continue to perform node splitting.

Further, in another embodiment, in order to avoid over-fitting during the training process, a threshold value for the number of regression trees is preset to limit the number of regression trees generated.

In this embodiment, when the node splitting is stopped, it is judged on the second data side whether the total number of regression trees in the current round reaches a preset number threshold;

If the total number of regression trees in the current round reaches a preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.

It should be noted that the condition that limits the number of regression trees to be generated may also be to stop building the regression tree when the nodes cannot continue to split.

To better understand the present invention, based on the sample data in Tables 1 and 2 in the above embodiments, the federal training and modeling process of the present invention is exemplified below.

First round of federal training: training the first regression tree

(1) Node split in the first round

1.1 On the second data side, calculate the first gradient (g _i ) and the second gradient (h _i ) of the sample data in Table 2; encrypt g _i and h _i and send to the first data party;

1.2. On the side of the first data party, based on g _i and h _i , calculate the gain value of the split node for all possible splitting methods of the sample data in Table 1; send the gain value gain to the second data party;

The Age feature in Table 1 has 5 types of sample data division, the Gender feature has 2 types of sample data division, and the Amount of Given feature has 5 types of sample data division. Therefore, the sample data in Table 1 has a total of 12 divisions. That is, it is necessary to calculate the return value of the split node corresponding to the 12 division methods.

1.3 On the second data side, calculate the gain value of the split node for all possible splitting methods of the sample data in Table 2;

Since the Bill Payment feature in Table 2 has 5 sample data division methods and the Education feature has 3 sample data division methods, the sample data in Table 2 has a total of 8 division methods, that is, the division corresponding to the 8 division methods needs to be calculated. The return value of the node.

1.4. Select the maximum return value from the return value of the split node corresponding to the 12 partition methods calculated from the first data side and the return value of the split node corresponding to the 8 partition methods calculated from the second data side The corresponding feature is used as the global best split node for the current node split;

1.5. Based on the global best split node of the current node split, split the sample data corresponding to the current node to generate new nodes to build the regression tree of the gradient boosted tree model.

1.6. Determine whether the depth of the regression tree in this round reaches the preset depth threshold; if the depth of the regression tree in this round reaches the preset depth threshold, stop node splitting, and then obtain a regression tree of the gradient boosted tree model, otherwise continue to the next Round node split

1.7. Determine whether the total number of regression trees in the current round reaches the preset number threshold; if the total number of regression trees in the current round reaches the preset number threshold, stop the federal training, otherwise enter the next round of federal training.

(2) Second and third rounds of node splitting

2.1. Assuming that the feature corresponding to the previous round of node splitting is that Bill Payment is less than or equal to 3102, this feature is used as the splitting node (the corresponding sample is X1, X2, X3, X4, X5), and two new sub-nodes are generated, of which the left The node corresponds to a sample set (X1, X5) less than or equal to 3102, and the right node corresponds to a sample set (X2, X3, X4) greater than 3102, and the sample set (X1, X5) and the sample set (X2, X3, X4) As the new sample set, the second and third rounds of node splitting are continued to split the two new nodes and generate new nodes. ;

2.2. Since the second and third rounds of node splitting belong to the same round of federal training, the sample gradient values used in the first round of node splitting will continue to be used. Assuming that the feature corresponding to a split node in this round is Amount of credit less than or equal to 200, then this feature is used as the split node (the corresponding samples are X1 and X5) to generate two new subnodes, where the left node corresponds to less than or equal to A sample X5 of 200, and a right node corresponding to a sample X1 that is greater than 200; Similarly, the feature corresponding to another split node of this round is Age less than or equal to 35, then this feature is used as the split node (the corresponding samples are X2, X3, X4 ) To generate two new sub-nodes, where the left node corresponds to samples X2 and X3 less than or equal to 35, and the right node corresponds to samples X4 greater than 35. The specific implementation process refers to the first round of node splitting process.

Second round of federal training: training the second regression tree

3.1. Since the node split in this round belongs to the next round of federal training, the results of the previous round of federal training update the one-step and two-step used in the previous round of federal training, and continue the second round of federal training to perform node split to generate The new node constructs the next regression tree. The specific implementation process refers to the construction process of the previous regression tree.

3.2. As shown in FIG. 5, the sample data in Tables 1 and 2 in the above embodiment produced two regression trees after two rounds of federal training. The first regression tree includes three split nodes, which are: Bill Payment is less than or Equal to 3102, Amount of credit is less than or equal to 200, Age is less than or equal to 35; the second regression tree includes two split nodes, which are: Bill Payment is less than or equal to 6787, and Gender == 1.

3.3. Based on the two regression trees of the gradient boosting tree model shown in Figure 5, the average return value corresponding to the characteristics of the sample data: Bill Payment is (gain1 + gain4) / 2; Education is 0; Age is gain3; Gender is gain5; Amount of credit is gain2.

Further, in an embodiment of the federal training-based sample prediction method of the present invention, the specific implementation process of performing joint prediction on the prediction samples includes:

(1) On the second data side, traverse the regression tree corresponding to the gradient boosting tree model;

(2) If the attribute value of the current traversal node is recorded on the second data side, the next traversal node is determined by comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node;

(3) If the attribute value of the current traversal node is recorded on the first data side, a query request is initiated to the first data side for comparison on the first data side by comparing the data points of the local to-be-predicted samples with the current traversal node's Attribute value, determining the next traversal node and returning the node information to the second data party;

(4) After traversing the regression tree corresponding to the gradient boosting tree model, determine the sample category of the sample to be predicted based on the data labels of the samples corresponding to the node to which the sample to be predicted belongs, or obtain the weight value of the node to which the sample to be predicted belongs to The predicted score of the sample to be predicted.

In this embodiment, when generating a regression tree, the split node records of the regression tree are stored on the second data side. Therefore, in this embodiment, the second data side takes the lead in completing the joint prediction of the samples to be predicted, and the tree is specifically improved by traversing the gradient. The regression tree corresponding to the model determines the node to which the sample to be predicted belongs. The node to which the sample to be predicted belongs specifically is determined by comparing the data point of the sample to be predicted with the attribute value of the currently traversed node.

After the node to which the sample to be predicted belongs is determined, the sample category of the sample to be predicted can be determined based on the data label of the training sample corresponding to the node to which the sample to be predicted belongs, or the sample to be predicted can be obtained based on the weight value of the node to which the sample to be predicted belongs. Prediction score.

The invention also provides a computer-readable storage medium.

The computer-readable storage medium of the present invention stores a sample prediction program, and when the sample prediction program is executed by a processor, implements the steps of the federal training-based sample prediction method described in any one of the foregoing embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better. Implementation. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM / RAM), including Several instructions are used to cause a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above specific implementations, and the above specific implementations are only schematic and not restrictive. Those of ordinary skill in the art at Under the enlightenment of the present invention, many forms can be made without departing from the scope of the present invention and the scope of protection of the claims, and any equivalent structure or equivalent process transformation made by using the description and drawings of the present invention, or It is directly or indirectly used in other related technical fields, which all fall into the protection of the present invention.

Claims

A sample prediction method based on federal training, characterized in that the sample prediction method based on federal training includes the following steps:

XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;

Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
The federal training-based sample prediction method according to claim 1, wherein the federal training-based sample prediction method comprises:

Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;

The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
The federal training-based sample prediction method according to claim 2, wherein the two aligned training samples are a first training sample and a second training sample, respectively;

The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;

The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
The federated training-based sample prediction method according to claim 3, wherein the XGboost algorithm for federated training of two aligned training samples to construct a gradient boosting tree model comprises:

On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;

If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;

If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;

Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;

On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;

Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;

Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
The method for sample prediction based on federated training according to claim 4, characterized in that, on the second data side, obtaining one-step and two-step of each training sample in the sample set corresponding to the current node splitting Before the steps, it also includes:

When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;

If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;

If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;

If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;

If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
The federal training-based sample prediction method according to claim 4, wherein the federal training-based sample prediction method further comprises:

When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;

If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;

When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;

If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
The federal training-based sample prediction method according to claim 4, wherein the federal training-based sample prediction method further comprises:

On the second data side, record related information of the global best split node determined by each round of node splitting;

The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
The method for sample prediction based on federated training according to claim 7, characterized in that, based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine the sample category of the sample to be predicted or to obtain the sample to be predicted. Forecast scores include:

Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;

If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;

If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;

When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
A sample prediction device based on federal training, characterized in that the sample prediction device based on federal training includes a memory, a processor, and a sample prediction program stored on the memory and executable on the processor. When the sample prediction program is executed by the processor, the following steps are implemented:

XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;

Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
The sample prediction device based on federal training according to claim 9, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:

Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;

The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
The sample prediction device based on federal training according to claim 10, wherein the two aligned training samples are a first training sample and a second training sample, respectively;

The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;

The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
The sample prediction device based on federation training according to claim 11, wherein the XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model:

On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;

If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;

If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;

Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;

On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;

Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;

Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
The sample prediction device based on federal training according to claim 12, characterized in that, on the second data side, the first and second steps of each training sample in the sample set corresponding to the current node split are obtained. Before the steps, the processor calls the sample prediction program stored in the memory and further performs the following steps:

When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;

If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;

If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;

If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;

If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
The sample prediction device based on federal training according to claim 12, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:

When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;

If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;

When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;

If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
The sample prediction device based on federal training according to claim 12, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:

On the second data side, record related information of the global best split node determined by each round of node splitting;

The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
The sample prediction device based on federation training according to claim 15, wherein, based on the gradient boosting tree model, joint prediction is performed on samples to be predicted to determine a sample category of the sample to be predicted or to obtain a sample of the sample to be predicted. The steps to predict the score include:

Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;

If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;

If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;

When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
A computer-readable storage medium is characterized in that a sample prediction program is stored on the computer-readable storage medium, and when the sample prediction program is executed by a processor, the following steps are implemented:

XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;

Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
The computer-readable storage medium of claim 17, wherein when the sample prediction program is executed by a processor, the following steps are further implemented:

Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;

The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
The computer-readable storage medium of claim 18, wherein the two aligned training samples are a first training sample and a second training sample, respectively;

The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;

The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
The computer-readable storage medium of claim 19, wherein using the XGboost algorithm to perform federal training on two aligned training samples to construct a gradient boosting tree model comprises:

On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node split;

If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;

If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;

Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;

On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;

Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;

Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.