WO2020029590A1 - Sample prediction method and device based on federated training, and storage medium - Google Patents

Sample prediction method and device based on federated training, and storage medium Download PDF

Info

Publication number
WO2020029590A1
WO2020029590A1 PCT/CN2019/080297 CN2019080297W WO2020029590A1 WO 2020029590 A1 WO2020029590 A1 WO 2020029590A1 CN 2019080297 W CN2019080297 W CN 2019080297W WO 2020029590 A1 WO2020029590 A1 WO 2020029590A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
node
training
round
data
Prior art date
Application number
PCT/CN2019/080297
Other languages
French (fr)
Chinese (zh)
Inventor
范涛
成柯葳
马国强
刘洋
陈天健
杨强
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2020029590A1 publication Critical patent/WO2020029590A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Definitions

  • the invention relates to the technical field of machine learning, and in particular, to a sample prediction method, device, and computer-readable storage medium based on federal training.
  • one party usually independently trains the sample data, that is, unilateral modeling. At the same time, based on the established mathematical model, it is possible to determine the relatively important features in the sample feature set.
  • users have both consumer behavior and lending behavior, and user consumption behavior data is generated on the consumer service provider, and user loan behavior data is generated on the financial service provider.
  • the provider needs to predict the user's lending behavior based on the characteristics of the consumer's consumption behavior, and then needs to use the consumption behavior data of the consumer service provider and perform machine learning together with the borrowing behavior data of the consumer to build a prediction model.
  • the main purpose of the present invention is to provide a sample prediction method, device and computer-readable storage medium based on federal training, which aims to solve the problem that the prior art cannot implement joint training of sample data of different data providers, and thus cannot achieve mutual participation of both parties.
  • the present invention provides a sample prediction method based on federal training.
  • the sample prediction method based on federal training includes the following steps:
  • XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
  • joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
  • the federal training-based sample prediction method includes:
  • the ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
  • the two aligned training samples are a first training sample and a second training sample, respectively;
  • the first training sample attribute includes a sample ID and some sample features
  • the second training sample attribute includes a sample ID, another part of sample features, and a data label
  • the first training sample is provided by the first data party and stored locally on the first data party
  • the second training sample is provided by the second data party and stored locally on the second data party.
  • using the XGboost algorithm to perform federal training on two aligned training samples to construct a gradient boosting tree model includes:
  • the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
  • the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
  • the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
  • the method before the step of obtaining the first step and the second step of each training sample in the sample set corresponding to the current node splitting on the second data side, the method further includes:
  • the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
  • the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
  • the federal training-based sample prediction method further includes:
  • the federal training-based sample prediction method further includes:
  • the related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
  • the statistics of the average return value of the split nodes corresponding to the same feature in the gradient boosted tree model include:
  • each global best split node is used as the split node of each regression tree in the gradient boosting tree model, and the average return value of the split nodes corresponding to the same feature code is counted.
  • performing joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtaining a prediction score of the samples to be predicted includes:
  • a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
  • the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
  • the present invention also provides a sample prediction device based on federal training.
  • the sample prediction device based on federal training includes a memory, a processor, and a memory stored in the memory and accessible to the processor.
  • the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a sample prediction program, and the sample prediction program is implemented as any one of the above when executed by a processor. Describe the steps of the federal training-based sample prediction method.
  • the present invention uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model.
  • the gradient boosted tree model is a regression tree set, which includes multiple regression trees, and one split node of each regression tree. Corresponds to a feature of the training sample.
  • joint prediction is performed to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted.
  • the invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.
  • FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention
  • FIG. 2 is a schematic flowchart of an embodiment of a sample prediction method based on federal training according to the present invention
  • FIG. 3 is a schematic flowchart of performing sample alignment in an embodiment of a sample training method based on federal training according to the present invention
  • FIG. 4 is a detailed flowchart of an embodiment of step S10 in FIG. 2; FIG.
  • FIG. 5 is a schematic diagram of training results of an embodiment of a sample prediction method based on federal training according to the present invention.
  • the invention provides a sample prediction device based on federal training.
  • FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention.
  • the sample prediction device based on the federal training of the present invention may be a personal computer, or a device with a computing processing capability such as a server.
  • the federal training-based sample prediction device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display, an input unit such as a keyboard, and the optional user interface 1003 may further include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory or a non-volatile memory (for example, a magnetic disk memory).
  • the memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
  • sample prediction device based on the federal training shown in FIG. 1 does not constitute a limitation on the device, and may include more or less components than shown in the figure, or some components may be combined, or different Of the components.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a file copy program.
  • the network interface 1004 is mainly used to connect to the background server and perform data communication with the background server;
  • the user interface 1003 is mainly used to connect to the client (user) and perform communication with the client.
  • Data communication; and the processor 1001 may be used to call a sample prediction program stored in the memory 1005 and perform the following operations:
  • XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
  • joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
  • processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
  • the ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
  • the two aligned training samples are a first training sample and a second training sample, respectively;
  • the attributes of the first training sample include a sample ID and some sample features, and the attributes of the second training sample include a sample ID, another Part of the sample features and data labels;
  • the first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party;
  • processing The processor 1001 calls the sample prediction program stored in the memory 1005 and further performs the following operations:
  • the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
  • the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
  • the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
  • processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
  • the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
  • the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
  • processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
  • processor 1001 calls the sample prediction program stored in the memory 1005 and performs the following operations:
  • processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
  • the related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
  • processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
  • a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
  • the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
  • FIG. 2 is a schematic flowchart of an embodiment of a federal training-based sample prediction method according to the present invention.
  • the federal training-based sample prediction method includes the following steps:
  • step S10 the XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosted tree model.
  • the gradient boosted tree model includes multiple regression trees, and a split node of the regression tree corresponds to the training sample.
  • the XGboost (eXtreme Gradient and Boosting) algorithm is an improvement on the Boosting algorithm based on the GBDT (Gradient Boosting Decision Tree) algorithm.
  • the internal decision tree uses a regression tree.
  • the output of the algorithm is a collection of regression trees.
  • the basic idea of training and learning is to traverse all the segmentation methods of all the features of the training sample (that is, the method of node splitting), select the segmentation method with the least loss, and obtain two leaves (that is, split the node to generate a new node) , And then continue traversing until:
  • the training samples used by the XGboost algorithm are two independent training samples, that is, each training sample belongs to a different data party. If two training samples are regarded as a whole training sample, since the two training samples belong to different data sides, it can be regarded as segmenting the whole training sample, and the training samples are different features of the same sample (sample Slitting).
  • the federal training means that the sample training process is completed by the cooperation of two data parties.
  • the regression training contained in the finally trained gradient boosted tree model has split nodes corresponding to the characteristics of both training samples.
  • the revenue value of a split node can be used as the basis for evaluating the importance of features. The larger the revenue value of a split node, the smaller the node segmentation loss, and the greater the importance of the feature corresponding to the split node.
  • the trained gradient boosted tree model includes multiple regression trees, and different regression trees may use the same features for node segmentation, it is necessary to statistically identify all regression trees included in the gradient boosted tree model.
  • the average return value of the split node corresponding to the feature, and the average return value is used as the score of the corresponding feature.
  • Step S20 Based on the gradient boosting tree model, perform joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtain a prediction score of the samples to be predicted.
  • the gradient boosted tree model trained by using the XGboost algorithm can realize joint prediction of prediction samples, thereby achieving classification or scoring of the prediction samples.
  • This embodiment uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model.
  • the gradient boosted tree model is a regression tree set, which includes multiple regression trees, one split of each regression tree.
  • the node corresponds to a feature of the training sample.
  • joint prediction is performed on the samples to be predicted to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted.
  • the invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.
  • the two data parties perform sample alignment processing on both sides before performing federated modeling.
  • the specific processing flow is shown in FIG. 3.
  • the sample alignment between the two sides uses a blind signature and RSA encryption algorithm to perform an interactive encryption scheme on the sample ID.
  • the encrypted ID string is compared to identify the intersection and non-intersection parts (privacy part, invisible to each other) in the samples of both parties.
  • the present invention needs to encrypt the sample data during the sample alignment process.
  • the sample id of the data A is identified as X A : ⁇ u1, u2, u3, u4 ⁇
  • the sample id of the data B is identified as X B : ⁇ u1, u2, u3, u5 ⁇
  • the blind signature of the data x is E (x)
  • the RSA key generated by party B is (n, e, d)
  • the RSA key obtained by party A is (n, e).
  • Party A compares D A and Z B. If the two encrypted strings are equal, it means that X A and X B are equal.
  • the equal ids are the intersection of the samples ( ⁇ u1, u2, u3 ⁇ ), which are reserved; the unequal parts ( ⁇ u4, u5 ⁇ ), because they are in encrypted form, are not visible to both parties and can be discarded.
  • this embodiment specifically uses two independent training samples for illustration.
  • the first data party provides a first training sample
  • the attributes of the first training sample include a sample ID and some sample features
  • the second data party provides a second training sample
  • the second training sample attribute includes a sample ID and another part of the sample.
  • the sample characteristics refer to the characteristics exhibited or possessed by the sample. For example, if the sample is a person, the corresponding sample characteristics may be age, gender, income, education, etc. Data labels are used to classify multiple different samples. The classification results are determined based on the characteristics of the samples.
  • the main significance of the federal training modeling of the present invention is to achieve two-way privacy protection of the sample data of both parties. Therefore, during the federal training process, the first training sample is stored locally on the first data side and the second training sample is stored locally on the second data side.
  • the data in Table 1 below is provided by the first data side and stored in the first data.
  • the data in Table 2 is provided by the second data party and stored locally.
  • the first training sample attribute includes a sample ID (X1 to X5), Age features, Gender features, and Amount of credit features.
  • the second training sample attribute includes a sample ID (X1 to X5), a Bill feature, an Education feature, and a data label Lable.
  • FIG. 4 is a schematic diagram of a detailed process of an embodiment of step S10 in FIG. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S10 specifically includes:
  • step S101 on the second data side, one step and two steps of each training sample in the sample set corresponding to the current node split are obtained;
  • XGboost algorithm is a machine learning modeling method. It needs to use a classifier (that is, a classification function) to map sample data to a certain category, so that it can be applied to data prediction. In the process of using the classifier to learn classification rules, it is necessary to use a loss function to determine the size of the fitting error of machine learning.
  • a classifier that is, a classification function
  • the gradient boosting tree model requires multiple rounds of federal training.
  • Each round of federal training corresponds to the generation of a regression tree, and the generation of a regression tree requires multiple node splits.
  • the first node splitting uses the training samples saved at the beginning, and the next node splitting will use the training samples corresponding to the new node generated by the previous node splitting, and the same During the round of federal training, each round of node splitting follows the same one-step and two-step degrees used in the first round of node splitting. The next round of federal training will use the results of the previous round of federal training to update the first and second steps used in the last round of federal training.
  • the XGboost algorithm supports a custom loss function.
  • the custom loss function is used to find the first-order and second-order partial derivatives of the objective function, corresponding to the first and second steps of the local sample data to be trained.
  • a split node needs to be determined, and the split node can be determined by the revenue value.
  • the formula for calculating the gain is as follows:
  • I L represents the contained sample set of the left child node after the current node split
  • I R represents the contained sample set of the right child node after the current node split
  • g i represents a step of the sample i
  • h i represents the sample i
  • Two steps, ⁇ and ⁇ are constant.
  • both sides have the same gradient characteristics, and because the data label exists in the sample data of the second data side, based on the second data
  • the one-step and two-step of the square sample data are used to calculate the return value of the split node of each sample data in each split mode.
  • step S102 if the current round of node splitting is the first round of node splitting for constructing a regression tree, the first step and the second step are encrypted and sent to the first data together with the sample ID of the sample set. For the calculation of the revenue value of the split node for each training method corresponding to the local training sample corresponding to the sample ID based on the first and second steps of encryption on the first data side ;
  • the current round of node splitting is the first round of node splitting to construct a regression tree
  • one of the sample data is calculated on the second data side. After the gradation and the second gradation, they are encrypted before being sent to the first data party.
  • the local sample data of the first data side is calculated to split the nodes in each splitting method. Since the return value is encrypted with one and two steps, the calculated return value is also an encrypted value, so there is no need to encrypt the return value.
  • the new node can be split to generate a regression tree.
  • a second data party having sample data with a data label dominate the construction of the regression tree of the gradient boosted tree model. Therefore, the local sample data of the first data side calculated on the first data side needs to be sent to the second data side for the revenue value of the split node in each split mode.
  • step S103 if the current node splits into a non-first-round node split that constructs a regression tree, the sample ID of the sample set is sent to the first data side, so that the first round is continued on the first data side.
  • the current round of node splitting is a non-first round of node splitting to construct the regression tree
  • only the sample ID of the sample set corresponding to this round of node splitting is sent to the first data party, and the first data party continues to use
  • the one-step and two-step degrees used in the first round of node splitting are used to calculate the revenue value of the splitting node for the training samples corresponding to the received sample ID locally in each splitting mode.
  • Step S104 The second data party receives the encrypted revenue values of all split nodes returned by the first data party and decrypts them;
  • Step S105 On the second data side, based on the one-step and two-step, calculate the revenue value of the split node of the training sample corresponding to the sample ID locally in each split mode;
  • Step S106 Determine the global best split node for the current round of node splitting based on the return values of all split nodes calculated by both parties;
  • the return value of all split nodes calculated by the two parties can be regarded as the return value of the split node of the two parties' overall data samples in each split mode. Therefore, by comparing The size of each return value, the split node with the largest return value as the global best split node for the current node split.
  • sample features corresponding to the global best split node may belong to both the training samples on the first data side and the training samples on the second data side.
  • the relevant information includes : The provider corresponding to the sample data, the feature code corresponding to the sample data, and the return value.
  • this record is (Site A, E A (f i ), gain).
  • this record is (Site B, E B (f i ), gain).
  • E A (f i) A side feature data representing encoded f i, E B (f i) characteristic data indicating the direction B f i is encoded, the encoding may be marked by features f i without giving away their original feature data .
  • each global best split node as the split node of each regression tree in the gradient boosting tree model to count the average return value of the split nodes corresponding to the same feature code.
  • Step S107 Based on the global best split node of the current node split, split the sample set corresponding to the current node to generate a new node to build a regression tree of the gradient boosted tree model.
  • the sample data corresponding to the current node split of the current round belongs to the first data side.
  • the sample data corresponding to the current node split of the current round belongs to the second data side.
  • new nodes (left and right child nodes) can be generated to build a regression tree.
  • new nodes can be continuously generated, and a deeper regression tree can be obtained. If node splitting is stopped, a regression tree of the gradient boosted tree model can be obtained.
  • the one-step and two-step degrees of the training samples for node splitting are specifically obtained in the following manner:
  • the first round of node split corresponds to the construction of the first regression tree
  • the current node split is the first node split to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to the current node split;
  • a depth threshold of the regression tree is preset to limit node splitting.
  • the node splitting is stopped, and a regression tree of the gradient boosted tree model is obtained, otherwise the next round of node splitting is continued.
  • condition that restricts node splitting may also be stopping node splitting when the node cannot continue to split, for example, a sample corresponding to the current node cannot continue to perform node splitting.
  • a threshold value for the number of regression trees is preset to limit the number of regression trees generated.
  • condition that limits the number of regression trees to be generated may also be to stop building the regression tree when the nodes cannot continue to split.
  • the Age feature in Table 1 has 5 types of sample data division, the Gender feature has 2 types of sample data division, and the Amount of Given feature has 5 types of sample data division. Therefore, the sample data in Table 1 has a total of 12 divisions. That is, it is necessary to calculate the return value of the split node corresponding to the 12 division methods.
  • the sample data in Table 2 has a total of 8 division methods, that is, the division corresponding to the 8 division methods needs to be calculated.
  • the return value of the node Since the Bill Payment feature in Table 2 has 5 sample data division methods and the Education feature has 3 sample data division methods, the sample data in Table 2 has a total of 8 division methods, that is, the division corresponding to the 8 division methods needs to be calculated. The return value of the node.
  • this feature is used as the splitting node (the corresponding sample is X1, X2, X3, X4, X5), and two new sub-nodes are generated, of which the left The node corresponds to a sample set (X1, X5) less than or equal to 3102, and the right node corresponds to a sample set (X2, X3, X4) greater than 3102, and the sample set (X1, X5) and the sample set (X2, X3, X4) As the new sample set, the second and third rounds of node splitting are continued to split the two new nodes and generate new nodes. ;
  • the sample gradient values used in the first round of node splitting will continue to be used. Assuming that the feature corresponding to a split node in this round is Amount of credit less than or equal to 200, then this feature is used as the split node (the corresponding samples are X1 and X5) to generate two new subnodes, where the left node corresponds to less than or equal to A sample X5 of 200, and a right node corresponding to a sample X1 that is greater than 200; Similarly, the feature corresponding to another split node of this round is Age less than or equal to 35, then this feature is used as the split node (the corresponding samples are X2, X3, X4 ) To generate two new sub-nodes, where the left node corresponds to samples X2 and X3 less than or equal to 35, and the right node corresponds to samples X4 greater than 35.
  • the specific implementation process refers to the first round of
  • Second round of federal training training the second regression tree
  • the results of the previous round of federal training update the one-step and two-step used in the previous round of federal training, and continue the second round of federal training to perform node split to generate
  • the new node constructs the next regression tree.
  • the specific implementation process refers to the construction process of the previous regression tree.
  • the sample data in Tables 1 and 2 in the above embodiment produced two regression trees after two rounds of federal training.
  • the first regression tree includes three split nodes, which are: Bill Payment is less than or Equal to 3102, Amount of credit is less than or equal to 200, Age is less than or equal to 35;
  • Bill Payment is (gain1 + gain4) / 2; Education is 0; Age is gain3; Gender is gain5; Amount of credit is gain2.
  • the specific implementation process of performing joint prediction on the prediction samples includes:
  • the next traversal node is determined by comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node;
  • the split node records of the regression tree are stored on the second data side. Therefore, in this embodiment, the second data side takes the lead in completing the joint prediction of the samples to be predicted, and the tree is specifically improved by traversing the gradient.
  • the regression tree corresponding to the model determines the node to which the sample to be predicted belongs. The node to which the sample to be predicted belongs specifically is determined by comparing the data point of the sample to be predicted with the attribute value of the currently traversed node.
  • the sample category of the sample to be predicted can be determined based on the data label of the training sample corresponding to the node to which the sample to be predicted belongs, or the sample to be predicted can be obtained based on the weight value of the node to which the sample to be predicted belongs. Prediction score.
  • the invention also provides a computer-readable storage medium.
  • the computer-readable storage medium of the present invention stores a sample prediction program, and when the sample prediction program is executed by a processor, implements the steps of the federal training-based sample prediction method described in any one of the foregoing embodiments.
  • the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better.
  • Implementation. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM / RAM), including Several instructions are used to cause a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A sample prediction method based on federated training, comprising the following steps: performing federated training on two aligned training samples by using an XGboost algorithm to construct a gradient boosting tree model (S10), wherein the gradient boosting tree model comprises a plurality of regression trees, and a split node of each regression tree corresponds to a feature of each training sample; and performing joint prediction on a sample to be predicted on the basis of the gradient boosting tree model, to determine a sample category of the sample to be predicted or obtain a prediction score of the sample to be predicted (S20). According to the method, federated training-based modeling is implemented by using training samples of different data parties, and thus sample prediction is implemented on the basis of the established model.

Description

基于联邦训练的样本预测方法、装置及存储介质Sample prediction method, device and storage medium based on federal training
本申请要求于2018年8月10日提交中国专利局、申请号为201810913869.3、发明名称为“基于联邦训练的样本预测方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 10, 2018, with application number 201810913869.3, and the invention name is "Sample Prediction Method, Device, and Storage Medium Based on Federal Training", the entire contents of which are incorporated by reference Incorporated in the application.
技术领域Technical field
本发明涉及机器学习技术领域,尤其涉及一种基于联邦训练的样本预测方法、装置及计算机可读存储介质。The invention relates to the technical field of machine learning, and in particular, to a sample prediction method, device, and computer-readable storage medium based on federal training.
背景技术Background technique
当前信息时代,人们的某些行为可以通过数据表现出来,比如消费行为,因而衍生出了大数据分析,通过机器学习构建相应的行为分析模型,进而可对人们的行为进行分类或者基于用户的行为特征进行预测等。In the current information age, certain behaviors of people can be expressed through data, such as consumer behaviors, which leads to big data analysis. Machine learning can be used to build corresponding behavior analysis models, which can then classify people's behaviors or based on user behaviors Feature prediction.
现有的机器学习技术中通常都是由一方对样本数据进行独立训练,也即是单方建模。同时,基于建立的数学模型,可确定样本特征集中重要程度相对较高的特征。然而在很多跨领域的大数据分析场景中,比如用户既有消费行为,也有借贷行为,而用户消费行为数据产生在消费服务提供方,而用户借贷行为数据产生在金融服务提供方,如果金融服务提供方需要基于用户的消费行为特征预测用户的借贷行为,则需要使用消费服务提供方的消费行为数据并与本方的借贷行为数据一起进行机器学习来构建预测模型。In the existing machine learning technology, one party usually independently trains the sample data, that is, unilateral modeling. At the same time, based on the established mathematical model, it is possible to determine the relatively important features in the sample feature set. However, in many cross-domain big data analysis scenarios, for example, users have both consumer behavior and lending behavior, and user consumption behavior data is generated on the consumer service provider, and user loan behavior data is generated on the financial service provider. The provider needs to predict the user's lending behavior based on the characteristics of the consumer's consumption behavior, and then needs to use the consumption behavior data of the consumer service provider and perform machine learning together with the borrowing behavior data of the consumer to build a prediction model.
因此,针对上述应用场景,需要一种新的建模方式来实现不同数据提供方的样本数据的联合训练,进而实现双方共同参与建模。Therefore, for the above application scenarios, a new modeling method is needed to realize the joint training of the sample data of different data providers, so as to realize the joint participation of both parties in modeling.
发明内容Summary of the invention
本发明的主要目的在于提供一种基于联邦训练的样本预测方法、 装置及计算机可读存储介质,旨在解决现有技术无法实现不同数据提供方的样本数据的联合训练,进而无法实现双方共同参与建模与样本预测的技术问题。The main purpose of the present invention is to provide a sample prediction method, device and computer-readable storage medium based on federal training, which aims to solve the problem that the prior art cannot implement joint training of sample data of different data providers, and thus cannot achieve mutual participation of both parties. Technical issues in modeling and sample prediction.
为实现上述目的,本发明提供一种基于联邦训练的样本预测方法,所述基于联邦训练的样本预测方法包括以下步骤:To achieve the above objective, the present invention provides a sample prediction method based on federal training. The sample prediction method based on federal training includes the following steps:
采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述回归树的一个分裂节点对应训练样本的一个特征;XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
可选地,所述基于联邦训练的样本预测方法包括:Optionally, the federal training-based sample prediction method includes:
在进行联邦训练之前,采用盲签名和RSA加密演算法,对样本数据的ID进行交互加密;Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;
通过比较双方加密后的ID加密串,识别双方样本中的交集部分,并将样本中的交集部分作为样本对齐后的训练样本。The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
可选地,所述两个对齐的训练样本分别为第一训练样本和第二训练样本;Optionally, the two aligned training samples are a first training sample and a second training sample, respectively;
所述第一训练样本属性包括样本ID以及部分样本特征,所述第二训练样本属性包括样本ID、另一部分样本特征以及数据标签;The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;
所述第一训练样本由第一数据方提供并保存在第一数据方本地,所述第二训练样本由第二数据方提供并保存在第二数据方本地。The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
可选地,所述采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型包括:Optionally, using the XGboost algorithm to perform federal training on two aligned training samples to construct a gradient boosting tree model includes:
在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;
若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;
在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;
基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;
基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
可选地,所述在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度的步骤之前,还包括:Optionally, before the step of obtaining the first step and the second step of each training sample in the sample set corresponding to the current node splitting on the second data side, the method further includes:
在进行节点分裂时,判断本轮节点分裂是否对应构造首棵回归树;When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;
若本轮节点分裂对应构造首棵回归树,则判断本轮节点分裂是否为构造首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;
若本轮节点分裂为构造首棵回归树的首轮节点分裂,则在所述第二数据方侧,初始化本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;若本轮节点分裂为构造首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度;If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
若本轮节点分裂对应构造非首棵回归树,则判断本轮节点分裂是否为构造非首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;
若本轮节点分裂为构造非首棵回归树的首轮节点分裂,则根据上一轮联邦训练更新一阶梯度与二阶梯度;若本轮节点分裂为构造非首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
可选地,所述基于联邦训练的样本预测方法还包括:Optionally, the federal training-based sample prediction method further includes:
当生成新的节点以构建梯度提升树模型的回归树时,在所述第二 数据方侧,判断本轮回归树的深度是否达到预设深度阈值;When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the current regression tree reaches a preset depth threshold;
若本轮回归树的深度达到所述预设深度阈值,则停止节点分裂,得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂;If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;
当停止节点分裂时,在所述第二数据方侧,判断本轮回归树的总数量是否达到预设数量阈值;When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;
若本轮回归树的总数量达到所述预设数量阈值,则停止联邦训练,否则继续下一轮联邦训练。If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
可选地,所述基于联邦训练的样本预测方法还包括:Optionally, the federal training-based sample prediction method further includes:
在所述第二数据方侧,记录每一轮节点分裂确定的全局最佳分裂节点的相关信息;On the second data side, record related information of the global best split node determined by each round of node splitting;
其中,所述相关信息包括:对应样本数据的提供方、对应样本数据的特征编码以及收益值。The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
可选地,所述统计所述梯度提升树模型中同一特征对应的分裂节点的平均收益值包括:Optionally, the statistics of the average return value of the split nodes corresponding to the same feature in the gradient boosted tree model include:
在所述第二数据方侧,以各全局最佳分裂节点作为所述梯度提升树模型中各回归树的分裂节点,统计同一特征编码对应的分裂节点的平均收益值。On the second data side, each global best split node is used as the split node of each regression tree in the gradient boosting tree model, and the average return value of the split nodes corresponding to the same feature code is counted.
可选地,所述基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分包括:Optionally, based on the gradient boosting tree model, performing joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtaining a prediction score of the samples to be predicted includes:
在所述第二数据方侧,遍历所述梯度提升树模型对应的回归树;Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;
若当前遍历节点的属性值记录在所述第二数据方,则通过比较本地待预测样本的数据点与当前遍历节点的属性值,以确定下一遍历节点;If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;
若当前遍历节点的属性值记录在所述第一数据方,则向所述第一数据方发起查询请求,以供在所述第一数据方侧,通过比较本地待预测样本的数据点与当前遍历节点的属性值,确定下一遍历节点并向所述第二数据方返回该节点信息;If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
当遍历完所述梯度提升树模型对应的回归树时,基于待预测样本所属节点所对应的样本的数据标签,确定待预测样本的样本类别,或 基于待预测样本所属节点的权重值,获得待预测样本的预测得分。When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
进一步地,为实现上述目的,本发明还提供一种基于联邦训练的样本预测装置,所述基于联邦训练的样本预测装置包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的样本预测程序,所述样本预测程序被所述处理器执行时实现如上任一项所述的基于联邦训练的样本预测方法的步骤。Further, in order to achieve the above object, the present invention also provides a sample prediction device based on federal training. The sample prediction device based on federal training includes a memory, a processor, and a memory stored in the memory and accessible to the processor. A sample prediction program running on the computer, the sample prediction program, when executed by the processor, implements the steps of the federal training-based sample prediction method according to any one of the preceding items.
进一步地,为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有样本预测程序,所述样本预测程序被处理器执行时实现如上任一项所述的基于联邦训练的样本预测方法的步骤。Further, in order to achieve the above object, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a sample prediction program, and the sample prediction program is implemented as any one of the above when executed by a processor. Describe the steps of the federal training-based sample prediction method.
本发明采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,梯度提升树模型为回归树集合,其包括有多棵回归树,每棵回归树的一个分裂节点对应训练样本的一个特征;最后在基于梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。本发明实现了使用不同数据方的训练样本进行联邦训练建模,进而可实现对具有多方样本数据特征的样本进行预测。The present invention uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model. The gradient boosted tree model is a regression tree set, which includes multiple regression trees, and one split node of each regression tree. Corresponds to a feature of the training sample. Finally, based on the gradient boosting tree model, joint prediction is performed to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted. The invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明基于联邦训练的样本预测装置实施例方案涉及的硬件运行环境的结构示意图;FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention;
图2为本发明基于联邦训练的样本预测方法一实施例的流程示意图;2 is a schematic flowchart of an embodiment of a sample prediction method based on federal training according to the present invention;
图3为本发明基于联邦训练的样本预测方法一实施例中进行样本对齐的流程示意图;FIG. 3 is a schematic flowchart of performing sample alignment in an embodiment of a sample training method based on federal training according to the present invention; FIG.
图4为图2中步骤S10一实施例的细化流程示意图;FIG. 4 is a detailed flowchart of an embodiment of step S10 in FIG. 2; FIG.
图5为本发明基于联邦训练的样本预测方法一实施例的训练结果示意图。FIG. 5 is a schematic diagram of training results of an embodiment of a sample prediction method based on federal training according to the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional characteristics and advantages of the present invention will be further explained with reference to the embodiments and the drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.
本发明提供一种基于联邦训练的样本预测装置。The invention provides a sample prediction device based on federal training.
如图1所示,图1是本发明基于联邦训练的样本预测装置实施例方案涉及的硬件运行环境的结构示意图。As shown in FIG. 1, FIG. 1 is a schematic structural diagram of a hardware operating environment involved in a solution of an embodiment of a sample prediction device based on federal training according to the present invention.
本发明基于联邦训练的样本预测装置可以是个人电脑,也可以是服务器等具有计算处理能力的设备。The sample prediction device based on the federal training of the present invention may be a personal computer, or a device with a computing processing capability such as a server.
如图1所示,基于联邦训练的样本预测装置可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the federal training-based sample prediction device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display, an input unit such as a keyboard, and the optional user interface 1003 may further include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (for example, a magnetic disk memory). The memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
本领域技术人员可以理解,图1中示出的基于联邦训练的样本预测装置结构并不构成对装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the sample prediction device based on the federal training shown in FIG. 1 does not constitute a limitation on the device, and may include more or less components than shown in the figure, or some components may be combined, or different Of the components.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及文件拷贝程序。As shown in FIG. 1, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a file copy program.
在图1所示的基于联邦训练的样本预测装置中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的样本预测程序,并执行以下操作:In the sample prediction device based on federal training shown in FIG. 1, the network interface 1004 is mainly used to connect to the background server and perform data communication with the background server; the user interface 1003 is mainly used to connect to the client (user) and perform communication with the client. Data communication; and the processor 1001 may be used to call a sample prediction program stored in the memory 1005 and perform the following operations:
采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述 回归树的一个分裂节点对应训练样本的一个特征;XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
进一步地,处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
在进行联邦训练之前,采用盲签名和RSA加密演算法,对样本数据的ID进行交互加密;Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;
通过比较双方加密后的ID加密串,识别双方样本中的交集部分,并将样本中的交集部分作为样本对齐后的训练样本。The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
进一步地,所述两个对齐的训练样本分别为第一训练样本和第二训练样本;所述第一训练样本属性包括样本ID以及部分样本特征,所述第二训练样本属性包括样本ID、另一部分样本特征以及数据标签;所述第一训练样本由第一数据方提供并保存在第一数据方本地,所述第二训练样本由第二数据方提供并保存在第二数据方本地;处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the two aligned training samples are a first training sample and a second training sample, respectively; the attributes of the first training sample include a sample ID and some sample features, and the attributes of the second training sample include a sample ID, another Part of the sample features and data labels; the first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party; processing The processor 1001 calls the sample prediction program stored in the memory 1005 and further performs the following operations:
在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;
若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;
在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的 收益值;On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a training sample corresponding to the sample ID locally in each split mode;
基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;
基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
进一步地,处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
在进行节点分裂时,判断本轮节点分裂是否对应构造首棵回归树;When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;
若本轮节点分裂对应构造首棵回归树,则判断本轮节点分裂是否为构造首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;
若本轮节点分裂为构造首棵回归树的首轮节点分裂,则在所述第二数据方侧,初始化本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;若本轮节点分裂为构造首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度;If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
若本轮节点分裂对应构造非首棵回归树,则判断本轮节点分裂是否为构造非首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;
若本轮节点分裂为构造非首棵回归树的首轮节点分裂,则根据上一轮联邦训练更新一阶梯度与二阶梯度;若本轮节点分裂为构造非首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
进一步地,处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
在所述第一数据方侧,基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;On the first data side, based on the encrypted one-step and two-step degrees, calculating a revenue value of a split node of a training sample corresponding to the sample ID locally in each split mode;
或者在所述第一数据方侧,沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;Or, on the first data side, one step and two steps used in the first round of node splitting are used, and the revenue value of the splitting node for each training sample corresponding to the sample ID locally is calculated;
对所有分裂节点的收益值进行加密后发送至所述第二数据方。Encrypt the revenue values of all split nodes and send them to the second data party.
进一步地,处理器1001调用存储器1005中存储的样本预测程序 还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 and performs the following operations:
当生成新的节点以构建梯度提升树模型的回归树时,在所述第二数据方侧,判断本轮回归树的深度是否达到预设深度阈值;When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;
若本轮回归树的深度达到所述预设深度阈值,则停止节点分裂,得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂;If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;
当停止节点分裂时,在所述第二数据方侧,判断本轮回归树的总数量是否达到预设数量阈值;When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;
若本轮回归树的总数量达到所述预设数量阈值,则停止联邦训练,否则继续下一轮联邦训练。If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
进一步地,处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
在所述第二数据方侧,记录每一轮节点分裂确定的全局最佳分裂节点的相关信息;On the second data side, record related information of the global best split node determined by each round of node splitting;
其中,所述相关信息包括:对应样本数据的提供方、对应样本数据的特征编码以及收益值。The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
进一步地,处理器1001调用存储器1005中存储的样本预测程序还执行以下操作:Further, the processor 1001 calls the sample prediction program stored in the memory 1005 to perform the following operations:
在所述第二数据方侧,遍历所述梯度提升树模型对应的回归树;Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;
若当前遍历节点的属性值记录在所述第二数据方,则通过比较本地待预测样本的数据点与当前遍历节点的属性值,以确定下一遍历节点;If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;
若当前遍历节点的属性值记录在所述第一数据方,则向所述第一数据方发起查询请求,以供在所述第一数据方侧,通过比较本地待预测样本的数据点与当前遍历节点的属性值,确定下一遍历节点并向所述第二数据方返回该节点信息;If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
当遍历完所述梯度提升树模型对应的回归树时,基于待预测样本所属节点所对应的样本的数据标签,确定待预测样本的样本类别,或基于待预测样本所属节点的权重值,获得待预测样本的预测得分。When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
基于上述基于联邦训练的样本预测装置实施例方案涉及的硬件运行环境,提出本发明基于联邦训练的样本预测方法的以下各实施例。Based on the hardware operating environment involved in the foregoing solution of the federal training-based sample prediction device embodiment, the following embodiments of the federal training-based sample prediction method of the present invention are proposed.
参照图2,图2为本发明基于联邦训练的样本预测方法一实施例的流程示意图。本实施例中,所述基于联邦训练的样本预测方法包括以下步骤:Referring to FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of a federal training-based sample prediction method according to the present invention. In this embodiment, the federal training-based sample prediction method includes the following steps:
步骤S10,采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述回归树的一个分裂节点对应训练样本的一个特征;In step S10, the XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosted tree model. The gradient boosted tree model includes multiple regression trees, and a split node of the regression tree corresponds to the training sample. A feature
XGboost(eXtreme Gradient Boosting)算法是在GBDT(Gradient Boosting Decision Tree,梯度提升树)算法的基础上对Boosting算法进行的改进,内部决策树使用的是回归树,算法输出是回归树的集合,包含有多棵回归树,训练学习的基本思路是遍历训练样本所有特征的所有分割方法(也即节点分裂的方式),选择损失最小的分割方法,得到两个叶子(也即分裂节点而生成新节点),然后继续遍历,直至:The XGboost (eXtreme Gradient and Boosting) algorithm is an improvement on the Boosting algorithm based on the GBDT (Gradient Boosting Decision Tree) algorithm. The internal decision tree uses a regression tree. The output of the algorithm is a collection of regression trees. For multiple regression trees, the basic idea of training and learning is to traverse all the segmentation methods of all the features of the training sample (that is, the method of node splitting), select the segmentation method with the least loss, and obtain two leaves (that is, split the node to generate a new node) , And then continue traversing until:
(1)若满足停止分裂条件,则输出一棵回归树;(1) If the stopping splitting condition is satisfied, a regression tree is output;
(2)若满足停止迭代条件,则输出一个回归树集合。(2) If the stopping iteration condition is satisfied, a regression tree set is output.
本实施例中,XGboost算法使用的训练样本为两个独立的训练样本,也即每一个训练样本分别归属不同的数据方。如果将两个训练样本看成一个整体训练样本,则由于两个训练样本归属不同的数据方,因此,可以看成是对整体训练样本进行切分,进而训练样本是同一样本的不同特征(样本纵切)。In this embodiment, the training samples used by the XGboost algorithm are two independent training samples, that is, each training sample belongs to a different data party. If two training samples are regarded as a whole training sample, since the two training samples belong to different data sides, it can be regarded as segmenting the whole training sample, and the training samples are different features of the same sample (sample Slitting).
此外,由于两个训练样本分别归属不同的数据方,因此,为实现联邦训练建模,需要对双方提供的原始样本数据进行样本对齐。In addition, because the two training samples belong to different data parties, in order to achieve federal training modeling, it is necessary to perform sample alignment on the original sample data provided by both parties.
本实施例中,联邦训练是指样本训练过程由两数据方协作共同完成,最终训练得到的梯度提升树模型所包含的回归树,其分裂节点对应双方训练样本的特征。In this embodiment, the federal training means that the sample training process is completed by the cooperation of two data parties. The regression training contained in the finally trained gradient boosted tree model has split nodes corresponding to the characteristics of both training samples.
XGboost算法中,在遍历训练样本所有特征的所有分割方法时,通过收益值来评价分割方法的优劣,每次分裂节点都选择损失最小的分割方法。因此,分裂节点的收益值可作为特征重要性的评价依据,分裂节点的收益值越大,则节点分割损失越小,进而该分裂节点对应的特征的重要性也越大。In the XGboost algorithm, when traversing all the segmentation methods of all the features of the training sample, the value of the segmentation method is evaluated by the revenue value, and each segmentation node selects the segmentation method with the smallest loss. Therefore, the revenue value of a split node can be used as the basis for evaluating the importance of features. The larger the revenue value of a split node, the smaller the node segmentation loss, and the greater the importance of the feature corresponding to the split node.
本实施例中,由于训练得到的梯度提升树模型中包括有多棵回归 树,而不同回归树有可能使用了相同特征进行节点分割,因此,需要统计梯度提升树模型包括的所有回归树中同一特征对应的分裂节点的平均收益值,并将平均收益值作为对应特征的评分。In this embodiment, since the trained gradient boosted tree model includes multiple regression trees, and different regression trees may use the same features for node segmentation, it is necessary to statistically identify all regression trees included in the gradient boosted tree model. The average return value of the split node corresponding to the feature, and the average return value is used as the score of the corresponding feature.
步骤S20,基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Step S20: Based on the gradient boosting tree model, perform joint prediction on the samples to be predicted to determine a sample category of the samples to be predicted or obtain a prediction score of the samples to be predicted.
本实施例中,采用XGboost算法训练得到的梯度提升树模型可以实现对预测样本进行联合预测,从而实现对预测样本进行分类或进行打分。In this embodiment, the gradient boosted tree model trained by using the XGboost algorithm can realize joint prediction of prediction samples, thereby achieving classification or scoring of the prediction samples.
本实施例采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,梯度提升树模型为回归树集合,其包括有多棵回归树,每棵回归树的一个分裂节点对应训练样本的一个特征;最后在基于梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。本发明实现了使用不同数据方的训练样本进行联邦训练建模,进而可实现对具有多方样本数据特征的样本进行预测。This embodiment uses the XGboost algorithm to perform federal training on two aligned training samples to build a gradient boosted tree model. The gradient boosted tree model is a regression tree set, which includes multiple regression trees, one split of each regression tree. The node corresponds to a feature of the training sample. Finally, based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine the sample category of the sample to be predicted or to obtain the prediction score of the sample to be predicted. The invention realizes federal training modeling by using training samples of different data parties, and then can realize prediction of samples having data characteristics of multi-party samples.
进一步地,为保证联邦建模过程中,不同数据方使用的样本梯度一致,因此,两个数据方在进行联邦建模之前,先进行双方样本对齐处理,具体处理流程如图3所示。Further, in order to ensure that the sample gradients used by different data parties are consistent during the federation modeling process, the two data parties perform sample alignment processing on both sides before performing federated modeling. The specific processing flow is shown in FIG. 3.
双方样本对齐采用盲签名和RSA加密演算法对样本ID进行交互加密方案,通过比较加密后的ID加密串来识别双方样本中交集部分与非交集部分(隐私部分,彼此双方不可见),为实现对非交集部分样本数据的隐私保护,本发明在样本对齐过程中需要对样本数据进行加密。The sample alignment between the two sides uses a blind signature and RSA encryption algorithm to perform an interactive encryption scheme on the sample ID. The encrypted ID string is compared to identify the intersection and non-intersection parts (privacy part, invisible to each other) in the samples of both parties. For the privacy protection of the non-intersecting part of the sample data, the present invention needs to encrypt the sample data during the sample alignment process.
假设数据A方的样本id标识为X A:{u1,u2,u3,u4},数据B方的样本id标识为X B:{u1,u2,u3,u5},数据x的盲签名为E(x),B方生成的RSA密钥是(n,e,d),A方得到的RSA密钥是(n,e),进行如下示例过程: Assume that the sample id of the data A is identified as X A : {u1, u2, u3, u4}, the sample id of the data B is identified as X B : {u1, u2, u3, u5}, and the blind signature of the data x is E (x), the RSA key generated by party B is (n, e, d), and the RSA key obtained by party A is (n, e). The following example process is performed:
(1)A方对id加密:Y A={(r e%n)*E(u)|u∈X A},其中r是对应于X A中每一个不同的样本id生成的不同的随机数,然后A方把Y A发送给B方; (1) Party A encrypts id: Y A = {(re e % n) * E (u) | u∈X A }, where r is a different random generated corresponding to each different sample id in X A Count, and then party A sends Y A to party B;
(2)B方把该id加密串进行再次加密:Z A={y d|y∈Y A},B方再把双层加密的串Z A发给A方; (2) Party B re-encrypts the id encrypted string: Z A = {y d | y ∈ Y A }, and Party B sends the double-encrypted string Z A to Party A;
(3)A方对Z A进行如下操作:
Figure PCTCN2019080297-appb-000001
(3) Party A performs the following operations on Z A :
Figure PCTCN2019080297-appb-000001
(4)B方对id加密:Z B={E(E(u)) d|u∈X B},然后把Z B发送给A方; (4) Party B encrypts id: Z B = {E (E (u)) d | u ∈ X B }, and then sends Z B to Party A;
(5)A方比较D A和Z B,如果这两个加密串相等,则表示X A和X B相等。相等的id则是样本交集部分({u1,u2,u3}),保留;不相等的部分({u4,u5}),因为是加密的形式,双方对此不可见,可丢弃。 (5) Party A compares D A and Z B. If the two encrypted strings are equal, it means that X A and X B are equal. The equal ids are the intersection of the samples ({u1, u2, u3}), which are reserved; the unequal parts ({u4, u5}), because they are in encrypted form, are not visible to both parties and can be discarded.
进一步地,为便于描述本发明的联合训练的具体实现方式,本实施例具体以两个独立的训练样本进行举例说明。Further, in order to facilitate the description of the specific implementation of the joint training of the present invention, this embodiment specifically uses two independent training samples for illustration.
本实施例中,第一数据方提供第一训练样本,第一训练样本属性包括样本ID以及部分样本特征;第二数据方提供第二训练样本,第二训练样本属性包括样本ID、另一部分样本特征以及数据标签。In this embodiment, the first data party provides a first training sample, and the attributes of the first training sample include a sample ID and some sample features; the second data party provides a second training sample, and the second training sample attribute includes a sample ID and another part of the sample. Features and data labels.
其中,样本特征是指样本所表现或具有的特征,比如样本为人,则对应的样本特征可以是年龄、性别、收入、学历等。数据标签用于对多个不同的样本进行分类,分类的结果具体依据于样本的特征进行判定得出。The sample characteristics refer to the characteristics exhibited or possessed by the sample. For example, if the sample is a person, the corresponding sample characteristics may be age, gender, income, education, etc. Data labels are used to classify multiple different samples. The classification results are determined based on the characteristics of the samples.
本发明联邦训练进行建模的主要意义在于实现双方样本数据的双向隐私保护。因此,在联邦训练过程中,第一训练样本保存在第一数据方本地,第二训练样本保存在第二数据方本地,例如下面表1中数据由第一数据方提供并保存在第一数据方本地,表面表2中数据由第二数据方提供并保存在第二数据方本地。The main significance of the federal training modeling of the present invention is to achieve two-way privacy protection of the sample data of both parties. Therefore, during the federal training process, the first training sample is stored locally on the first data side and the second training sample is stored locally on the second data side. For example, the data in Table 1 below is provided by the first data side and stored in the first data. Locally, the data in Table 2 is provided by the second data party and stored locally.
表1Table 1
样本IDSample ID AgeAge GenderGender Amount of given creditAmount of credit
X1X1 2020 11 50005000
X2X2 3030 11 300000300000
X3 X3 3535 00 250000250000
X4X4 4848 00 300000300000
X5 X5 1010 11 200200
如上表1所示,第一训练样本属性包含有样本ID(X1~X5)、Age 特征、Gender特征以及Amount of given credit特征。As shown in Table 1 above, the first training sample attribute includes a sample ID (X1 to X5), Age features, Gender features, and Amount of credit features.
表2Table 2
样本IDSample ID Bill PaymentBill Payment Education Education LableLable
X1X1 31023102 22 24twenty four
X2X2 1725017250 33 1414
X3X3 1402714027 22 1616
X4 X4 67876787 11 1010
X5X5 280280 11 2626
如上表2所示,第二训练样本属性包含有样本ID(X1~X5)、Bill Payment特征、Education特征以及数据标签Lable。As shown in Table 2 above, the second training sample attribute includes a sample ID (X1 to X5), a Bill feature, an Education feature, and a data label Lable.
进一步地,参照图4,图4为图2中步骤S10一实施例的细化流程示意图。基于上述实施例,本实施例中,上述步骤S10具体包括:Further, referring to FIG. 4, FIG. 4 is a schematic diagram of a detailed process of an embodiment of step S10 in FIG. 2. Based on the foregoing embodiment, in this embodiment, the foregoing step S10 specifically includes:
步骤S101,在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;In step S101, on the second data side, one step and two steps of each training sample in the sample set corresponding to the current node split are obtained;
XGboost算法是一种机器学习建模方法,需要使用分类器(也即分类函数)把样本数据映射到给定类别中的某一个,从而可以应用于数据预测。在利用分类器学习分类规则过程中,需要使用损失函数来判断机器学习的拟合误差大小。XGboost algorithm is a machine learning modeling method. It needs to use a classifier (that is, a classification function) to map sample data to a certain category, so that it can be applied to data prediction. In the process of using the classifier to learn classification rules, it is necessary to use a loss function to determine the size of the fitting error of machine learning.
本实施例中,在每次进行节点分裂时,在第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度。In this embodiment, each time a node split is performed, on the second data side, one step and two steps of each training sample in the sample set corresponding to the current node split are obtained.
其中,梯度提升树模型需要进行多轮联邦训练,每一轮联邦训练对应生成一棵回归树,而一棵回归树的生成需要进行多次节点分裂。Among them, the gradient boosting tree model requires multiple rounds of federal training. Each round of federal training corresponds to the generation of a regression tree, and the generation of a regression tree requires multiple node splits.
因此,在每一轮联邦训练过程中,首次节点分裂使用的是最开始保存的训练样本,下一次的节点分裂则会使用上一次节点分裂所产生的新节点对应样本集的训练样本,并且同一轮联邦训练过程中,每一轮节点分裂都沿用首轮节点分裂所使用的一阶梯度与二阶梯度。而下一轮的联邦训练会使用上一轮联邦训练结果更新上一轮联邦训练所使用的一阶梯度与二阶梯度。Therefore, in each round of federal training, the first node splitting uses the training samples saved at the beginning, and the next node splitting will use the training samples corresponding to the new node generated by the previous node splitting, and the same During the round of federal training, each round of node splitting follows the same one-step and two-step degrees used in the first round of node splitting. The next round of federal training will use the results of the previous round of federal training to update the first and second steps used in the last round of federal training.
XGboost算法支持自定义损失函数,使用自定义的损失函数对目标函数求一阶偏导数与二阶偏导数,对应得到本地待训练的样本数据 的一阶梯度与二阶梯度。The XGboost algorithm supports a custom loss function. The custom loss function is used to find the first-order and second-order partial derivatives of the objective function, corresponding to the first and second steps of the local sample data to be trained.
基于上述实施例中对于XGboost算法与梯度提升树模型的说明,因此,构建回归树需要确定分裂节点,而分裂节点可通过收益值确定。收益值gain的计算公式如下:Based on the description of the XGboost algorithm and the gradient boosting tree model in the above embodiment, to construct a regression tree, a split node needs to be determined, and the split node can be determined by the revenue value. The formula for calculating the gain is as follows:
Figure PCTCN2019080297-appb-000002
Figure PCTCN2019080297-appb-000002
其中,I L代表当前节点分裂后左子节点的包含的样本集合,I R代表当前节点分裂后右子节点的包含的样本集合,g i表示样本i的一阶梯度,h i表示样本i的二阶梯度,λ、γ为常数。 Among them, I L represents the contained sample set of the left child node after the current node split, I R represents the contained sample set of the right child node after the current node split, g i represents a step of the sample i, and h i represents the sample i Two steps, λ and γ are constant.
由于待训练的样本数据分别存在第一数据方与第二数据方,因此,需要在第一数据方侧与第二数据方侧分别计算各自样本数据在每一种分裂方式下分裂节点的收益值。Since the sample data to be trained exists in the first data side and the second data side, it is necessary to calculate the revenue value of the split node in each splitting method on the first data side and the second data side, respectively. .
本实施例中,由于第一数据方与第二数据方预先进行了样本对齐,因而双方具有相同的梯度特征,同时由于数据标签存在于第二数据方的样本数据中,因此,基于第二数据方的样本数据的一阶梯度与二阶梯度,计算双方样本数据在每一种分裂方式下分裂节点的收益值。In this embodiment, because the first data side and the second data side are pre-aligned with the sample, both sides have the same gradient characteristics, and because the data label exists in the sample data of the second data side, based on the second data The one-step and two-step of the square sample data are used to calculate the return value of the split node of each sample data in each split mode.
步骤S102,若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;In step S102, if the current round of node splitting is the first round of node splitting for constructing a regression tree, the first step and the second step are encrypted and sent to the first data together with the sample ID of the sample set. For the calculation of the revenue value of the split node for each training method corresponding to the local training sample corresponding to the sample ID based on the first and second steps of encryption on the first data side ;
本实施例中,为实现联邦训练过程中实现双方样本数据的双向隐私保护,因此,若本轮节点分裂为构造回归树的首轮节点分裂,则在第二数据方侧计算得到样本数据的一阶梯度与二阶梯度后,先进行加密,然后再发送给第一数据方。In this embodiment, in order to achieve two-way privacy protection of the sample data of both parties during the federal training process, if the current round of node splitting is the first round of node splitting to construct a regression tree, one of the sample data is calculated on the second data side. After the gradation and the second gradation, they are encrypted before being sent to the first data party.
在第一数据方侧,基于接收到的样本数据的一阶梯度与二阶梯度,以及上述收益值gain的计算公式,计算得到第一数据方本地样本数据在每一种分裂方式下分裂节点的收益值,由于一阶梯度与二阶梯度进行了加密,因此,计算得到的收益值也是加密值,因而无需对收益 值进行加密。On the first data side, based on the one-step and two-step degrees of the received sample data, and the above formula for calculating the gain value, the local sample data of the first data side is calculated to split the nodes in each splitting method. Since the return value is encrypted with one and two steps, the calculated return value is also an encrypted value, so there is no need to encrypt the return value.
在计算出样本数据的各种分割方式下分裂节点的收益值后,即可分裂生成新节点以构建回归树。本实施例优选由样本数据具有数据标签的第二数据方主导构建梯度提升树模型的回归树。因此,需要将在第一数据方侧计算得到的第一数据方本地样本数据在每一种分裂方式下分裂节点的收益值发送给第二数据方。After calculating the revenue value of the split node under various segmentation methods of the sample data, the new node can be split to generate a regression tree. In this embodiment, it is preferred that a second data party having sample data with a data label dominate the construction of the regression tree of the gradient boosted tree model. Therefore, the local sample data of the first data side calculated on the first data side needs to be sent to the second data side for the revenue value of the split node in each split mode.
步骤S103,若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;In step S103, if the current node splits into a non-first-round node split that constructs a regression tree, the sample ID of the sample set is sent to the first data side, so that the first round is continued on the first data side. One-step and two-step used for node splitting, to calculate the revenue value of the local training sample corresponding to the sample ID to split the node in each splitting method;
本实施例中,若本轮节点分裂为构造回归树的非首轮节点分裂,则只需将本轮节点分裂对应的样本集的样本ID发送给第一数据方,而第一数据方继续沿用首轮节点分裂时所使用的一阶梯度与二阶梯度,计算本地与接收到的样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值。In this embodiment, if the current round of node splitting is a non-first round of node splitting to construct the regression tree, only the sample ID of the sample set corresponding to this round of node splitting is sent to the first data party, and the first data party continues to use The one-step and two-step degrees used in the first round of node splitting are used to calculate the revenue value of the splitting node for the training samples corresponding to the received sample ID locally in each splitting mode.
步骤S104,第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Step S104: The second data party receives the encrypted revenue values of all split nodes returned by the first data party and decrypts them;
步骤S105,在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;Step S105: On the second data side, based on the one-step and two-step, calculate the revenue value of the split node of the training sample corresponding to the sample ID locally in each split mode;
在第二数据方侧,基于计算得到的样本数据的一阶梯度与二阶梯度,以及上述收益值gain的计算公式,计算第二数据方本地待训练的样本数据在每一种分裂方式下分裂节点的收益值。On the second data side, based on the first and second steps of the calculated sample data and the calculation formula of the above-mentioned gain value, calculate the local sample data to be trained on the second data side to be split in each splitting method. The return value of the node.
步骤S106,基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Step S106: Determine the global best split node for the current round of node splitting based on the return values of all split nodes calculated by both parties;
由于双方初始的样本数据进行了样本对齐,因此,双方各自计算出的所有分裂节点的收益值可以看成是对双方整体数据样本在每一种分裂方式下分裂节点的收益值,因此,通过比较个收益值的大小,将收益值最大的分裂节点作为本轮节点分裂的全局最佳分裂节点。Because the initial sample data of the two parties are sample aligned, the return value of all split nodes calculated by the two parties can be regarded as the return value of the split node of the two parties' overall data samples in each split mode. Therefore, by comparing The size of each return value, the split node with the largest return value as the global best split node for the current node split.
需要说明的是,该全局最佳分裂节点对应的样本特征既有可能属于第一数据方的训练样本,也有可能属于第二数据方的训练样本。It should be noted that the sample features corresponding to the global best split node may belong to both the training samples on the first data side and the training samples on the second data side.
可选的,由于梯度提升树模型的回归树构建由第二数据方主导,因此,在第二数据方侧,需要记录每一轮节点分裂确定的全局最佳分裂节点的相关信息;相关信息包括:对应样本数据的提供方、对应样本数据的特征编码以及收益值。Optionally, since the regression tree construction of the gradient boosting tree model is dominated by the second data side, on the second data side, it is necessary to record the relevant information of the global best split node determined by each round of node splitting; the relevant information includes : The provider corresponding to the sample data, the feature code corresponding to the sample data, and the return value.
例如,若数据方A持有全局最佳分割点对应的特征f i,则这条记录为(Site A,E A(f i),gain)。反之,若数据方B持有全局最佳分割点对应的特征f i,则这条记录为(Site B,E B(f i),gain)。其中,E A(f i)表示数据方A对特征f i进行编码,E B(f i)表示数据方B对特征f i进行编码,通过编码可以标示特征f i而不泄露其原始特征数据。 For example, if the data side A holds the feature f i corresponding to the global best segmentation point, then this record is (Site A, E A (f i ), gain). Conversely, if the data side B holds the feature f i corresponding to the global best segmentation point, this record is (Site B, E B (f i ), gain). Wherein, E A (f i) A side feature data representing encoded f i, E B (f i) characteristic data indicating the direction B f i is encoded, the encoding may be marked by features f i without giving away their original feature data .
可选的,在上述实施例中进行特征选择时,优选以各全局最佳分裂节点作为梯度提升树模型中各回归树的分裂节点,统计同一特征编码对应的分裂节点的平均收益值。Optionally, when performing feature selection in the above embodiment, it is preferable to use each global best split node as the split node of each regression tree in the gradient boosting tree model to count the average return value of the split nodes corresponding to the same feature code.
步骤S107,基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Step S107: Based on the global best split node of the current node split, split the sample set corresponding to the current node to generate a new node to build a regression tree of the gradient boosted tree model.
若本轮节点分裂的全局最佳分裂节点对应的样本特征属于第一数据方的训练样本,则本轮分割的当前节点对应的样本数据属于第一数据方。相应地,若本轮节点分裂的全局最佳分裂节点对应的样本特征属于第二数据方的训练样本,则本轮分割的当前节点对应的样本数据属于第二数据方。If the sample features corresponding to the global best split node of the current node split belong to the training samples of the first data side, the sample data corresponding to the current node split of the current round belongs to the first data side. Correspondingly, if the sample features corresponding to the global best split node of the current node split belong to the training samples of the second data side, the sample data corresponding to the current node split of the current round belongs to the second data side.
通过节点分裂,即可生成新的节点(左子节点和右子节点),从而构建回归树。而通过多轮节点分裂,则可以不断生成新的节点,进而得到树深度更深的回归树,而若停止节点分裂,则可得到梯度提升树模型的一棵回归树。Through node splitting, new nodes (left and right child nodes) can be generated to build a regression tree. Through multiple rounds of node splitting, new nodes can be continuously generated, and a deeper regression tree can be obtained. If node splitting is stopped, a regression tree of the gradient boosted tree model can be obtained.
本实施例中,由于双方计算通信的数据都是模型中间结果的加密数据,因此训练过程也不会泄露原始特征数据。同时整个训练过程中使用加密算法以保证数据的隐私性。优选采用部分同态加密算法,支持加法同态。In this embodiment, since the data calculated and communicated by both parties are encrypted data of the intermediate results of the model, the training process will not leak the original feature data. At the same time, an encryption algorithm is used throughout the training process to ensure the privacy of the data. Partial homomorphic encryption algorithm is preferably used, which supports addition homomorphism.
进一步地,在一实施例中,基于节点分裂条件的不同,具体通过以下方式得到用于节点分裂的训练样本的一阶梯度与二阶梯度:Further, in one embodiment, based on the difference in node splitting conditions, the one-step and two-step degrees of the training samples for node splitting are specifically obtained in the following manner:
1、本轮节点分裂对应构造首棵回归树1.The first round of node split corresponds to the construction of the first regression tree
1.1、若本轮节点分裂为构造首棵回归树的首轮节点分裂,则在第二数据方侧,初始化本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;1.1. If the current node split is the first node split to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to the current node split;
1.2、若本轮节点分裂为构造首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。1.2. If the current round of node splitting is a non-first round of node splitting to construct the first regression tree, the first and second steps used in the first round of node splitting will be used.
2、本轮节点分裂对应构造非首棵回归树2.Corresponding to the current round of node splitting to construct a non-first regression tree
2.1、若本轮节点分裂对应构造非首棵回归树的首轮节点分裂,则根据上一轮联邦训练更新一阶梯度与二阶梯度;2.1. If the current node split corresponds to the first node split of the non-first regression tree, the first and second steps are updated according to the previous federal training;
2.2、若本轮节点分裂为构造非首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。2.2. If the current round of node splitting is a non-first round of node splitting to construct a non-first regression tree, the first and second steps used in the first round of node splitting will be used.
进一步地,在一实施例中,为降低回归树的复杂度,因此预设回归树的深度阈值以进行节点分裂限制。Further, in an embodiment, in order to reduce the complexity of the regression tree, a depth threshold of the regression tree is preset to limit node splitting.
本实施例中,当每一轮生成新的节点以构建梯度提升树模型的回归树时,在第二数据方侧,判断本轮回归树的深度是否达到预设深度阈值;In this embodiment, when a new node is generated in each round to build a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree in this round reaches a preset depth threshold;
若本轮回归树的深度达到预设深度阈值,则停止节点分裂,进而得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂。If the depth of the regression tree in this round reaches a preset depth threshold, the node splitting is stopped, and a regression tree of the gradient boosted tree model is obtained, otherwise the next round of node splitting is continued.
需要说明的是,限制节点分裂的条件也可以是当节点不能继续分裂时停止节点分裂,比如当前节点对应的一个样本,则无法继续进行节点分裂。It should be noted that the condition that restricts node splitting may also be stopping node splitting when the node cannot continue to split, for example, a sample corresponding to the current node cannot continue to perform node splitting.
进一步地,在另一实施例中,为避免训练过程过度拟合,因此预设回归树的数量阈值以限制回归树的生成数量。Further, in another embodiment, in order to avoid over-fitting during the training process, a threshold value for the number of regression trees is preset to limit the number of regression trees generated.
本实施例中,当停止节点分裂时,在第二数据方侧,判断本轮回归树的总数量是否达到预设数量阈值;In this embodiment, when the node splitting is stopped, it is judged on the second data side whether the total number of regression trees in the current round reaches a preset number threshold;
若本轮回归树的总数量达到预设数量阈值,则停止联邦训练,否则继续下一轮联邦训练。If the total number of regression trees in the current round reaches a preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
需要说明的是,限制回归树的生成数量的条件也可以是当节点不 能继续分裂时停止构建回归树。It should be noted that the condition that limits the number of regression trees to be generated may also be to stop building the regression tree when the nodes cannot continue to split.
为更好地理解本发明,下面基于上述实施例中表1、2中样本数据,对本发明联邦训练与建模过程进行举例说明。To better understand the present invention, based on the sample data in Tables 1 and 2 in the above embodiments, the federal training and modeling process of the present invention is exemplified below.
第一轮联邦训练:训练第一棵回归树First round of federal training: training the first regression tree
(1)第一轮节点分裂(1) Node split in the first round
1.1、在第二数据方侧,计算表2中样本数据的一阶梯度(g i)与二阶梯度(h i);对g i和h i进行加密后发送给第一数据方; 1.1 On the second data side, calculate the first gradient (g i ) and the second gradient (h i ) of the sample data in Table 2; encrypt g i and h i and send to the first data party;
1.2、在第一数据方侧,基于g i和h i,计算表1中样本数据所有可能的分裂方式下分裂节点的收益值gain;将收益值gain发送给第二数据方; 1.2. On the side of the first data party, based on g i and h i , calculate the gain value of the split node for all possible splitting methods of the sample data in Table 1; send the gain value gain to the second data party;
由于表1中Age特征具有5种样本数据划分方式、Gender特征具有2种样本数据划分方式、Amount of given credit特征5种样本数据划分方式,因此,表1中样本数据一共具有12种分裂方式,也即需要计算12种划分方式对应的分裂节点的收益值。The Age feature in Table 1 has 5 types of sample data division, the Gender feature has 2 types of sample data division, and the Amount of Given feature has 5 types of sample data division. Therefore, the sample data in Table 1 has a total of 12 divisions. That is, it is necessary to calculate the return value of the split node corresponding to the 12 division methods.
1.3、在第二数据方侧,计算表2中样本数据所有可能的分裂方式下分裂节点的收益值gain;1.3 On the second data side, calculate the gain value of the split node for all possible splitting methods of the sample data in Table 2;
由于表2中Bill Payment特征具有5种样本数据划分方式、Education特征具有3种样本数据划分方式,因此,表2中样本数据一共具有8种分裂方式,也即需要计算8种划分方式对应的分裂节点的收益值。Since the Bill Payment feature in Table 2 has 5 sample data division methods and the Education feature has 3 sample data division methods, the sample data in Table 2 has a total of 8 division methods, that is, the division corresponding to the 8 division methods needs to be calculated. The return value of the node.
1.4、从第一数据方侧计算出的12种划分方式对应的分裂节点的收益值以及从第二数据方侧计算出的8种划分方式对应的分裂节点的收益值中,选出最大收益值对应的特征作为本轮节点分裂的全局最佳分裂节点;1.4. Select the maximum return value from the return value of the split node corresponding to the 12 partition methods calculated from the first data side and the return value of the split node corresponding to the 8 partition methods calculated from the second data side The corresponding feature is used as the global best split node for the current node split;
1.5、基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本数据进行分裂,生成新的节点以构建梯度提升树模型的回归树。1.5. Based on the global best split node of the current node split, split the sample data corresponding to the current node to generate new nodes to build the regression tree of the gradient boosted tree model.
1.6、判断本轮回归树的深度是否达到预设深度阈值;若本轮回归树的深度达到预设深度阈值,则停止节点分裂,进而得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂;1.6. Determine whether the depth of the regression tree in this round reaches the preset depth threshold; if the depth of the regression tree in this round reaches the preset depth threshold, stop node splitting, and then obtain a regression tree of the gradient boosted tree model, otherwise continue to the next Round node split
1.7、判断本轮回归树的总数量是否达到预设数量阈值;若本轮 回归树的总数量达到预设数量阈值,则停止联邦训练,否则进入下一轮联邦训练。1.7. Determine whether the total number of regression trees in the current round reaches the preset number threshold; if the total number of regression trees in the current round reaches the preset number threshold, stop the federal training, otherwise enter the next round of federal training.
(2)第二、三轮节点分裂(2) Second and third rounds of node splitting
2.1、假设上一轮节点分裂对应的特征为Bill Payment小于或等于3102,则该特征作为分裂节点(对应样本为X1、X2、X3、X4、X5),产生两个新的分节点,其中左节点对应小于或等于3102的样本集合(X1、X5),而右节点对应大于3102的样本集合(X2、X3、X4),将样本集合(X1、X5)和样本集合(X2、X3、X4)分别作为新的样本集继续第二、三轮节点分裂,以分别对两个新节点进行分裂,生成新的节点。;2.1. Assuming that the feature corresponding to the previous round of node splitting is that Bill Payment is less than or equal to 3102, this feature is used as the splitting node (the corresponding sample is X1, X2, X3, X4, X5), and two new sub-nodes are generated, of which the left The node corresponds to a sample set (X1, X5) less than or equal to 3102, and the right node corresponds to a sample set (X2, X3, X4) greater than 3102, and the sample set (X1, X5) and the sample set (X2, X3, X4) As the new sample set, the second and third rounds of node splitting are continued to split the two new nodes and generate new nodes. ;
2.2、由于第二、三轮节点分裂属于同一轮联邦训练,因此继续沿用第一轮节点分裂所使用的样本梯度值。假设本轮的一个分裂节点对应的特征为Amount of given credit小于或等于200,则该特征作为分裂节点(对应样本为X1、X5),产生两个新的分节点,其中左节点对应小于或等于200的样本X5,而右节点对应大于200的样本X1;同样地,本轮的另一个分裂节点对应的特征为Age小于或等于35,则该特征作为分裂节点(对应样本为X2、X3、X4),产生两个新的分节点,其中左节点对应小于或等于35的样本X2、X3,而右节点对应大于35的样本X4。具体实现流程参考第一轮节点分裂过程。2.2. Since the second and third rounds of node splitting belong to the same round of federal training, the sample gradient values used in the first round of node splitting will continue to be used. Assuming that the feature corresponding to a split node in this round is Amount of credit less than or equal to 200, then this feature is used as the split node (the corresponding samples are X1 and X5) to generate two new subnodes, where the left node corresponds to less than or equal to A sample X5 of 200, and a right node corresponding to a sample X1 that is greater than 200; Similarly, the feature corresponding to another split node of this round is Age less than or equal to 35, then this feature is used as the split node (the corresponding samples are X2, X3, X4 ) To generate two new sub-nodes, where the left node corresponds to samples X2 and X3 less than or equal to 35, and the right node corresponds to samples X4 greater than 35. The specific implementation process refers to the first round of node splitting process.
第二轮联邦训练:训练第二棵回归树Second round of federal training: training the second regression tree
3.1、由于本轮节点分裂属于下一轮联邦训练,因此以上一轮联邦训练结果更新上一轮联邦训练所使用的一阶梯度与二阶梯度,继续第二轮联邦训练进行节点分裂,以生成新的节点构建下一棵回归树,具体实现流程参考前一棵回归树的构建过程。3.1. Since the node split in this round belongs to the next round of federal training, the results of the previous round of federal training update the one-step and two-step used in the previous round of federal training, and continue the second round of federal training to perform node split to generate The new node constructs the next regression tree. The specific implementation process refers to the construction process of the previous regression tree.
3.2、如图5所示,上述实施例中表1、2中样本数据经过两轮联邦训练后产生了两棵回归树,第一棵回归树包括三个分裂节点,分别是:Bill Payment小于或等于3102、Amount of given credit小于或等于200、Age小于或等于35;第二棵回归树包括两个分裂节点,分别是:Bill Payment小于或等于6787、Gender==1。3.2. As shown in FIG. 5, the sample data in Tables 1 and 2 in the above embodiment produced two regression trees after two rounds of federal training. The first regression tree includes three split nodes, which are: Bill Payment is less than or Equal to 3102, Amount of credit is less than or equal to 200, Age is less than or equal to 35; the second regression tree includes two split nodes, which are: Bill Payment is less than or equal to 6787, and Gender == 1.
3.3、基于如图5所示的梯度提升树模型的两棵回归树,样本数 据的特征对应的平均收益值:Bill Payment为(gain1+gain4)/2;Education为0;Age为gain3;Gender为gain5;Amount of given credit为gain2。3.3. Based on the two regression trees of the gradient boosting tree model shown in Figure 5, the average return value corresponding to the characteristics of the sample data: Bill Payment is (gain1 + gain4) / 2; Education is 0; Age is gain3; Gender is gain5; Amount of credit is gain2.
进一步地,在本发明基于联邦训练的样本预测方法一实施例中,对待预测样本进行联合预测的具体实现流程包括:Further, in an embodiment of the federal training-based sample prediction method of the present invention, the specific implementation process of performing joint prediction on the prediction samples includes:
(1)在第二数据方侧,遍历梯度提升树模型对应的回归树;(1) On the second data side, traverse the regression tree corresponding to the gradient boosting tree model;
(2)若当前遍历节点的属性值记录在第二数据方,则通过比较本地待预测样本的数据点与当前遍历节点的属性值,以确定下一遍历节点;(2) If the attribute value of the current traversal node is recorded on the second data side, the next traversal node is determined by comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node;
(3)若当前遍历节点的属性值记录在第一数据方,则向第一数据方发起查询请求,以供在第一数据方侧,通过比较本地待预测样本的数据点与当前遍历节点的属性值,确定下一遍历节点并向第二数据方返回该节点信息;(3) If the attribute value of the current traversal node is recorded on the first data side, a query request is initiated to the first data side for comparison on the first data side by comparing the data points of the local to-be-predicted samples with the current traversal node's Attribute value, determining the next traversal node and returning the node information to the second data party;
(4)当遍历完梯度提升树模型对应的回归树时,基于待预测样本所属节点所对应的样本的数据标签,确定待预测样本的样本类别,或基于待预测样本所属节点的权重值,获得待预测样本的预测得分。(4) After traversing the regression tree corresponding to the gradient boosting tree model, determine the sample category of the sample to be predicted based on the data labels of the samples corresponding to the node to which the sample to be predicted belongs, or obtain the weight value of the node to which the sample to be predicted belongs to The predicted score of the sample to be predicted.
本实施例中,由于在生成回归树时,回归树的分裂节点记录保存在第二数据方侧,因此本实施例由第二数据方主导完成对待预测样本的联合预测,具体通过遍历梯度提升树模型对应的回归树以确定待预测样本的所属节点。其中,待预测样本的所属节点具体通过比较待预测样本的数据点与当前遍历节点的属性值进行确定。In this embodiment, when generating a regression tree, the split node records of the regression tree are stored on the second data side. Therefore, in this embodiment, the second data side takes the lead in completing the joint prediction of the samples to be predicted, and the tree is specifically improved by traversing the gradient. The regression tree corresponding to the model determines the node to which the sample to be predicted belongs. The node to which the sample to be predicted belongs specifically is determined by comparing the data point of the sample to be predicted with the attribute value of the currently traversed node.
在确定了待预测样本的所属节点后,即可基于待预测样本所属节点对应的训练样本的数据标签,确定待预测样本的样本类别,或基于待预测样本所属节点的权重值,获得待预测样本的预测得分。After the node to which the sample to be predicted belongs is determined, the sample category of the sample to be predicted can be determined based on the data label of the training sample corresponding to the node to which the sample to be predicted belongs, or the sample to be predicted can be obtained based on the weight value of the node to which the sample to be predicted belongs. Prediction score.
本发明还提供一种计算机可读存储介质。The invention also provides a computer-readable storage medium.
本发明计算机可读存储介质上存储有样本预测程序,所述样本预测程序被处理器执行时实现如上述任一项实施例中所述的基于联邦训练的样本预测方法的步骤。The computer-readable storage medium of the present invention stores a sample prediction program, and when the sample prediction program is executed by a processor, implements the steps of the federal training-based sample prediction method described in any one of the foregoing embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解 到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better. Implementation. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM / RAM), including Several instructions are used to cause a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present invention.
上面结合附图对本发明的实施例进行了描述,但是本发明并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本发明的启示下,在不脱离本发明宗旨和权利要求所保护的范围情况下,还可做出很多形式,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,这些均属于本发明的保护之内。The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above specific implementations, and the above specific implementations are only schematic and not restrictive. Those of ordinary skill in the art at Under the enlightenment of the present invention, many forms can be made without departing from the scope of the present invention and the scope of protection of the claims, and any equivalent structure or equivalent process transformation made by using the description and drawings of the present invention, or It is directly or indirectly used in other related technical fields, which all fall into the protection of the present invention.

Claims (20)

  1. 一种基于联邦训练的样本预测方法,其特征在于,所述基于联邦训练的样本预测方法包括以下步骤:A sample prediction method based on federal training, characterized in that the sample prediction method based on federal training includes the following steps:
    采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述回归树的一个分裂节点对应训练样本的一个特征;XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
    基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
  2. 如权利要求1所述的基于联邦训练的样本预测方法,其特征在于,所述基于联邦训练的样本预测方法包括:The federal training-based sample prediction method according to claim 1, wherein the federal training-based sample prediction method comprises:
    在进行联邦训练之前,采用盲签名和RSA加密演算法,对样本数据的ID进行交互加密;Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;
    通过比较双方加密后的ID加密串,识别双方样本中的交集部分,并将样本中的交集部分作为样本对齐后的训练样本。The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
  3. 如权利要求2所述的基于联邦训练的样本预测方法,其特征在于,所述两个对齐的训练样本分别为第一训练样本和第二训练样本;The federal training-based sample prediction method according to claim 2, wherein the two aligned training samples are a first training sample and a second training sample, respectively;
    所述第一训练样本属性包括样本ID以及部分样本特征,所述第二训练样本属性包括样本ID、另一部分样本特征以及数据标签;The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;
    所述第一训练样本由第一数据方提供并保存在第一数据方本地,所述第二训练样本由第二数据方提供并保存在第二数据方本地。The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
  4. 如权利要求3所述的基于联邦训练的样本预测方法,其特征在于,所述采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型包括:The federated training-based sample prediction method according to claim 3, wherein the XGboost algorithm for federated training of two aligned training samples to construct a gradient boosting tree model comprises:
    在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;
    若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分 裂方式下分裂节点的收益值;If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
    若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
    第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;
    在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;
    基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;
    基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
  5. 如权利要求4所述的基于联邦训练的样本预测方法,其特征在于,所述在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度的步骤之前,还包括:The method for sample prediction based on federated training according to claim 4, characterized in that, on the second data side, obtaining one-step and two-step of each training sample in the sample set corresponding to the current node splitting Before the steps, it also includes:
    在进行节点分裂时,判断本轮节点分裂是否对应构造首棵回归树;When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;
    若本轮节点分裂对应构造首棵回归树,则判断本轮节点分裂是否为构造首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;
    若本轮节点分裂为构造首棵回归树的首轮节点分裂,则在所述第二数据方侧,初始化本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;若本轮节点分裂为构造首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度;If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
    若本轮节点分裂对应构造非首棵回归树,则判断本轮节点分裂是否为构造非首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;
    若本轮节点分裂为构造非首棵回归树的首轮节点分裂,则根据上一轮联邦训练更新一阶梯度与二阶梯度;若本轮节点分裂为构造非首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
  6. 如权利要求4所述的基于联邦训练的样本预测方法,其特征在于,所述基于联邦训练的样本预测方法还包括:The federal training-based sample prediction method according to claim 4, wherein the federal training-based sample prediction method further comprises:
    当生成新的节点以构建梯度提升树模型的回归树时,在所述第二数据方侧,判断本轮回归树的深度是否达到预设深度阈值;When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;
    若本轮回归树的深度达到所述预设深度阈值,则停止节点分裂,得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂;If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;
    当停止节点分裂时,在所述第二数据方侧,判断本轮回归树的总数量是否达到预设数量阈值;When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;
    若本轮回归树的总数量达到所述预设数量阈值,则停止联邦训练,否则继续下一轮联邦训练。If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
  7. 如权利要求4所述的基于联邦训练的样本预测方法,其特征在于,所述基于联邦训练的样本预测方法还包括:The federal training-based sample prediction method according to claim 4, wherein the federal training-based sample prediction method further comprises:
    在所述第二数据方侧,记录每一轮节点分裂确定的全局最佳分裂节点的相关信息;On the second data side, record related information of the global best split node determined by each round of node splitting;
    其中,所述相关信息包括:对应样本数据的提供方、对应样本数据的特征编码以及收益值。The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
  8. 如权利要求7所述的基于联邦训练的样本预测方法,其特征在于,所述基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分包括:The method for sample prediction based on federated training according to claim 7, characterized in that, based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine the sample category of the sample to be predicted or to obtain the sample to be predicted. Forecast scores include:
    在所述第二数据方侧,遍历所述梯度提升树模型对应的回归树;Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;
    若当前遍历节点的属性值记录在所述第二数据方,则通过比较本地待预测样本的数据点与当前遍历节点的属性值,以确定下一遍历节点;If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;
    若当前遍历节点的属性值记录在所述第一数据方,则向所述第一数据方发起查询请求,以供在所述第一数据方侧,通过比较本地待预测样本的数据点与当前遍历节点的属性值,确定下一遍历节点并向所述第二数据方返回该节点信息;If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
    当遍历完所述梯度提升树模型对应的回归树时,基于待预测样本所属节点所对应的样本的数据标签,确定待预测样本的样本类别,或基于待预测样本所属节点的权重值,获得待预测样本的预测得分。When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
  9. 一种基于联邦训练的样本预测装置,其特征在于,所述基于联邦训练的样本预测装置包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的样本预测程序,所述样本预测程序被所述处理器执行时实现如下步骤:A sample prediction device based on federal training, characterized in that the sample prediction device based on federal training includes a memory, a processor, and a sample prediction program stored on the memory and executable on the processor. When the sample prediction program is executed by the processor, the following steps are implemented:
    采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述回归树的一个分裂节点对应训练样本的一个特征;XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
    基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
  10. 如权利要求9所述的基于联邦训练的样本预测装置,其特征在于,所述处理器调用所述存储器中存储的样本预测程序还执行以下步骤:The sample prediction device based on federal training according to claim 9, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:
    在进行联邦训练之前,采用盲签名和RSA加密演算法,对样本数据的ID进行交互加密;Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;
    通过比较双方加密后的ID加密串,识别双方样本中的交集部分,并将样本中的交集部分作为样本对齐后的训练样本。The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
  11. 如权利要求10所述的基于联邦训练的样本预测装置,其特征在于,所述两个对齐的训练样本分别为第一训练样本和第二训练样本;The sample prediction device based on federal training according to claim 10, wherein the two aligned training samples are a first training sample and a second training sample, respectively;
    所述第一训练样本属性包括样本ID以及部分样本特征,所述第二训练样本属性包括样本ID、另一部分样本特征以及数据标签;The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;
    所述第一训练样本由第一数据方提供并保存在第一数据方本地,所述第二训练样本由第二数据方提供并保存在第二数据方本地。The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
  12. 如权利要求11所述的基于联邦训练的样本预测装置,其特征在于,所述采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型包括:The sample prediction device based on federation training according to claim 11, wherein the XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model:
    在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node splitting;
    若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所 述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
    若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
    第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;
    在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;
    基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;
    基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
  13. 如权利要求12所述的基于联邦训练的样本预测装置,其特征在于,所述在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度的步骤之前,所述处理器调用所述存储器中存储的样本预测程序还执行以下步骤:The sample prediction device based on federal training according to claim 12, characterized in that, on the second data side, the first and second steps of each training sample in the sample set corresponding to the current node split are obtained. Before the steps, the processor calls the sample prediction program stored in the memory and further performs the following steps:
    在进行节点分裂时,判断本轮节点分裂是否对应构造首棵回归树;When performing node splitting, determine whether the current round of node splitting corresponds to the construction of the first regression tree;
    若本轮节点分裂对应构造首棵回归树,则判断本轮节点分裂是否为构造首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to the construction of the first regression tree, determine whether this round of node splitting is the first round of node splitting to construct the first regression tree;
    若本轮节点分裂为构造首棵回归树的首轮节点分裂,则在所述第二数据方侧,初始化本轮节点分裂对应的样本集中各训练样本的一阶梯度与二阶梯度;若本轮节点分裂为构造首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度;If the current round of node splitting is the first round of node splitting to construct the first regression tree, then on the second data side, initialize the first and second steps of each training sample in the sample set corresponding to this round of node splitting; if this Round node splitting is a non-first-round node split that constructs the first regression tree, then the first and second steps used in the first round of node splitting are used;
    若本轮节点分裂对应构造非首棵回归树,则判断本轮节点分裂是否为构造非首棵回归树的首轮节点分裂;If the current round of node splitting corresponds to constructing a non-first regression tree, determine whether the current round of node splitting is the first round of node splitting to construct a non-first regression tree;
    若本轮节点分裂为构造非首棵回归树的首轮节点分裂,则根据上 一轮联邦训练更新一阶梯度与二阶梯度;若本轮节点分裂为构造非首棵回归树的非首轮节点分裂,则沿用首轮节点分裂所使用的一阶梯度与二阶梯度。If the current node splits into a first-round node split that constructs a non-first regression tree, the first and second steps are updated according to the last round of federal training; if the current node splits into a non-first round that constructs a non-first regression tree Node splitting follows the same one-step and two-step degrees used in the first round of node splitting.
  14. 如权利要求12所述的基于联邦训练的样本预测装置,其特征在于,所述处理器调用所述存储器中存储的样本预测程序还执行以下步骤:The sample prediction device based on federal training according to claim 12, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:
    当生成新的节点以构建梯度提升树模型的回归树时,在所述第二数据方侧,判断本轮回归树的深度是否达到预设深度阈值;When a new node is generated to construct a regression tree of the gradient boosted tree model, on the second data side, it is judged whether the depth of the regression tree of the current round reaches a preset depth threshold;
    若本轮回归树的深度达到所述预设深度阈值,则停止节点分裂,得到梯度提升树模型的一棵回归树,否则继续下一轮节点分裂;If the depth of the regression tree in the current round reaches the preset depth threshold, stop node splitting to obtain a regression tree of the gradient boosted tree model, otherwise continue to the next round of node splitting;
    当停止节点分裂时,在所述第二数据方侧,判断本轮回归树的总数量是否达到预设数量阈值;When the node splitting is stopped, judging whether the total number of regression trees in the current round reaches a preset number threshold on the second data side;
    若本轮回归树的总数量达到所述预设数量阈值,则停止联邦训练,否则继续下一轮联邦训练。If the total number of regression trees in the current round reaches the preset number threshold, the federal training is stopped, otherwise the next round of federal training is continued.
  15. 如权利要求12所述的基于联邦训练的样本预测装置,其特征在于,所述处理器调用所述存储器中存储的样本预测程序还执行以下步骤:The sample prediction device based on federal training according to claim 12, wherein the processor calls the sample prediction program stored in the memory and further performs the following steps:
    在所述第二数据方侧,记录每一轮节点分裂确定的全局最佳分裂节点的相关信息;On the second data side, record related information of the global best split node determined by each round of node splitting;
    其中,所述相关信息包括:对应样本数据的提供方、对应样本数据的特征编码以及收益值。The related information includes: a provider corresponding to the sample data, a feature code corresponding to the sample data, and a revenue value.
  16. 如权利要求15所述的基于联邦训练的样本预测装置,其特征在于,所述基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分的步骤包括:The sample prediction device based on federation training according to claim 15, wherein, based on the gradient boosting tree model, joint prediction is performed on samples to be predicted to determine a sample category of the sample to be predicted or to obtain a sample of the sample to be predicted. The steps to predict the score include:
    在所述第二数据方侧,遍历所述梯度提升树模型对应的回归树;Traverse the regression tree corresponding to the gradient boosted tree model on the second data side;
    若当前遍历节点的属性值记录在所述第二数据方,则通过比较本地待预测样本的数据点与当前遍历节点的属性值,以确定下一遍历节点;If the attribute value of the current traversal node is recorded on the second data side, comparing the data point of the local to-be-predicted sample with the attribute value of the current traversal node to determine the next traversal node;
    若当前遍历节点的属性值记录在所述第一数据方,则向所述第一 数据方发起查询请求,以供在所述第一数据方侧,通过比较本地待预测样本的数据点与当前遍历节点的属性值,确定下一遍历节点并向所述第二数据方返回该节点信息;If the attribute value of the currently traversed node is recorded in the first data party, a query request is initiated to the first data party for the first data party to compare the data points of the local to-be-predicted sample with the current Traverse the attribute value of the node, determine the next traversal node and return the node information to the second data party;
    当遍历完所述梯度提升树模型对应的回归树时,基于待预测样本所属节点所对应的样本的数据标签,确定待预测样本的样本类别,或基于待预测样本所属节点的权重值,获得待预测样本的预测得分。When the regression tree corresponding to the gradient boosting tree model is traversed, the sample category of the sample to be predicted is determined based on the data label of the sample corresponding to the node to which the sample to be predicted belongs, or based on the weight value of the node to which the sample to be predicted belongs, the The prediction score of the prediction sample.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有样本预测程序,所述样本预测程序被处理器执行时实现如下步骤:A computer-readable storage medium is characterized in that a sample prediction program is stored on the computer-readable storage medium, and when the sample prediction program is executed by a processor, the following steps are implemented:
    采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型,其中,所述梯度提升树模型包括多棵回归树,所述回归树的一个分裂节点对应训练样本的一个特征;XGboost algorithm is used to perform federal training on two aligned training samples to construct a gradient boosting tree model, wherein the gradient boosting tree model includes multiple regression trees, and a split node of the regression tree corresponds to a feature of the training sample;
    基于所述梯度提升树模型,对待预测样本进行联合预测,以确定待预测样本的样本类别或获得待预测样本的预测得分。Based on the gradient boosting tree model, joint prediction is performed on the samples to be predicted to determine a sample category of the samples to be predicted or to obtain a prediction score of the samples to be predicted.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述样本预测程序被处理器执行时还实现如下步骤:The computer-readable storage medium of claim 17, wherein when the sample prediction program is executed by a processor, the following steps are further implemented:
    在进行联邦训练之前,采用盲签名和RSA加密演算法,对样本数据的ID进行交互加密;Before the federal training, the blind signature and RSA encryption algorithm were used to interactively encrypt the ID of the sample data;
    通过比较双方加密后的ID加密串,识别双方样本中的交集部分,并将样本中的交集部分作为样本对齐后的训练样本。The ID encrypted strings encrypted by both sides are compared to identify the intersection part in the samples of both parties, and the intersection part in the samples is used as the training sample after the samples are aligned.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述两个对齐的训练样本分别为第一训练样本和第二训练样本;The computer-readable storage medium of claim 18, wherein the two aligned training samples are a first training sample and a second training sample, respectively;
    所述第一训练样本属性包括样本ID以及部分样本特征,所述第二训练样本属性包括样本ID、另一部分样本特征以及数据标签;The first training sample attribute includes a sample ID and some sample features, and the second training sample attribute includes a sample ID, another part of sample features, and a data label;
    所述第一训练样本由第一数据方提供并保存在第一数据方本地,所述第二训练样本由第二数据方提供并保存在第二数据方本地。The first training sample is provided by the first data party and stored locally on the first data party, and the second training sample is provided by the second data party and stored locally on the second data party.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述采用XGboost算法对两个对齐的训练样本进行联邦训练,以构建梯度提升树模型包括:The computer-readable storage medium of claim 19, wherein using the XGboost algorithm to perform federal training on two aligned training samples to construct a gradient boosting tree model comprises:
    在所述第二数据方侧,获取本轮节点分裂对应的样本集中各训练 样本的一阶梯度与二阶梯度;On the second data side, obtaining a first step and a second step of each training sample in the sample set corresponding to the current node split;
    若本轮节点分裂为构造回归树的首轮节点分裂,则对所述一阶梯度与所述二阶梯度进行加密后与所述样本集的样本ID一起发送至所述第一数据方,以供在所述第一数据方侧基于加密的所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the current round of node splitting is the first round of node splitting for constructing the regression tree, the first step and the second step are encrypted and sent to the first data side together with the sample ID of the sample set to For calculating, on the first data side, the revenue value of a split node for each training mode corresponding to the training ID corresponding to the sample ID based on the one-step and two-step encryption;
    若本轮节点分裂为构造回归树的非首轮节点分裂,则将所述样本集的样本ID发送至所述第一数据方,以供在所述第一数据方侧沿用首轮节点分裂所使用的一阶梯度与二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;If the nodes in the current round are split into non-first-round node splits that construct a regression tree, the sample ID of the sample set is sent to the first data party for the first-round node splitting to be used on the side of the first data party Using the one-step and two-step degrees to calculate the return value of the splitting node of the local training sample corresponding to the sample ID in each splitting method;
    第二数据方接收所述第一数据方返回的所有分裂节点的加密收益值并进行解密;Receiving, by the second data party, the encrypted revenue values of all split nodes returned by the first data party and decrypting them;
    在所述第二数据方侧,基于所述一阶梯度与所述二阶梯度,计算本地与所述样本ID对应的训练样本在每一种分裂方式下分裂节点的收益值;On the second data side, based on the one-step and two-step, calculating a revenue value of a split node of a local training sample corresponding to the sample ID in each split mode;
    基于双方各自计算出的所有分裂节点的收益值,确定本轮节点分裂的全局最佳分裂节点;Determine the best global split node for the current round of node splits based on the return values of all split nodes calculated by both parties;
    基于本轮节点分裂的全局最佳分裂节点,对当前节点对应的样本集进行分裂,生成新的节点以构建梯度提升树模型的回归树。Based on the global best split node of the current node split, the sample set corresponding to the current node is split to generate new nodes to build the regression tree of the gradient boosted tree model.
PCT/CN2019/080297 2018-08-10 2019-03-29 Sample prediction method and device based on federated training, and storage medium WO2020029590A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810913869.3 2018-08-10
CN201810913869.3A CN109165683B (en) 2018-08-10 2018-08-10 Sample prediction method, device and storage medium based on federal training

Publications (1)

Publication Number Publication Date
WO2020029590A1 true WO2020029590A1 (en) 2020-02-13

Family

ID=64895662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/080297 WO2020029590A1 (en) 2018-08-10 2019-03-29 Sample prediction method and device based on federated training, and storage medium

Country Status (2)

Country Link
CN (1) CN109165683B (en)
WO (1) WO2020029590A1 (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402095A (en) * 2020-03-23 2020-07-10 温州医科大学 Method for detecting student behaviors and psychology based on homomorphic encrypted federated learning
CN111414646A (en) * 2020-03-20 2020-07-14 矩阵元技术(深圳)有限公司 Data processing method and device for realizing privacy protection
CN111444956A (en) * 2020-03-25 2020-07-24 平安科技(深圳)有限公司 Low-load information prediction method and device, computer system and readable storage medium
CN111461874A (en) * 2020-04-13 2020-07-28 浙江大学 Credit risk control system and method based on federal mode
CN111666576A (en) * 2020-04-29 2020-09-15 平安科技(深圳)有限公司 Data processing model generation method and device and data processing method and device
CN111814985A (en) * 2020-06-30 2020-10-23 平安科技(深圳)有限公司 Model training method under federated learning network and related equipment thereof
CN111882054A (en) * 2020-05-27 2020-11-03 杭州中奥科技有限公司 Method and related equipment for cross training of network data of encryption relationship between two parties
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN111914277A (en) * 2020-08-07 2020-11-10 平安科技(深圳)有限公司 Intersection data generation method and federal model training method based on intersection data
CN112231308A (en) * 2020-10-14 2021-01-15 深圳前海微众银行股份有限公司 Method, device, equipment and medium for removing weight of horizontal federal modeling sample data
CN112288094A (en) * 2020-10-09 2021-01-29 武汉大学 Federal network representation learning method and system
CN112381307A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 Meteorological event prediction method and device and related equipment
CN112651458A (en) * 2020-12-31 2021-04-13 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112749749A (en) * 2021-01-14 2021-05-04 深圳前海微众银行股份有限公司 Classification method and device based on classification decision tree model and electronic equipment
CN112836830A (en) * 2021-02-01 2021-05-25 广西师范大学 Method for voting and training in parallel by using federated gradient boosting decision tree
CN113051239A (en) * 2021-03-26 2021-06-29 北京沃东天骏信息技术有限公司 Data sharing method, use method of model applying data sharing method and related equipment
CN113204443A (en) * 2021-06-03 2021-08-03 京东科技控股股份有限公司 Data processing method, equipment, medium and product based on federal learning framework
CN113392164A (en) * 2020-03-13 2021-09-14 京东城市(北京)数字科技有限公司 Method, main server, service platform and system for constructing longitudinal federated tree
CN113435537A (en) * 2021-07-16 2021-09-24 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113657996A (en) * 2021-08-26 2021-11-16 深圳市洞见智慧科技有限公司 Method and device for determining feature contribution degree in federated learning and electronic equipment
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113722987A (en) * 2021-08-16 2021-11-30 京东科技控股股份有限公司 Federal learning model training method and device, electronic equipment and storage medium
CN113722739A (en) * 2021-09-06 2021-11-30 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN113807534A (en) * 2021-03-08 2021-12-17 京东科技控股股份有限公司 Model parameter training method and device of federal learning model and electronic equipment
CN113807380A (en) * 2020-12-31 2021-12-17 京东科技信息技术有限公司 Method and device for training federated learning model and electronic equipment
CN113824546A (en) * 2020-06-19 2021-12-21 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN113824677A (en) * 2020-12-28 2021-12-21 京东科技控股股份有限公司 Federal learning model training method and device, electronic equipment and storage medium
CN114399000A (en) * 2022-01-20 2022-04-26 中国平安人寿保险股份有限公司 Object interpretability feature extraction method, device, equipment and medium of tree model
CN114677200A (en) * 2022-04-01 2022-06-28 重庆邮电大学 Business information recommendation method and device based on multi-party high-dimensional data longitudinal federal learning
CN114882333A (en) * 2021-05-31 2022-08-09 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium
US11914678B2 (en) 2020-09-23 2024-02-27 International Business Machines Corporation Input encoding for classifier generalization

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165683B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Sample prediction method, device and storage medium based on federal training
CN109670484B (en) * 2019-01-16 2022-03-25 电子科技大学 Mobile phone individual identification method based on bispectrum characteristics and lifting tree
CN112183759B (en) * 2019-07-04 2024-02-13 创新先进技术有限公司 Model training method, device and system
CN110443378B (en) * 2019-08-02 2023-11-03 深圳前海微众银行股份有限公司 Feature correlation analysis method and device in federal learning and readable storage medium
CN110674979A (en) * 2019-09-11 2020-01-10 腾讯科技(深圳)有限公司 Risk prediction model training method, prediction device, medium and equipment
CN110717671B (en) * 2019-10-08 2021-08-31 深圳前海微众银行股份有限公司 Method and device for determining contribution degree of participants
CN110795603B (en) * 2019-10-29 2021-02-19 支付宝(杭州)信息技术有限公司 Prediction method and device based on tree model
CN110796266B (en) * 2019-10-30 2021-06-15 深圳前海微众银行股份有限公司 Method, device and storage medium for implementing reinforcement learning based on public information
CN110851869B (en) * 2019-11-14 2023-09-19 深圳前海微众银行股份有限公司 Sensitive information processing method, device and readable storage medium
CN110944011B (en) * 2019-12-16 2021-12-07 支付宝(杭州)信息技术有限公司 Joint prediction method and system based on tree model
CN110968886B (en) * 2019-12-20 2022-12-02 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111242385A (en) * 2020-01-19 2020-06-05 苏宁云计算有限公司 Prediction method, device and system of gradient lifting tree model
CN111309848A (en) * 2020-01-19 2020-06-19 苏宁云计算有限公司 Generation method and system of gradient lifting tree model
CN111368901A (en) * 2020-02-28 2020-07-03 深圳前海微众银行股份有限公司 Multi-party combined modeling method, device and medium based on federal learning
CN113392101B (en) * 2020-03-13 2024-06-18 京东城市(北京)数字科技有限公司 Method, main server, service platform and system for constructing transverse federal tree
CN113554476B (en) * 2020-04-23 2024-04-19 京东科技控股股份有限公司 Training method and system of credit prediction model, electronic equipment and storage medium
CN111598186B (en) * 2020-06-05 2021-07-16 腾讯科技(深圳)有限公司 Decision model training method, prediction method and device based on longitudinal federal learning
CN111695697B (en) * 2020-06-12 2023-09-08 深圳前海微众银行股份有限公司 Multiparty joint decision tree construction method, equipment and readable storage medium
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN111915019B (en) * 2020-08-07 2023-06-20 平安科技(深圳)有限公司 Federal learning method, system, computer device, and storage medium
CN111967615B (en) * 2020-09-25 2024-05-28 北京百度网讯科技有限公司 Multi-model training method and device based on feature extraction, electronic equipment and medium
CN112199706B (en) * 2020-10-26 2022-11-22 支付宝(杭州)信息技术有限公司 Tree model training method and business prediction method based on multi-party safety calculation
CN112464287B (en) * 2020-12-12 2022-07-05 同济大学 Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112529101B (en) * 2020-12-24 2024-05-14 深圳前海微众银行股份有限公司 Classification model training method and device, electronic equipment and storage medium
CN113822311B (en) * 2020-12-31 2023-09-01 京东科技控股股份有限公司 Training method and device of federal learning model and electronic equipment
WO2022144001A1 (en) * 2020-12-31 2022-07-07 京东科技控股股份有限公司 Federated learning model training method and apparatus, and electronic device
CN113807544B (en) * 2020-12-31 2023-09-26 京东科技控股股份有限公司 Training method and device of federal learning model and electronic equipment
CN112597135A (en) * 2021-01-04 2021-04-02 天冕信息技术(深圳)有限公司 User classification method and device, electronic equipment and readable storage medium
CN112766514B (en) * 2021-01-22 2021-12-24 支付宝(杭州)信息技术有限公司 Method, system and device for joint training of machine learning model
CN112767129A (en) * 2021-01-22 2021-05-07 建信金融科技有限责任公司 Model training method, risk prediction method and device
CN113642669B (en) * 2021-08-30 2024-04-05 平安医疗健康管理股份有限公司 Feature analysis-based fraud prevention detection method, device, equipment and storage medium
CN113705727B (en) * 2021-09-16 2023-05-12 四川新网银行股份有限公司 Decision tree modeling method, prediction method, equipment and medium based on differential privacy
CN114580011B (en) * 2022-01-29 2024-06-14 国网青海省电力公司电力科学研究院 Electric power facility security situation sensing method and system based on federal privacy training
CN114362948B (en) * 2022-03-17 2022-07-12 蓝象智联(杭州)科技有限公司 Federated derived feature logistic regression modeling method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
WO2018031597A1 (en) * 2016-08-08 2018-02-15 Google Llc Systems and methods for data aggregation based on one-time pad based sharing
CN107832581A (en) * 2017-12-15 2018-03-23 百度在线网络技术(北京)有限公司 Trend prediction method and device
CN107871160A (en) * 2016-09-26 2018-04-03 谷歌公司 Communicate efficient joint study
CN109165683A (en) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 Sample predictions method, apparatus and storage medium based on federation's training

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101056166B (en) * 2007-05-28 2010-04-21 北京飞天诚信科技有限公司 A method for improving the data transmission security
CN104009842A (en) * 2014-05-15 2014-08-27 华南理工大学 Communication data encryption and decryption method based on DES encryption algorithm, RSA encryption algorithm and fragile digital watermarking
CN113435602A (en) * 2016-11-01 2021-09-24 第四范式(北京)技术有限公司 Method and system for determining feature importance of machine learning sample
CN107704966A (en) * 2017-10-17 2018-02-16 华南理工大学 A kind of Energy Load forecasting system and method based on weather big data
CN107767183A (en) * 2017-10-31 2018-03-06 常州大学 Brand loyalty method of testing based on combination learning and profile point
CN107993139A (en) * 2017-11-15 2018-05-04 华融融通(北京)科技有限公司 A kind of anti-fake system of consumer finance based on dynamic regulation database and method
CN108257105B (en) * 2018-01-29 2021-04-20 南华大学 Optical flow estimation and denoising joint learning depth network model for video image
TWM561279U (en) * 2018-02-12 2018-06-01 林俊良 Blockchain system and node server for processing strategy model scripts of financial assets
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018031597A1 (en) * 2016-08-08 2018-02-15 Google Llc Systems and methods for data aggregation based on one-time pad based sharing
CN107871160A (en) * 2016-09-26 2018-04-03 谷歌公司 Communicate efficient joint study
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN107832581A (en) * 2017-12-15 2018-03-23 百度在线网络技术(北京)有限公司 Trend prediction method and device
CN109165683A (en) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 Sample predictions method, apparatus and storage medium based on federation's training

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392164B (en) * 2020-03-13 2024-01-12 京东城市(北京)数字科技有限公司 Method for constructing longitudinal federal tree, main server, service platform and system
CN113392164A (en) * 2020-03-13 2021-09-14 京东城市(北京)数字科技有限公司 Method, main server, service platform and system for constructing longitudinal federated tree
CN111414646B (en) * 2020-03-20 2024-03-29 矩阵元技术(深圳)有限公司 Data processing method and device for realizing privacy protection
CN111414646A (en) * 2020-03-20 2020-07-14 矩阵元技术(深圳)有限公司 Data processing method and device for realizing privacy protection
CN111402095A (en) * 2020-03-23 2020-07-10 温州医科大学 Method for detecting student behaviors and psychology based on homomorphic encrypted federated learning
CN111444956A (en) * 2020-03-25 2020-07-24 平安科技(深圳)有限公司 Low-load information prediction method and device, computer system and readable storage medium
CN111444956B (en) * 2020-03-25 2023-10-31 平安科技(深圳)有限公司 Low-load information prediction method, device, computer system and readable storage medium
CN111461874A (en) * 2020-04-13 2020-07-28 浙江大学 Credit risk control system and method based on federal mode
CN111666576A (en) * 2020-04-29 2020-09-15 平安科技(深圳)有限公司 Data processing model generation method and device and data processing method and device
CN111666576B (en) * 2020-04-29 2023-08-04 平安科技(深圳)有限公司 Data processing model generation method and device, and data processing method and device
CN111882054A (en) * 2020-05-27 2020-11-03 杭州中奥科技有限公司 Method and related equipment for cross training of network data of encryption relationship between two parties
CN111882054B (en) * 2020-05-27 2024-04-12 杭州中奥科技有限公司 Method for cross training of encryption relationship network data of two parties and related equipment
CN113824546A (en) * 2020-06-19 2021-12-21 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN113824546B (en) * 2020-06-19 2024-04-02 百度在线网络技术(北京)有限公司 Method and device for generating information
CN111814985B (en) * 2020-06-30 2023-08-29 平安科技(深圳)有限公司 Model training method under federal learning network and related equipment thereof
CN111814985A (en) * 2020-06-30 2020-10-23 平安科技(深圳)有限公司 Model training method under federated learning network and related equipment thereof
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN111914277A (en) * 2020-08-07 2020-11-10 平安科技(深圳)有限公司 Intersection data generation method and federal model training method based on intersection data
CN111914277B (en) * 2020-08-07 2023-09-01 平安科技(深圳)有限公司 Intersection data generation method and federal model training method based on intersection data
US11914678B2 (en) 2020-09-23 2024-02-27 International Business Machines Corporation Input encoding for classifier generalization
CN112288094A (en) * 2020-10-09 2021-01-29 武汉大学 Federal network representation learning method and system
CN112288094B (en) * 2020-10-09 2022-05-17 武汉大学 Federal network representation learning method and system
CN112231308A (en) * 2020-10-14 2021-01-15 深圳前海微众银行股份有限公司 Method, device, equipment and medium for removing weight of horizontal federal modeling sample data
CN112231308B (en) * 2020-10-14 2024-05-03 深圳前海微众银行股份有限公司 Method, device, equipment and medium for de-duplication of transverse federal modeling sample data
CN112381307B (en) * 2020-11-20 2023-12-22 平安科技(深圳)有限公司 Meteorological event prediction method and device and related equipment
CN112381307A (en) * 2020-11-20 2021-02-19 平安科技(深圳)有限公司 Meteorological event prediction method and device and related equipment
CN113824677A (en) * 2020-12-28 2021-12-21 京东科技控股股份有限公司 Federal learning model training method and device, electronic equipment and storage medium
CN113824677B (en) * 2020-12-28 2023-09-05 京东科技控股股份有限公司 Training method and device of federal learning model, electronic equipment and storage medium
CN113807380B (en) * 2020-12-31 2023-09-01 京东科技信息技术有限公司 Training method and device of federal learning model and electronic equipment
CN113807380A (en) * 2020-12-31 2021-12-17 京东科技信息技术有限公司 Method and device for training federated learning model and electronic equipment
CN112651458A (en) * 2020-12-31 2021-04-13 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112651458B (en) * 2020-12-31 2024-04-02 深圳云天励飞技术股份有限公司 Classification model training method and device, electronic equipment and storage medium
CN112749749A (en) * 2021-01-14 2021-05-04 深圳前海微众银行股份有限公司 Classification method and device based on classification decision tree model and electronic equipment
CN112749749B (en) * 2021-01-14 2024-04-16 深圳前海微众银行股份有限公司 Classification decision tree model-based classification method and device and electronic equipment
CN112836830B (en) * 2021-02-01 2022-05-06 广西师范大学 Method for voting and training in parallel by using federated gradient boosting decision tree
CN112836830A (en) * 2021-02-01 2021-05-25 广西师范大学 Method for voting and training in parallel by using federated gradient boosting decision tree
CN113807534B (en) * 2021-03-08 2023-09-01 京东科技控股股份有限公司 Model parameter training method and device of federal learning model and electronic equipment
CN113807534A (en) * 2021-03-08 2021-12-17 京东科技控股股份有限公司 Model parameter training method and device of federal learning model and electronic equipment
CN113051239A (en) * 2021-03-26 2021-06-29 北京沃东天骏信息技术有限公司 Data sharing method, use method of model applying data sharing method and related equipment
CN114882333A (en) * 2021-05-31 2022-08-09 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium
CN113204443B (en) * 2021-06-03 2024-04-16 京东科技控股股份有限公司 Data processing method, device, medium and product based on federal learning framework
CN113204443A (en) * 2021-06-03 2021-08-03 京东科技控股股份有限公司 Data processing method, equipment, medium and product based on federal learning framework
CN113435537A (en) * 2021-07-16 2021-09-24 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113722987B (en) * 2021-08-16 2023-11-03 京东科技控股股份有限公司 Training method and device of federal learning model, electronic equipment and storage medium
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113722987A (en) * 2021-08-16 2021-11-30 京东科技控股股份有限公司 Federal learning model training method and device, electronic equipment and storage medium
CN113723477B (en) * 2021-08-16 2024-04-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113657996A (en) * 2021-08-26 2021-11-16 深圳市洞见智慧科技有限公司 Method and device for determining feature contribution degree in federated learning and electronic equipment
CN113722739B (en) * 2021-09-06 2024-04-09 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN113722739A (en) * 2021-09-06 2021-11-30 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN114399000A (en) * 2022-01-20 2022-04-26 中国平安人寿保险股份有限公司 Object interpretability feature extraction method, device, equipment and medium of tree model
CN114677200A (en) * 2022-04-01 2022-06-28 重庆邮电大学 Business information recommendation method and device based on multi-party high-dimensional data longitudinal federal learning

Also Published As

Publication number Publication date
CN109165683A (en) 2019-01-08
CN109165683B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
WO2020029590A1 (en) Sample prediction method and device based on federated training, and storage medium
CN109034398B (en) Gradient lifting tree model construction method and device based on federal training and storage medium
US11985037B2 (en) Systems and methods for conducting more reliable assessments with connectivity statistics
US11665072B2 (en) Parallel computational framework and application server for determining path connectivity
US11968105B2 (en) Systems and methods for social graph data analytics to determine connectivity within a community
US11032585B2 (en) Real-time synthetically generated video from still frames
WO2020119272A1 (en) Risk identification model training method and apparatus, and server
US9875277B1 (en) Joining database tables
WO2022142001A1 (en) Target object evaluation method based on multi-score card fusion, and related device therefor
CN104077723A (en) Social network recommending system and social network recommending method
US10742627B2 (en) System and method for dynamic network data validation
CN110851485B (en) Social relation mining method and device, computer equipment and readable medium
CN113688252A (en) Safe cross-domain recommendation method based on multi-feature collaborative knowledge map and block chain
CN112101577A (en) XGboost-based cross-sample federal learning and testing method, system, device and medium
CN112560105B (en) Joint modeling method and device for protecting multi-party data privacy
CN114139202A (en) Privacy protection sample prediction application method and system based on federal learning
WO2021135540A1 (en) Neo4j-based anomalous user processing method and apparatus, computer device, and medium
CN107194280B (en) Model establishing method and device
US11853400B2 (en) Distributed machine learning engine
CN112529102A (en) Feature expansion method, device, medium, and computer program product
CN110175283B (en) Recommendation model generation method and device
CN117033997A (en) Data segmentation method, device, electronic equipment and medium
Li et al. Incentive and knowledge distillation based federated learning for cross-silo applications
CN116702899B (en) Entity fusion method suitable for public and private linkage scene
US11847246B1 (en) Token based communications for machine learning systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19848348

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19848348

Country of ref document: EP

Kind code of ref document: A1