CN116029392A

CN116029392A - Joint training method and system based on federal learning

Info

Publication number: CN116029392A
Application number: CN202310065357.7A
Authority: CN
Inventors: 彭胜波; 彭宇; 周宏�; 王克敏
Original assignee: China Tobacco Corp Guizhou Provincial Co
Current assignee: China Tobacco Corp Guizhou Provincial Co
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-04-28

Abstract

The invention discloses a federal learning-based joint training method and system. The method comprises the following steps: each active party calculates the first derivative and the second derivative of each sample, and sends the ciphertext to the parameter server of the passive party after homomorphic encryption; the parameter server of the passive party based on the secure multiparty computing protocol uses ciphertext [ g ] of the first derivative from each active party from the sample dimension _ji ]And ciphertext of second derivative [ h ] _ji ]Respectively summing to obtain ciphertext [ g ] of first derivative of active side sample _i ]And ciphertext of second derivative [ h ] _i ]The method comprises the steps of carrying out a first treatment on the surface of the Based on the precision lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the method is based on the [ g ] _i ]And [ h ] _j ]Training to obtain a lifting tree model so as to meet the federal learning scene coexisting horizontally and vertically.

Description

Joint training method and system based on federal learning

Technical Field

The invention relates to the field of federal learning, in particular to a federal learning-based joint training method and a federal learning-based joint training system.

Background

Federal learning (Federated Learning) is an emerging artificial intelligence basic technology, and is first proposed by Google in 2016, originally used for solving the problem of local model updating of android mobile phone terminal users, and the design goal is to develop high-efficiency machine learning among multiple participants or multiple computing nodes on the premise of guaranteeing information security during large data exchange, protecting terminal data and personal data privacy and guaranteeing legal compliance.

Lateral federal learning and longitudinal federal learning are two types of federal learning. Horizontal federal learning, federal learning is performed in conjunction with multiple rows of samples of multiple participants having the same characteristics. Longitudinal federal learning is performed in conjunction with a plurality of participants having common samples but different feature sets. On the basis of longitudinal federal learning, cheng K proposed a precision lossless privacy protection tree promotion algorithm in paper security boost: A Lossless Federatedd Learning Framework [ J ]. ArXiv,2019 to train a high quality promotion tree model in a longitudinal federal environment. The method requires that the data of different parties possess the following properties: the sample data is vertically partitioned, one data provider providing tag data and the other data provider providing feature data.

However, in a vertical federal learning scenario, tag data may be distributed among different participants. The existing longitudinal federation learning algorithm cannot meet the federation learning scene coexisting horizontally and longitudinally. For example, party a is a bank that has user tag data of: there is a lending crisis, there is no lending crisis, there may or may not be user features, however, the labels of party a are distributed among different banks, e.g. the farmer has a part of labels and the bank has a part of labels. Party B is an insurance company that has characteristic data of the user (e.g., age, income, etc. of the user). The bank and the insurance company are in a longitudinal relationship, and the label owned by the farm and the label owned by the building are in a transverse relationship.

Disclosure of Invention

The invention provides a federal learning-based joint training method and a federal learning-based joint training system, which are used for meeting the federal learning scene coexisting horizontally and vertically.

In a first aspect, an embodiment of the present invention provides a federal learning-based joint training method, applied to a federal learning-based joint training system, where the system includes at least two active parties and at least one passive party, the passive party owns feature data, each active party owns part of tag data, any one of the at least two active parties is selected as a coordinator, and all the active parties and the passive parties have completed sample alignment, and the method is characterized by comprising:

each active party calculates the first derivative and the second derivative of each sample, and sends the ciphertext to the parameter server of the passive party after homomorphic encryption;

the parameter server of the passive party based on the secure multiparty computing protocol uses ciphertext [ g ] of the first derivative from each active party from the sample dimension _ji ]And ciphertext of second derivative [ h ] _ji ]Respectively summing to obtain ciphertext [ g ] of first derivative of active side sample _i ]And ciphertext of second derivative [ h ] _i ]；

Based on the precision lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the method is based on the [ g ] _i ]And [ h ] _i ]Training to obtain a lifting tree model.

Further, the precision-based lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the method is based on the [ g ] _i ]And [ h ] _i ]Training to obtain a lifting tree model, comprising:

the parameter server of the passive party is based on [ g ] _i ]And [ h ] _i ]Ciphertext [ g ] of left subtree gradient of current node _l ]And [ h ] _l ]；

Decryption by the master [ g ] _l ]And [ h ] _l ]According to g _l And h _l Calculating split information of the current node and sending the split information to a coordinator;

the coordinator calculates the globally optimal split information according to the split information and sends the split information to the corresponding passive party;

the passive party divides a sample space according to the splitting information, adds a record of node splitting information into a lookup table, and then broadcasts a record code record id and the sample space of the record to the active party;

the active party splits the current node according to the received sample space, and associates the current node with the passive party and the record id;

and taking the child node split by the current node as a father node, and returning to execute the steps until a preset termination condition is reached.

Further, the preset termination condition includes:

the maximum splitting gain of the node is smaller than or equal to a set gain threshold;

or alternatively, the first and second heat exchangers may be,

the number of the samples contained in the leaf nodes is smaller than a set number threshold;

or alternatively, the first and second heat exchangers may be,

the tree depth of the lifting tree is equal to the set depth threshold.

the passive side calculates ciphertext [ g ] of the left subtree gradient of the current node _l ]And [ h ] _l ]And broadcast [ g ] according to the computing power resources of each active party _l ]And [ h ] _l ]To different active parties;

the initiative recipe according to g _l And h _l Calculating split information of a current node, sending the split information to a coordinator, and calculating globally optimal split information by the coordinator;

based on an accuracy lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and a lifting tree model is obtained according to global optimal split information training.

Further, the precision-based lossless privacy protection tree lifting algorithm is used for cooperationThe regulating party coordinates the active party and the passive party according to the [ g ] _i ]And [ h ] _i ]After training to obtain the lifting tree model, the method further comprises the following steps:

when calculating the evaluation index on the verification set, calculating a local evaluation index value based on the label owned by each active party;

and carrying out statistical operation of the corresponding index values based on the secure multiparty computing protocol, thereby obtaining the evaluation index information based on all the labels.

the parameter server of the passive party will [ g ] _i ]And [ h ] _i ]Dividing the network into at least two worker servers;

each worker server calculates ciphertext [ g ] of the left subtree gradient of the current node _l ]And [ h ] _l ]；

The parameter server of the passive side gathers the [ g ] of each worker server _l ]And [ h ] _l ]；

Based on the precision lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the method is based on the [ g ] _l ]And [ h ] _l ]Training to obtain a lifting tree model.

Further, the secure multiparty computing protocol comprises: SPDZ protocol supporting both parties' secure computing or NPDZ protocol supporting multiparty secure computing.

Further, the parameter server of the passive party sums the first and second derivatives from different active parties from the sample dimension based on the secure multiparty computing protocol, respectively, comprising:

acquiring the number of the participants, and selecting a target protocol from the SPDZ protocol and the NPDZ protocol according to the number of the participants;

based on the target protocol, the first and second derivatives from different initiatives are summed separately from the sample dimension.

In a second aspect, an embodiment of the present invention further provides a federal learning-based joint training system, where the system includes at least two active parties and at least one passive party, the passive parties have feature data, each active party has part of tag data, any one of the at least two active parties is selected as a coordinator, and all of the active parties and the passive parties have completed sample alignment, where the system applies the federal learning-based joint training method according to any one of the embodiments of the present invention.

According to the invention, the ciphertext gradient of all samples is summed on the passive side based on the SPDZ protocol or the NPDZ protocol, a third-party node is not needed, the safety of the data of the participators is ensured to the greatest extent, and the federal learning scene coexisting horizontally and longitudinally is satisfied.

When the sample size of the federal training of the participants is large, the computational overhead of this process of comparing the corresponding split gains of different features is relatively large. When the global split point is selected, the embodiment of the invention calculates the left subtree gradient [ g ] of the current node by the passive party _l ]And [ h ] _l ]And broadcast [ g ] according to the computational power resources of the active party _l ]And [ h ] _l ]And the global optimal splitting points are calculated by all the parties together to different active parties, so that the efficiency of the algorithm is improved.

When evaluating indexes such as confusion matrix are calculated on the verification set, label information of all active parties is needed to be utilized, but each active party only has part of label information and cannot be combined explicitly. According to the embodiment of the invention, when the evaluation index is calculated, the local evaluation index value is calculated based on the label owned by each active party, and then the statistical operation of the corresponding index value is performed based on the secure multiparty calculation protocol, so that the evaluation index information based on all the labels is obtained.

In joint modeling scenes such as advertisement delivery, financial credit, knowledge graph and the like, the data size of the participants is generally large, and the federal learning algorithm cannot read all data into a memory for calculation, so that how to use the massive data for joint modeling. The embodiment of the invention uses the parameter server of the passive party to realize the following steps of _i ]And [ h ] _i ]Dividing the network into at least two worker servers; each worker server calculates ciphertext [ g ] of the left subtree gradient of the current node _l ]And [ h ] _l ]The method comprises the steps of carrying out a first treatment on the surface of the The parameter server of the passive side gathers the [ g ] of each worker server _l ]And [ h ] _l ]Thereby solving the problem of mass data.

Drawings

FIG. 1 is a flow chart of a federal learning-based joint training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model evaluation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of interaction of a federal learning-based joint training method according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

FIG. 1 is a flowchart of a federal learning-based joint training method according to an embodiment of the present invention. The embodiment of the invention can be applied to the situation that tag data is possibly distributed among different participants in a longitudinal federal learning scene. The method may be performed by a federal learning-based joint training system comprising at least two active parties and at least one passive party, the passive parties having characteristic data, each active party having part of tag data, any one of the at least two active parties being selected as a coordinator, the coordinator being also the active party at the same time. Sample alignment has been completed by all active and passive parties.

Referring to fig. 1, a specific federal learning-based joint training method includes the steps of:

s110, each active party calculates the first derivative and the second derivative of each sample, and sends the ciphertext to the parameter server of the passive party after homomorphic encryption.

Wherein each sample includes characteristic data and/or tag data for that sample.

The first derivative and the second derivative are data items that are fitted based on the taylor formula.

For example, party a is a bank that has user tag data of: there is a lending crisis, there is no lending crisis, there may or may not be user features, however, the labels of party a are distributed among different banks, e.g. the farmer has a part of labels and the bank has a part of labels. Party B is an insurance company that has characteristic data of the user (e.g., age, income, etc. of the user). The construction can be both a coordinator and an active party, the farm is the active party, and the insurance company is the passive party.

S120, the parameter server of the passive party carries out ciphertext [ g ] on the first derivative of each active party from the sample dimension based on the secure multiparty calculation protocol _ji ]And ciphertext of second derivative [ h ] _ji ]Respectively summing to obtain ciphertext [ g ] of first derivative of active side sample _i ]And ciphertext of second derivative [ h ] _i ]。

Where j represents different passive parties and i represents different samples.

The secure multiparty computing protocol includes: SPDZ protocol supporting both parties' secure computing or NPDZ protocol supporting multiparty secure computing.

Optionally, the secure multiparty computing protocol may further comprise: ABY (arithmetical, bootan, yao), ABY3.

Specifically, the choice of SPDZ protocol and NPDZ protocol is determined according to the number of participants.

S130, based on an accuracy lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the method is based on the [ g ] _i ]And [ h ] _i ]Training to obtain a lifting tree model.

The precision lossless privacy protection tree promotion algorithm is proposed by Cheng K in paper SecureBoost: ALossless Federatedd Learning Framework [ J ]. ArXiv, 2019.

According to the embodiment of the invention, the distribution calculation is carried out according to the calculation force resources of all the parties, so that the training efficiency of the model is improved.

Fig. 2 is a schematic diagram of a model evaluation method according to an embodiment of the present invention. Referring to fig. 2, the precision lossless privacy protection tree-based lifting algorithm, a coordinator coordinates an active party and a passive party according to [ g ] _i ]And [ h ] _i ]After training to obtain the lifting tree model, the method further comprises the following steps:

According to the embodiment of the invention, when the evaluation index is calculated, the local evaluation index value is calculated based on the label owned by each active party, and then the statistical operation of the corresponding index value is performed based on the secure multiparty calculation protocol, so that the evaluation index information based on all the labels is obtained.

Further, the precision-based lossless privacy protection tree lifting algorithm, a coordinator coordinates an active party and a passive party, and the precision-based lossless privacy protection tree lifting algorithm is implemented according to [ g ] _i ]And [ h ] _i ]Training to obtainLifting a tree model, comprising:

The embodiment of the invention uses the parameter server of the passive party to realize the following steps of _i ]And [ h ] _i ]Dividing the network into at least two worker servers; each worker server calculates ciphertext [ g ] of the left subtree gradient of the current node _l ]And [ h ] _l ]The method comprises the steps of carrying out a first treatment on the surface of the The parameter server of the passive side gathers the [ g ] of each worker server _l ]And [ h ] _l ]Thereby solving the problem of mass data.

FIG. 3 is a schematic diagram of interaction of a federal learning-based joint training method according to an embodiment of the present invention. The training method is exemplified by longitudinal federal (X, X, Y) mode learning, where X represents feature data and Y represents tag data. Assuming that all participants have completed the sample alignment process before federal learning, participant a _j (j=1, 2, … |a|) (active party ActiveParty) each possess partial tag data Y, party B _m (m=1, 2 … |b|) (passive party) has characteristic data X _i Wherein i is used for identifying different samples, j is used for identifying different active parties, a| represents the number of active parties, m is used for identifying different passive parties, b| represents the number of passive parties, and A is used in the embodiment ₁ As a coordinator, A _j (j.noteq.1) is other active party, and the model is XGBoost as example, and the joint modeling steps are as follows:

(1)A ₁ generating homomorphic encryption public-private key pairs<s,d>Broadcasting private keys d to A _j (j.noteq.1), broadcasting the public key s to B _m 。

(2)A _j Ciphertext [ g ] of the first derivative of each sample of the respective possession is calculated _ji ]And ciphertext of second derivative [ h ] _ji ]And combine [ g ] _ji ]And [ h ] _ji ]To B _m Parameter server (PS, paramServer).

(3)B _m PS based on SPDZ protocol summation [ g ] _ji ]And [ h ] _ji ]To give [ g ] _i ]And [ h ] _i ]，B _i PS of (2) will [ g ] _i ]And [ h ] _i ]The number of the workers is divided into 3 workers, and the workers are respectively marked as w1, w2 and w3;

each worker calculates ciphertext [ g ] of the left subtree gradient of the current node _l ]And [ h ] _l ]；

B _m PS summary of each worker [ g ] _l ]And [ h ] _l ]Obtaining ciphertext g of current node left subtree gradient _l ]And [ h ] _l ]And broadcast ciphertext to A _j 。

(4)A _j Decryption [ g ] _l ]、[h _l ]And according to g _l 、h _l Calculating a splitting gain, splitting characteristics and splitting threshold, and transmitting splitting information to A ₁ 。

(5)A ₁ Calculating globally optimal splitting characteristics and splitting threshold values according to the splitting information, and sending the splitting information to corresponding B _m 。

(6)B _m Dividing a sample space according to the split information, and adding a record into the wakeup table (the wakeup table is firstly established if the wakeup table does not exist for the first time); then R is taken up _n And the sample space of the left subtree is broadcast to A _j . Wherein R is _n I.e. the record id in the record table.

(7)A _j Splitting a current node according to the received sample space and combining the current node with (B) _m 、R _n ) And (5) association.

(8) Repeating (1) - (7) until the tree establishment termination condition is reached.

Wherein, the preset termination condition comprises:

the maximum splitting gain of the node is smaller than or equal to a set gain threshold; or alternatively, the first and second heat exchangers may be,

the number of the samples contained in the leaf nodes is smaller than a set number threshold; or alternatively, the first and second heat exchangers may be,

the tree depth of the lifting tree is equal to the set depth threshold.

The present embodiment is not limited in any way, and may be specifically set according to actual needs.

Taking an active party as a building and an agricultural party as an insurance company as an example, building a part of tag data, and taking the agricultural party as an insurance company as an example, wherein the agricultural party is provided with part of tag data, and the tag data are as follows: the user has a lending crisis and the user does not have a lending crisis. Insurance companies possess characteristic data: age of the user, income of the user. Assuming that the construction is both an active party and a coordinator, the farm is the active party and the insurance company is the passive party.

Taking the model as XGBoost as an example, the joint modeling steps are as follows:

(1) The construction server side generates homomorphic encryption public and private key pairs < s, d >, and broadcasts the private key d to the farm.

(2) The construction server and the farming server respectively calculate the first derivative and the second derivative of the sample labels owned by the construction server and the farming server, and homomorphic encryption is carried out to obtain [ g ] _1i ]、[h _1i ]，[g _2i ]、[h _2i ]And sending to an insurance company server.

(3) Insurance company server sums g based on SPDZ protocol _1i ]、[g _2i ]To give [ g ] _i ]Sum [ h ] _1i ]、[h _2i ]Obtain [ h ] _i ]Insurance company server will [ g ] _i ]And [ h ] _i ]The servers are divided into 3 worker servers and respectively marked as w1, w2 and w3;

The insurance company server gathers the [ g ] of each worker _l ]And [ h ] _l ]Obtaining the gradient [ g ] of the left subtree of the current node _l ]And [ h ] _l ]And will [ g ] _l ]And [ h ] _l ]Broadcast to the construction server and the farming server.

(4) Decryption of construction server and farm server [ g ] _l ]、[h _l ]And according to g _l 、h _l Calculating a splitting gain, splitting characteristics and splitting threshold, andthe split information is sent to the construction server.

(5) The construction server calculates a global optimum split characteristic and a split threshold from the split information and sends the split information to the insurer server.

(6) The insurance company server divides a sample space according to the split information and adds a record into the lookup table (the lookup table is not existed for the first time, the lookup table is established first); then R is taken up _n And the sample space of the left subtree is broadcast to the construction server and the farming server. Wherein R is _n I.e. the record id in the record table.

(7) The construction server and farm server split the current node according to the received sample space and combine the current node with (insurance company, R) _n ) And (5) association.

The embodiment of the invention also provides a federal learning-based joint training system, which comprises at least two active parties and at least one passive party, wherein the passive party has characteristic data, each active party has part of tag data, any one of the at least two active parties is selected as a coordinator, and all the active parties and the passive parties have completed sample alignment.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A federal learning-based joint training method, applied to a federal learning-based joint training system, wherein the system includes at least two active parties and at least one passive party, the passive parties possess characteristic data, each active party possesses partial tag data, any one of the at least two active parties is selected as a coordinator, and all of the active parties and the passive parties have completed sample alignment, comprising:

2. The method according to claim 1, wherein the precision-based lossless privacy preserving tree promotion algorithm, a coordinator coordinates an active party and a passive party according to the [ g ] _i ]And [ h ] _i ]Training to obtain a lifting tree model, comprising:

3. The method of claim 2, wherein the preset termination condition comprises:

or alternatively, the first and second heat exchangers may be,

the tree depth of the lifting tree is equal to the set depth threshold.

4. The method according to claim 1, wherein the precision-based lossless privacy preserving tree promotion algorithm, a coordinator coordinates an active party and a passive party according to the [ gi ] _] And [ h ] _i ]Training to obtain a lifting tree model, comprising:

5. The method according to claim 1, wherein the precision-based lossless privacy preserving tree promotion algorithm coordinatesThe party coordinates the active party and the passive party according to the [ g ] _i ]And [ h ] _i ]After training to obtain the lifting tree model, the method further comprises the following steps:

6. The method according to claim 1, wherein the precision-based lossless privacy preserving tree promotion algorithm, a coordinator coordinates an active party and a passive party according to the [ g ] _i ]And [ h ] _i ]Training to obtain a lifting tree model, comprising:

7. The method of claim 1, wherein the secure multiparty computing protocol comprises: SPDZ protocol supporting both parties' secure computing or NPDZ protocol supporting multiparty secure computing.

8. The method of claim 7, wherein the parameter server of the passive party bases the ciphertext [ g ] of the first derivative from each active party from the sample dimension based on a secure multiparty computing protocol _ji ]And ciphertext of second derivative [ h ] _ji ]Respectively summing to obtain ciphertext [ g ] of first derivative of active side sample _i ]And ciphertext of second derivative [ h ] _i ]Comprising:

based on the target protocol, the first and second derivatives from the coordinator and different initiatives are summed separately from the sample dimension.

9. A federal learning based joint training system, wherein the system comprises at least two active parties and at least one passive party, the passive party having characteristic data, each active party having part of tag data, any one of the at least two active parties being selected as a coordinator, all active and passive parties having completed sample alignment, characterized in that the system is applied with the federal learning based joint training method of any one of claims 1 to 8.