CN113591115A

CN113591115A - Method for batch normalization in logistic regression model for safe federal learning

Info

Publication number: CN113591115A
Application number: CN202110890465.9A
Authority: CN
Inventors: 赵培江; 祝文伟
Original assignee: Shenpu Technology Shanghai Co ltd
Current assignee: Shenpu Technology Shanghai Co ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-02

Abstract

The invention relates to the related field of federal learning, and particularly discloses a method for batch standardization in a logistic regression model for safe federal learning, which comprises the following steps: s1, preparing training data by all parties of the federal learning participants: the federally learned participants may have multiple parties P1, P2.., Pn, each having training data _1, data _ 2.., data _ n, respectively, wherein one party has a tag y, assuming that the party having the tag is P1, and a coordinator for gradient decryption; s2, fusing data sets: in the longitudinal federal study, data should be submitted, namely, a sample ID commonly owned by each participant is screened to form a fused data set fuse _ data; s3, training a logistic regression model in longitudinal federal learning; s4, predicting; on one hand, the method can avoid the model training failure caused by the fact that too large gradient cannot be converged in the logistic regression algorithm modeling of safe federal learning, and the training success rate of the model is improved.

Description

Method for batch normalization in logistic regression model for safe federal learning

Technical Field

The invention relates to the related field of federal learning, in particular to a method for batch standardization in a logistic regression model of safe federal learning.

Background

Machine learning refers to a process of using some algorithms to guide a computer to independently construct a reasonable model by using known data and to judge a new situation by using the model, and plays a very important role in various applications such as network search, online advertisement, commodity recommendation, mechanical failure prediction, insurance pricing, financial risk management and the like. Traditionally, machine learning models are trained on a centralized corpus of data, which may be collected by a single or multiple data providers. Although parallel distributed algorithms have been developed to speed up the training process, the training data itself is still collected centrally and stored in one data center.

In 5.2018, the European Union mentioned a new height to the privacy Protection requirement by the General Data Protection Regulation (GDPR) act. In addition to this, there are many legal regulations that have been published about private data. Therefore, the previous platform mechanism is challenged to share data in any way, and also brings serious privacy problems to the data collection of machine learning. Because data used for machine learning training is often sensitive, it may come from multiple owners with different privacy requirements. This serious privacy problem limits the actual amount of data.

The federal study defines that all data are kept locally, so that privacy is not disclosed and regulations are not violated; a virtual common model is established by combining data of a plurality of participants, and a system which benefits jointly can specifically achieve the purpose that the respective data cannot go out of the local area, and then a virtual common model is established in a parameter exchange mode under an encryption mechanism under the condition of not violating data privacy regulations. Federal learning is used as a modeling method for guaranteeing data safety, and has huge application prospects in industries such as sales, finance and the like. In these industries, data cannot be aggregated directly for machine learning model training due to a number of factors including intellectual property, privacy protection, data security, etc. At this point, a federated model needs to be trained via federated learning.

In a longitudinal federal learning logistic regression model, an approximation function of a sigmoid function adopted when homomorphic encrypted data is calculated cannot limit the upper limit and the lower limit of a calculation result; in the model operation, the sum of the products of the model parameters and the input features is not limited, so that when the sum is too large, the gradient of the model is always increased to become incapable of convergence.

Disclosure of Invention

The invention aims to provide a method for batch standardization in a logistic regression model for safe federal learning, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for batch normalization in a logistic regression model for safe federal learning, comprising the steps of:

s1, preparing training data by all parties of the federal learning participants: the federally learned participants may have multiple parties P1, P2.., Pn, each having training data _1, data _ 2.., data _ n, respectively, wherein one party has a tag y, assuming that the party having the tag is P1, and a coordinator for gradient decryption;

s2, fusing data sets: in the longitudinal federal study, data should be submitted, namely, a sample ID commonly owned by each participant is screened to form a fused data set fuse _ data;

s3, training a logistic regression model in longitudinal federal learning: the specific training method comprises the following steps of,

s301, all the participants of the joint training in each training process form a data set X locally according to the sample ID and the characteristics after intersection_Pn＝[x₁,x₂,...,x_m]∈R^mAnd respective model parameters W_Pn∈R^m；

S302, each participant respectively takes out a batch B [ X ] with the data volume B from the owned data]∈R^b×mAccording to the characteristic dimension m, performing matrix multiplication with the model parameter W to obtain the model calculation parameter WX belonging to the R in the first step^b×1；

S303, each participant calculates the average value of the parameters according to the model calculation parameters of the first step

And standard deviation v²b；

S304, calculating the average value of the parameters of the model of the current batch

And standard deviation v²b put into a sliding window queue of length k for recording the mean and standard deviation of previous batches

And

performing the following steps;

s305, if the amount of the data stored in the sliding window queue recording the average value and the standard deviation of the previous batch is larger than a set value, dequeuing the data at the head of the queue;

s306, calculating the moving average of the mean value and the standard deviation of the model calculation parameters in the sliding window queue to obtain

And M (v)²) And standardizing the WX according to the following formula to obtain standardized model calculation parameters:

wherein, epsilon is an additional item for preventing from dividing 0, and is a settable decimal, [ alpha, beta ] is a standardized parameter, and the initial value is [1,0 ];

s307, standardizing the model calculation parameters obtained in the step S302 by using the average value and the standard deviation of the model calculation parameters obtained in the step S306 to obtain standardized model calculation parameters;

s308, calculating parameter BN (WX) for the normalized model except P1_b) Performing homomorphic encryption to obtain E_Pn[BN(WX_b)]And sends the encrypted result to P1;

s309, summing the normalized model calculation parameters in P1 to obtain:

and calculates gradient calculation parameters E [ h (WX) ] of the encrypted model by using local supervision information]＝f(E[WX]) Gradient E [ g ] of-y and P1_P1]＝E[h(WX)]x_i；

S310, P1 gradient calculation parameters E [ h (WX)]Sending to other participants the gradient E g of P1_P1]Sending the data to a coordinator;

s311, the other participants calculate the respective gradients E [ g ] using the gradient calculation parameters_Pn]And sending the gradient to a coordinator;

s312, the coordinator decrypts the gradient of each participant to obtain g_PnAnd distributed to each participant;

s313, each participant uses the decrypted gradient g_PnTo update the model parameters [ W, alpha, beta ]]；

S314, when the model reaches the maximum iteration times, the model terminates the training termination condition, otherwise, the step S302 is repeated;

s315, after the model training is finished, all the participators simultaneously save the model parameter W_PnAnd the mean value of the model calculation parameters in the sliding window queue

And standard deviation M (v)²)；

S4, prediction: the specific prediction steps are as follows:

s401, the new user carries out intersection;

s402, data characteristic X and model parameter W of new user_PnCarrying out matrix multiplication to obtain model calculation parameters WX, and standardizing by using the average value and standard deviation of the model calculation parameters stored in model training;

s403, use

Obtaining the normalized model calculation parameters, wherein the normalized model calculation parameters are obtained by the formula: y-pred ═ f (wx) to calculate prediction results and prediction probabilities, where

Compared with the prior art, the invention has the beneficial effects that: the method can avoid the model training failure caused by the fact that too large gradient cannot be converged in the logistic regression algorithm modeling of safe federal learning on one hand, and improve the training success rate of the model on the other hand, can change the distribution condition of the original data, and enables other participants to be more difficult to obtain the original distribution shape of the original data so as to improve the safety.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a method for batch normalization in a logistic regression model for safe federal learning, comprising the steps of: