CN113947214A

CN113947214A - Client knowledge distillation-based federal learning implementation method

Info

Publication number: CN113947214A
Application number: CN202111410414.8A
Authority: CN
Inventors: 王建新; 王殊; 刘渊; 张德文; 聂璇
Original assignee: Hunan Sanxiang Bank Co Ltd
Current assignee: Hunan Sanxiang Bank Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-01-18
Anticipated expiration: 2041-11-23
Also published as: CN113947214B

Abstract

The invention discloses a federal learning implementation method based on client knowledge distillation. The method comprises the following steps: when a new communication turn starts, the server side sends a model parameter set to each client side, and the model parameter set is formed according to model parameters submitted to the server side by all the client sides in the previous communication turn; after each client receives the model parameter set sent by the server, local model parameter training of the current communication turn is started, and during the local model parameter training, characteristic information from other clients is migrated to the local model; when the client finishes the training of the current communication round, the client sends the parameters of the local model to the server; after all the clients send back the new model parameters, the server generates a new set according to the new model parameters for training the next communication turn; the number of communication rounds required by the federal learning modeling is reduced on the premise of not reducing the performance of the model, and the efficiency of the joint modeling by utilizing multi-mechanism data is improved.

Description

Client knowledge distillation-based federal learning implementation method

Technical Field

The invention relates to the technical field of federal learning, in particular to a client knowledge distillation-based federal learning implementation method.

Background

With the further development of big data technology, the importance of data privacy and security has become a global hot issue. At 2018, the international well-known social media "facebook" is penalized by local authorities by allowing some data analysis companies to collect large amounts of data for user group analysis without user permission, causing the stocks of the "facebook" companies to drop and large-scale anti-counseling activities to be triggered. In 2018, in the eu, General Data Protection Regulation (GDPR) was passed to retrieve control of citizens and residents over personal Data, and to simplify unified regulations in the eu for international commerce. Relevant departments in China will also go out related laws such as personal information protection law, data security law and the like to ensure that citizens have control right on personal information. The coming out of these laws and regulations raises the requirements of users on data security and privacy to a new level, thereby making it more difficult to collect and utilize the private data of the users.

In addition to the difficulty in collecting personal data of users, due to the problems of competition and privacy protection among various industries, it is no longer practical to centralize independent data distributed in various organizations into a complete data set and build and train a model with the integrated data set, and since widely distributed data cannot be utilized in a centralized manner, the isolated distributed data are called data islands. In the face of individual data islands, how to safely and legally use widely distributed data to perform combined modeling becomes a key ring in the landing process of an artificial intelligence system.

To address this data islanding problem, various approaches have been proposed by various research institutes to indirectly access the data islanding. The federal learning approach proposed by McMahan et al attempts to solve joint modeling with decentralized data under privacy protection. The cross-device federated learning method provided by Google utilizes an input method of users stored on mobile devices in the world, obtains a word prediction model by learning on the premise of protecting the privacy of the users, and the Sheller and the like regard medical data of a plurality of organizations as a plurality of data islands and obtain a semantic segmentation model after combined modeling by utilizing the federated learning method. The core steps of these methods are as follows: the method comprises the steps that a client logs in a server, the server selects a client terminal set which can run training in the current communication turn, then the server sends the latest global model parameters of the current communication turn to each selected client, then each selected client conducts training by using local private data of the selected client in parallel to obtain new model parameters and sends the new model parameters back to the server, and finally the server conducts weighted average on the model parameters sent back by the selected client and updates the global model parameters. Fig. 1 shows a workflow diagram of a classical federated learning approach.

The federated learning mode provides possibility for performing joint modeling across data islands, but the classical federated learning method needs to perform multiple communication rounds to achieve expected performance in the face of multi-organization data sets, and meanwhile, more communication rounds mean that more data needs to be exchanged between a server and a client, which causes a larger communication burden. Therefore, how to increase the convergence speed of the model and reduce the number of communication rounds is one of the targets of the federal learning method without sacrificing the performance of the global model.

Disclosure of Invention

The invention overcomes the defects of the existing federal learning method when facing multi-organization (an organization is regarded as a client participating in the federal learning training) data, realizes the federal learning method for distilling knowledge at the client, and utilizes the model sent by other clients in the last communication turn and extracts the knowledge information contained in the model so as to improve the local model of the client. The invention reduces the number of communication rounds required by the federal learning modeling on the premise of not reducing the performance of the model, and improves the efficiency of the joint modeling by utilizing multi-mechanism data.

The core part of the invention is a server and a client, and the server comprises the following steps:

step 1: the client is connected to the server, and once the number of connected clients meets the requirement, the learning process is started. And setting the current communication round variable as t, and setting the initial communication round as t as 0.

Step 2: the connected client uploads the initial parameters of the local model to the server, and the server stores the initial parameters as a parameter set. Let θⁱParameters sent to the server on behalf of client i.

And step 3: the server side sends a parameter set theta to each client side_t＝{θ0,θ¹,…,θ^KExpressing a parameter set at the t communication turn, wherein K is the number of the clients; the current communication round t is incremented by t + 1.

Each client removes the parameter set of the target client sent by the server after receiving

Then the operation is carried out in parallel, and the steps are as follows:

s1: the client k receives the parameter set sent by the server

Thereafter using local data X_iObtain the value f (X) of its local model output_i；θ^k)。

S2: the client uses the parameter set sent by the server to merge and combine with the local data to obtain the distribution E with the same number as the parameter set^k(X_iT), defined as:

s3: the client uses KL Divergence (Kullback-Leibler Divergence) D_KL(p | q) metric distribution pAnd similarity to q. The similarity of the predicted distribution and the mean distribution using the KL divergence measure in the present invention is defined as:

where σ is the activation function. Function of loss

As a first part of the overall loss function of the client k, reducing the KL divergence helps to migrate other client data information contained in the parameter set to the current client.

S4: the client calculates the task related loss, which is considered as the second part of the overall loss function. For the semantic segmentation task, a Dice coefficient (Dice coefficient) of the predicted distribution and the labels of the data is calculated as a task loss, and for the classification task, a cross entropy loss function is calculated.

S5: client-side reduction loss function to update parameter theta of local model^k。

S6: the client side updates the model parameter theta^kAnd sending back to the server.

After all the clients finish sending the model parameters, the server side renews the parameter set theta_t-1: and replacing the old parameters corresponding to the client with the new parameters.

And 4, step 4: and if the maximum communication turn is reached, finishing training, otherwise, continuing to step 3.

The invention has the following beneficial effects: the federate learning implementation method based on the knowledge distillation of the client utilizes the information contained in the model parameters sent by the client to the server and utilizes the information when the client carries out local training, thereby improving the performance of the local model of the client, effectively reducing the communication turns required by the federate learning method and accelerating the training process; the information is migrated to the local model by using a knowledge distillation technology, and the information migration process is performed by using a model from a different client as a teacher model and a model local to the client as a student model. Compared with the traditional federal study, the performance of the global model on each client is improved in a mode of carrying out model aggregation on the server, the performance of the local model is directly improved by the client in a local knowledge distillation mode, the time required by joint modeling is further reduced by the mode of directly updating the local model, and meanwhile, the performance of the model is improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any inventive exercise.

FIG. 1 is a schematic workflow diagram of a classical federated learning approach;

FIG. 2 is a federated learning implementation of the present invention;

FIG. 3 is a diagram of a server sending a set of parameters to a client;

FIG. 4 is a schematic illustration of a client performing local knowledge distillation;

FIG. 5 is a schematic diagram of an application example of the method under financial wind control.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the specific embodiments of the present invention and the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The technical solutions provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic workflow diagram of a classical federated learning method, and one of the objectives of federated learning is to establish a general global model capable of performing well on each client, so that the global model needs to utilize information extracted by using its own data in the client, and how to utilize the data information of the client has extensive research in a traditional centralized learning manner, one of which is a knowledge distillation technology. In the knowledge distillation technology, two models are provided, namely a teacher model with huge parameter quantity and a student model with relatively small parameter quantity, and the aim is to transfer the contained knowledge information to the student model by using the teacher model with strong expression capability, so that the student model can show the performance similar to the teacher model with smaller model parameter quantity. In the invention, each client migrates the knowledge contained in the client into the local model by using the parameter information of other clients acquired from the server before performing local calculation, thereby improving the performance of the local model.

The invention provides a new federal learning implementation mode, which is used for solving the problem that the convergence speed of a federal learning method is low (for example, more communication turns are needed) when the federal learning method is used for training on the basis of a cross-island multi-mechanism data set. In the invention, the client acquires other models sent to the server by the client from the server, and the models are regarded as teacher models, and then the knowledge distillation is utilized to transfer the information contained in the models to the local models. In order to help users to better understand the objective and workflow of the present invention, the technical details of the present invention will be described below with reference to the accompanying drawings. In fig. 2, there are K clients and their local models, which need to transmit the model to the server in the initial stage, and the server will save the model parameters of these clients as a set. Assuming that the current task is an image semantic segmentation task, image data X exists in local data of the client k_iAnd corresponding tag data Y_iIs stored as D_kIn order from D when the client is training_kMiddle read data-tag pair

The models used by the clients are all the same structure, and each client has an independent local model which participates inNumber theta^k。

Step 1: the server receives the initialized parameters from the client and stores the parameters in theta₀。t＝0；

Step 2: t is t +1, and the server sends the parameter set of other clients except the client k to the client k

The process is shown in figure 3. The client side runs training in parallel, and the local model theta is updated for any client side k^kThe steps of (1):

s1: data set D based on current client k_kGet data-tag pairs (X)_i,Y_i) Where i represents an index of data

S2: after the client receives the parameter set sent by the server, two loss functions are calculated. Please refer to fig. 3, which includes the following steps:

a: based on X_iObtaining the predicted label output f (X) of the current client local model_i；θ^k)；

B: based on X_iAnd obtaining average output E of other client models by using formula (1)^k(X_iT), σ is the Softmax activation function;

c: calculating KL divergence L using equation (2)_dAs part of the overall loss function L;

d: the second partial loss function is the mission loss function L_taskIn the image segmentation task, a Dice coefficient is often used as a loss, and is defined as follows:

e: the overall loss function is defined as:

L＝αL_task+(1-α)L_d (4)

where α is a weighting factor for the loss function.

F: updating client local model theta using loss function^kAs follows:

where η is the learning rate.

S3: circularly operating the steps A to F until i ═ D_k|；

S4: when the client k finishes training, a new theta is added^kSending back to the server;

and step 3: when the server receives a new parameter theta from the client k^kThereafter, the theta is replaced_t-1Old theta^k. When all the client sending parameters are received and the theta is updated_t-1Then, jump is made to step 2.

Take financial wind control based on the horizontal federal learning domain as an example. Financial institutions face the problem of sample shortage in anti-fraud scenarios, and can be applied to horizontal federal learning in anti-fraud scenarios among banking industries. Therefore, the invention provides a federal learning implementation mode which can construct a high-quality financial wind control model in banking and on the premise of protecting user privacy. For example, bank a has transaction data and anti-fraud data of users in one region, while bank B has transaction data and anti-fraud data of users in another region, in order to optimize anti-fraud models of both parties under the condition of protecting bank data privacy, modeling across banking institutions may be performed by using the method, and the working flow is shown in fig. 5. Legend: the bank organization A/B respectively obtains the latest bank B/A model stored in the server, and the bank organization A/B trains by using local data and updates the local model. And uploading the latest model parameters to the server.

Unlike the traditional federal learning algorithm which constructs a universal global model by carrying out weighted average on model parameters on a server side, the method provided by the invention improves the performance of the model on a local model by carrying out knowledge distillation on the local part of the client side. By feedback of the example, the method effectively improves the model learning efficiency and reduces the number of communication rounds required by joint modeling by utilizing multi-mechanism data while not reducing the model performance.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored in the storage medium, and when being executed by a processor, the computer program realizes part or all of the steps in each embodiment of the client knowledge distillation-based federal learning realization method provided by the invention. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A realization method of federal learning based on client knowledge distillation is characterized by comprising the following steps:

the method comprises the steps that when a new communication turn starts, a server side sends a model parameter set to each client side, wherein the model parameter set is formed by model parameters submitted to the server side by all the client sides in the previous communication turn;

after each client receives a model parameter set sent by a server, starting local model parameter training of the current communication turn, wherein the local model is a model stored locally at the client; when local model parameter training is carried out, feature information from other clients is transferred into a local model;

when the client finishes the training of the current communication round, sending the parameters of the local model to the server;

after waiting for all clients to send back new model parameters, the server generates a new set according to the new model parameters for training of the next communication turn.

2. The method of claim 1, wherein the server sending a set of model parameters for each client at the beginning of a new communication round comprises:

before the new communication round t begins, the server side performs the formula

Sending to each client k the model parameters formed by the model parameters sent by the other clients of the last communication turn (t-1) to the server

Wherein, theta^kThe model theta is sent to the server side for the client side k in the (t-1) communication turn_t-1The model parameters are set by model parameters sent by all the clients to the server after the last communication turn (t-1).

3. The method of claim 2, wherein starting the local model parameter training for the current communication round after each client receives the model parameter set sent by the server comprises:

after receiving the model parameter set sent by the server, defining the characteristic information E from other clients at any client k of the communication turn t^k(X_iT) is as follows:

wherein N is the number of all clients, X_iData samples owned by the client, f (X)_i；θ^k) A predicted distribution output for the local model.

4. The method of claim 1, wherein migrating feature information from other clients into the local model while performing local model parameter training comprises:

at any client k in the communication turn t, the characteristic information E from other clients to be defined^k(X_iT) migration to the local model parameter θ^kThe similarity of the distribution and the mean distribution predicted using the KL divergence measure is obtained:

where σ is the activation function, D_KLIs the KL divergence.

5. The method of claim 4, wherein sending parameters of the local model to the server when the client completes training of the current communication round comprises:

at any client k of the communication turn t, the updated parameter theta is sent to the server^k。

6. The method of claim 5, wherein after waiting for all clients to send back new model parameters, the server generates a new set from the new model parameters for training of a next communication round comprising:

parameter set theta sent by client side through server side¹,θ²,…,θ^NRespectively replace theta_t-1And start training for the next communication round, wherein，θⁱParameters sent to the server on behalf of client i.