CN115495771A

CN115495771A - Data privacy protection method and system based on self-adaptive adjustment weight

Info

Publication number: CN115495771A
Application number: CN202210798075.3A
Authority: CN
Inventors: 陈益强; 何雨婷; 杨晓东; 于汉超
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-12-20

Abstract

The invention provides a data privacy protection method and system based on self-adaptive adjustment weight, solves the problems of model performance reduction and convergence speed reduction brought by non-independent and identically distributed data, and belongs to the technical field of federal learning application. The method comprises the following steps: when each round of federal communication starts, the server side evaluates the credibility of the global model category level by using the auxiliary data set, and issues the credibility matrix and global model parameters to the clients participating in the round of federal communication; the client side evaluates the reliability of a global model sample level according to the local private data set, weights the knowledge distillation by utilizing category reliability and sample reliability, dynamically guides the training process of the local model, and uploads updated local model parameters to the server side; and the server side weights and aggregates the parameters of each local model to update the global model.

Description

Data privacy protection method and system based on self-adaptive adjustment weight

Technical Field

The invention relates to the technical field of federal learning and data security, in particular to a federal learning method and a system based on client selective knowledge distillation.

Background

Conventional machine learning techniques have been successfully applied to the fields of computer vision, natural language processing, recommendation systems, and automatic control. With the application of artificial intelligence in various industries, people are also continuously concerned about user privacy and data security. The Protection of Data security and privacy is continuously strengthened in various countries, for example, general Data Protection Regulation (GDPR) is formally introduced in 2018 of the european union, and personal information Protection law of the people's republic of china is passed in 2021 of China. Due to the privacy protection restrictions of these laws and regulations, data in the fields of medicine, enterprises, military, etc. presents an island distribution. Federal Learning (FL), which has emerged in recent years, enables secure sharing of multiparty data by transmitting model parameters instead of raw data. On one hand, the privacy of the user and the data security can be well protected when the data is not locally generated; on the other hand, joint training can fully sense local private data of each client, and the data island problem is solved.

The environments of different users, devices, organizations and other participating clients are heterogeneous in nature, so that data in federal learning is Non-independent and clinically distributed (Non-IID). Non-IID has been a leading-edge hotspot problem to be solved urgently in the field of federal learning, wherein tag distribution shift is particularly significant in the practical application scenario of federal learning. The data heterogeneity of the client can cause the local training of the client to seriously deviate from the global target, so that the update process is divergent (Weight Divergence). Therefore, one of the research challenges of the federal learning data Non-IID lies in constraining the model updating direction in the local training process of each client, and learning knowledge from local private data while preserving the knowledge of the global model. The prior art adds a correction term to the local loss function so that the local update process does not deviate too much from the global model. Where the correction term is calculated using the local model minus the L2 distance of the global model of the previous round. A global model trained on a complete data set learns better representations than a local model trained on a skewed subset. Based on the above, a comparison learning loss term is added to the local loss function, so as to achieve the purposes of reducing the distance between the representation learned by the local model and the representation learned by the global model and increasing the distance between the representation learned by the local model and the representation learned by the previous local model. The method can also adopt an Elastic Weight Consolidation (EWC) method to eliminate the catastrophic forgetting problem in federal learning, and a penalty term is added in a local loss function to hinder the change of important model parameters of a local model on a global task.

The above research methods all use the global model to constrain the update direction of the local model of the client, so as to prevent the difference between the updated local model and the global model from being too large. However, these works all suffer from the following two drawbacks: on the one hand, these efforts do not adaptively adjust the weights between the correction term and the task loss term in the local objective function. If the weight of the correction term is too large, the federation in the round cannot learn new knowledge, and if the weight of the correction term is too small, the federation in the round can deviate from the global target in the optimization direction, so that the weight needs to be adjusted very carefully to optimize towards an ideal optimal model, and the process of trying the weight consumes a great deal of time and energy. On the other hand, these efforts do not take into account that the poor performance global model misleads the local model update direction to optimize in the wrong direction, especially for the client that did not participate in the previous round of federation. Because each round of clients participating in the federation is dynamically changed and the data of the clients is not independently and uniformly distributed, the global model obtained by aggregation has different characterization capabilities on different classes. At the beginning of federal learning, the global model has not yet learned a good representation, and the local model should be trained with more attention paid to learning knowledge from local private data rather than to retain global knowledge. And in the middle and later stages of federal communication, the global model performs better on a specific category than the local model, and the local model selectively retains global knowledge from the category level and the sample level.

Most existing federal learning algorithms are based on classical federal Averaging (FedAvg), which uses a traditional Client-Server (C-S) architecture to split a distributed training process into a multi-round iterative Client-side local training process and a Server-side parameter aggregation process. As shown in fig. 1, in the client local training process, each client downloads a model from a server and then performs multiple rounds of training on a local private data set; in the parameter aggregation process of the server side, the server receives updated model parameters from the client side, and aggregation is performed in a parameter averaging mode by using the total sample size of the client side as the weight. Assuming a total of N clients, local private dataset for each client is D = { D = { ₁ ,D ₂ ,…,D _N }. The objective function L of the global model w is the objective function L of the local model of each client _i Weighted average of (c):

wherein q is _i Represents the weight in the aggregation of client i, | D _i L represents the total sample size of the client i local private dataset.

In the practical application scenario of federal learning, the local data distribution of each client is usually Non-independent and same distributed Non-IID, especially the label distribution P (y) of the data may be different. On one hand, when each client performs local training on a local inclined data set, the updating directions of model fitting samples are inconsistent, so that each local model deviates from a global target, and 'catastrophic forgetting' of global knowledge is generated, and further, the deviation between the aggregated and updated global model and an ideal model is overlarge. Meanwhile, the global model has different capabilities of extracting different types of features, and has strong capability of extracting most types of features in the currently selected local private data of the client, so that the credibility of the output logit (the input variable of the softmax activation function of the neural network for outputting the predicted value) of the global model on the most types of channels is higher than that on the few types of channels. On the other hand, the output logit of the global model is also related to the specific training sample. When the training samples belong to the above-mentioned few classes, or the features in the samples of the most classes are unique, the global model cannot effectively extract the features of the samples, so that the confidence of the output logit on each class channel is low at this time.

In summary, how to selectively retain global knowledge and adaptively adjust local update direction in the local training process, and improve the generalization and convergence speed of the model becomes the focus of research attention of us.

Disclosure of Invention

The invention aims to overcome the defects that the updating direction of a local model cannot be adjusted in a self-adaptive mode and the performance change of a global model, the performance difference between classes, the performance difference between samples and the like are not considered when the existing federal learning method faces to non-independent same-distribution data, and provides a data privacy protection method based on self-adaptive adjustment weight, which comprises the following steps:

step 1, inputting an auxiliary data set marked with category labels into a global model to obtain the classification precision of the global model on each category of data, and using the classification precision as a category credibility matrix;

step 2, at least one client acquires the global model from a cloud, initializes a local model of the client, and locally has a local private data set, wherein samples in the local private data set correspond to category labels; inputting the local private data set into the local model to obtain the output of the local model and the classification loss; inputting the local private data set into the global model local to the client to obtain the output of the global model and the classification precision of each sample in the local private data as a sample credibility matrix; obtaining distillation loss according to the sample reliability matrix and the class reliability matrix; training the local model based on a total loss consisting of the classification loss and the distillation loss; uploading the trained model parameters of the local model to the cloud;

and 3, carrying out weighted aggregation on the received model parameters by the cloud end to obtain a new model.

The data privacy protection method based on the adaptive adjustment weight, wherein the step 3 comprises: and (4) replacing the global model with the new model, circularly executing the steps 1 to 3 until the total loss is converged or a preset iteration number is reached, finishing the training and updating of the global model, saving the current global model as a final model, and classifying or predicting specified data.

The data privacy protection method based on self-adaptive weight adjustment, wherein the distillation loss

Wherein

A weight vector of class, \ indicates element multiplication; z is a radical of formula ^g Is the output of the global model, and z is the output of the local model;

weight vector M k ₁ The values of the positions are:

M(x)[k ₁ ]＝M _max ·[M _class [k ₁ ]M _sample (x)-0.1] ⁺ ，

M _sample (x)＝1-(1-p ^g (x)[k ₂ ]) ^0.5 ；

where sample x belongs to class k ₂ ，

A _k1,k1 Is the recall ratio of category k1, A _k,k1 Indicating the probability that class k is mispredicted to class k 2; p is a radical of formula ^g (x)[k ₂ ]K represents the class of the correct prediction sample x of the global model ₂ Probability of (M) _max Is the upper limit of distillation loss term.

According to the data privacy protection method based on the self-adaptive adjustment weight, the client is a medical institution data center or a financial institution data center, and the new model is used for classifying input images or predicting risks of transaction data.

The invention also provides a data privacy protection system based on the self-adaptive weight adjustment, which comprises the following steps:

the cloud end is used for inputting the auxiliary data set marked with the category label into the global model to obtain the classification precision of the global model on each category of data, and the classification precision is used as a category credibility matrix; obtaining a new model by performing weighted aggregation on the received model parameters;

the client side is used for acquiring the global model from the cloud side, initializing a local model of the client side, wherein the client side is provided with a local private data set locally, and samples in the local private data set correspond to category labels; inputting the local private data set into the local model to obtain the output of the local model and the classification loss; inputting the local private data set into the global model local to the client to obtain the output of the global model and the classification precision of each sample in the local private data as a sample credibility matrix; obtaining distillation loss according to the sample reliability matrix and the class reliability matrix; training the local model based on a total loss consisting of the classification loss and the distillation loss; and uploading the trained model parameters of the local model to the cloud.

The data privacy protection system based on the adaptive adjustment weight comprises a cloud end and a server, wherein the cloud end is used for: and replacing the global model with the new model, saving the current global model as a final model, and classifying or predicting the specified data.

The adaptive weight-based data privacy protection system wherein the distillation loss

Wherein

A weight vector of class,. Indicates element multiplication; z is a radical of ^g Is the output of the global model, and z is the output of the local model;

weight vector M k ₁ The values of the positions are:

M(x)[k ₁ ]＝M _max ·[M _class [k ₁ ]M _sample (x)-0.1] ⁺ ，

M _sample (x)＝1-(1-p ^g (x)[k ₂ ]) ^0.5 ；

where sample x belongs to class k ₂ ，

A _k1,k1 Is the recall rate of category k1, A _k,k1 Indicating the probability that class k is mispredicted to class k 2; p is a radical of ^g (x)[k ₂ ]K represents the class of the correct prediction sample x of the global model ₂ Probability of (M) _max Is the upper limit of distillation loss term.

According to the data privacy protection system based on the self-adaptive adjustment weight, the client is a medical institution data center or a financial institution data center, and the new model is used for classifying input images or predicting risks of transaction data.

The invention also provides a storage medium for storing a program for executing the data privacy protection method based on the self-adaptive adjustment weight.

The invention also provides a client used for any data privacy protection system based on the self-adaptive weight adjustment.

According to the scheme, the invention has the advantages that:

according to the method, the strategy of selective knowledge distillation of the client is introduced in the local training process of federal learning, the credibility of the global model is evaluated from the category level and the sample level, the global knowledge is selectively distilled into the local model according to the credibility, so that the local model of each client does not deviate from the global model from the learning knowledge of local private data at the same time, the number of turns of federal communication is reduced, the performance of the model is improved, and the convergence of the model is accelerated.

Drawings

FIG. 1 is a prior art federal learning flow chart;

FIG. 2 is a flow chart of the federated learning of the present invention;

FIG. 3 is a schematic diagram of data distribution of clients;

FIG. 4 is a graph of comparative results on a reference data set;

FIG. 5 is a diagram of a confusion matrix for a global model over a test set.

Detailed Description

Before describing embodiments of the present invention in detail, some of the terms used therein will be explained as follows:

a client refers to a node that provides services to a client. The clients may be a large number of mobile or internet of things devices, or may be different organizations (e.g., government departments, medical institutions, financial institutions, geographically distributed data centers, etc.), and the local private data is stored in the clients. The client in the embodiment of the present invention is not limited to any application scenario.

The server (cloud) refers to a node providing services for the client. The central server side is connected with each client side, and the multiple client sides are coordinated to jointly model on the premise that original data are not leaked.

The deep learning model is a deep neural network formed by connecting a plurality of processing units. The model updated at the server side is called a global model, and the model updated at the client side is called a local model.

The auxiliary data set is an open data set used for assisting federal training. May consist of a public data set related to the training task or a small amount of local data published by each client.

In view of the limitations and challenges presented by prior approaches, the present invention proposes a federal learning approach based on client selective knowledge distillation. The method has the key points that the knowledge of the global model is selectively distilled into the local model, and the weight of distillation loss in the local training process is adaptively adjusted according to the credibility of the global model at the category level and the sample level, so that the performance of the model is improved, and the convergence of the model is accelerated. Since each federal requires recalculation of M _ class and M _ sample, while the weight of distillation loss, M, is determined by both, it is called adaptive weight adjustment.

The invention comprises the following key technical points:

the key point 1 is that a federal learning method based on client knowledge distillation is introduced aiming at the problem of catastrophic forgetting in the local training process. The method has the technical effects that the knowledge of the global model can be reserved in the local updating process;

and 2, a client selective knowledge distillation strategy is introduced aiming at the problem of performance difference between categories and performance difference between samples of the global model. And respectively evaluating the credibility of the global model class level by using the auxiliary data set at the server side and evaluating the credibility of the global model sample level by using each sample of the private data of the server side locally, and weighting by using the class credibility and the sample credibility during knowledge distillation. The training process of the local model is dynamically guided by the strategy of "choose best from, and change bad". The technical effect is that the local update process can be adaptively adjusted.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Aiming at the problems, the invention provides a client-based Selective Self-Distillation federal Learning method (FedSSD), which takes a global model as a teacher and a local model as a student, and selectively learns global expression from the global model, so that the local updating process of model parameters is adaptively corrected, and new knowledge is learned from local private data while global knowledge is not forgotten.

According to one embodiment of the invention, a small part of data is shared from the various clients to the server, or the public data set is used as the construction auxiliary data set D _V . For example, a breast cancer histological image classification task may employ a few public data sets Camelyon17 as an auxiliary data set. In the t round of federal communication turn, the data set D is assisted at the server end firstly _V Evaluating the credibility of the global model category level (represented by confusion matrix)

Wherein the function

Representing a global model w ^t In the auxiliary data set D _V Precision of (A), K represents the number of classes, A _k1,k2 Representing the probability that the global model predicts class k1 as class k 2. Then, the confidence matrix A ^t And a global model w ^t And issuing the data to the client participating in the round of federation. The client can be a mobile phone terminal or an organization such as a hospital, a research institute, a company, etc.

According to one embodiment of the invention, a client i receives a credibility matrix A issued by a server ^t And a global model w ^t . First using the global model w ^t Initialize its local model

Then local private data D is processed by a Stochastic Gradient Descent (SGD) algorithm _i Optimize its local objective function L _i . To prevent local training from causing catastrophic forgetfulness to the global model, both classification losses and distillation losses need to be optimized: l is _i ＝L _CE,i +L _SSD,i 。

According to one embodiment of the invention, for each input sample x, we define the output of the global model to be

The output of the local model, logit, is

The predicted probability after passing the softmax layer is

Loss of classification L _CE,i In general, there is a cross-entropy loss,

the distillation loss is output from the global model logit: z is a radical of ^g (p after softmax) ^g ) Z, the local model outputs a local weight M of M _sample 、M _class And (4) forming. The output logit of the global model on the local data represents the global knowledge, and can be regarded as an absolute prediction estimation that the sample belongs to each class, and after passing through the softmax layer, the logit is converted into a relative prediction estimation that the sample belongs to each class. In order to decouple the prediction capability of the global model in each category, weighted Mean Square Error (MSE) is used to align local vectors output by the local model and the global model, instead of KL Divergence (Kullback-Leibler) to align the prediction distributions of the global model and the local model. Thus, the loss of selective distillation is defined as

Wherein

Is a weight vector associated with the category, E indicates a meaning of expect, which indicates element multiplication. Suppose that sample x belongs to class k ₂ In the M vector k ₁ The values of the positions are: m (x) [ k ] ₁ ]＝M _max · [M _class [k ₁ ]M _sample (x)-0.1] ⁺ ，

M _sample (x)＝1-(1-p ^g (x)[k ₂ ]) ^0.5 . Wherein

Each element in M is guaranteed to be greater than or equal to 0. A. The _k1,k1 Is the recall ratio of category k1, A _k,k1 Indicating the probability that class k is mispredicted to class k 2. p is a radical of ^g (x)[k ₂ ]K represents the class of the correct prediction sample x of the global model ₂ The probability of (c). M _max The upper limit value of the distillation loss term is determined.

The framework of the overall process is shown in fig. 2. The process can be summarized as follows:

1. the server side initializes a global model w ⁰ Calculating the credibility of the model category level on the auxiliary data set;

2. the server randomly selects C multiplied by N clients S _t Participating in the federate training of the current round and combining the global model parameters w ^t And confidence matrix A ^t Sending the data to the clients;

3. client-side local private data set D _i Upper computation global model w ^t Sample level credibility, training and updating local model

Wherein L is _i ＝L _CE,i +L _SSD,i

4. Each client side updates the model parameters

Uploading to a server side;

5. the server carries out weighted aggregation on the received model parameters

And calculating an updated global model w on the helper data set ^t+1 Confidence matrix A of ^t+1 。

And repeating the steps 2-5 until the model converges.

According to an embodiment of the invention, in an actual application scenario, the final purpose is to obtain a global model with strong generalization, and each client can download the model from a server side to perform local reasoning. For example, in a government open application scene, the method of the invention can break the data isolated island of government departments, and realize the safe sharing of cross-department, social data and the like; in the application scene of the biological medical treatment, the method can effectively combine a plurality of hospitals to realize the tasks of disease prediction, medical image identification, drug discovery, gene sequencing and the like; in a financial application scene, the method can be used for training the joint credit style model on the premise of protecting user information from being leaked, so that more precise and precise financial risk control is realized.

According to one embodiment of the present invention, a Dirichlet distribution (P) is used _k Dir (delta)) construction of Non-IID data distribution scenarios, P _k,i Indicating that the client i owns the class as a proportion of the number of k samples. Delta is a hyper-parameter that controls the degree of heterogeneity between clients, and a smaller value indicates a more uneven distribution of client data and a greater degree of heterogeneity. The method verifies the effectiveness of the method on three public data sets of CIFAR10, CIFAR100 and TinyImageNet. δ =0.5 is set as a default value, and an example of data distribution of each client is shown in fig. 3, where an abscissa represents an ID of the client (default setting is 10 clients), an ordinate represents a category ID (3 datasets have 10, 100, and 200 categories, respectively), a rectangular box represents the number of samples of the category owned by the client, and a darker color represents a larger number of samples owned by the client. In addition, a data division method in the FedAvg algorithm is also adopted, and K classes are randomly allocated to each client, and are denoted by # K = K here. For CIFAR10, we used a model consistent with FedAvg, a simple CNN network (two layers of convolution, two layers of full connectivity); resNet50 was used for models of CIFAR100 and tinyimagenete.

This is in contrast to the 5 related works, which include the benchmark method FedAvg and the four similar methods FedProx ("Federated Optimization in Heterogeneous Networks" by Li et al), fedCurv ("adapting in Federated Learning on Non-IID Data" by Shoham et al), MOON ("Model-synthesized learned Learning" by Li et al) and SCAFFOLD ("SCAFFOLD: stored Controlled Learning for fed Learning" by Karimerddy et al) for adding regularization terms to the local objective function. The comparison result on the reference data set is shown in fig. 4, where the first row represents the accuracy rate variation graph of the global model on the test set, and the second row represents the average accuracy rate variation graph of the local models of the clients on the same test set. It can be seen that the method FedSSD is superior to other methods on the data sets of CIFAR10 and TinyImageNet, the global test accuracy on CIFAR100 is slightly worse than MOON, but the average local accuracy is higher. On the other hand, we observed that the average local accuracy of the fedsds and benchmark methods was not much worse in the first rounds of federal communications because the global model has not yet learned good feature expression. Compared with other methods, the average local accuracy of FedSSD is greatly improved on three data sets, which shows that FedSSD can effectively retain global knowledge and learn knowledge from local private data. Besides the above comparative experiments, we also analyzed the influence of the data isomerism degree on the fedsds, and the global test accuracy is shown in table one. It can be seen that fedsds perform well in data distributions of different degrees of heterogeneity.

Watch 1

To further illustrate the effectiveness of the present invention, a confusion matrix on a test set of a global model issued in a certain round is visualized as shown in fig. 5 (b), and a client is randomly selected, and data distribution is shown in fig. 5 (a). The confusion matrix of the model on the test set after training on the local private data by using the reference method FedAvg and the fedsdd method proposed by the present invention is shown in fig. 4 (c). The model after the FedAvg local update has poor performance on the

Claims

1. A data privacy protection method based on self-adaptive weight adjustment is characterized by comprising the following steps:

step 2, at least one client acquires the global model from a cloud, the client initializes a local model of the client, the client locally has a local private data set, and samples in the local private data set correspond to category labels; inputting the local private data set into the local model to obtain the output of the local model and the classification loss; inputting the local private data set into the global model local to the client to obtain the output of the global model and the classification precision of each sample in the local private data as a sample credibility matrix; obtaining distillation loss according to the sample credibility matrix and the class credibility matrix; training the local model based on a total loss consisting of the classification loss and the distillation loss; uploading the trained model parameters of the local model to the cloud;

2. The data privacy protection method based on adaptive adjustment weight according to claim 1, wherein the step 3 comprises: and (4) replacing the global model with the new model, circularly executing the steps 1 to 3 until the total loss is converged or a preset iteration number is reached, finishing the training and updating of the global model, saving the current global model as a final model, and classifying or predicting specified data.

3. The adaptation-based method of claim 1Method for privacy protection of data with adjusted weights, characterized in that the distillation loss is

Wherein

weight vector M k ₁ The values of the positions are:

M(x)[k ₁ ]＝M _max ·[M _class [k ₁ ]M _sample (x)-0.1] ⁺ ，

M _sample (x)＝1-(1-p ^g (x)[k ₂ ]) ^0.5 ；

where the sample x belongs to the class k2,

A _k1，k1 is the recall ratio of category k1, A _k，k1 Indicating the probability of class k being mispredicted into class k 2; p is a radical of ^g (x)[k ₂ ]The class representing the correct prediction sample x of the global model is k ₂ Probability of, M _max Is the upper limit of distillation loss term.

4. The adaptive weight adjustment based data privacy protection method according to claim 1, wherein the client is a medical institution data center or a financial institution data center, and the new model is used for classifying input images or performing risk prediction on transaction data.

5. A data privacy protection system based on self-adaptive weight adjustment is characterized by comprising:

the client side acquires the global model from the cloud side, initializes a local model of the client side, and locally has a local private data set, wherein samples in the local private data set correspond to the category label; inputting the local private data set into the local model to obtain the output of the local model and the classification loss; inputting the local private data set into the global model local to the client to obtain the output of the global model and the classification precision of each sample in the local private data as a sample credibility matrix; obtaining distillation loss according to the sample reliability matrix and the class reliability matrix; training the local model based on a total loss consisting of the classification loss and the distillation loss; and uploading the trained model parameters of the local model to the cloud.

6. The adaptive weight adjustment based data privacy protection system of claim 5, wherein the cloud is configured to: and replacing the global model with the new model, saving the current global model as a final model, and classifying or predicting the specified data.

7. The adaptive adjustment weight-based data privacy protection system of claim 5, wherein the distillation loss

Wherein

weight vector M k ₁ The values of the positions are:

M(x)[k ₁ ]＝M _max ·[M _class [k ₁ ]M _sample (x)-0.1] ⁺ ，

M _sample (x)＝1-(1-p ^g (x)[k ₂ ]) ^0.5 ；

where the sample x belongs to class k ₂ ，

A _k1，k1 Is the recall rate of category k1, A _k，k1 Indicating the probability that class k is mispredicted to class k 2; p is a radical of ^g (x)[k ₂ ]The class representing the correct prediction sample x of the global model is k ₂ Probability of (M) _max Is the upper limit of distillation loss term.

8. The adaptive weight adjustment based data privacy protection system of claim 5, wherein the client is a medical institution data center or a financial institution data center, and the new model is used for classifying input images or performing risk prediction on transaction data.

9. A storage medium storing a program for executing the adaptive weight-based data privacy protection method according to any one of claims 1 to 4.

10. A client terminal for use in the data privacy protection system based on the adaptive adjustment weight in any one of claims 5 to 8.