CN113361625A

CN113361625A - Error data detection method with privacy protection in federated learning scene

Info

Publication number: CN113361625A
Application number: CN202110696108.9A
Authority: CN
Inventors: 李向阳; 张兰; 李安然
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-07

Abstract

The invention discloses a method for detecting error data with privacy protection in a federated learning scene, which comprises the following steps: 1, constructing a training target of a federal model, 2, training the federal model, 3, testing the federal model, 4, detecting a user side containing error data, 5, detecting and deleting the error data, 6, retraining the federal learning model, and 7, testing the result of the error data. The method can efficiently detect the user side containing the error training data and the error training data in the federal learning process with privacy protection, and repair errors at a lower cost, so that the prediction accuracy of the federal learning is improved, the convergence speed of the federal learning model is accelerated, and the two proposed efficient detection algorithms can respectively save calculation resources and communication resources so as to meet the requirement of dynamic resource limitation of the federal learning.

Description

Error data detection method with privacy protection in federated learning scene

Technical Field

The invention relates to a wrong data detection method with privacy protection in a federal learning scene, belonging to the field of data safety and data quality evaluation.

Background

In recent years, artificial intelligence has raised a wave and a tide, and AI gradually enters the living aspect of people from face recognition, living body inspection, criminal case alarm, to alpha dog wars, human go chess, hand and plum stone, to unmanned driving, and to the generally-applied precise marketing. The AI model is obtained by training a large amount of high-quality data, and in real life, except a few big companies, most enterprises have the problems of small data volume and poor data quality, so that the realization of an artificial intelligence technology is not supported sufficiently; meanwhile, the domestic and foreign supervision environment also gradually strengthens data protection, so that data freely flows on the premise of safety compliance, and becomes a great trend; data owned by business companies often has great potential value from both a user and enterprise perspective. Two companies and even departments between the companies need to consider the exchange of benefits, and the organizations cannot provide the respective data to be aggregated with other companies, so that even in the same company, the data often appears in an isolated island form, and the federal study is born at the end of the year.

Federal learning enables a data provider to complete a learning task through parameter updating instead of original data after local training of a sharing model under the coordination of a cloud server. In the federal learning process, the local data quality of a user affects the performance of the global model, and a large amount of error data (e.g., error label data) can reduce the performance of the global model, for example, the model has slow convergence speed and low test precision.

There have been a series of works for data error detection for centralized deep learning, including robustness and interpretability analysis of models, which fall into two broad categories: model-based interpretability analysis and data-based interpretability analysis. Model-based interpretability analysis focuses on building a more robust model by perturbing hidden units of the model. Interpretability analysis based on the data tracks predictions of the model through a model learning algorithm and returns to the training data to determine the training data points that have the greatest impact on the given test data. The influence function value (influence function) of a single data point is used to approximate the true influence value obtained by retraining the model after removing the data point from the training data.

But existing model interpretable work cannot be used directly in FL systems: 1) the existing method is designed for centralized model training and needs to directly access original training data, while in a federal system, local data cannot be directly accessed by a third party due to protection of user data privacy; 2) even if local data is accessed in some way, existing impact function evaluations incur significant computational and communication overhead, which is unacceptable for resource-constrained clients in federated systems.

Disclosure of Invention

The invention provides a method for detecting error data with privacy protection in a federated learning scene to overcome the defects of the prior art, which aims to detect a user side containing error training data and the error training data in a federated learning process with high efficiency and privacy protection and repair errors at a lower cost, thereby improving the prediction accuracy of federated learning, accelerating the convergence speed of a federated learning model, saving computing resources and communication resources and meeting the requirement of dynamic resource limitation of federated learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a method for detecting error data with privacy protection in a federated learning scene, which is characterized in that a terminal server and K local user terminals { C }_k1,2, …, K, where C_kRepresents the kth local user end and the kth local user end C_kStore n_kA training sample, denoted as { z_k,i|i＝1,2,…,n_k}，z_k,iRepresents the kth local subscriber C_kThe ith training sample of (1); the terminal server stores a test data set Z_S(ii) a The error data detection method is carried out according to the following steps:

step 1, constructing a training target of a federal model:

constructing a loss function L (z, theta) of the federal model by using the formula (1):

in formula (1), θ represents a model parameter of federal learning after random initialization by using gaussian distribution, z represents training samples of K local clients, and F_k(theta) denotes the kth local subscriber terminal C_kAnd has the following average of the loss functions:

in the formula (2), L (z)_k,iAnd theta) represents the kth local subscriber C_kThe loss function of the ith training sample of (1);

step 2, training of a federal model:

step 2.1, defining and initializing the current training time t as 1; assigning global model parameter theta of federal learning to global model parameter theta of t training_t；

Step 2.2, in the training process of the t time, the terminal server randomly selects m local user terminals, and the global model parameter theta of the t training is used_tSending the data to the selected user side; the k selected local user terminal C_kIndependently calculating updated model parameters using equation (3)

And will be

Sending the model parameters to the terminal server, and storing the updated model parameters by the terminal server

And used as a training log;

in the formula (3), η represents the learning rate;

the local client C selected at the k-th time of the t + 1-th training_kThe model parameters of (1);

represents the kth local user terminal C during the t training_kThe gradient of the mean of the loss function of (a);

step 2.3, the terminal server utilizes the formula (4) to aggregate to obtain the global model parameter theta of the t +1 th training_t+1；

In equation (4), n represents the total number of samples of all local ues, i.e.

Step 2.4, after the t +1 is assigned to the t, the step 2.2 is returned to for execution sequentially until the global model parameter theta_tUntil convergence, to obtain the optimal global model parameters

Step 3, testing of a federal model:

the terminal server sends a test data set Z to the terminal server_SInput to the optimal global model parameters

Obtaining the global model parameters

A set of mispredicted test samples Z;

step 4, detecting a user side containing error data:

the terminal server calculates the kth local user terminal C by using the formula (5)_kDistance D of local update and global update_k：

In the formula (5), the reaction mixture is,

represents the kth local subscriber C_kWhether it is selected in the t-th training, if so

Indicates is selected if

Indicating not selected, N (k) indicating the k-th local user terminal C_kThe selected times in the training process from T/2 times to T times; theta_tRepresenting that the terminal server aggregates training logs uploaded by K local user terminals in the t training process to obtain a global model parameter of the t training;

if the kth local subscriber C_kDistance D of_kIf the ratio of the local subscriber terminal C to the median distances of all the local subscriber terminals is greater than the set distance threshold δ, it indicates that the kth local subscriber terminal C is_kThe client side contains error data and is marked as a negative influence client side;

and 5, detecting error data and deleting:

suppose the kth local user terminal C_kIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side C_kCan satisfy the calculationThe requirement of method 1 is that algorithm 1 is selected to calculate the ith training sample z_k,iInfluence function value of (I)_f(z_k,i) Otherwise, calculate the ith training sample z using Algorithm 2_k,iInfluence function value of (I)_f(z_k,i)；

If the value of the influence function I_f(z_k,i) If the ratio of the median of the impact function values of all negative impact clients is greater than the set impact threshold, the ith training sample z is represented_k,iIs an error sample, thereby to the k-th local user terminal C_kJudging all training samples to detect all error samples;

the terminal server sends a deleting command to the kth negative influence user terminal C_kSo that the kth negatively affects the ue C_kDeleting all error samples of the self;

step 6, retraining the federated learning model:

the terminal server adjusts the probability of each local user terminal being selected according to the influence function values of all the local user terminals, so that the terminal server cooperates with the local user terminals to update the global model parameters

Step 6.1, initializing t to 1;

step 6.2, in the t training process, making K local user terminals have equal initial selection probability P_t ¹,P_t ²,…,P_t ^k,…,P_t ^K；

Step 6.3, the terminal server selects probability P according to the initial_t ¹,P_t ²,…,P_t ^k,…,P_t ^KSelecting m local clients and participating in the training process shown in the step 2.2-step 2.3, so as to obtain global model parameters theta obtained by respectively uploading training logs in the t-th training process and aggregating the training logs after deleting all error samples of the K local clients_t', and the k-th selected bookGround user terminal C_kModel parameters θ 'at the time of t-th training after deleting all error samples of itself'_t ^k；

Step 6.4, in the t +1 training process, the terminal server updates the kth local user terminal C by using the formula (6)_kIs selected, is selected_t ^kTo obtain the kth local user end C in the t +1 training process_kIs selected from

In the formula (6), S_tRepresenting the local user terminal set selected by the terminal server in the t training process;

step 6.5, the terminal server trains the probability of the process according to the t +1

Selecting m local clients and participating in the training process shown in the step 2.2-step 2.3;

6.6, assigning t +1 to t, and returning to the step 6.4 to execute in sequence until the global model parameter theta_tUntil convergence, resulting in optimal global model parameters

Step 7, global model parameters are calculated

The mispredicted test sample set Z is input to the optimal global model parameters

In the prediction process, the optimal global model parameters are obtained

The mispredicted test sample set Z 'is determined to be theta' if the sample size of the test sample set Z 'does not meet the requirement of the terminal server'_tIs assigned to theta_t，θ′_t ^kIs assigned to

Step 4-step 7 are executed again, otherwise, it represents that the kth local user terminal C is detected_kAll error data.

The method for detecting error data with privacy protection in the federated learning scene is also characterized in that the algorithm 1 in the step 2 is carried out according to the following process:

in the t training process, the terminal server sever calculates any test sample Z in the test sample set Z_testGradient of (2)

And sends it to the kth local user terminal C_k；

The kth local subscriber terminal C_kRandomly selecting m samples from the jth training sample stored in the test data storage unit, and according to the gradient of the test data

Taylor expansion based on m samples is performed on equation (7) for calculating the vector product s_testJ th estimated value s of_test,jThen for the jth estimated value s_test,jAfter Gaussian noise is added, a noise-added estimation value is obtained

The kth local subscriber terminal C_kRepetition of r₁Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition

And transmitting to the terminal server;

in the formula (7), H_kRepresents the kth local subscriber C_kλ represents a threshold value such that the sum of the hessian matrices

Is a semi-positive definite matrix, I represents a unit matrix,

representing test data z_testA gradient of (a);

the terminal server will add the estimated value after making a noise

As a vector product s_testSo that the ith training sample z is calculated using equation (8)_k,iInfluence function value of (I)_f(z_k,i)：

In the formula (8), the reaction mixture is,

represents the vector product s_testThe transposing of (1).

The algorithm 2 in the step 2 is performed according to the following processes:

And sends it to the kth local user terminal C_k；

The kth local subscriber terminal C_kRandomly selecting m samples from training samples stored in the training device for the jth time, and calculating a Hessian matrix based on the m samples

Element h in line l_lAnd a test specimen z_testGradient of (2)

Element of line l

Calculating the vector product s using equation (9)_testJ th estimated value s of_test,j(ii) a Then for the jth estimated value s_test,jAfter Gaussian noise is added, a noise-added estimation value is obtained

The kth local subscriber terminal C_kRepetition of r₂Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition

And transmitting to the terminal server;

the terminal server will add the estimated value after making a noise

As a vector product s_testSo that the ith training sample z is calculated using equation (8)_k,iInfluence function value of (I)_f(z_k,i)。

Compared with the prior art, the invention has the beneficial effects that:

because the invention adopts a layered detection method, the detection method is high in efficiency; meanwhile, detection methods for optimizing computing resources and communication resources are respectively designed, and the detection methods are high in adaptability; in addition, local data are not exposed to any third party in the whole detection process, parameters transmitted in the middle are disturbed through differential privacy, and the user data privacy is protected through the detection method.

Drawings

FIG. 1 is a flow chart of a system for error data detection with privacy protection in the federated learning scenario of the present invention.

Detailed Description

In this embodiment, a method for detecting error data with privacy protection in a federated learning scenario is applied by a terminal server and K local clients { C } as shown in fig. 1_k1,2, …, K, where C_kRepresents the kth local user end and the kth local user end C_kStore n_kA training sample, denoted as { z_k,i|i＝1,2,…,n_k}，z_k,iRepresents the kth local subscriber C_kThe ith training sample of (1); terminal server storage test data set Z_S(ii) a The error data detection method is carried out according to the following steps:

step 1, federal model training:

the objective of the federal model training is to minimize the loss function shown in equation (1):

in the formula (1), θ represents a model parameter of federal learning after random initialization by using gaussian distribution, and is used for mapping an input space X to an output space Y, z represents training samples of K local user terminals, L (z, θ) represents a loss function of the model θ, and F_k(theta) denotes the kth local subscriber terminal C_kIs calculated by equation (2):

step 2, training of a federal model:

Step 2.2, in the training process of the t time, the terminal server randomly selects m local user terminals C_kAnd the t-th trained global model parameter theta is measured_tSending the data to the selected user side; the k selected local user terminal C_kIndependently calculating updated model parameters using equation (3)

And will be

Sending to the terminal server, the terminal server saves the local update parameters

And use it as a training log;

in the formula (3), η represents the learning rate;

step 2.3, the terminal server obtains the global model parameter theta of the t +1 th round by using formula (4) polymerization_t+1；

Step 1.4, after t +1 is assigned to t, the step 2.2 is returned to for execution in sequence until the global model parameter theta_tUntil convergence, to obtain the optimal global model parameters

Step 3, federal model test:

the terminal server sends a test data set Z_SInput to the optimal global model parameters

In (3), obtaining global model parameters

A set of mispredicted test samples Z;

step 4, detecting a user side containing error data:

In the formula (5), the reaction mixture is,

Indicates is selected if

Indicating not selected, N (k) indicating the k-th local user terminal C_kThe selected times in the training process from T/2 times to T times; theta_tRepresenting that the terminal server aggregates training logs uploaded by K local user terminals in the t training process to obtain global model parameters of the t training;

step 5, detecting error data:

suppose the kth local user terminal C_kIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side C_kCan meet the requirements of the algorithm 1, the algorithm 1 is selected to calculate the ith training sample z_k,iInfluence function value of (I)_f(z_k,i) Otherwise, calculate the ith training sample z using Algorithm 2_k,iInfluence function value of (I)_f(z_k,i)。

step 6, based on the influence function value I_f(z_k,i) Local user side selection and global model parameters

Updating:

Step 6.1, initializing t to 1;

Step 6.3, the terminal server selects the probability P according to the initial_t ¹,P_t ²,…,P_t ^k,…,P_t ^KSelecting m tth training processes to participate in the training process shown in the step 2.2-the step 2.3, so as to obtain global model parameters theta 'obtained by respectively uploading training logs and aggregating the training logs in the tth training process after deleting all error samples of K local user sides'_tAnd the k selected local client C_kModel parameters θ 'at the time of t-th training after deleting all error samples of itself'_t ^k；

Step 6.4, in the t +1 training process, the terminal server updates the kth local user terminal C by using the formula (4)_kIs selected, is selected_t ^kTo obtain the kth local user end C in the t +1 training process_kIs selected from

step 6.6, after t +1 is assigned to t, the step 6.4 is returned to and executed in sequence until the global model parameter theta'_tUntil convergence, to obtain the optimal global model parameters

Step 7, global model parameters are calculated

In the prediction process, the optimal global model parameters are obtained

The mispredicted test sample set Z 'is determined to be theta' if the sample size of the test sample set Z 'does not meet the requirement of the terminal server'_tIs assigned to theta_t，θ′_t ^kAssigning to the local user terminal C, executing the steps 4-7 again, otherwise, indicating that the kth local user terminal C is detected_kAll error data.

Due to calculation of I_f(z_k,i) Requires O (np)²+p³) Wherein p represents a global model parameter

The calculation cost is large, and in order to reduce the calculation cost, the algorithm 1 is carried out according to the following processes:

And sends it to the kth local user terminal C_k；

And transmitting to a terminal server;

Is a semi-positive definite matrix, I represents a unit matrix,

representing test data z_testOf the gradient of (c).

The terminal server will add the estimated value after making a noise

In the formula (8), the reaction mixture is,

represents the vector product s_testThe transposing of (1).

Due to calculation of I_f(z_k,i) Requires O (Kp)²+ np) and thus a large communication overhead, and to reduce the communication overhead, algorithm 2 proceeds as follows:

in the t training process, the terminal server calculates any test sample Z in the test sample set Z_testGradient of (2)

And sends it to the kth local user terminal C_k；

Element h in line l_lAnd test data

Row i element of the gradient of

Calculating the vector product s using equation (9)_testJ th estimated value s of_test,j. Then for the jth estimated value s_test,jAfter Gaussian noise is added, a noise-added estimation value is obtained

And transmitting to a terminal server;

the terminal server will add the estimated value after making a noise

Claims

1. A method for detecting error data with privacy protection in a federated learning scene is characterized in that a terminal server and K local user terminals { C }are applied_k1,2, …, K, where C_kRepresents the kth local user end and the kth local user end C_kStore n_kA training sample, denoted as { z_k，i|i＝1，2，…，n_k}，z_k，iRepresents the kth local subscriber C_kThe ith training sample of (1); the terminal server stores a test data set Z_S(ii) a The error data detection method is carried out according to the following steps:

step 1, constructing a training target of a federal model:

in the formula (2), L (z)_k，iAnd theta) represents the kth local subscriber C_kThe loss function of the ith training sample of (1);

step 2, training of a federal model:

step 2.1, defining and initializing the current training times t ═ l; assigning global model parameter theta of federal learning to global model parameter theta of t training_t；

And will be

And used as a training log;

in the formula (3), η represents the learning rate;

Step 3, testing of a federal model:

Obtaining the global model parameters

A set of mispredicted test samples Z;

step 4, detecting a user side containing error data:

the terminalThe server calculates the kth local client C by using the formula (5)_kDistance D of local update and global update_k：

In the formula (5), the reaction mixture is,

Indicates is selected if

and 5, detecting error data and deleting:

suppose the kth local user terminal C_kIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side C_kIf the computing resources can meet the requirements of the algorithm 1, the algorithm 1 is selected for computingIth training sample z_k，iInfluence function value of (I)_f(z_k，i) Otherwise, calculate the ith training sample z using Algorithm 2_k，iInfluence function value of (I)_f(z_k，i)；

If the value of the influence function I_f(z_k，i) If the ratio of the median of the impact function values of all negative impact clients is greater than the set impact threshold, the ith training sample z is represented_k，iIs an error sample, thereby to the k-th local user terminal C_kJudging all training samples to detect all error samples;

step 6, retraining the federated learning model:

Step 6.1, initializing t to 1;

step 6.2, in the t training process, making K local user terminals have equal initial selection probability P_t ¹，P_t ²，…，P_t ^k，…，P_t ^K；

Step 6.3, the terminal server selects probability P according to the initial_t ¹，P_t ²，…，P_t ^k，…，P_t ^KSelecting m local user sides and participating in the training process shown in the step 2.2-the step 2.3, so as to obtain global model parameters theta 'obtained by respectively uploading training logs in the t-th training process and aggregating the training logs after deleting all error samples of the K local user sides'_tAnd the k selected local userTerminal C_kModel parameters θ 'at the time of t-th training after deleting all error samples of itself'_t ^k；

In the formula (6), S_tRepresenting the local user terminal set selected by the terminal server in the f training process;

Step 7, global model parameters are calculated

In the prediction process, the optimal global model parameters are obtained

2. The method for detecting error data with privacy protection in a federated learning scene as claimed in claim 1, wherein the algorithm 1 in step 2 is performed as follows:

And sends it to the kth local user terminal C_k；

Taylor expansion based on m samples is performed on equation (7) for calculating the vector product s_testJ th estimated value s of_test，jThen for the jth estimated value s_test，jAfter Gaussian noise is added, a noise-added estimation value is obtained

And transmitting to the terminal server;

Is a semi-positive definite matrix, I represents a unit matrix,

representing test data z_testA gradient of (a);

the terminal server will add the estimated value after making a noise

As a vector product s_testSo that the ith training sample z is calculated using equation (8)_k，iInfluence function value of (I)_f(z_k，i)：

In the formula (8), the reaction mixture is,

represents the vector product s_testThe transposing of (1).

3. The method for detecting error data with privacy protection in a federated learning scene as claimed in claim 2, wherein the algorithm 2 in step 2 is performed as follows:

in the t training process, the terminalThe server sever calculates any test sample Z in the test sample set Z_testGradient of (2)

And sends it to the kth local user terminal C_k；

Element h in line l_lAnd a test specimen z_testGradient of (2)

Element of line l

Calculating the vector product s using equation (9)_testJ th estimated value s of_test，j(ii) a Then for the jth estimated value s_test，jAfter Gaussian noise is added, a noise-added estimation value is obtained

And transmitting to the terminal server;

the terminal server will add the estimated value after making a noise

As a vector product s_testSo that the ith training sample z is calculated using equation (8)_k，iInfluence function value of (I)_f(z_k，i)。