CN113361625A - Error data detection method with privacy protection in federated learning scene - Google Patents

Error data detection method with privacy protection in federated learning scene Download PDF

Info

Publication number
CN113361625A
CN113361625A CN202110696108.9A CN202110696108A CN113361625A CN 113361625 A CN113361625 A CN 113361625A CN 202110696108 A CN202110696108 A CN 202110696108A CN 113361625 A CN113361625 A CN 113361625A
Authority
CN
China
Prior art keywords
training
test
local
local user
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110696108.9A
Other languages
Chinese (zh)
Inventor
李向阳
张兰
李安然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110696108.9A priority Critical patent/CN113361625A/en
Publication of CN113361625A publication Critical patent/CN113361625A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for detecting error data with privacy protection in a federated learning scene, which comprises the following steps: 1, constructing a training target of a federal model, 2, training the federal model, 3, testing the federal model, 4, detecting a user side containing error data, 5, detecting and deleting the error data, 6, retraining the federal learning model, and 7, testing the result of the error data. The method can efficiently detect the user side containing the error training data and the error training data in the federal learning process with privacy protection, and repair errors at a lower cost, so that the prediction accuracy of the federal learning is improved, the convergence speed of the federal learning model is accelerated, and the two proposed efficient detection algorithms can respectively save calculation resources and communication resources so as to meet the requirement of dynamic resource limitation of the federal learning.

Description

Error data detection method with privacy protection in federated learning scene
Technical Field
The invention relates to a wrong data detection method with privacy protection in a federal learning scene, belonging to the field of data safety and data quality evaluation.
Background
In recent years, artificial intelligence has raised a wave and a tide, and AI gradually enters the living aspect of people from face recognition, living body inspection, criminal case alarm, to alpha dog wars, human go chess, hand and plum stone, to unmanned driving, and to the generally-applied precise marketing. The AI model is obtained by training a large amount of high-quality data, and in real life, except a few big companies, most enterprises have the problems of small data volume and poor data quality, so that the realization of an artificial intelligence technology is not supported sufficiently; meanwhile, the domestic and foreign supervision environment also gradually strengthens data protection, so that data freely flows on the premise of safety compliance, and becomes a great trend; data owned by business companies often has great potential value from both a user and enterprise perspective. Two companies and even departments between the companies need to consider the exchange of benefits, and the organizations cannot provide the respective data to be aggregated with other companies, so that even in the same company, the data often appears in an isolated island form, and the federal study is born at the end of the year.
Federal learning enables a data provider to complete a learning task through parameter updating instead of original data after local training of a sharing model under the coordination of a cloud server. In the federal learning process, the local data quality of a user affects the performance of the global model, and a large amount of error data (e.g., error label data) can reduce the performance of the global model, for example, the model has slow convergence speed and low test precision.
There have been a series of works for data error detection for centralized deep learning, including robustness and interpretability analysis of models, which fall into two broad categories: model-based interpretability analysis and data-based interpretability analysis. Model-based interpretability analysis focuses on building a more robust model by perturbing hidden units of the model. Interpretability analysis based on the data tracks predictions of the model through a model learning algorithm and returns to the training data to determine the training data points that have the greatest impact on the given test data. The influence function value (influence function) of a single data point is used to approximate the true influence value obtained by retraining the model after removing the data point from the training data.
But existing model interpretable work cannot be used directly in FL systems: 1) the existing method is designed for centralized model training and needs to directly access original training data, while in a federal system, local data cannot be directly accessed by a third party due to protection of user data privacy; 2) even if local data is accessed in some way, existing impact function evaluations incur significant computational and communication overhead, which is unacceptable for resource-constrained clients in federated systems.
Disclosure of Invention
The invention provides a method for detecting error data with privacy protection in a federated learning scene to overcome the defects of the prior art, which aims to detect a user side containing error training data and the error training data in a federated learning process with high efficiency and privacy protection and repair errors at a lower cost, thereby improving the prediction accuracy of federated learning, accelerating the convergence speed of a federated learning model, saving computing resources and communication resources and meeting the requirement of dynamic resource limitation of federated learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a method for detecting error data with privacy protection in a federated learning scene, which is characterized in that a terminal server and K local user terminals { C }k1,2, …, K, where CkRepresents the kth local user end and the kth local user end CkStore nkA training sample, denoted as { zk,i|i=1,2,…,nk},zk,iRepresents the kth local subscriber CkThe ith training sample of (1); the terminal server stores a test data set ZS(ii) a The error data detection method is carried out according to the following steps:
step 1, constructing a training target of a federal model:
constructing a loss function L (z, theta) of the federal model by using the formula (1):
Figure BDA0003128457600000021
in formula (1), θ represents a model parameter of federal learning after random initialization by using gaussian distribution, z represents training samples of K local clients, and Fk(theta) denotes the kth local subscriber terminal CkAnd has the following average of the loss functions:
Figure BDA0003128457600000022
in the formula (2), L (z)k,iAnd theta) represents the kth local subscriber CkThe loss function of the ith training sample of (1);
step 2, training of a federal model:
step 2.1, defining and initializing the current training time t as 1; assigning global model parameter theta of federal learning to global model parameter theta of t trainingt
Step 2.2, in the training process of the t time, the terminal server randomly selects m local user terminals, and the global model parameter theta of the t training is usedtSending the data to the selected user side; the k selected local user terminal CkIndependently calculating updated model parameters using equation (3)
Figure BDA0003128457600000023
And will be
Figure BDA0003128457600000024
Sending the model parameters to the terminal server, and storing the updated model parameters by the terminal server
Figure BDA0003128457600000025
And used as a training log;
Figure BDA0003128457600000026
in the formula (3), η represents the learning rate;
Figure BDA0003128457600000031
the local client C selected at the k-th time of the t + 1-th trainingkThe model parameters of (1);
Figure BDA0003128457600000032
represents the kth local user terminal C during the t trainingkThe gradient of the mean of the loss function of (a);
step 2.3, the terminal server utilizes the formula (4) to aggregate to obtain the global model parameter theta of the t +1 th trainingt+1
Figure BDA0003128457600000033
In equation (4), n represents the total number of samples of all local ues, i.e.
Figure BDA0003128457600000034
Step 2.4, after the t +1 is assigned to the t, the step 2.2 is returned to for execution sequentially until the global model parameter thetatUntil convergence, to obtain the optimal global model parameters
Figure BDA0003128457600000035
Step 3, testing of a federal model:
the terminal server sends a test data set Z to the terminal serverSInput to the optimal global model parameters
Figure BDA0003128457600000036
Obtaining the global model parameters
Figure BDA0003128457600000037
A set of mispredicted test samples Z;
step 4, detecting a user side containing error data:
the terminal server calculates the kth local user terminal C by using the formula (5)kDistance D of local update and global updatek
Figure BDA0003128457600000038
In the formula (5), the reaction mixture is,
Figure BDA0003128457600000039
represents the kth local subscriber CkWhether it is selected in the t-th training, if so
Figure BDA00031284576000000310
Indicates is selected if
Figure BDA00031284576000000311
Indicating not selected, N (k) indicating the k-th local user terminal CkThe selected times in the training process from T/2 times to T times; thetatRepresenting that the terminal server aggregates training logs uploaded by K local user terminals in the t training process to obtain a global model parameter of the t training;
if the kth local subscriber CkDistance D ofkIf the ratio of the local subscriber terminal C to the median distances of all the local subscriber terminals is greater than the set distance threshold δ, it indicates that the kth local subscriber terminal C iskThe client side contains error data and is marked as a negative influence client side;
and 5, detecting error data and deleting:
suppose the kth local user terminal CkIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side CkCan satisfy the calculationThe requirement of method 1 is that algorithm 1 is selected to calculate the ith training sample zk,iInfluence function value of (I)f(zk,i) Otherwise, calculate the ith training sample z using Algorithm 2k,iInfluence function value of (I)f(zk,i);
If the value of the influence function If(zk,i) If the ratio of the median of the impact function values of all negative impact clients is greater than the set impact threshold, the ith training sample z is representedk,iIs an error sample, thereby to the k-th local user terminal CkJudging all training samples to detect all error samples;
the terminal server sends a deleting command to the kth negative influence user terminal CkSo that the kth negatively affects the ue CkDeleting all error samples of the self;
step 6, retraining the federated learning model:
the terminal server adjusts the probability of each local user terminal being selected according to the influence function values of all the local user terminals, so that the terminal server cooperates with the local user terminals to update the global model parameters
Figure BDA0003128457600000041
Step 6.1, initializing t to 1;
step 6.2, in the t training process, making K local user terminals have equal initial selection probability Pt 1,Pt 2,…,Pt k,…,Pt K
Step 6.3, the terminal server selects probability P according to the initialt 1,Pt 2,…,Pt k,…,Pt KSelecting m local clients and participating in the training process shown in the step 2.2-step 2.3, so as to obtain global model parameters theta obtained by respectively uploading training logs in the t-th training process and aggregating the training logs after deleting all error samples of the K local clientst', and the k-th selected bookGround user terminal CkModel parameters θ 'at the time of t-th training after deleting all error samples of itself't k
Step 6.4, in the t +1 training process, the terminal server updates the kth local user terminal C by using the formula (6)kIs selected, is selectedt kTo obtain the kth local user end C in the t +1 training processkIs selected from
Figure BDA0003128457600000042
Figure BDA0003128457600000043
In the formula (6), StRepresenting the local user terminal set selected by the terminal server in the t training process;
step 6.5, the terminal server trains the probability of the process according to the t +1
Figure BDA0003128457600000051
Selecting m local clients and participating in the training process shown in the step 2.2-step 2.3;
6.6, assigning t +1 to t, and returning to the step 6.4 to execute in sequence until the global model parameter thetatUntil convergence, resulting in optimal global model parameters
Figure BDA0003128457600000052
Step 7, global model parameters are calculated
Figure BDA0003128457600000053
The mispredicted test sample set Z is input to the optimal global model parameters
Figure BDA0003128457600000054
In the prediction process, the optimal global model parameters are obtained
Figure BDA0003128457600000055
The mispredicted test sample set Z 'is determined to be theta' if the sample size of the test sample set Z 'does not meet the requirement of the terminal server'tIs assigned to thetat,θ′t kIs assigned to
Figure BDA0003128457600000056
Step 4-step 7 are executed again, otherwise, it represents that the kth local user terminal C is detectedkAll error data.
The method for detecting error data with privacy protection in the federated learning scene is also characterized in that the algorithm 1 in the step 2 is carried out according to the following process:
in the t training process, the terminal server sever calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure BDA0003128457600000057
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from the jth training sample stored in the test data storage unit, and according to the gradient of the test data
Figure BDA0003128457600000058
Taylor expansion based on m samples is performed on equation (7) for calculating the vector product stestJ th estimated value s oftest,jThen for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure BDA0003128457600000059
The kth local subscriber terminal CkRepetition of r1Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure BDA00031284576000000510
And transmitting to the terminal server;
Figure BDA00031284576000000511
in the formula (7), HkRepresents the kth local subscriber Ckλ represents a threshold value such that the sum of the hessian matrices
Figure BDA00031284576000000512
Is a semi-positive definite matrix, I represents a unit matrix,
Figure BDA00031284576000000513
representing test data ztestA gradient of (a);
the terminal server will add the estimated value after making a noise
Figure BDA00031284576000000514
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i):
Figure BDA0003128457600000061
In the formula (8), the reaction mixture is,
Figure BDA0003128457600000062
represents the vector product stestThe transposing of (1).
The algorithm 2 in the step 2 is performed according to the following processes:
in the t training process, the terminal server sever calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure BDA0003128457600000063
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from training samples stored in the training device for the jth time, and calculating a Hessian matrix based on the m samples
Figure BDA0003128457600000064
Element h in line llAnd a test specimen ztestGradient of (2)
Figure BDA0003128457600000065
Element of line l
Figure BDA0003128457600000066
Calculating the vector product s using equation (9)testJ th estimated value s oftest,j(ii) a Then for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure BDA0003128457600000067
The kth local subscriber terminal CkRepetition of r2Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure BDA0003128457600000068
And transmitting to the terminal server;
Figure BDA0003128457600000069
the terminal server will add the estimated value after making a noise
Figure BDA00031284576000000610
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i)。
Compared with the prior art, the invention has the beneficial effects that:
because the invention adopts a layered detection method, the detection method is high in efficiency; meanwhile, detection methods for optimizing computing resources and communication resources are respectively designed, and the detection methods are high in adaptability; in addition, local data are not exposed to any third party in the whole detection process, parameters transmitted in the middle are disturbed through differential privacy, and the user data privacy is protected through the detection method.
Drawings
FIG. 1 is a flow chart of a system for error data detection with privacy protection in the federated learning scenario of the present invention.
Detailed Description
In this embodiment, a method for detecting error data with privacy protection in a federated learning scenario is applied by a terminal server and K local clients { C } as shown in fig. 1k1,2, …, K, where CkRepresents the kth local user end and the kth local user end CkStore nkA training sample, denoted as { zk,i|i=1,2,…,nk},zk,iRepresents the kth local subscriber CkThe ith training sample of (1); terminal server storage test data set ZS(ii) a The error data detection method is carried out according to the following steps:
step 1, federal model training:
the objective of the federal model training is to minimize the loss function shown in equation (1):
Figure BDA0003128457600000071
in the formula (1), θ represents a model parameter of federal learning after random initialization by using gaussian distribution, and is used for mapping an input space X to an output space Y, z represents training samples of K local user terminals, L (z, θ) represents a loss function of the model θ, and Fk(theta) denotes the kth local subscriber terminal CkIs calculated by equation (2):
Figure BDA0003128457600000072
in the formula (2), L (z)k,iAnd theta) represents the kth local subscriber CkThe loss function of the ith training sample of (1);
step 2, training of a federal model:
step 2.1, defining and initializing the current training time t as 1; assigning global model parameter theta of federal learning to global model parameter theta of t trainingt
Step 2.2, in the training process of the t time, the terminal server randomly selects m local user terminals CkAnd the t-th trained global model parameter theta is measuredtSending the data to the selected user side; the k selected local user terminal CkIndependently calculating updated model parameters using equation (3)
Figure BDA0003128457600000073
And will be
Figure BDA0003128457600000074
Sending to the terminal server, the terminal server saves the local update parameters
Figure BDA0003128457600000075
And use it as a training log;
Figure BDA0003128457600000076
in the formula (3), η represents the learning rate;
Figure BDA0003128457600000077
the local client C selected at the k-th time of the t + 1-th trainingkThe model parameters of (1);
Figure BDA0003128457600000078
represents the kth local user terminal C during the t trainingkThe gradient of the mean of the loss function of (a);
step 2.3, the terminal server obtains the global model parameter theta of the t +1 th round by using formula (4) polymerizationt+1
Figure BDA0003128457600000079
In equation (4), n represents the total number of samples of all local ues, i.e.
Figure BDA0003128457600000081
Step 1.4, after t +1 is assigned to t, the step 2.2 is returned to for execution in sequence until the global model parameter thetatUntil convergence, to obtain the optimal global model parameters
Figure BDA0003128457600000082
Step 3, federal model test:
the terminal server sends a test data set ZSInput to the optimal global model parameters
Figure BDA0003128457600000083
In (3), obtaining global model parameters
Figure BDA0003128457600000084
A set of mispredicted test samples Z;
step 4, detecting a user side containing error data:
the terminal server calculates the kth local user terminal C by using the formula (5)kDistance D of local update and global updatek
Figure BDA0003128457600000085
In the formula (5), the reaction mixture is,
Figure BDA0003128457600000086
represents the kth local subscriber CkWhether it is selected in the t-th training, if so
Figure BDA0003128457600000087
Indicates is selected if
Figure BDA0003128457600000088
Indicating not selected, N (k) indicating the k-th local user terminal CkThe selected times in the training process from T/2 times to T times; thetatRepresenting that the terminal server aggregates training logs uploaded by K local user terminals in the t training process to obtain global model parameters of the t training;
if the kth local subscriber CkDistance D ofkIf the ratio of the local subscriber terminal C to the median distances of all the local subscriber terminals is greater than the set distance threshold δ, it indicates that the kth local subscriber terminal C iskThe client side contains error data and is marked as a negative influence client side;
step 5, detecting error data:
suppose the kth local user terminal CkIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side CkCan meet the requirements of the algorithm 1, the algorithm 1 is selected to calculate the ith training sample zk,iInfluence function value of (I)f(zk,i) Otherwise, calculate the ith training sample z using Algorithm 2k,iInfluence function value of (I)f(zk,i)。
If the value of the influence function If(zk,i) If the ratio of the median of the impact function values of all negative impact clients is greater than the set impact threshold, the ith training sample z is representedk,iIs an error sample, thereby to the k-th local user terminal CkJudging all training samples to detect all error samples;
the terminal server sends a deleting command to the kth negative influence user terminal CkSo that the kth negatively affects the ue CkDeleting all error samples of the self;
step 6, based on the influence function value If(zk,i) Local user side selection and global model parameters
Figure BDA0003128457600000091
Updating:
the terminal server adjusts the probability of each local user terminal being selected according to the influence function values of all the local user terminals, so that the terminal server cooperates with the local user terminals to update the global model parameters
Figure BDA0003128457600000092
Step 6.1, initializing t to 1;
step 6.2, in the t training process, making K local user terminals have equal initial selection probability Pt 1,Pt 2,…,Pt k,…,Pt K
Step 6.3, the terminal server selects the probability P according to the initialt 1,Pt 2,…,Pt k,…,Pt KSelecting m tth training processes to participate in the training process shown in the step 2.2-the step 2.3, so as to obtain global model parameters theta 'obtained by respectively uploading training logs and aggregating the training logs in the tth training process after deleting all error samples of K local user sides'tAnd the k selected local client CkModel parameters θ 'at the time of t-th training after deleting all error samples of itself't k
Step 6.4, in the t +1 training process, the terminal server updates the kth local user terminal C by using the formula (4)kIs selected, is selectedt kTo obtain the kth local user end C in the t +1 training processkIs selected from
Figure BDA0003128457600000093
Figure BDA0003128457600000094
In the formula (6), StRepresenting the local user terminal set selected by the terminal server in the t training process;
step 6.5, the terminal server trains the probability of the process according to the t +1
Figure BDA0003128457600000095
Selecting m local clients and participating in the training process shown in the step 2.2-step 2.3;
step 6.6, after t +1 is assigned to t, the step 6.4 is returned to and executed in sequence until the global model parameter theta'tUntil convergence, to obtain the optimal global model parameters
Figure BDA0003128457600000096
Step 7, global model parameters are calculated
Figure BDA0003128457600000097
The mispredicted test sample set Z is input to the optimal global model parameters
Figure BDA0003128457600000098
In the prediction process, the optimal global model parameters are obtained
Figure BDA0003128457600000099
The mispredicted test sample set Z 'is determined to be theta' if the sample size of the test sample set Z 'does not meet the requirement of the terminal server'tIs assigned to thetat,θ′t kAssigning to the local user terminal C, executing the steps 4-7 again, otherwise, indicating that the kth local user terminal C is detectedkAll error data.
Due to calculation of If(zk,i) Requires O (np)2+p3) Wherein p represents a global model parameter
Figure BDA0003128457600000101
The calculation cost is large, and in order to reduce the calculation cost, the algorithm 1 is carried out according to the following processes:
in the t training process, the terminal server sever calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure BDA0003128457600000102
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from the jth training sample stored in the test data storage unit, and according to the gradient of the test data
Figure BDA0003128457600000103
Taylor expansion based on m samples is performed on equation (7) for calculating the vector product stestJ th estimated value s oftest,jThen for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure BDA0003128457600000104
The kth local subscriber terminal CkRepetition of r1Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure BDA0003128457600000105
And transmitting to a terminal server;
Figure BDA0003128457600000106
in the formula (7), HkRepresents the kth local subscriber Ckλ represents a threshold value such that the sum of the hessian matrices
Figure BDA0003128457600000107
Is a semi-positive definite matrix, I represents a unit matrix,
Figure BDA0003128457600000108
representing test data ztestOf the gradient of (c).
The terminal server will add the estimated value after making a noise
Figure BDA0003128457600000109
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i):
Figure BDA00031284576000001010
In the formula (8), the reaction mixture is,
Figure BDA00031284576000001011
represents the vector product stestThe transposing of (1).
Due to calculation of If(zk,i) Requires O (Kp)2+ np) and thus a large communication overhead, and to reduce the communication overhead, algorithm 2 proceeds as follows:
in the t training process, the terminal server calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure BDA00031284576000001012
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from training samples stored in the training device for the jth time, and calculating a Hessian matrix based on the m samples
Figure BDA0003128457600000111
Element h in line llAnd test data
Figure BDA0003128457600000112
Row i element of the gradient of
Figure BDA0003128457600000113
Calculating the vector product s using equation (9)testJ th estimated value s oftest,j. Then for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure BDA0003128457600000114
The kth local subscriber terminal CkRepetition of r2Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure BDA0003128457600000115
And transmitting to a terminal server;
Figure BDA0003128457600000116
the terminal server will add the estimated value after making a noise
Figure BDA0003128457600000117
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i)。

Claims (3)

1. A method for detecting error data with privacy protection in a federated learning scene is characterized in that a terminal server and K local user terminals { C }are appliedk1,2, …, K, where CkRepresents the kth local user end and the kth local user end CkStore nkA training sample, denoted as { zk,i|i=1,2,…,nk},zk,iRepresents the kth local subscriber CkThe ith training sample of (1); the terminal server stores a test data set ZS(ii) a The error data detection method is carried out according to the following steps:
step 1, constructing a training target of a federal model:
constructing a loss function L (z, theta) of the federal model by using the formula (1):
Figure FDA0003128457590000011
in formula (1), θ represents a model parameter of federal learning after random initialization by using gaussian distribution, z represents training samples of K local clients, and Fk(theta) denotes the kth local subscriber terminal CkAnd has the following average of the loss functions:
Figure FDA0003128457590000012
in the formula (2), L (z)k,iAnd theta) represents the kth local subscriber CkThe loss function of the ith training sample of (1);
step 2, training of a federal model:
step 2.1, defining and initializing the current training times t ═ l; assigning global model parameter theta of federal learning to global model parameter theta of t trainingt
Step 2.2, in the training process of the t time, the terminal server randomly selects m local user terminals, and the global model parameter theta of the t training is usedtSending the data to the selected user side; the k selected local user terminal CkIndependently calculating updated model parameters using equation (3)
Figure FDA0003128457590000013
And will be
Figure FDA0003128457590000014
Sending the model parameters to the terminal server, and storing the updated model parameters by the terminal server
Figure FDA0003128457590000015
And used as a training log;
Figure FDA0003128457590000016
in the formula (3), η represents the learning rate;
Figure FDA0003128457590000017
the local client C selected at the k-th time of the t + 1-th trainingkThe model parameters of (1);
Figure FDA0003128457590000018
represents the kth local user terminal C during the t trainingkThe gradient of the mean of the loss function of (a);
step 2.3, the terminal server utilizes the formula (4) to aggregate to obtain the global model parameter theta of the t +1 th trainingt+1
Figure FDA0003128457590000021
In equation (4), n represents the total number of samples of all local ues, i.e.
Figure FDA0003128457590000022
Step 2.4, after the t +1 is assigned to the t, the step 2.2 is returned to for execution sequentially until the global model parameter thetatUntil convergence, to obtain the optimal global model parameters
Figure FDA0003128457590000023
Step 3, testing of a federal model:
the terminal server sends a test data set Z to the terminal serverSInput to the optimal global model parameters
Figure FDA0003128457590000024
Obtaining the global model parameters
Figure FDA0003128457590000025
A set of mispredicted test samples Z;
step 4, detecting a user side containing error data:
the terminalThe server calculates the kth local client C by using the formula (5)kDistance D of local update and global updatek
Figure FDA0003128457590000026
In the formula (5), the reaction mixture is,
Figure FDA0003128457590000027
represents the kth local subscriber CkWhether it is selected in the t-th training, if so
Figure FDA0003128457590000028
Indicates is selected if
Figure FDA0003128457590000029
Indicating not selected, N (k) indicating the k-th local user terminal CkThe selected times in the training process from T/2 times to T times; thetatRepresenting that the terminal server aggregates training logs uploaded by K local user terminals in the t training process to obtain a global model parameter of the t training;
if the kth local subscriber CkDistance D ofkIf the ratio of the local subscriber terminal C to the median distances of all the local subscriber terminals is greater than the set distance threshold δ, it indicates that the kth local subscriber terminal C iskThe client side contains error data and is marked as a negative influence client side;
and 5, detecting error data and deleting:
suppose the kth local user terminal CkIf the local user side is a negative influence user side, the terminal server requires all local user sides to report available computing resources and communication resources of the local user sides, meanwhile, the computing resources and communication resources required by algorithm 1 with low communication overhead based on the differential privacy mechanism and algorithm 2 with high computing efficiency based on the differential privacy mechanism are estimated, and if the kth local user side C is a local user side CkIf the computing resources can meet the requirements of the algorithm 1, the algorithm 1 is selected for computingIth training sample zk,iInfluence function value of (I)f(zk,i) Otherwise, calculate the ith training sample z using Algorithm 2k,iInfluence function value of (I)f(zk,i);
If the value of the influence function If(zk,i) If the ratio of the median of the impact function values of all negative impact clients is greater than the set impact threshold, the ith training sample z is representedk,iIs an error sample, thereby to the k-th local user terminal CkJudging all training samples to detect all error samples;
the terminal server sends a deleting command to the kth negative influence user terminal CkSo that the kth negatively affects the ue CkDeleting all error samples of the self;
step 6, retraining the federated learning model:
the terminal server adjusts the probability of each local user terminal being selected according to the influence function values of all the local user terminals, so that the terminal server cooperates with the local user terminals to update the global model parameters
Figure FDA0003128457590000031
Step 6.1, initializing t to 1;
step 6.2, in the t training process, making K local user terminals have equal initial selection probability Pt 1,Pt 2,…,Pt k,…,Pt K
Step 6.3, the terminal server selects probability P according to the initialt 1,Pt 2,…,Pt k,…,Pt KSelecting m local user sides and participating in the training process shown in the step 2.2-the step 2.3, so as to obtain global model parameters theta 'obtained by respectively uploading training logs in the t-th training process and aggregating the training logs after deleting all error samples of the K local user sides'tAnd the k selected local userTerminal CkModel parameters θ 'at the time of t-th training after deleting all error samples of itself't k
Step 6.4, in the t +1 training process, the terminal server updates the kth local user terminal C by using the formula (6)kIs selected, is selectedt kTo obtain the kth local user end C in the t +1 training processkIs selected from
Figure FDA0003128457590000032
Figure FDA0003128457590000033
In the formula (6), StRepresenting the local user terminal set selected by the terminal server in the f training process;
step 6.5, the terminal server trains the probability of the process according to the t +1
Figure FDA0003128457590000034
Selecting m local clients and participating in the training process shown in the step 2.2-step 2.3;
step 6.6, after t +1 is assigned to t, the step 6.4 is returned to and executed in sequence until the global model parameter theta'tUntil convergence, to obtain the optimal global model parameters
Figure FDA0003128457590000041
Step 7, global model parameters are calculated
Figure FDA0003128457590000042
The mispredicted test sample set Z is input to the optimal global model parameters
Figure FDA0003128457590000043
In the prediction process, the optimal global model parameters are obtained
Figure FDA0003128457590000044
The mispredicted test sample set Z 'is determined to be theta' if the sample size of the test sample set Z 'does not meet the requirement of the terminal server'tIs assigned to thetat,θ′t kIs assigned to
Figure FDA0003128457590000045
Step 4-step 7 are executed again, otherwise, it represents that the kth local user terminal C is detectedkAll error data.
2. The method for detecting error data with privacy protection in a federated learning scene as claimed in claim 1, wherein the algorithm 1 in step 2 is performed as follows:
in the t training process, the terminal server sever calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure FDA0003128457590000046
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from the jth training sample stored in the test data storage unit, and according to the gradient of the test data
Figure FDA0003128457590000047
Taylor expansion based on m samples is performed on equation (7) for calculating the vector product stestJ th estimated value s oftest,jThen for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure FDA0003128457590000048
The kth local subscriber terminal CkRepetition of r1Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure FDA0003128457590000049
And transmitting to the terminal server;
Figure FDA00031284575900000410
in the formula (7), HkRepresents the kth local subscriber Ckλ represents a threshold value such that the sum of the hessian matrices
Figure FDA00031284575900000411
Is a semi-positive definite matrix, I represents a unit matrix,
Figure FDA00031284575900000414
representing test data ztestA gradient of (a);
the terminal server will add the estimated value after making a noise
Figure FDA00031284575900000415
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i):
Figure FDA00031284575900000412
In the formula (8), the reaction mixture is,
Figure FDA00031284575900000413
represents the vector product stestThe transposing of (1).
3. The method for detecting error data with privacy protection in a federated learning scene as claimed in claim 2, wherein the algorithm 2 in step 2 is performed as follows:
in the t training process, the terminalThe server sever calculates any test sample Z in the test sample set ZtestGradient of (2)
Figure FDA0003128457590000051
And sends it to the kth local user terminal Ck
The kth local subscriber terminal CkRandomly selecting m samples from training samples stored in the training device for the jth time, and calculating a Hessian matrix based on the m samples
Figure FDA0003128457590000052
Element h in line llAnd a test specimen ztestGradient of (2)
Figure FDA0003128457590000053
Element of line l
Figure FDA0003128457590000054
Calculating the vector product s using equation (9)testJ th estimated value s oftest,j(ii) a Then for the jth estimated value stest,jAfter Gaussian noise is added, a noise-added estimation value is obtained
Figure FDA0003128457590000055
The kth local subscriber terminal CkRepetition of r2Selecting a sample and calculating an estimated value, and finally obtaining the estimated value after noise addition
Figure FDA0003128457590000056
And transmitting to the terminal server;
Figure FDA0003128457590000057
the terminal server will add the estimated value after making a noise
Figure FDA0003128457590000058
As a vector product stestSo that the ith training sample z is calculated using equation (8)k,iInfluence function value of (I)f(zk,i)。
CN202110696108.9A 2021-06-23 2021-06-23 Error data detection method with privacy protection in federated learning scene Pending CN113361625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110696108.9A CN113361625A (en) 2021-06-23 2021-06-23 Error data detection method with privacy protection in federated learning scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110696108.9A CN113361625A (en) 2021-06-23 2021-06-23 Error data detection method with privacy protection in federated learning scene

Publications (1)

Publication Number Publication Date
CN113361625A true CN113361625A (en) 2021-09-07

Family

ID=77535777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110696108.9A Pending CN113361625A (en) 2021-06-23 2021-06-23 Error data detection method with privacy protection in federated learning scene

Country Status (1)

Country Link
CN (1) CN113361625A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881306A (en) * 2023-02-22 2023-03-31 中国科学技术大学 Networked ICU intelligent medical decision-making method based on federal learning and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
CN110633570A (en) * 2019-07-24 2019-12-31 浙江工业大学 Black box attack defense method for malicious software assembly format detection model
US20200034665A1 (en) * 2018-07-30 2020-01-30 DataRobot, Inc. Determining validity of machine learning algorithms for datasets
CN111460524A (en) * 2020-03-27 2020-07-28 鹏城实验室 Data integrity detection method and device and computer readable storage medium
CN112214342A (en) * 2020-09-14 2021-01-12 德清阿尔法创新研究院 Efficient error data detection method in federated learning scene
CN112435230A (en) * 2020-11-20 2021-03-02 哈尔滨市科佳通用机电股份有限公司 Deep learning-based data set generation method and system
EP3828777A1 (en) * 2019-10-31 2021-06-02 NVIDIA Corporation Processor and system to train machine learning models based on comparing accuracy of model parameters

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200034665A1 (en) * 2018-07-30 2020-01-30 DataRobot, Inc. Determining validity of machine learning algorithms for datasets
CN109961098A (en) * 2019-03-22 2019-07-02 中国科学技术大学 A kind of training data selection method of machine learning
CN110633570A (en) * 2019-07-24 2019-12-31 浙江工业大学 Black box attack defense method for malicious software assembly format detection model
EP3828777A1 (en) * 2019-10-31 2021-06-02 NVIDIA Corporation Processor and system to train machine learning models based on comparing accuracy of model parameters
CN111460524A (en) * 2020-03-27 2020-07-28 鹏城实验室 Data integrity detection method and device and computer readable storage medium
CN112214342A (en) * 2020-09-14 2021-01-12 德清阿尔法创新研究院 Efficient error data detection method in federated learning scene
CN112435230A (en) * 2020-11-20 2021-03-02 哈尔滨市科佳通用机电股份有限公司 Deep learning-based data set generation method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HARD, A.等: "Training Keyword Spotting Models on Non-IID Data with Federated Learning", 《ARXIV》 *
王琦等: "基于大数据分析技术的网络入侵检测方法", 《微电脑应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881306A (en) * 2023-02-22 2023-03-31 中国科学技术大学 Networked ICU intelligent medical decision-making method based on federal learning and storage medium

Similar Documents

Publication Publication Date Title
CN111124840B (en) Method and device for predicting alarm in business operation and maintenance and electronic equipment
CN107153874B (en) Water quality prediction method and system
CN109840595B (en) Knowledge tracking method based on group learning behavior characteristics
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN112784920A (en) Cloud-side-end-coordinated dual-anti-domain self-adaptive fault diagnosis method for rotating part
CN112561119A (en) Cloud server resource performance prediction method using ARIMA-RNN combined model
CN115982141A (en) Characteristic optimization method for time series data prediction
CN115051929A (en) Network fault prediction method and device based on self-supervision target perception neural network
CN113361625A (en) Error data detection method with privacy protection in federated learning scene
CN117407665A (en) Retired battery time sequence data missing value filling method based on generation countermeasure network
KR20200000660A (en) System and method for generating prediction model for real-time time-series data
Lo Predicting software reliability with support vector machines
CN113779116B (en) Object ordering method, related equipment and medium
CN115204463A (en) Residual service life uncertainty prediction method based on multi-attention machine mechanism
CN115392434A (en) Depth model reinforcement method based on graph structure variation test
CN115616163A (en) Gas accurate preparation and concentration measurement system
CN114139601A (en) Evaluation method and system for artificial intelligence algorithm model of power inspection scene
Imbiriba et al. Recursive Gaussian processes and fingerprinting for indoor navigation
CN111382391A (en) Target correlation feature construction method for multi-target regression
Andersson et al. Data-driven impulse response regularization via deep learning
Woo et al. Development of a reinforcement learning-based adaptive scheduling algorithm for block assembly production line
Pearson et al. Predicting ecological outcomes using fuzzy interaction webs
Li et al. Two-stage Walsh-average-based robust estimation and variable selection for partially linear additive spatial autoregressive models
CN114970344A (en) Packed tower pressure drop prediction method based on width migration learning
Pan et al. Anomalous Update Identification Based on Cosine Similarity for Collaborative Wind Power Forecasting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210907