CN112464269A

CN112464269A - Data selection method in federated learning scene

Info

Publication number: CN112464269A
Application number: CN202011464915.XA
Authority: CN
Inventors: 张兰; 李向阳; 李安然
Original assignee: Deqing Alpha Innovation Research Institute
Current assignee: Deqing Alpha Innovation Research Institute
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-09

Abstract

A data selection method in the federated learning scene comprises the steps of filtering out users and data related to tasks, selecting users before training, selecting users and data in the training process, and training a model; meanwhile, because the server end log information is adopted to dynamically select the user; the data selection strategy is efficient and accurate based on the gradient upper bound value selection data and in view of the impact of erroneous data on the gradient.

Description

Data selection method in federated learning scene

Technical Field

The invention relates to a data selection method in a federal learning scene, and belongs to the field of data analysis and data quality evaluation.

Background

How to acquire a large number of high-quality data sets has become a common bottleneck for many machine learning models and AI applications. This is not only because collecting and labeling large numbers of samples is very expensive, but also because privacy issues prevent data sharing in many fields (e.g., medicine and economics). The advent of federal learning has made it possible for end users to jointly train network models using local data. In the federal learning process, the local data quality of the user affects the performance of the global model, and low-quality data (e.g., error label data and non-uniformly distributed data) seriously hinders the global model from achieving good effect.

The invention aims to select a group of high-quality training samples for a given federated learning task in a privacy protection mode under a given budget, so that the accuracy of a model is improved and the convergence speed of the model is accelerated.

There has been a series of work on data selection in deep learning: 1) the method provides various quality indexes such as task relevance and content diversity, performs quality index detection on data samples, and selects data with high quality scores to participate in training. 2) The training samples important to the model are dynamically selected to compose data batch during the training process to speed up model convergence, typically with importance scores quantified by gradient norm or loss values. They cannot be used directly in federal learning: 1) existing methods require direct access to all training samples, whereas in federal systems, data cannot be accessed directly by third parties. 2) Directly computing the importance of each sample incurs unacceptable overhead for resource-limited participants. 3) Existing methods do not take into account the impact of non-IID or erroneous samples on the sample selection strategy and may place more importance on erroneous samples, thereby degrading model performance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a privacy protection mode for selecting a group of high-quality training samples for a given federal learning task, so that the accuracy of a model is improved and the convergence speed of the model is accelerated. The method comprises the steps of filtering out users and data related to tasks, selecting users before training, selecting users and data in the training process and training models.

Preferably, the method comprises the following steps: task-related user and data filtering is such that when a FL task arrives, the server first computes each user C_k，k∈[K]Tag set Y of_k＝{y_k|(x_k，y_k)∈D_kThe intersection of { (x) and target tag set Y_k，y_k)|y_k∈Y_kAndyy) to filter out users having target category data. If the number of samples in the intersecting set exceeds the minimum number of the target model | { (x)_k，y_k)|y_k∈Y_kN Y } | > v, then the user is relevant, in order to meet the need for privacy protection, we use privacy protection negotiation technology (PSI).

Preferably, the method comprises the following steps: selecting by a user before training: the server further selects a set of high quality users (user subscript set Q) from the set of relevant users using a lattice determinant-based (DPP) algorithm to maximize homogeneity and content diversity under a budget constraint B: max V (Q), s.t., Sigma_{k∈Q，Q∈N′}b_kB.V (Q) is the quality value of the selected user. The server then coordinates the selected users to begin training the model. In the module, the method mainly comprises the following steps:

a) user selection based on homogeneity: the server preferentially selects users with uniformly distributed data and without missing categories. When homogeneity is taken as an index for selecting users, V_μ(Q)＝∑_k∈Qμ_k，μ_kDefined as the difference between the data distribution and the uniform distribution for user k, i.e.:

calculating mu for privacy protection_kThe server and each user use the public key of the server to jointly calculate by using the efficient and safe two-party calculation protocol based on the homomorphic encryption of the BGN. The server then greedily selects the one with the largest

Until budget B is exhausted, find the best set of users.

b) Based on diversity user selection: the server selects users with various data contents to participate in model training. When content diversity is used as a user selection criterion, v (q) ═ ρ (D) and D uber @ u_k∈QD_k，

Wherein S (v)_i，v_j) Is calculating user C_i，C_jSimilarity functions between them, such as euclidean distance. The Server greedily selects the next user with the lowest similarity to the current set of users.

In order to calculate the content diversity of a data set, firstly, feature vector expression of data needs to be extracted, we extract features using a deep learning model, for example, extract content feature vectors of pictures using a VGG-16 network, and then calculate the content diversity of all data of the user. When the data volume M of the user is large and the feature vector dimension l is high, the calculation content diversity overhead is large O (M)²l), meanwhile, the existing computing method needs to be directly contacted with original data, so we propose an efficient privacy-protecting content diversity computing method, which constructs the features of each user data set through a low-dimensional vector based on JL transformation, and protects the privacy of each sample by using a random response mechanism, and mainly comprises the following steps:

i. constructing a data set content sketch: user C_kLocally generated content feature vector phi_k＝{φ_k，i|i∈[U_k]Is then server selects a mapping matrix w will be phi_kMapped as a low dimensional vector h (phi)_k，i)＝sign(w·φ_k，i). The distortion caused by the mapping reduces the accuracy of diversity, but protects the privacy of user mapping vectors to a certain extent, and the data set D_KThe content vector sketch of

Random response mechanism: to further protect the privacy of the presence of each datum, we use a random response mechanism to generate a vector sketch h (phi)_k，i) Is represented by a perturbation vector

To be provided with

Has a probability of 1 to

Has a probability of 0 and a probability of 1-f

f is a parameter resulting from the definition of the degree of privacy. The user then generates a disturbance sketch using the disturbance vectors

And handle

Sending the result to a server, and using the disturbance draft vector to calculate similarity by the server

And calculating content diversity

By this time, the server reduces the overhead of computing content diversity by several orders of magnitude and protects the data privacy of the user.

C) Selecting by a user based on the lattice determinant: when considering both homogeneity and diversity, the user selection problem is translated into the DPP problem. User C_iIs μ_iAnd user C_jIs S_ijWe define a semi-positive definite matrix A_[N′]＝[A_ij]_{i，j∈[N′]}，A_ij＝u_iu_jS_ijThen probability P that the user is selected_A(Ｑ)＝det(A_Q). When homogeneity increases and similarity decreases, determinant increases, and thus DPP-based selection tends to select users with evenly distributed categories while avoiding content heightSimilar users. Value function V_d(Ｑ)＝det(A_Q)，

S_Q＝[S_ij]_i，j∈QWe transform the user selection problem into a log-submodel problem, iteratively selecting the maximum P_A(Q { k }) of user C_k。

Preferably, the method comprises the following steps: user and data selection during training is given a selected set of high quality users, with an index set of Q ∈ [ N']And in order to further improve the model performance and reduce the training overhead, users of the zeta proportion are selected in each training iteration, and simultaneously the users locally select important data samples to participate in model training. An intuitive quantized user C_kThe method for calculating the importance in the t-th round

Is a sample z_k，iIn the input and output of the last layer (L layer) of the model, the gradient-based upper bound value selection data is adopted, but in the method, when the model is complex and the data volume is large, the calculation cost is large O (n.s), n is the total data volume, and s is the number theta epsilon of the model parameters to R^sTo this end, we propose a policy for dynamically selecting users based on server-side log information. Specifically, in the t-th round, the server is selected according to the probability of the user being selected

The number m of users is selected,

i.e. we have users that have a large impact on the model to be selected with a higher probability. For each selected user C_kThey calculate locally the importance of each data λ, (z_k，iT-1), then with probability

(gradient L of error data)₂Norm value far greater than the gradient L of the correct data₂Norm) of those selected

Data z of_k，i∈D_k。

Preferably, the method comprises the following steps: model training is that in each iteration, all selected users train their local models on the selected samples, and the server aggregates the model updates of the user side to update the global model. The Server repeats the process until a global optimal model θ is obtained^*。

The invention designs an efficient data selection method facing the federal learning, improves the precision of the model and accelerates the convergence speed of the model. The method has the advantages that the vector sketch and the random response mechanism are adopted, so that the strategy selection of the user is efficient and has privacy protection; meanwhile, because the server end log information is adopted to dynamically select the user; the data selection strategy is efficient and accurate based on the gradient upper bound value selection data and in view of the impact of erroneous data on the gradient.

Drawings

FIG. 1 is a flow diagram of an efficient data selection system in a federated learning scenario.

Detailed Description

The invention will be described in detail below with reference to the following figures: as shown in fig. 1, the data selection method in the federal learning scenario proposed in the present invention is mainly divided into the following modules: filtering out users and data related to tasks, selecting users before training, selecting users and data in the training process, and training models.

(1) Task related user and data filtering: when a FL task arrives, server first calculates each user C_k，k∈[K]Tag set Y of_k＝{y_k|(x_k，y_k)∈D_kIntersection of the target tag set YSet { (x)_k，y_k)|y_k∈Y_kAndyy) to filter out users having target category data. If the number of samples in the intersecting set exceeds the minimum number of the target model | { (x)_k，y_k)|y_k∈Y_kAndyjj > v, then the user is relevant. To meet the need for privacy protection, we use privacy protection intersection technology (PSI).

(2) Selecting by a user before training: the server further selects a set of high quality users (user subscript set Q) from the set of relevant users using a lattice determinant-based (DPP) algorithm to maximize homogeneity and content diversity under a budget constraint B: max V (Q), s.t., Sigma_{k∈Q，Q∈N′}b_kB.V (Q) is the quality value of the selected user. The server then coordinates the selected users to begin training the model. In the module, the method mainly comprises the following steps:

Until budget B is exhausted, find the best set of users.

To be provided with

Has a probability of 1 to

Has a probability of 0 and a probability of 1-f

And handle

And calculating content diversity

(3) Selecting by a user based on the lattice determinant: when considering both homogeneity and diversity, the user selection problem is translated into the DPP problem. User C_iIs μ_iAnd user C_jIs S_ijWe define a semi-positive definite matrix A_[N′]＝[A_ij]_{i，j∈[N′]}，A_ij＝u_iu_jS_ijThen probability P that the user is selected_A(Ｑ)＝det(A_Q). When homogeneity increases and similarity decreases, determinant increases, and thus DPP-based selection tends to select users with evenly distributed categories while avoiding users with highly similar content. Value function V_d(Ｑ)＝det(A_Q)，

S_Q＝[S_ij]_i，j∈QWe transform the user selection problem into]The og-submodel problem iteratively selects the maximum P_A(Q { k }) of user C_k。

(4) User and data selection during training: given a selected set of high quality users, its index set is Q ∈ [ N']And in order to further improve the model performance and reduce the training overhead, users of the zeta proportion are selected in each training iteration, and simultaneously the users locally select important data samples to participate in model training. An intuitive quantized user C_kThe method for calculating the importance in the t-th round

The number m of users is selected,

i.e. we have users that have a large impact on the model to be selected with a higher probability. For each selected user C_kThey calculate locally the importance of each data λ (z)_k，iT-1), then with probability

(gradient L of error data)₂Norm value far greater than the gradient L of the correct data₂Norm) selectionThose

Data z of_k，i∈D_k。

(5) Model training: in each iteration, all selected users train their local models on the selected samples, and the server aggregates the model updates of the user side to update the global model. The Server repeats the process until a global optimal model θ is obtained^*。

Claims

1. A data selection method in a federated learning scene is characterized by comprising the steps of filtering out users and data related to tasks, selecting the users before training, selecting the users and data in the training process and training a model.

2. The method of claim 1, wherein the task-related users and data are filtered such that when a FL task arrives, the server first calculates each user C_k,k∈[K]Tag set Y of_k＝{y_k|(x_k,y_k)∈D_kThe intersection of { (x) and target tag set Y_k,y_k)|y_k∈Y_kAndyy) to filter out users having target category data. If the number of samples in the intersecting set exceeds the minimum number of the target model | { (x)_k,y_k)|y_k∈Y_k∩Y>v, then the user is relevant, in order to meet the need for privacy protection, we use privacy protection requirements technology (PSI).

3. The method of claim 1, wherein the pre-training user selects: the server further selects a set of high quality users (user subscript set Q) from the set of relevant users using a lattice determinant-based (DPP) algorithm to maximize homogeneity and content diversity under a budget constraint B: maxv (q), s.t., Σ_{k∈Q,Q∈N′}b_kB.V (Q) is the quality value of the selected user.The server then coordinates the selected users to begin training the model. In the module, the method mainly comprises the following steps:

Until budget B is exhausted, find the best set of users.

b) Based on diversity user selection: the server selects users with various data contents to participate in model training. When content diversity is used as a user selection criterion, v (q) ═ ρ (D) and D uber @ u_k∈QD_k,

Wherein S (v)_i,v_j) Is calculating user C_i,C_jSimilarity functions between them, such as euclidean distance. The Server greedily selects the next user with the lowest similarity to the current set of users.

In order to calculate the content diversity of a data set, firstly, feature vector expression of data needs to be extracted, we extract features using a deep learning model, for example, extract content feature vectors of pictures using a VGG-16 network, and then calculate the content diversity of all data of the user. When in useWhen the data volume M of the user is large and the feature vector dimension l is high, the calculation content diversity expense is large O (M)²l), meanwhile, the existing computing method needs to be directly contacted with original data, so we propose an efficient privacy-protecting content diversity computing method, which constructs the features of each user data set through a low-dimensional vector based on JL transformation, and protects the privacy of each sample by using a random response mechanism, and mainly comprises the following steps:

i. constructing a data set content sketch: user C_kLocally generated content feature vector phi_k＝{φ_k,i|i∈[U_k]Is then server selects a mapping matrix w will be phi_kMapped as a low dimensional vector h (phi)_k,i)＝sign(w·φ_k,i). The distortion caused by the mapping reduces the accuracy of diversity, but protects the privacy of user mapping vectors to a certain extent, and the data set D_KThe content vector sketch of

Random response mechanism: to further protect the privacy of the presence of each datum, we use a random response mechanism to generate a vector sketch h (phi)_k,i) Is represented by a perturbation vector

To be provided with

Has a probability of 1 to

Has a probability of 0 and a probability of 1-f

f is a user-defined controlAnd (4) a parameter for controlling the privacy degree. The user then generates a disturbance sketch using the disturbance vectors

And handle

And calculating content diversity

C) Selecting by a user based on the lattice determinant: when considering both homogeneity and diversity, the user selection problem is translated into the DPP problem. User C_iIs μ_iAnd user C_jIs S_ijWe define a semi-positive definite matrix A_[N′]＝[A_ij]_i,j∈[N′]，A_ij＝u_iu_jS_ijThen probability of user being selected

When homogeneity increases and similarity decreases, determinant increases, and thus DPP-based selection tends to select users with evenly distributed categories while avoiding users with highly similar content. Function of value

S_Q＝[S_ij]_i,j∈QWe transform the user selection problem into a log-submodel problem, iteratively selecting the maximum P_A(Q { k }) of user C_k。

4. The method of data selection in a federated learning scenario as in claim 1, wherein the user and data selection during the training process is given a selected set of high quality users with an index set of Q e [ N']And in order to further improve the model performance and reduce the training overhead, users of the zeta proportion are selected in each training iteration, and simultaneously the users locally select important data samples to participate in model training. An intuitive quantized user C_kThe method for calculating the importance in the t-th round

Is a sample z_k,iIn the input and output of the last layer (L layer) of the model, the gradient-based upper bound value selection data is adopted, but in the method, when the model is complex and the data volume is large, the calculation cost is large O (n.s), n is the total data volume, and s is the number theta epsilon of the model parameters to R^sTo this end, we propose a policy for dynamically selecting users based on server-side log information. Specifically, in the t-th round, the server is selected according to the probability of the user being selected

The number m of users is selected,

i.e. we have users that have a large impact on the model to be selected with a higher probability. For each selected user C_kThey calculate locally the importance of each data λ (z)_k,iT-1), then with probability

Data z of_k,i∈D_k。

5. The method of claim 1, wherein the model training is such that in each iteration, all selected users train their local models on selected samples, and server aggregates model updates at user end to update global models. The Server repeats the process until a global optimal model θ is obtained^*。