CN117829307A

CN117829307A - Federal learning method and system for data heterogeneity

Info

Publication number: CN117829307A
Application number: CN202311769640.4A
Authority: CN
Inventors: 赵川; 魏宇楠; 赵圣楠; 埃尔加内·阿米娜; 林宇成; 鞠雷
Original assignee: Quancheng Provincial Laboratory
Current assignee: Quancheng Provincial Laboratory
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-05

Abstract

The invention relates to a federal learning method and a federal learning system for data isomerism, which belong to the technical field of federal learning, and the method comprises the following steps: the central server acquires the data of the user terminal, performs federal learning training to cluster the user terminal, and is connected with the user terminal through an acceleration node; the user side acquires global model gradients and aggregation gradients through the acceleration nodes, performs local model training, updates the local model and sends the local model gradients to the acceleration nodes; the acceleration nodes aggregate the model gradients of the user terminals and then send the aggregated model gradients to the central server, and each acceleration node is connected with the user terminal in one user cluster; and after receiving the user gradient sent by the acceleration node, the central server aggregates the user gradient to obtain a global model gradient and updates the global model, and then distributes the aggregated gradient to the user terminal through the acceleration node. The invention ensures the training precision of the model and improves the efficiency and performance of federal learning.

Description

Federal learning method and system for data heterogeneity

Technical Field

The invention relates to a federal learning method and system for data heterogeneity, and belongs to the technical field of federal learning.

Background

In recent years, with the rapid increase in the number of smart phones and smart wearable devices, a large amount of data generated by users is stored in these terminal devices having storage and data processing functions. As a new machine learning paradigm, federal learning (Federated Learning, FL) can fully utilize computing, storage and data resources of edge devices to implement collaborative model training among multiple devices while protecting user local data privacy. FL has been used in a number of application areas including predicting human behavior, emotion detection, natural language processing, and enterprise infrastructure, among others.

The McMahan et al of Google proposed the basic framework of FL for the first time and designed a parameter aggregation algorithm named FedAvg, but Non-IID (Non-independent distribution) data would affect the prediction accuracy of the global model. And as the data heterogeneity increases, the accuracy of the FedAVg will drop significantly. To address the challenges presented by non-IID data, fedDyn and SCAFFOLD each use regularization methods to estimate global knowledge of all device data distribution. However, this approach can produce large deviations when only a small number of devices are involved in each round of training. There are studies showing that momentum can be used to improve the accuracy of FL and combine it with other methods. The schemes of CMFL, oort, and Favor, among others, neutralize the effects of Non-IID and accelerate convergence by selecting a set of "excellent" devices to participate in each round of training. But these methods do not take full advantage of the valuable data stored on a small number of devices. FedRep and FedMD build different models for each device using transfer learning and knowledge distillation, respectively, which makes it difficult for new devices to select the appropriate model for initialization.

In a typical FL framework, the user equipment participating in the training first performs local training based on its own data, and then updates and uploads the model parameters to the central server. The central server aggregates model parameters from different devices in each round and then broadcasts the aggregated global model parameters to the devices. In the whole training process of the FL, the training data on each device cannot leave the local, so that the privacy of the data is protected. FL typically involves a large number of devices that typically have highly heterogeneous hardware resources (CPU, memory, and network resources) and Non-IID data. The existing FL framework can be divided into synchronous FL (e.g., fedAvg) and asynchronous FL (e.g., fedAsync). When large-scale heterogeneous equipment performs federal learning, the synchronization FL often causes free effects, and the equipment enters an idle state; asynchronous FL can avoid devices from falling into idle state, but more rounds of communication can occur between the more powerful devices and the server, possibly resulting in server crash, and the obsolete model transmitted to the server by the less powerful devices can also affect the training results of the global model. In addition, in both frames, the non-IID data stored on the device can cause significant differences in the weights of the device updates, thereby greatly affecting the training accuracy of the final model.

Disclosure of Invention

In order to solve the problems, the invention provides a federal learning method and a federal learning system for data heterogeneity, which can ensure training accuracy of a model and improve efficiency and performance of federal learning.

The technical scheme adopted for solving the technical problems is as follows:

in a first aspect, a federal learning method for data isomerism provided by an embodiment of the present invention includes the following steps:

the central server acquires the data of the user terminal, performs federal learning training to cluster the user terminal, and is connected with the user terminal through an acceleration node;

the user side acquires global model gradients and aggregation gradients through the acceleration nodes, performs local model training, updates the local model and sends the local model gradients to the acceleration nodes;

the acceleration nodes aggregate the model gradients of the user terminals and then send the aggregated model gradients to the central server, and each acceleration node is connected with the user terminal in one user cluster;

and after receiving the user gradient sent by the acceleration node, the central server aggregates the user gradient to obtain a global model gradient and updates the global model, and then distributes the aggregated gradient to the user terminal through the acceleration node.

As a possible implementation manner of this embodiment, the federal learning training process includes:

acquiring the data distribution condition of a user side, and constructing a data distribution matrix of the user side;

calculating EMD distance between the user similarity matrix and the user side data distribution;

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

As a possible implementation manner of this embodiment, each row of the data distribution matrix of the user terminal corresponds to the number of data samples of different data labels owned by one user terminal.

As a possible implementation manner of this embodiment, the calculating the EMD distance between the user similarity matrix and the user side data distribution includes:

suppose user terminal U _j With user terminal U _l The probability distribution of the data is respectivelyWherein->A weight representing a certain characteristic value under a certain distribution;

define the distance matrix between P and Q as Dist= [ d ] _ij ]，d _ij Represents p _i And q _j Distance between them, flow matrix f= [ F ] _ij ]So that p _i To q _j The sum of the distances of (2) is the smallest, namely:

wherein f _ij The representation is from p _i To q _j Is a variable amount of (a);

after finding the optimal flow matrix F, the EMD distance between the two data distributions of P and Q is:

wherein i is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to m.

As a possible implementation manner of this embodiment, the calculating the euclidean distance to obtain the cluster includes:

converting distance matrix Dist into an undirected weighted graph g=<V,E>Wherein each vertex represents a user, the weight of each edgeFor any two users U _j And U _l Distance between, i.e.)>

For the followingDefinition of its degree d _j For the sum of the weights of all edges connected to it, i.e. +.>Calculating to obtain a degree matrix of the distance matrix Dist:

the laplacian matrix of graph G is calculated:

the point set of the k subgraphs of graph G is A ₁ ,A ₂ ,…,A _k ，And A is ₁ ∪A ₂ ∪…∪A _k ＝V；

For the followingDefining a graph cut loss function between A and B as:

the tangent pattern loss function of the k-subgraph is defined as:

for the followingDefinition of vol (A): sigma _j∈A d _j ；

Performing optimal sub-graph cutting on G by using a Normalized Cut graph method, wherein the graph cutting loss is defined as follows:

wherein the method comprises the steps ofIs A _j Is introduced into the instruction matrix>

For vector h _j Definition:

for the followingThe method comprises the following steps:

namely:

order theThen there are:

the optimization objective function is therefore:

wherein,is->The feature matrix is formed by the feature vectors corresponding to the first k minimum feature values according to line standardization;

for feature matrixEach row of which is considered as a sample, and k is initialized in the sample set ^′ The clustering centers E= { E ₁ ,e ₂ ,…,e _k In each iteration, each sampleCalculating Euclidean distance between each cluster center and the cluster c with the shortest distance _i Until the maximum iteration times are reached, obtaining a clustering result C= { C _i |i∈[1,k]}。

As a possible implementation manner of this embodiment, the clustering-based clustering aggregation of client model gradients includes:

cluster c _i User terminal U in _j Model gradient W after local model training is completed _j Periodically transmitted to acceleration node AN _i ，AN _i The model gradient of the cluster is uploaded to a central server at regular intervals, and the central server takes the distance from each user in each cluster to the cluster center as the corresponding weight of parameter aggregation to perform gradient aggregation:

wherein,representing global model gradient after the (r+1) th round of iteration, l _ij Representing user U _j And cluster center e _i Distance L of (2) _i Representing cluster c _i Average distance of the user and the cluster center.

As a possible implementation manner of this embodiment, the user side includes a smart phone, a PC, or a smart wearable device.

In a second aspect, the federal learning system for data heterogeneity provided in the embodiment of the present invention includes a client, an acceleration node, and a central server,

the central server acquires user side data, performs federal learning training to cluster the user sides, and is connected with the user sides through acceleration nodes;

As a possible implementation manner of this embodiment, the federal learning training process performed by the central server is:

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

The technical scheme of the embodiment of the invention has the following beneficial effects:

the federal learning method for data isomerism comprises the following steps: the central server acquires the data of the user terminal, performs federal learning training to cluster the user terminal, and is connected with the user terminal through an acceleration node; the user side acquires global model gradients and aggregation gradients through the acceleration nodes, performs local model training, updates the local model and sends the local model gradients to the acceleration nodes; the acceleration nodes aggregate the model gradients of the user terminals and then send the aggregated model gradients to the central server, and each acceleration node is connected with the user terminal in one user cluster; and after receiving the user gradient sent by the acceleration node, the central server aggregates the user gradient to obtain a global model gradient and updates the global model, and then distributes the aggregated gradient to the user terminal through the acceleration node. The invention ensures that the heterogeneity of the user side data does not influence the model training effect, and can fully utilize the computing resources on the user side equipment, thereby improving the efficiency and performance of federal learning.

Aiming at the isomerism of data among users, the method groups the users with similar data distribution based on spectral clustering and then carries out federal model training, thereby reducing the influence caused by data isomerism and effectively improving the model accuracy. In the parameter aggregation stage, the method takes the distance between the users in the cluster and the cluster center as a part of the aggregation weight consideration, and further reduces the influence caused by the user data distribution difference. The invention ensures the training precision of the model and improves the efficiency and performance of federal learning.

Drawings

FIG. 1 is a flow chart illustrating a federal learning method for data heterogeneity according to an example embodiment;

FIG. 2 is a block diagram of a federal learning system oriented to data heterogeneity, according to an example embodiment;

FIG. 3 is a diagram of a federal learning training framework, according to an exemplary embodiment;

fig. 4 is a schematic diagram illustrating a user data preprocessing process according to an exemplary embodiment.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

in order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different structures of the invention. In order to simplify the present disclosure, components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and processes are omitted so as to not unnecessarily obscure the present invention.

As shown in fig. 1, the federal learning method for data isomerism provided by the embodiment of the invention includes the following steps:

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

converting distance matrix Dist into an undirected weighted graph g=<V,E>Wherein each vertex represents a user, the weight of each edgeFor any two users U _j And U _l The distance between, i.e

the laplacian matrix of graph G is calculated:

For the followingDefining a graph cut loss function between A and B as:

the tangent pattern loss function of the k-subgraph is defined as:

for the followingDefinition of vol (A): sigma _j∈A d _j ；

For vector h _j Definition:

for the followingThe method comprises the following steps:

namely:

order theThen there are:

the optimization objective function is therefore:

for feature matrixEach row of which is considered as a sample, and k is initialized in the sample set ^′ The clustering centers E= { E ₁ ,e ₂ ,…,e _k In each iteration, each sample calculates the Euclidean distance from each cluster center and is integrated into cluster c with the shortest distance _i Until the maximum iteration times are reached, obtaining a clustering result C= { C _i |i∈[1,k]}。

wherein,representing global model gradient after the (r+1) th round of iteration, l _ij Representing user U _j And cluster center e _i Distance L of (2) _i Representing cluster c _i Middle userAverage distance of cluster centers.

As shown in fig. 2, the federal learning system for data heterogeneity provided in the embodiment of the present invention includes a client, an acceleration node and a central server,

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

The invention provides a federal learning framework for data heterogeneous conditions, and aims to solve the technical problem of user data isomerism in federal learning. The system model, as shown in fig. 3, mainly comprises three types of entities:

center Server (CS): the central server is a type of cloud computing equipment which is reliable, high in reliability and high in computing power and data processing power, and after receiving the user gradient sent by the acceleration nodes, the central server aggregates the user gradient to obtain a global model gradient and updates the global model, and then distributes the aggregated gradient to the user through each acceleration node.

Acceleration Node (AN): the accelerating node is a terminal device with the functions of calculation, storage and network routing, is closer to a user than the central server, has smaller communication delay and is mainly used for reducing the communication delay between the user and the central server and the communication bandwidth of the central server. After the central server clusters the users, each acceleration node is responsible for being connected with terminal equipment in one user cluster, periodically aggregates the model gradients of the users in the group and then sends the aggregated model gradients to the central server, and simultaneously distributes global model gradients to the user equipment.

The user terminal: each user terminal has a local small data set, the data are independently and uniformly distributed, and the user terminal can communicate with a central server by using terminal equipment such as a smart phone, a PC or intelligent wearable equipment, so as to cooperatively train a more accurate and efficient machine learning model. The method comprises the steps that at each round of FL, a user side performs local model training, local model gradients are uploaded to AN AN, the user side obtains global model gradients from CS to update the local model, and then the user side performs new rounds of training by using own data until the training is completed.

/>

The federal learning training process for the system is shown in algorithm 1. Without loss of generality, it is assumed that there are m acceleration nodes, each connected to a user cluster, each cluster containing n user equipments and connected to the same acceleration node. It is assumed that the central server knows the union of all user data features, but cannot distinguish the data features owned by a single user.

1.1 data preprocessing stage

For the user devices participating in the present training, the central server first counts the data distribution of each user, as shown in FIG. 4(a) In the example shown, the MNIST data set is taken as an example, the data labels are in 10 categories, wherein each row corresponds to the number of data samples of different data labels owned by the device. Then calculate EMD distance between every two users, assuming user U _j And U _l Probability distribution of dataWherein the method comprises the steps ofThe weight of a certain feature value under a certain distribution is represented (here, the weight of each data feature is equal by default). Define the distance matrix between P and Q as Dist= [ d ] _ij ]，d _ij Represents p _i And q _j The distance between the two flow matrixes (i is more than or equal to 1 and less than or equal to m, j is more than or equal to 1 and less than or equal to m), and finally a flow matrix F= [ F ] is found out _ij ]Can make p _i To q _j The sum of the distances of (2) is the smallest, namely:

wherein f _ij The representation is from p _i To q _j Is a variable amount of (a). The problem here is due to the optimization problem of the minimum of the linear function under a linear constraint. After finding the optimal F, the EMD distance between the two data distributions is:

after the distance matrix Dist between users is obtained (shown in fig. 4 (b)), it is converted into an undirected weighted graph g=<V,E>As shown in FIG. 4 (c), where each vertex represents a user, the weight of each edgeFor any two users U _j And U _l Distance between, i.e.)> />

Thereafter, algorithm 2 is run forDefinition of its degree d _j For the sum of the weights of all edges connected to it, i.e.Then calculating to obtain a degree matrix of the distance matrix Dist:

the laplacian matrix of graph G is further calculated:

let the point set of k subgraphs of graph G be A ₁ ,A ₂ ,…,A _k WhereinAnd A is ₁ ∪A ₂ ∪…∪A _k =v. For->B ε V, define the tangent pattern loss function between A and B as:

the tangent pattern loss function of the k-subgraph is defined as:

for the followingDefinition of vol (A): sigma _j∈A d _j . G was performed using the Normalized Cut mapping method

Cutting an optimal subgraph, and defining graph cutting loss as follows:

wherein the method comprises the steps ofIs A _j Is a complement of (a). Introducing an indication matrix->For vector h _j Definition: />

For the followingThe method comprises the following steps:

namely:

order theThen there are:

the optimization objective function is therefore:

wherein the method comprises the steps ofIs->The feature matrix is formed by the feature vectors corresponding to the first k minimum feature values according to line standardization. For feature matrix->Each row of which is considered as a sample, and k is initialized in the sample set as shown in algorithm 3 ^′ The clustering centers E= { E ₁ ,e ₂ ,…,e _k In each iteration, each sample calculates the Euclidean distance from each cluster center and is integrated into cluster c with the shortest distance _i Until the maximum iteration times are reached, obtaining a clustering result C= { C _i |i∈[1,k]}。/>

1.2 gradient polymerization stage

Cluster c _i User equipment U within _j Completing local model trainingThen the model gradient W _j Periodically transmitted to acceleration node AN _i ，AN _i The model gradient of the cluster is uploaded to CS.A weight of each model uploaded to the server by the system defaults is irrelevant to the data volume on the device uploading the model, and each local model has the same weight, so that the system gradient aggregation error caused by the size of a user data set is avoided. When the resource heterogeneity of all devices is very high, the system uses the distance from each user in each cluster to the cluster center as the corresponding weight for parameter aggregation. Gradient polymerization is as follows:

wherein the method comprises the steps ofRepresenting global model gradient after the (r+1) th round of iteration, l _ij Representing user U _j And cluster center e _i Distance L of (2) _i Representing cluster c _i Average distance of the user and the cluster center. CS distributes aggregated model gradients to AN _i And then distributed to each user equipment U _j 。

Aiming at the isomerism of data among users, the method groups the users with similar data distribution based on spectral clustering and then carries out federal model training, thereby reducing the influence caused by data isomerism and effectively improving the model accuracy.

In the parameter aggregation stage, the method takes the distance between the users in the cluster and the cluster center as a part of the aggregation weight consideration, and further reduces the influence caused by the user data distribution difference.

The invention ensures the training precision of the model and improves the efficiency and performance of federal learning

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A federal learning method for data heterogeneity is characterized by comprising the following steps:

2. The data heterogeneity-oriented federal learning method of claim 1, wherein the federal learning training process comprises:

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

3. The federal learning method for data heterogeneity according to claim 2, wherein each row of the client data distribution matrix corresponds to the number of data samples of different data labels owned by a client.

4. A federal learning method for data heterogeneity according to claim 3, wherein calculating the EMD distance between the user similarity matrix and the user side data distribution comprises:

5. The data heterogeneity-oriented federation learning method of claim 4, wherein the computing euclidean distance to obtain clusters comprises:

converting distance matrix Dist into an undirected weighted graph g=<V,E>Each of which is provided withThe vertices represent a user, the weight of each edgeFor any two users U _j And U _l Distance between, i.e.)>

the laplacian matrix of graph G is calculated:

For the followingDefining a graph cut loss function between A and B as:

the tangent pattern loss function of the k-subgraph is defined as:

for the followingDefinition of vol (a): = Σ _j∈A d _j ；

For vector h _j Definition:

for the followingThe method comprises the following steps:

namely:

order theThen there are:

the optimization objective function is therefore:

6. The data heterogeneity-oriented federal learning method of claim 5, wherein clustering-based aggregation of client model gradients comprises:

7. The federal learning method for data heterogeneity according to any one of claims 1-6, wherein the client comprises a smart phone, a PC, or a smart wearable device.

8. A federal learning system oriented to data heterogeneity is characterized by comprising a user side, an acceleration node and a central server,

9. The data heterogeneity-oriented federal learning system of claim 8, wherein the central server performs a federal learning training process that:

calculating Euclidean distance to obtain cluster;

and aggregating the client model gradients based on the cluster.

10. The data heterogeneity-oriented federal learning system according to claim 8 or 9, wherein the client comprises a smart phone, a PC, or a smart wearable device.