CN115952860A

CN115952860A - Heterogeneous statistics-oriented clustering federal learning method

Info

Publication number: CN115952860A
Application number: CN202310060893.8A
Authority: CN
Inventors: 左方; 高铭远; 刘家萌
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-04-11

Abstract

The invention discloses a heterogeneous statistics-oriented clustering federal learning method, which comprises the following steps: step 1, constructing an edge node distribution classifier; step 2, determining a measurement index of edge node clustering; step 3, determining a clustering method of the node cluster; step 4, clustering the edge nodes by using a clustering method; step 5, the server initializes the global model and sends the model to the head node of each node cluster; step 6, after receiving the model, the edge node performs local training on a local data set and updates the model, sends the updated model to the next node in the cluster for training until all the nodes in each cluster complete training, and uploads the updated model to the server; step 7, the server receives the updated models of all clusters, then carries out weighted average and updates the global model; and 8, repeating the step 6 and the step 7 until the global model converges. Compared with the traditional federal learning method, the method is more efficient and has stronger applicability.

Description

Heterogeneous statistics-oriented clustering federal learning method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a heterogeneous statistics-oriented clustering federal learning method.

Background

Modern mobile and internet of things devices (e.g., smartphones, smart wearable devices, smart home devices) are producing large amounts of data each day, which provides opportunities to make complex Machine Learning (ML) models to address challenging artificial intelligence tasks. In conventional High Performance Computing (HPC), all data is collected and concentrated in one place for processing by a supercomputer having hundreds to thousands of compute nodes. However, concerns about security and privacy have led to new legislation, such as General Data Protection Regulations (GDPR) and Health Insurance Portability and Accountability Act (HIPAA), that prevents data from being transmitted to a centralized location, making traditional high performance computing difficult to apply to collecting and processing scattered data. Joint learning addresses security and privacy challenges by using decentralized data, i.e., training local models on local data of each client (data side), and using a central aggregator to accumulate learning gradients of local models to train a global model, thereby enlightening a new emerging high performance computational paradigm. While the computing resources of a single client may be far less powerful than the computing nodes of a traditional supercomputer, the computing power from a large number of clients can be accumulated to form a very powerful "decentralized virtual supercomputer". Joint learning has proven its success in a range of applications. From GBoard and keyword discovery and the like consumer devices to the pharmaceutical, medical research, financial, and manufacturing industries.

The data in federal learning is owned by the customer and may vary widely in quantity and content. Resulting in severe data heterogeneity that does not typically occur in data center distributed learning because the data distribution therein is well controlled. In data center distributed learning, the categories and characteristics of the training data are evenly distributed across all clients, i.e., independently Identically Distributed (IID). However, in federal learning, the distribution of data classes and features depends on the data owner, thus resulting in a non-uniform data distribution, referred to as a non-independent uniform distribution (non-IID data heterogeneity). This heterogeneity greatly affects training time and accuracy, and a technical solution for the above situation is needed.

Disclosure of Invention

Aiming at the problems that in federal learning, the distribution of data categories and characteristics depends on data owners, so that the data distribution is non-uniform, and further the training time and accuracy are greatly influenced, the invention provides a heterogeneous statistics-oriented clustering federal learning method, which is used in a federal learning environment with data statistics heterogeneity, and realizes a more efficient and highly applicable federal learning method.

In order to achieve the purpose, the invention adopts the following technical scheme:

a heterogeneous statistics-oriented clustering federal learning method comprises the following steps:

step 1, constructing an edge node distribution classifier;

step 2, determining a measurement index of edge node clustering;

step 3, determining a clustering method of the node cluster;

step 4, clustering the edge nodes by using a clustering method;

step 5, the server initializes the global model and sends the model to the head node of each node cluster;

step 6, after receiving the model, the edge node performs local training on a local data set and updates the model, sends the updated model to the next node in the cluster for training until all the nodes in each cluster complete training, and uploads the updated model to the server;

step 7, the server receives the updated models of all clusters, then carries out weighted average and updates the global model;

and 8, repeating the step 6 and the step 7 until the global model converges.

Further, the step 1 comprises:

global model f _θ Segmentation into a depth feature extractor

And a classifier->

Wherein θ = (θ) _feat ,θ _clf ) Is a set of parameters of the global model;

before the formal start of federal learning, a pre-training phase is used to estimate the data distribution on the edge nodes participating in training, during which each edge node k initializes θ from the same random ₀ Initially, e rounds of training are performed on their local data sets, updating the model to

Based on local classifiers, respectively

Parameter ψ _clf Or it's common data set at the server side

Upper prediction psi _conf Constructing an edge node distribution classifier;

at the server side, a classifier is used for updating the model according to the edge node

Evaluating the data distribution of the node>

Further, the step 2 comprises:

approximation of data distribution from edge node k

Initially, similar node clusters are established from nodes with different distributions, the distance between the node clusters is minimized, and the distance in the node clusters is maximized;

use ofCosine and Euclidean distance are used for comparing the weight of the client classifier

The actual probability distribution form given as the confidence vector, and the KL divergence as the measure index.

Further, the clustering method comprises the following steps:

strategy 1: the clients are randomly assigned to the node clusters until a defined stopping criterion is met;

strategy 2: first, N is obtained by using the K-means method _S Individual homogeneous clustering; all node clusters are then formed by iteratively extracting one edge node at a time from each cluster, up to the number of samples in each node cluster S

And edge node K _S ≤k _S,max ；

Strategy 3: randomly selecting an edge node k _i Assigned to the current node cluster S, i ∈ [ K ]](ii) a Then, a second edge node k is selected _j Let k be _i And k _j The distance therebetween reaches a maximum, i.e.

This process is repeated continuously and finally maximization->

Reach the set maximum edge node number K _S,max And a minimum sample number>

Wherein tau is a measure of clustering.

Compared with the prior art, the invention has the following beneficial effects:

the method is suitable for the federal learning environment with data statistics heterogeneity, can be conveniently deployed under the traditional two-layer framework of the server-edge node, and can also be expanded and deployed under the three-layer framework of the cloud-edge server-edge node. Compared with the traditional federal learning method, the method is more efficient and has stronger applicability.

Drawings

Fig. 1 is a schematic flow chart of a heterogeneous statistics-oriented clustering federal learning method according to an embodiment of the present invention;

fig. 2 is a second flowchart of a heterogeneous statistics-oriented clustering federal learning method according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

the goal of traditional federal learning is to learn a global model

Each edge node K ∈ [ K ]]Can be based on the local data set->

To obtain n _k The FedAvg is a method based on T communication rounds of iteration and aims to solve the problem of ^ er>

Wherein

Is a local empirical risk,/ _k Is the cross entropy loss, n = ∑ Σ _k n _k Is the total amount of data involved in the training. At each round T e [ T ∈ [ ]]The server will theta _t Is sent to a randomly selected->

A part of the customer. Each client

Using D by minimizing local objects _k Performing local gradient descent to reduce theta _t Is updated to be->

And returns it to the server. The updated model is then summarized by the server to a new global model->

Is medium, i.e.>

However, in real-world scenarios, there is no guarantee that local datasets from different customers are independently extracted from the same underlying distribution.

Aiming at the problems, as shown in fig. 1 and fig. 2, the invention provides a heterogeneous statistics-oriented clustering federal learning method, which comprises the following steps:

step one, constructing an edge node distribution classifier psi. The invention integrates a global model f _θ Separation into a depth feature extractor

And a classifier->

Wherein θ = (θ) _feat ,θ _clf ) Is a set of parameters of the global model. The classification output is selected by>

It is given. Before the formal start of federal learning, a pre-training phase is used to estimate the data distribution on the edge nodes participating in training, during which each edge node k initializes θ from the same random ₀ Initially, e rounds of training are performed on their local data sets to update the model to ≧>

The invention uses two strategies, based on local classifier @, respectively>

Radix Ginseng (radix Ginseng)Number psi _clf Or its public data set at the server side->

Upper prediction psi _conf . For strategy one, assume that the weight of the classifier can represent the local distribution of each client and directly feed it back to the clustering method φ _(.) . For strategy two, at a common "feature set">

Up-test each pick>

Wherein->

Containing c e [ N [ ] _C ]J samples of (a). Then according to class>

The predictions are averaged and a confidence vector for the kth client is defined as ≥>

On the server side, the classifier psi is used, and the updated model is combined with the updated edge node>

An estimate of the data distribution of the node is obtained>

And step two, determining a clustering measurement index tau. Approximation of data distribution from edge node k

Initially, similar node clusters are established from nodes having different distributions in order to minimize the distance between node clusters while maximizing the distance within a node cluster. Given a/>

And &>

It is necessary to find a metric for measuring the distance between two distribution estimates

Using cosine and Euclidean distance to compare the weight of the customer classifier, use @>

The actual probability distribution form given as the confidence vector, and the KL divergence as the measure.

And step three, determining a clustering method phi of the node cluster. First, define

For a client belonging to a cluster of nodes S>

The collection of data of (2). In order to find a satisfying: minimum sample number>

And a maximum number of customers K _S,max Maximum number of nodes N of the constraint _S Distributing the classifier psi at a given edge node _(.) And clustering metric τ, the present invention introduces three strategies to find an approximation of the maximization problem. The first is phi _rand Policy, a simple and practical method, is that customers are randomly assigned to a cluster of nodes until a defined stopping criterion is met. The second is phi _kmeans The strategy is based on a K-means algorithm: first, N is obtained by using the K-means method _S Individual homogeneous clustering; then, all node clusters are formed by iteratively extracting one edge node at a time from each cluster until the number of samples ≧ in each node cluster S>

And edge node K _S ≤k _S,max . Finally, phi _geedy The strategy follows a greedy approach to generating node clusters. Initially, an edge node k is randomly selected _i Assigned to the current node cluster S, i ∈ [ K ]]. Then, a second edge node k is selected _j Let k be _i And k _j Reaches a maximum, i.e. < >>

This process is repeated continuously, finally maximizing by iteration>

To a predetermined maximum number of edge nodes K _S,max And a minimum sample number->

And step four, according to a clustering method, dividing all edge nodes participating in training into i node clusters after pre-training is finished, combining edge nodes with different distributions together, and simultaneously dividing edge nodes with similar distributions.

Step five, the server initializes a global model theta _t Communicating with all edge nodes participating in the training and sending the model to all node clusters S _i ，i∈[N _S ]Head node k of _i,1 。

Step six, node k _i,1 Upon receiving the model θ _t Then, in the local data

To carry out E _k Updating the model to be ^ er/standard for each round of training>

Will then->

Sent to the next edge node k in the node cluster _i,2 And repeating the process until the last client side of the node cluster is judged to be>

The model is received and local training is completed. Is at>

After the training is completed, the updated model is->

Sends to head node k in the cluster _i,1 。

Step seven, head node k in the cluster _i,1 Receiving a model

Then judging whether to repeat E according to the training effect _S Substep six, if not required, the model is evaluated>

Sending the updated model to a server, and after receiving the updated model returned by all the node clusters, the server bases the operation on the updated model

The model updates are averaged.

And step eight, repeating the step six and the step seven until the global model converges.

It should be noted that if the method is applied to a three-tier architecture of cloud-edge server-edge node, the edge server can be regarded as the server in the above steps simply when the edge server-edge node hierarchy. In the server-edge server hierarchy, the edge servers can be regarded as edge nodes in the above steps, and the sequence training is performed between the edge serversThe method can be carried out without clustering. The basis for this is that merging models may only be useful if the models are trained on a larger data set. According to statistics, in N _S After a round, each model may have been trained on the entire data set, so that the performance of the strategy is closer and closer to a centralized strategy.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A heterogeneous statistics-oriented clustering federal learning method is characterized by comprising the following steps:

step 1, constructing an edge node distribution classifier;

step 2, determining a measurement index of edge node clustering;

step 3, determining a clustering method of the node cluster;

step 4, clustering the edge nodes by using a clustering method;

and 8, repeating the step 6 and the step 7 until the global model converges.

2. The heterogeneous statistics-oriented clustering federated learning method according to claim 1, wherein the step 1 includes:

global model f _θ Is divided intoDepth feature extractor

And a classifier->

Wherein θ = (θ) _feat ,θ _clf ) Is a set of parameters of the global model;

before the formal start of federal learning, a pre-training phase is used to estimate the data distribution on the edge nodes participating in training, during which each edge node k initializes θ from the same random ₀ Initially, e rounds are trained on their local data sets, updating the model to

Based on local classifiers, respectively

Parameter ψ _clf Or it's common data set at the server side

Upper prediction psi _conf Constructing an edge node distribution classifier;

An estimate of the data distribution of the node is obtained>

3. The heterogeneous statistics-oriented clustering federated learning method according to claim 2, wherein the step 2 includes:

data distribution approximation from edge node kValue of

using cosine and Euclidean distances to compare weights of a client classifier, using

4. The heterogeneous statistics-oriented clustering federated learning method according to claim 1, wherein the clustering method includes:

And edge node K _S ≤k _S,max ；

Strategy 3: randomly selecting an edge node k _i Assigned to the current node cluster S, i ∈ [ K ]](ii) a Then, a second edge node k is selected _j Let k be _i And k _j The distance between them is maximized, i.e.

This process is repeated continuously, finally maximizing by iteration>

To a predetermined maximum edge pitchNumber of points K _S,max And a minimum sample number>

Wherein tau is a measure of clustering. />