CN115204416A

CN115204416A - Heterogeneous client-oriented joint learning method based on hierarchical sampling optimization

Info

Publication number: CN115204416A
Application number: CN202210690767.6A
Authority: CN
Inventors: 马武彬; 鲁晨阳; 郑龙信; 吴亚辉; 周浩浩; 戴超凡; 邓苏
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-10-18

Abstract

The invention discloses a heterogeneous client-oriented joint learning method based on hierarchical sampling optimization, which selects available clients from different clusters; the parameter server broadcasts the global model to all the clients, the clients train local data samples to obtain local model parameters, the parameter server collects local model parameter information of each client, and the clients are divided into different clusters by adopting a clustering method; during each round of training, extracting available clients from each cluster according to the sample weight to participate in training, and performing gradient aggregation; after each round of client-side participating in training receives the latest global model parameters from the parameter server, the gradient under the current parameters is calculated by using local data, the latest parameters are sent back to the parameter server after iteration, and the parameter server carries out weighted average on the returned parameters. The invention can converge to the global optimal solution under the heterogeneous condition of the client.

Description

Combined learning method for heterogeneous client based on hierarchical sampling optimization

Technical Field

The invention belongs to the technical field of distributed learning, and particularly relates to a heterogeneous client-oriented joint learning method based on hierarchical sampling optimization.

Background

Federal Learning (Federal Learning) is a new distributed Learning paradigm, and compared with the traditional distributed machine Learning method, the distributed machine Learning method has the advantages that (1) the communication, the computing power and the storage capacity of each client are greatly different (equipment is heterogeneous); (2) The distribution and the quantity of data of each client are greatly different (data are heterogeneous); and (3) high communication consumption and the like. Under the client heterogeneous condition (including device heterogeneous and data heterogeneous), the data distribution difference of the client is large, so that the convergence speed of the model is greatly reduced, especially under the extreme data heterogeneous condition, the traditional federal learning algorithm cannot be converged, and the training curve fluctuates greatly along with the increase of the local iteration turns.

Federated learning is a new distributed machine learning architecture that allows multiple devices (referred to as clients in federated learning) to collectively train a global model without the need to upload local data. Compared with the traditional distributed machine learning, the following differences are mainly found: the client node is often unreliable (the edge node in federal learning is limited by equipment, communication and the like, and can frequently drop the line) because the node has independent control right on local equipment and data; communication consumption in federal learning is higher than computational consumption; node data distribution in federated learning is non-independent identically distributed (non-IID); the local data quantity distribution of each client in the federal learning is extremely unbalanced. These new features present challenges to the design and analysis of federally learned algorithms.

One of the major challenges is client heterogeneity, which includes data heterogeneity and device heterogeneity. The client heterogeneous situation of federal learning exists widely in real-world conditions, for example, the client distribution of non-uniform distribution, because the data of each client is generated locally, the sample generation mechanism in different clients may be different (such as different countries or regions); characteristic distribution inclination (covariate drift), such as recognition of handwritten fonts, different human writing methods are different even for the same character; the distribution of the labels is inclined (the prior probability drifts), for example, chinese is mainly used in China, and the number of people used in foreign countries is small; the amount is skewed or unbalanced, etc. In real life, various situations can cause the occurrence of data non-independent and same-distribution situations; device computing power and communication differences. Traditional machine learning is based on the assumption that data are independently and identically distributed, but federal learning is different from centralized machine learning, and in the case that data are not centralized, data of each node are not independently and identically distributed.

Considering a practical situation, when a federal learning method is applied to train a training mobile phone input method model, different mobile phone operation speeds, internal data, network conditions and the like are different, the latest mobile phone has higher operation speed and transmission speed than the old mobile phone, and the mobile phone in the better signal positions of towns and the like is more stable in communication transmission than the mobile phone in rural areas and signal interference areas. In the process of model training, the training speed of old mobile phones is low, and training tasks cannot be completed on time, and the situation of signal loss is easy to occur when a poor mobile phone transmits a model through a network, so that the data distribution and the actual distribution of a parameter server are different. Due to the existence of client heterogeneity, some types of data participate in the training process more frequently, thereby introducing errors into the training data.

In recent years, in the field of big data, the development of large-scale distributed machine learning arranged in a data center is greatly promoted by the improvement of the storage capacity and calculation capacity of machine equipment. Traditional distributed machine learning requires training by concentrating the overall data onto one node or data center. However, with the local computing power of mobile devices such as mobile phones, smart wearable devices, sensors, and the like being improved, and the limitation on user data privacy protection in recent years, compared with transmitting data to a central node, the method of training locally on distributed devices and then transmitting training parameters to a parameter server is more effective. This problem, known as federal learning, requires addressing challenges of large-scale training data, privacy protection, heterogeneous data and devices, and the like.

McMahan et al in 2016 proposed an iterative model averaging based deep web federated learning method (FedAvg), which was named federated learning because the learning task was performed by a loose federation of clients coordinated by a central server, each of which, like constituting a loose federation. Compared with data center type distributed machine learning, the Federal learning has the main advantage that requirements of model training and direct access to original data are separated, so that the FedAvg algorithm has great significance in situations of strict requirements on data privacy or difficulty in centralized sharing of data, and meanwhile, the FedAvg algorithm accelerates learning efficiency by adopting a multi-round local iteration mode, so that the FedAvg algorithm is greatly helpful for reducing communication consumption.

Peter Kairouz et al discussed the latest advances in federal learning, summarizing the current acute challenges facing federal learning: the problem of non-independent and same distribution of data; privacy protection issues for personal data; training problems under limited communication bandwidth; robustness in the face of malicious nodes and attacks; fairness issues in federal learning. The article indicates that the data heterogeneity and the equipment heterogeneity of the client greatly influence the learning efficiency in the federal learning, which is one of the urgent challenges facing the federal learning at present.

In order to solve the problem of non-independent and same distribution of data in federal learning, yue Zhao et al improves the FedAvg algorithm, finds that when the data are in non-independent and same distribution, the FedAvg algorithm has high precision loss, proposes to calculate weight divergence by using earth operation, can improve the accuracy of federal learning in non-IID data, and proposes a federal learning strategy of data sharing, and improves the training effect on the non-IID data by creating a small part of data which are globally shared among all client devices in a central server. Creating globally shared data among all client devices at a central server to improve the training effect on non-IID data can reduce the influence of data skew, but this approach is equivalent to artificially adding errors, and this approach of sharing data per se violates the principle of federal learning for data privacy protection, and is very difficult to implement.

Starting with an objective function, the Tian Li et al adds a restriction term to the objective function of a model, so that when each client uses local data update, a new model does not deviate from a global model too much, thereby reducing the influence caused by data heterogeneity. Starting from another direction, jiang Y et al think that data are heterogeneous, and cannot obtain a global model with sufficient precision, so that the model is personalized, and the global model is additionally trained by using local data to obtain a personalized model with higher quality. Similar to this idea, a.ghosh et al, sattler et al, all propose a way of dividing clients into different clusters, and then train a single global model in each cluster, and they use different clustering methods to cluster local empirical loss functions or node gradients of the clients, because the clients in each cluster have higher similarity, the trained intra-cluster global model has higher accuracy, but the models trained by this method have poor generalization, which violates the co-training purpose, and the clustering methods used by them all need to specify the number of clusters in advance, which is difficult in practical application. Yikai Yan et al consider the intermittent availability of clients and consider that different clients participate in training for different times, which leads to the training model leaning towards the client data participating in training more, so that when a client selects, the client with less participation in training is preferentially selected for training to ensure that each client participates in training for the same time as much as possible.

Under the client heterogeneous condition (including device heterogeneous and data heterogeneous), the data distribution of the client is greatly different, so that the convergence speed of the model is greatly reduced, especially under the extreme data heterogeneous condition, the traditional federal learning algorithm cannot converge, and the training curve fluctuates greatly along with the increase of the local iteration turns.

Disclosure of Invention

Aiming at the difficulty brought to model training by client side isomerism in federal learning, the invention provides a federal learning algorithm FedSSO optimized by hierarchical sampling. In FedSSO, a clustering method based on density is used for dividing the total clients into different clusters, so that the clients in each cluster have higher similarity, and available clients are extracted from different clusters according to sample weight to participate in training. Thus, all sorts of data will be proportionally involved in each round of training to accelerate the model convergence to a globally optimal solution. Meanwhile, a learning rate decreasing and local iteration turn selection mechanism is set so as to ensure the convergence of the model.

Specifically, the invention discloses a heterogeneous client-oriented joint learning method based on hierarchical sampling optimization, which comprises the following steps:

selecting available clients from different clusters, wherein data received by the clients are heterogeneous data;

the parameter server initializes a global model, then broadcasts the global model to all clients, the clients train local data samples according to the received global model to obtain local model parameters, the parameter server collects local model parameter information of each client, and the clients are divided into different clusters by adopting a clustering method;

during each round of training, available clients are extracted from each cluster according to the sample weight to participate in the training, and gradient aggregation is performed to ensure that all kinds of data participate in each round of training, so that the influence caused by client heterogeneity is reduced; the target during training is a convex function;

after each round of clients participating in training receives the latest global model parameters from the parameter server, the gradient under the current parameters is calculated by using local data, the latest parameters are sent back to the parameter server after the E-time iteration random gradient is reduced, and the parameter server performs weighted average on the returned parameters.

Further, learning the average value of the one-dimensional data from the N clients, and converting the target into a problem of minimizing the mean square error, where the problem of minimizing the mean square error is as follows:

in which ξ _i ～D _i Is a sample independently selected from the local data, mean e _i ＝E[ξ _i ]，ρ _i Is the weight, ξ, of the ith client _i For samples of the client, τ _i An offset is weighted for the ith client.

Further, when each client has the same data size, the optimal solution of the problem is:

τ _i for the ith client weight offset, the objective function will converge to:

further, the clustering method is an OPTICS clustering method.

Further, the formula of the client local model parameters is as follows:

where eta is the learning rate, W ₀ In order to be a global model,

is a sample of the client that is,

and i is the loss function of the ith client, and i is the ith client.

Further, the weighted average method is as follows:

k is the total number of the clients extracted in each round, and K is the kth client.

Further, T _ε The number of iterations required for the algorithm to reach the precision epsilon, the number of communications between the client and the parameter server

Comprises the following steps:

G ² to the desired squared norm limit of the random gradient,

is the variance bound of the random gradient, ρ _k Is the weight of the kth client, E is the local iteration round, N is the total number of the clients, K is the total number of the clients extracted in each round, L represents L-smooth, mu represents mu strong convex,

when the data isomerism degree is low, the gamma value is close to 0, and the larger the local iteration round E is, the better the local iteration round E is; when the data heterogeneous degree is larger, the smaller the Γ is, the smaller the local iteration round E is, the better the local iteration round E is.

Under the condition that all clients participate in training, the algorithm has the following convergence:

wherein,

is a data heterogeneous parameter.

Under the condition that part of clients participate in training, the algorithm convergence is as follows:

wherein

Meanwhile, the learning rate value is required to be gradually reduced in the training process to converge to the optimal result, so that the algorithm reaches O (E) ² The convergence rate of/T).

The invention has the following beneficial effects:

even if the optimization problem of a convex objective function is solved, when an accurate gradient (non-random gradient) is calculated, the traditional federated learning algorithm cannot converge to a global optimal solution under the heterogeneous conditions of the client. In particular, under highly client heterogeneous conditions, the traditional federated learning approach may diverge due to constant learning rates and high local iteration rounds, and the present invention can overcome these problems.

A convergence algorithm is provided, the algorithm adopts a layered sampling mode, a certain number of available clients are extracted from a client cluster which is divided in advance according to a weight ratio during each round of training to participate in the training, and O (E) can be used ² T) converges to a global optimal solution.

The FedSSO algorithm was evaluated using the standard data sets MNIST, CIFAR-10 and sentevent 140 and compared to FedAvg and FedProx. The evaluation result proves that the FedSSO algorithm has higher training precision and higher training speed on the heterogeneous data set.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 results of the present invention with FedAvg, fedProx in MNIST (non-IID) data set;

FIG. 3 shows the results of the present invention with FedAvg, fedProx in MNIST (non-IID 2) dataset;

FIG. 4 results of the present invention with FedAvg, fedProx in Cifar-10 (non-IID) dataset;

FIG. 5 shows the results of the present invention with FedAvg, fedProx in Cifar-10 (non-IID 2) dataset;

FIG. 6 results of the present invention with the data set of FendAvg, fedProx at Sentiment 140;

FIG. 7FedAvg results for different degrees of data distribution heterogeneity in MNIST (IID) data sets;

FIG. 8 shows the results of FedAvg under different degrees of data distribution heterogeneity in MNIST (non-IID) data sets;

FIG. 9 shows the results of FedAvg under different degrees of data distribution heterogeneity in MNIST (non-IID 2) data sets;

FIG. 10FedAvg results for Cifar-10 (IID) dataset with different degrees of data distribution isomerism;

FIG. 11FedAvg results for Cifar-10 (non-IID) dataset with different degrees of data distribution isomerism;

FIG. 12FedAvg results in Cifar-10 (non-IID 2) dataset with different degrees of data distribution isomerism;

FIG. 13 shows the results of the present invention with different degrees of heterogeneity of data distribution in the sentevent 140 dataset;

FIG. 14 results of the present invention under extremely heterogeneous conditions of MNIST (non-IID) data set;

FIG. 15 results of the present invention under extremely heterogeneous conditions of the MNIST (non-IID 2) data set;

FIG. 16 results of the invention under extreme heterogeneous conditions of the Cifar-10 (non-IID) dataset;

FIG. 17 results of the invention under extreme heterogeneous conditions of the Cifar-10 (non-IID 2) dataset;

FIG. 18 results of the present invention under extremely heterogeneous conditions of the sentevent 140 dataset.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

The optimization model in federal learning is:

where N is the total number of clients, ρ _k Is the weight of the kth client. Suppose that the k-th client local data distribution is D ^k ，

Is a sample independently selected from local data, in the standard FedAvg algorithm, (assuming the tth iteration), first the central parameter server broadcasts the latest global parameter w to all clients participating in the training _t Then, each participating client performs E-round local iteration:

wherein eta _t For learning rate, assume that K (1. Ltoreq. K) is selected for each round<N) the clients participate in the training, and the central server aggregates the collected client gradients:

the distribution of the overall data is a mixture of all local data distributions:

when the client data belongs to independent co-distribution, D for all k ∈ N, D _k = D. However, in real life, the data distribution of different clients is often different, so the invention is based on the assumption that the data is not independently distributed.

Influence example 1 of data heterogeneity a distributed optimization problem is considered, assuming that the objective function is a convex function, the objective is to learn the average value of one-dimensional data from N clients. Xi _i ～D _i Mean value e _i ＝E[ξ _i ]. We transform this objective into a minimum mean square errorThe following problems are solved:

for the convenience of calculation, assuming that each client has the same data volume, the optimal solution can be obtained as follows:

let τ be _i For the ith client weight offset due to communication loss, client device differences, etc., the objective function will converge to:

it turns out that the derivation of the objective function can be:

let the derivative be 0 one can get:

according to the assumption that each client has the same data volume, any rho _i (i∈N)，ρ _i 1/N, therefore

When the convergence value of the objective function is calculated under realistic conditions, ρ _i ＝1/N+τ _i (τ i is the aggregation weight offset for client i),

if and only if e ₁ ＝e ₂ ＝…＝e _n (the data distribution is IID) or τ for all i ∈ {1,2,. Eta., N }, τ _i Where =0, x = x ^* . Therefore, the traditional federal learning algorithm can cause the heterogeneous situation of the dataPoor results.

The FedSSO algorithm framework of the invention: as described above, data and device heterogeneity severely degrades the performance of the FedAvg algorithm. In federal learning, the overall data distribution is a mixture of client-side local data distributions by weight, which in the FedAvg algorithm is a sample weight. The setting only considers the data quantity difference of each client and does not consider the problems of hardware equipment difference, communication difference and the like of the clients, for example, the classical example of federal learning is considered, when a mobile phone input method model is trained, a latest mobile phone has higher operation speed and transmission speed than an old mobile phone, and the communication transmission of a mobile phone at a better signal position in a town and the like is more stable than that of a mobile phone at a rural area and a signal interference area, so that the data distribution and the actual distribution of the parameter server are different. Due to the existence of client heterogeneity, some types of data participate in the training process more frequently, thereby introducing errors into the training data. In order to alleviate the error problem, all types of data are used for training in each round of training, the probability that each type of data participates in training is basically the same, and the training data distribution is the unbiased mixture of the sample distribution of each client side, so that the deviation in the training data is eliminated, and the result of convergence is established.

The detailed procedure of the FedSSO algorithm is given in algorithm one, and the client selection principle of the FedSSO algorithm is to select available clients from different clusters (lines 2-8). The parameter server first initializes the global model and then models w ₀ Broadcasting to all clients, and enabling the clients to obtain one sample xi of local data according to the received model ⁱ Training to obtain local model parameters w ⁱ The parameter server collects the local model parameter information of each client, and the clients are divided into different clusters by adopting an OPTICS Clustering method (Ordering Points To Identify the Clustering Structure). OPTICS is a density-based clustering algorithm. It defines clusters as the largest set of points connected by density and divides the area with sufficient density into clusters. OPTICS can be in noisy spatial data, in contrast to K-means and BIRCHClusters of arbitrary shape were found, while K-means and BIRCH were only applicable to convex sample set clustering. Compared with the DBSCAN method, the OPTIC is insensitive to input parameters, and the clustering stability is improved.

And during each round of training, available clients are extracted from each cluster in proportion to participate in the training, so that all kinds of data are guaranteed to participate in each round of training, and the influence caused by client heterogeneity is reduced. After each round of client-side participating in training receives the latest global model parameters from the parameter server, the gradient under the current parameters is calculated by using local data, the latest parameters are sent back to the parameter server after the E-time random gradient is reduced in an iteration mode, and the parameter server carries out weighted average on the returned parameters (lines 9-13).

In this section it will be demonstrated that for strongly convex, smooth functions and heterogeneous data, the FedSSO algorithm is represented by O (E) ² T) converges to a global optimum. Moreover, convergence conditions of the algorithm are analyzed, and the necessity of a learning rate decreasing and local iteration round selection mechanism is analyzed.

Invention pair function F ₁ ,F ₂ ,…,F _N The following assumptions were made:

hypothesis 1.F ₁ ,F ₂ ,…,F _N Is L-smooth:

hypothesis 2.F ₁ ,F ₂ ,…,F _N Is μ -strong and convex:

hypothesis 3. Hypothesis

Is a random sample that is uniform from the local data of the kth device. In each client, the variance of the random gradient is bounded:

for k＝1,...,N。

assume 4. The desired squared norm of the random gradient is consistently bounded:

quantitative indicators of data heterogeneity hypothesis F ^* ,

Respectively, objective functions F and F _k We can get the optimal solution of

To quantify the degree of data heterogeneity. When the client data is distributed as IID, Γ =0, the higher the data heterogeneity level, the higher the | Γ | value.

Quantitative indicators of device heterogeneity. Hypothesis τ _i When the model is aggregated for the ith client, the aggregation weight is compared with the parameter weight of the ith client to obtain the expected difference (the difference is influenced by the computing power of equipment, the communication environment and the like). Suppose that

As an objective function F _k Can be calculated to obtain the optimal solution

The method is a quantitative index of equipment heterogeneity.

Convergence analysis of example 1

It was first demonstrated that FedSSO converged in example 1, whereas FedAvg produced a band offset result. According to the setting, different clients are divided into different clusters { c ₁ ,c ₂ ,...,c _n Each cluster c _i Contains n clients _ci . Clients in the same cluster have similar mean values, the mean value E (ξ) of each cluster _ci )＝e _ci 。

For any client k e c _i }，

We can rewrite the objective function by:

because available clients are extracted from each cluster class, the situation that the clients are unavailable due to communication loss and the like does not exist, so that the method has the advantages of simple operation, low cost and the like

The above formula can be rewritten as:

when the learning rate eta is less than or equal to 2/L, a gradient descent method is adopted for calculation, and the solution of the formula is as follows:

it is known that

The following can be obtained:

when clustering is sufficiently accurate, the data in each cluster can be considered to be co-distributed, δ _ci The → 0 objective function converges to the optimal solution.

The whole equipment participates: in this section, the convergence problem of the FedSSO algorithm under full client participation training will be discussed. In fact, since the FedSSO algorithm is directed to a change in the client selection strategy, the FedSSO may be equivalent to the FedAvg algorithm when all clients participate in the training. The convergence of the FedAvg algorithm has been widely proven, but the previous demonstration does not consider the situations of communication loss, client device heterogeneity and the like in the presetting of the parameter aggregation weight, and in fact, due to the situations, a deviation is introduced to the training target and the optimal solution. In the attestation process, variables will be introduced to represent client aggregation weight changes due to objective reasons.

Assuming the algorithm terminates after T iterations, return w _T As a solution. E is the local iteration turn of the client, and T is required to be integer times of E, so that w can be output as expected _T 。

Theorem 1 according to assumptions 1-4, and L, μ, σ _k G is as defined in the hypothesis, assuming a learning rate η _t Is decreasing, gamma is greater than 0, eta is greater than 0 for all t _t ≤2η _t+E . The FedSSO algorithm participated by all the clients satisfies the following conditions:

part of the devices participate: the convergence problem of the FedSSO algorithm under the partial client participation condition will be discussed in this section. Since the federated learning is seriously affected by "straggler's effect" in the full client participation mode (meaning that all nodes need to wait for the slowest node), the federated learning with partial client participation has a more realistic application. Suppose that

Set of clients participating in training for k-th iteration, S _t The method is characterized by being formed by integrating clients randomly extracted from each cluster, wherein the total number of the clients extracted in each round is K. Assuming that the data volume of each client is balanced, the selected available clients are used in each round of training, so that the method is not influenced by communication loss and the like, and therefore rho ₁ ＝ρ ₂ ＝...＝ρ _N The polymerization step of FedSSO can be expressed as:

definition of p ₁ ＝ρ ₂ ＝...＝ρ _N =1/N seems to violate federal learning assumptions about imbalance, we can solve this problem by converting as follows. Suppose that

It can be seen that the objective function is scaled. The global objective function may be transformed into:

theorem 2 according to assumptions 1-4 and L, μ, σ _k G is as defined in the hypothesis, assuming a learning rate η _t Is the result of the decrement to be performed, _γ greater than 0, t is greater than or equal to 0, eta _t ≤2η _t+E B is as defined in theorem 1,

the following can be obtained:

wherein,

necessity of decreasing learning rate: it will be demonstrated in this section that selecting a progressively lower learning rate is essential for federal learning convergence under client heterogeneous conditions. In the previous certification process, we have obtained:

the updating process of the visible algorithm is similar to a Markov process, namely the next updating of the global model is independent of history and only related to the current parameters. Since the update of the model is determined by two parts, we can see that for the above equation, the first term is negative and the second term is positive, so the selection of the learning rate has an important influence on the algorithm convergence.

When the model parameters are close to the optimal solution,

if the learning rate eta _t If the first term is a constant, the first term of the above equation approaches to 0, and the second term is a positive constant, then the model update at this time will not bring about a reduction in the objective function value, and only an approximate optimal solution can be obtained. Therefore, it is emphasized that the present invention must select a decreasing learning rate to achieve convergence to the optimal result.

Selection mechanism of local iteration times: according to the conclusion of theorem 2, under appropriate parameter conditions, the dominant term of equation (4) can be obtained as follows:

by T _ε Representing the number of iterations required by the algorithm to achieve epsilon accuracy, then

For the required number of communications, it can be simplified as:

from the above equation, it can be seen that the larger the local iteration round is, the better the local iteration round is, and the higher the local iteration round may cause an increase in the number of communications. In fact, for different parameters, an optimal local iteration round E exists, the calculation of E is related to the model parameters and the data isomerism degree gamma, when the data isomerism degree is lower and gamma is close to 0, the latter half part of the formula is a dominant term, and the larger E is, the better E is; when the data isomerism degree is larger, the smaller gamma is, the front half part of the formula is a dominant term, and the smaller E is, the better is.

We will use different datasets and models to evaluate the FedSSO algorithm and compare it with the FedAvg algorithm and the FedProx algorithm and perform experimental analysis on the client local iteration rounds.

Data set: experiments were performed using three different standard datasets, which are benchmark datasets summarized according to prior federally learned related work. In the convex problem, the invention uses a multinational registration regression (MLP) model to compare the performances of different algorithms on an MNIST data set, and in order to simulate the heterogeneous condition of clients, the data is distributed among 100 clients, and each client only contains data with the sample number of 600; then a more complex cfar-10 dataset is selected, since the pictures in the dataset are from common articles in life, such as planes, vehicles, etc., and have more errors than the handwriting volume dataset, the invention also distributes the overall data evenly among 100 clients, each of which contains only one category of data. In order to explore the expression of the algorithm on a non-convex setting, the invention uses an LSTM classifier to perform text emotion analysis tasks of a tweet on a Sentiment140 data set, wherein each account corresponds to a device, and the tweet issued by the account is a local data set.

The implementation is as follows: the invention selects the FedAvg and FedProx algorithms as baseline, and the parameter mu in FedProx is 0.2 as in the text. To ensure that each sample taken is an unbiased estimate of the population, a target number of samples are taken that are not randomly replaced from the population (from each cluster in the FedSSO algorithm) in each round, so that each sample appears only once in each round of training. In order to simulate different degrees of data heterogeneity, a diversified sampling strategy is adopted. When the condition that the data are independently and simultaneously distributed needs to be simulated, sampling that the total data are independently not put back is divided into each client, so that the data distribution of each client is unbiased estimation of the total sample; when the data heterogeneous condition needs to be simulated, sequencing the overall data according to the size of the tags, dividing the overall data into different slices, enabling each slice to only contain data of one type of tag, then randomly dividing the slice data into different clients, and designing two heterogeneous data sets in order to simulate different data heterogeneous degrees: non-IID and non-IID2; each client in the non-IID comprises two types of data, and each client in the non-IID2 only comprises one type of data, so that the condition that the data of the terminals are heterogeneous is simulated.

For each data set, we set the initial learning rate to 0.01, and set η in the FedSSO algorithm experiment _t The learning rate decreasing mechanism of = 0.01/(1 +t). The proportion of the number of the selected clients in each round to the total number of the clients is 0.1, and the local batch size is 10. During clustering, the parameters of the OPTIC clustering method are as follows: the density was 2 and the radius was 0.25.

The experimental results were first tested under different data distribution conditions. For the FedAvg and FedProx algorithms, a local iteration round E =1 was set, and for the FedSSO algorithm, different results were tested at local iteration rounds of 1 and 5. Results as shown in fig. 2-6, for experiments on the convex problem, tests were performed on MNIST and Cifar-10 datasets under different data heterogeneous conditions; for the experiment of the non-convex problem, the test is carried out on the sentment 140 data set, each twitter user in the data set is a client, the sent twitter is local data of the client, and only heterogeneous single distribution is considered. It can be observed that the FedSSO algorithm converges in all five experimental settings. In contrast, the training curves of the FedAvg and FedProx algorithms fluctuate greatly, and particularly when the data distribution is extremely heterogeneous, the model cannot converge, which also verifies theorem 2. Particularly, because the data in different data sets have different degrees of similarity, in the MNIST data set, because the data is a digital gray scale image, the difference between different types of data is smaller than that of the cfar-10 (3-channel color image, which records objects common in life), and therefore, the loss curve fluctuation of the training model is smaller under the condition of heterogeneous data distribution. Through the experiment of adding the local iteration turns to the FedSSO algorithm, it can be seen that when the data distribution heterogeneous degree is not high, the convergence of the acceleration model can be used by increasing the local iteration times (fig. 2 and 4). However, when the data distribution is extremely unbalanced (fig. 3, 5 and 6), the effect of adding a local iteration round is not great, and we will discuss this problem in detail in the next section.

Local iteration round selection mechanism: the part firstly shows that the convergence of the traditional federal learning method (FedAvg) in the process of increasing the local iteration turns under different data heterogeneous settings. The results are shown in fig. 7-13. The selection of the local iteration round must take into account the effects of data heterogeneity. Take the experiment on the MNIST dataset as an example (fig. 7, 8 and 9). When the data is IID, the convergence speed can be increased by adding local iterations. When the data is non-IID, increasing the number of local iterations may also increase the convergence rate, but not to the same extent as in the former case. When the data is non-IID2 (the extremely heterogeneous condition of data distribution), increasing the local iteration round number of the client will slow down the decrease of the training loss, and when the local iteration round number is too high, the model cannot be converged. Meanwhile, as can be seen, as the data isomerism degree increases, the training loss curve fluctuation of the model gradually increases. This verifies that under heterogeneous conditions, setting the learning rate decrement and mechanism is necessary for convergence of federal learning. When the data distribution of different clients is not uniform, the difference between model parameters is deepened by increasing local iteration, so that the training loss curve fluctuates greatly, and at the moment, the model difference between different clients is too large due to an excessively high learning rate, so that convergence cannot be realized. Therefore, for the traditional federal learning, when the data are independent and the distribution is the same, the convergence speed of the model can be accelerated by increasing the local iteration number. When the data distribution is heterogeneous, increasing the number of local iterations may cause the convergence speed of the model to be slow or unable to converge.

Fig. 14 to fig. 18 show the convergence situation of the FedSSO of the present invention when setting different local iteration rounds under the condition of different data distribution heterogeneity. When the data are less heterogeneous (fig. 14-16), increasing the local iteration rounds can speed up the convergence speed of the model. However, as the degree of data distribution heterogeneity increases, the acceleration effect of increasing the local iteration turns on the convergence speed of the model gradually decreases, and even the model cannot be converged (fig. 17-18). Therefore, the FedSSO algorithm can accelerate convergence by increasing local iteration rounds in case of heterogeneous clients. However, in the case of extremely heterogeneous data, the FedSSO algorithm still cannot guarantee convergence under a high local iteration round. Therefore, under extreme heterogeneous conditions, selecting a lower local iteration round is a better choice.

FedSSO adopts a density-based clustering method to divide heterogeneous clients into different cluster sets, so that the clients in each cluster have higher similarity, and during each round of training, a specified number of clients are extracted from all clusters in proportion to participate in the training, thereby ensuring that all types of data participate in each round of training. We provide a proof of convergence of the FedSSO algorithm under the standard federal learning assumption and have validated our theory through experiments on a standard data set. Experiments prove that the FedSSO algorithm is greatly improved on a heterogeneous data set compared with FedAvg and FedProx algorithms. Finally, the convergence condition of the algorithm is analyzed, and the fact that the descending of the learning rate is important for model convergence is proved.

The invention has the following beneficial effects:

The FedSSO algorithm was evaluated using the standard data sets MNIST, CIFAR-10 and Sentiment140 and compared to FedAvg and FedProx. The evaluation result proves that the FedSSO algorithm has higher training precision and higher training speed on the heterogeneous data set.

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to include any of the permutations as natural. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.

Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.

In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. A combined learning method facing heterogeneous clients based on hierarchical sampling optimization is applied to N clients and parameter servers, and is characterized by comprising the following steps:

during each round of training, extracting available clients from each cluster according to the sample weight to participate in training, and performing gradient aggregation to ensure that all kinds of data participate in each round of training, so that the influence caused by client isomerism is reduced; the target during training is a convex function;

2. The hierarchical sampling optimization-based heterogeneous client-oriented joint learning method according to claim 1, wherein an average value of one-dimensional data is learned from N clients, and the target is converted into a problem of minimizing mean square error, which is as follows:

in which ξ _i ～D _i Is an independently selected sample from the local data, mean e _i ＝E[ξ _i ]，ρ _i Is the weight, ξ, of the ith client _i For samples of the client, τ _i An offset is weighted for the ith client.

3. The hierarchical sampling optimization-based heterogeneous client-oriented joint learning method according to claim 2, wherein when each client has the same data volume, the optimal solution of the problem is as follows:

4. the hierarchical sample optimization-based heterogeneous client-oriented joint learning method according to claim 1, wherein the clustering method is an OPTICS clustering method.

5. The hierarchical sampling optimization-based heterogeneous client-oriented joint learning method according to claim 1, wherein the formula of the client local model parameters is as follows:

where eta is the learning rate, W _t In order to be a global model,

is a sample of the client that is,

for the loss function, i is the ith client.

6. The hierarchical sample optimization-based heterogeneous client-oriented joint learning method according to claim 1, wherein the weighted average method is as follows:

7. The hierarchical sample optimization-based heterogeneous client-oriented joint learning method according to claim 1, wherein T is _ε The number of iterations required by the algorithm to achieve the accuracy epsilon, the number of communications between the client and the parameter server

Comprises the following steps:

G ² is the desired squared norm limit for the random gradient,

is the variance bound of the random gradient, p _k Is the weight of the kth client, E is the local iteration round, N is the total number of clients, K is the total number of clients extracted in each round, L represents L-smooth, mu represents mu strong convex,

when the data heterogeneous degree is low, the gamma is close to 0, and the larger the local iteration round E is, the better the local iteration round E is; when the data heterogeneous degree is larger, the smaller the Γ is, the smaller the local iteration round E is, the better the local iteration round E is.

8. The heterogeneous client-oriented joint learning method based on hierarchical sampling optimization according to claim 7, wherein under the condition that all clients participate in training, the algorithm convergence is as follows:

wherein,

data heterogeneous parameters;

wherein

Meanwhile, the learning rate value needs to be set to be gradually reduced in the training process to converge to the optimal result, so that the algorithm reaches O (E) ² The convergence rate of/T).

9. The hierarchical sample optimization-based heterogeneous client-oriented joint learning method according to claim 8, wherein the method is based on the following formula:

and obtaining that the algorithm can be ensured to be converged to an optimal solution only by setting the learning rate decreasing, wherein the learning rate decreasing parameters are as follows: eta _t ＝0.1/1+t，η _t ≤2η _t+E 。