CN116542322A

CN116542322A - Federal learning method

Info

Publication number: CN116542322A
Application number: CN202310498361.2A
Authority: CN
Inventors: 袁培燕; 石玲; 赵晓焱; 张俊娜; 刘春红
Original assignee: Henan Normal University
Current assignee: Henan Normal University
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-08-04

Abstract

The invention belongs to the technical field of edge artificial intelligence, and particularly relates to a federal learning method. In the method, each client performs sparse training on a network model downloaded by a server, if the accuracy of the network model is greater than an accuracy threshold value and the pruning rate is smaller than a target pruning rate in each iterative training process, pruning is performed, and for a weight set to 0, the corresponding position in a corresponding mask is set to 0 until the iterative termination condition of training is reached, a local model and a mask of each client are obtained, and the local model and the mask are uploaded to the server; and uploading each client to a local model by the server to aggregate to obtain a global model, and further obtaining a subnet of the global model and issuing according to the global model and masks of each client. According to the invention, the client performs sparse training on the network model and then pruning, so that the network model is lighter, and compared with a heuristic manner of direct pruning, the convergence speed is accelerated, and the problem of limited client resources is better solved.

Description

Federal learning method

Technical Field

The invention belongs to the technical field of edge artificial intelligence, and particularly relates to a federal learning method.

Background

The 2023 global smart phone user would break through forty billions as predicted by the global mobile market report, and these devices would generate a lot of data per second. Uploading data to the cloud end causes larger time delay, and even if the application has low requirements on real-time performance, the cloud end can face larger pressure along with continuous expansion of equipment scale, and the calculation is sunk to the edge end, so that the cloud end pressure relieving method is adopted. Edge computing provides services near the user or data input, and partial work is done at the edge end to better protect the user's data privacy.

Many studies have made a great breakthrough in applying machine learning to other fields. The edge device is closer to the user, can collect a lot of data, and the machine learning can extract useful information from the data, and the theory and technology of the machine learning can provide better service for the user in the edge calculation. The federal learning can train the neural network model by combining the client and the server on the premise of protecting the data security.

However, there are difficulties for clients to train models with the assistance of servers. The neural network model is large in scale, multiple iterations are needed in training, and resources and computing power of edge devices are limited, which may be difficult to bear training and communication overhead. To reduce costs, many studies lighten the model by reducing the number of parameters, the amount of communication data, or the number of communications, but this ignores the fact that data is not independently co-distributed between clients. In addition, models trained from heterogeneous data among clients often have heterogeneity, and a situation that model performance is reduced after aggregation in a server may occur. Personalized federal learning is a good solution for data isomerism, common methods for constructing personalized federal learning include meta learning, migration learning, adaptive adjustment and the like, but these methods do not consider the problem that the computing and communication capabilities of clients are limited.

The above method for reducing the model overhead can have a certain damage to the model performance, while the method for heterogeneous data needs to use additional parameters to represent the data, thereby increasing the model overhead. The adoption of pruning strategies in federal learning can train a personalized model for a client while removing redundant parameters, but the pruning strategies which are generally adopted are heuristic, and the dynamic performance of weights in training is not considered, so that no small precision loss is caused, and additional retraining is needed for recovery, and the cost of the method is still relatively high.

Disclosure of Invention

The invention aims to provide a federal learning method which is used for solving the problem of model precision loss caused by adopting a heuristic pruning strategy.

In order to solve the technical problems, the invention provides a federal learning method, which comprises the following steps:

1) Each client performs sparse training on the network model downloaded by the server, judges the precision and pruning rate of the network model in each sparse iterative training process, resets the weight which is approximately 0 in the part of the network model to 0 for pruning if the precision of the network model is greater than the precision threshold value and the pruning rate is smaller than the target pruning rate, and sets the corresponding position in the corresponding mask to 0 for the weight which is set to 0 until the iteration termination condition of the sparse training is reached, so that the local model and the mask of each client are obtained and uploaded to the server;

2) The server uploads each client to a local model for aggregation to obtain a global model; and obtaining a subnet of the global model according to the global model and masks of all the clients and sending the subnet to all the clients.

The beneficial effects of the technical scheme are as follows: according to the invention, the client performs sparse training on the network model and then pruning, so that the network model is lighter, and compared with a heuristic mode of direct pruning, the convergence speed is accelerated, the problem of limited client resources is better solved, and finally the network model can be better operated in the client. And compared with a method without sparse training, the method has the advantages that the influence of pruning on the model precision is smaller, the network model precision and the pruning rate are judged before pruning, and pruning is carried out again under the condition that the network model precision and the pruning rate meet the requirements, so that the model precision is prevented from being permanently damaged.

Further, in the case of performing the local model polymerization, only a portion of the local model where pruning is not performed is polymerized.

The beneficial effects of the technical scheme are as follows: only the non-pruned parts are aggregated, so that the influence of data isomerism on the model performance can be reduced.

Further, the corresponding positions of the parts, which are not pruned, of each local model are averaged to realize aggregation.

Further, in step 2), the global model is multiplied by the corresponding positions of the masks of the clients to obtain the sub-network of each client.

Further, the loss function is converted into an optimization problem with sparsity requirements, and an ADMM algorithm is adopted to solve an optimization target, so that sparsity training of the network model is achieved.

The beneficial effects of the technical scheme are as follows: ADMM can solve well the problem of non-convex optimization under sparse constraints, so it is used for solving.

Drawings

FIG. 1 is a diagram of a personalized federal learning process for ADMM weight pruning of the present invention;

FIG. 2 is a flow chart of ADMM weight pruning according to the present invention;

FIG. 3 is a schematic diagram of an aggregate local model of the present invention;

FIG. 4 is a schematic diagram of an acquisition subnetwork of the present invention;

FIG. 5 (a) is a graph of accuracy obtained when different target pruning rates are set for CIFAR-10;

FIG. 5 (b) is a graph of accuracy obtained when different target pruning rates are set for MNIST;

FIG. 6 (a) is a graph of accuracy obtained using different algorithms for CIFAR-10;

FIG. 6 (b) is a graph of accuracy obtained using different algorithms for MNIST;

FIG. 7 (a) is a graph comparing communication costs for CIFAR-10 using different algorithms;

fig. 7 (b) is a comparison graph of communication costs for MNIST using different algorithms;

FIG. 8 (a) is a main interface diagram of a system platform;

FIG. 8 (b) is a flower identification function interface view of the system platform;

FIG. 9 is a selection model interface diagram;

FIG. 10 is a selection image interface diagram;

fig. 11 (a) and 11 (b) are graphs of the results of flower recognition of the server and client models, respectively, under the same picture.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent.

Method embodiment:

assuming 1 server and n clients in the network, client 1, client 2 and client 3 are any 3 clients in communication with the server, with which the server co-trains the model. In this embodiment, a server, a client 1, a client 2, and a client 3 are taken as examples to provide a detailed description for a training model process, and the whole learning process is shown in fig. 1, and mainly includes three processes of ADMM weight pruning, local model aggregation, and subnet acquisition.

Step one, each client performs ADMM thinning training and pruning operation on the network model downloaded by the server to obtain a local model and a mask of each client, and uploads the local model and the mask to the server. As shown in fig. 2, in particular:

and performing sparse training on the network model. The loss function of the model is converted into an optimization problem with sparsity requirement, and the network model (the neural network in the embodiment) is sparsely trained, which is expressed as follows:

the above represents the expected loss function L of the client i under the sparse constraint of the model weight quantity _i Minimum. Wherein W is _i Representing clientsi neural network weight, consisting of _g ⊙M _i Obtaining; w (W) _g Parameters representing a global model in a server; m is M _i A mask representing client i; s is S _i ＝{W _i |card(W _i )≤N _i And, as a non-convex set, card (W _i ) Returning the number of non-zero elements of the neural network in the client i, N _i Representing the desired number of non-zero elements.

And then solving the formula (1), wherein the formula (1) is a non-convex optimization problem under sparse constraint, and the ADMM is a powerful algorithm for solving the problem, so that an ADMM is used for solving an optimization target, namely the formula (1). ADMM combines the respective advantages of dual rising and augmented Lagrangian algorithms, decomposes the original problem into a plurality of sub-problems, solves the variables alternately, has very wide application in large-scale distributed learning, and can bring good effects. To minimize the objective function, S is first introduced _i Is a function g (W) _i )：

The indicator function is added to equation (1) to convert the original inequality constraint into the equality constraint in equation (3).

Equation (3) is consistent with the standard form of ADMM, its augmented lagrangian function is:

wherein Y is ^T Is the Lagrangian multiplier and ρ is the penalty parameter. Order theThen formula (4) may become the following formula:

the ADMM algorithm iterates the following formula several times to solve:

solving W by using random gradient descent method _i ：

The solution of equation (7) can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,represent S _i Euclidean projection on. For->Resolution solution, hold->Front N _i The large parameters are unchanged, and the rest is set to zero. Updating U using equation (8) _i Thus, one iteration of ADMM is completed.

In a specific operation, the loss function can be performed asIs dynamically updated in W, Z and U after each round of training, and W is the iterative trainingThe model weight after that; hold front N in W+U _i The large parameters are unchanged, the other parameters are set to zero, and then the parameters are updated to Z; u is updated to U+W-Z.

The network model precision acc and pruning rate prune are calculated in each iteration process: if the accuracy acc of the network model is greater than the accuracy threshold acc0 and the pruning rate prune is less than the target pruning rate prune0, that is, acc > acc0 and prune < prune0, resetting the weight of part of the network model approaching 0 to prune, repeatedly reaching the iteration termination condition of the sparsification training, determining a corresponding mask, and obtaining the local model of each server. Specifically, each iteration process needs to judge the accuracy and pruning rate of the network model: if acc is greater than acc0 and prune is less than prune0, pruning can be performed to achieve the purpose of model weight reduction, and training is continued in the next iteration loop process after pruning; if acc is greater than acc0 and the prune is greater than or equal to prune0, indicating that pruning cannot be continued, and directly executing the next iteration loop process to continue training; if acc is less than or equal to acc0 and prune is less than prune0, indicating that the model precision does not meet the requirement, and directly executing the next iteration cycle process to continue training; if acc is less than or equal to acc0 and prune is greater than or equal to prune0, the model accuracy is not yet required, and the next iteration loop process is directly executed to continue training. That is, the whole training process uses the required iteration times as the iteration termination condition, and pruning can be performed only if the situation that acc > acc0 and prune < prune0 are found in the training process.

In the model sparse training, a large number of iterations are required to prune p% of parameters in the model (p% is the iteration pruning rate), which does not meet the light weight target proposed by us, so that the training is stopped when the client reaches the predetermined iterations, and the weight of the needed pruning may not be completely 0, and is much close to 0. To meet the sparsity requirement, p% of the parameters in the model (after sparsity training, the p% of the parameters are already near 0) are pruned, and the weight near 0 is pruned. Since these weights are themselves small, pruning has less impact on model accuracy than methods without sparsification training. Before pruning, whether the current precision and pruning rate meet the requirements or not needs to be judged, and permanent damage to the precision of the model is prevented. If the parameters in the neural network model of the client are set to 0, the value of the corresponding position in the mask is set to 0, the weight with the value of 0 is frozen, and the weight is not updated in the subsequent training, so that the calculation is reduced, and the method can be suitable for small-sized edge equipment.

And step two, the server aggregates the local models uploaded by the clients to obtain a global model.

And after the client performs ADMM weight pruning, uploading the model with the personalized features reserved to a server. The trained models of the data sets with similar labels also have similarity, and the traditional method for averaging all parameters is not suitable for the invention because the models are pruned in the client, and only the parts which are not pruned are aggregated, and the parameters which are partially or completely overlapped in each model are averaged. If some parameters are cut off in other clients except a certain client, the parameters do not participate in aggregation, and the influence of heterogeneous data on precision can be reduced. The aggregation method of the invention reserves the personalized attribute of the embedded model by searching the partner for the client, and improves the performance of the model.

For example, as shown in fig. 3, the first parameters of the first layer of the local models of the client 1, the client 2 and the client 3 are 3, 0 and 2 respectively, and the average value obtained after non-0 value aggregation is 5/2.

And thirdly, the server obtains a subnet of the global model according to the global model and masks of all the clients and sends the subnet to all the clients.

After the local training is finished, the client marks the pruning condition of the model by using a mask, and uploads the mask to the server, wherein the mask is embedded with information of local data and is used for distinguishing the model of each client in the server. The global model is multiplied by the corresponding position of the mask to obtain a subnet, and the subnet is downloaded to the client. In short, the subnet is obtained by selecting a part of networks from the global model by the client according to the characteristics of the local data. After randomly selecting the client subset S, each client in the set S obtains different subnets from the server, so that individuation of federation learning is realized.

For example, as shown in FIG. 4, the corresponding parameters and mask in the first neuron of the first layer of the subnet are 3 and 1, respectively, and are defined by W _g ⊙M ₁ Calculating the corresponding parameter in the subnet to be 3; the third parameter of the second layer of the similarly obtainable subnetwork is 0.

Thus, the federal learning method of the present invention is completed. Through the description of model training above, it can be seen that the redundant parameters in the network are mainly removed by utilizing ADMM weight pruning, and a special model is built for the client according to the mask to realize light weight and personalized model.

In order to better illustrate the effect of the method of the present invention, the dynamic pruning algorithm implemented by the present invention is compared with several existing algorithms under the data sets CIFAR-10 and MNIST, respectively, and the practicability of the present invention is further illustrated using the flower recognition system. Experiments were run on Windows 10-64-bit operating system with experimental platforms PyTorrch1.3.0 and Android studio3.4.1.

CIFAR-10 is a color dataset comprising 10 types of images, totaling sixty-thousand pieces of data, wherein fifty-thousand pieces are training sets and ten-thousand pieces are test sets. MINST is a 0-9 handwritten gray scale image, comprising sixty-thousand training pictures and ten-thousand test pictures. For the data set CIFAR-10, training is performed by adopting a six-layer neural network architecture, wherein the training comprises five hidden layers and one output layer, and the training method specifically comprises the following steps of: first two 5 x 5 convolutional layers, then three fully connected layers, the number of channels of the fully connected layers being 120, 84, 10 respectively, and finally the output layer. For the data set MNIST, training is performed by adopting a five-layer neural network architecture, wherein the training comprises four hidden layers and one output layer, and the training method is as follows: first two 5 x 5 convolutional layers, then two fully connected layers, the number of channels of the fully connected layers being 50, 10 respectively, and finally the output layer. The experimental parameter settings for datasets CIFAR-10 and MNIST are shown in table 1:

TABLE 1CIFAR-10 and MNIST parameter settings

Iterating 10% of pruning parameters in the communication between the server and the client, and setting the target pruning rate to 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, so as to obtain CIFAR-10 and MNIST precision as shown in fig. 5 (a) and 5 (b), respectively. In the data set CIFAR-10, the accuracy is highest when the target pruning rate is 30%, the accuracy is highest when the target pruning rate is 10% in MNIST, and the model accuracy is highest, and the model accuracy is reduced due to cutting off some important parameters. Model traffic obtained at different target pruning rates is also different, and accuracy is higher than 90% at a target pruning rate of 30% in CIFAR-10, but correspondingly more traffic is required, as is the case for MNIST.

According to the experimental results of FIGS. 5 (a) and 5 (b), the target pruning rates of CIFAR-10 were set to 30%, 50% and 70%, respectively, and MNIST was set to 10%, 50% and 90%, respectively. The method provided by the invention is compared with Standalone, fedAvg, LG-FedAvg under different target pruning rates, and the following three algorithms are introduced.

Standby one: the client uses the data training model at the local device without the participation of the server.

FedAvg: classical federal averaging algorithms. The server downloads the global model to each client, which uploads the model to the server aggregate after local training.

LG-FedAvg: a useful high-dimensional representation is learned in the local data of each client, and a global model is applied to this high-dimensional representation, thereby reducing traffic and making the model lightweight.

Fig. 6 (a) and 6 (b) are comparison of the accuracy of six sets of experiments performed under two data sets, and fig. 7 (a) and 7 (b) are comparison of communication costs (no communication costs since standby has no server involvement). It can be observed from fig. 6 (a), 6 (b), 7 (a) and 7 (b) that the lightweight personalized federal learning proposed by the present invention is better than Standalone, fedAvg and LG-Fedavg in both accuracy and traffic. For the data set CIFAR-10, the highest precision is 90.14% when the target pruning rate is 30%, and the traffic is 775.95MB; the accuracy of LG-FedAvg is 80.78%, the traffic is 1127.89MB, the accuracy of the method is improved by 9.36% compared with that of LG-FedAvg, and the traffic is reduced by 1.45 times. For the data set MNIST, the highest precision is 99.96% when the target pruning rate is 10%, the communication traffic is 212.45MB, the precision of LG-FedAvg is 97.03%, the communication traffic is 244.82MB, the precision is improved by 2.93%, and the communication traffic is reduced by 1.15 times.

In addition, a floral data set created by oxford university and a MobileNetV2 training model were used and deployed in an android device to further illustrate the effectiveness of the present invention. The data set comprises 17 types of flowers, each type comprises 80 pictures, and 1360 pictures are taken in total, and MobileNet V2 is a lightweight convolutional neural network based on an inverted residual structure, and is proposed by the Google team in 2018. The experimental parameter settings under the flower data set are shown in table 2, and the accuracy of the obtained model is 92.47% and the traffic of each client is 0.8783GB.

Table 2 flower data set parameter settings

The trained model is deployed to the local recognition system through the following three steps, fig. 8 (a) is a main interface of the system, and clicking "PYTORCH flower recognition" can obtain the flower recognition function display interface shown in fig. 8 (b). (1) And converting the pth model file obtained in the PyCharm into a pt file supported in the Android Studio. (2) And adding a dependence item of PyTorch in the build. Gradle of the Android Studio, and simultaneously placing the converted model file and tag information into the Android Studio. (3) reasoning using the model.

The invention deploys a global model of a server and a personalized model of a client into the system respectively, wherein the client comprises two flowers, namely Daisy and Bluebell. As shown in fig. 9, there are server models and client models available. Clicking on "album" selects the image in the mobile device that needs to be identified, as shown in fig. 10. Fig. 11 (a) and 11 (b) are graphs of the flower recognition results of the server and the client model under the same picture, and it can be clearly seen that the personalized model of the client has higher accuracy than the global model of the server.

In conclusion, the algorithm provided by the invention can be better adapted to the edge equipment, and a personalized model is trained for the client while model overhead is reduced. According to the invention, weight pruning based on ADMM is used, and pruning is carried out after sparse training is carried out on the model, so that the model is light. Compared with a heuristic method of direct pruning, the convergence speed is accelerated, and the problem of limited client resources is better solved. And constructing a personalized model for each client by using the mask, downloading a subnet of the global model to the client by the server, and only aggregating the part which is not pruned, so that the influence of data isomerism on the performance of the model is reduced. Aiming at the problems of high model cost and data isomerism, the invention trains a lightweight personalized model for the client so that the model can better run in the client.

Claims

1. A federal learning method, comprising the steps of:

2. The federal learning method according to claim 1, wherein, when the local model polymerization is performed, only a portion of the local model where pruning is not performed is polymerized.

3. The federal learning method according to claim 2, wherein the aggregation is achieved by averaging the corresponding positions of the portions of each local model where pruning is not performed.

4. The federal learning method according to claim 1, wherein in step 2), the global model is multiplied by the corresponding positions of the respective masks of the clients to obtain the sub-network of each client.

5. The federal learning method according to any one of claims 1 to 4, wherein the loss function is transformed into an optimization problem having sparsity requirements, and an ADMM algorithm is used to solve an optimization objective to implement sparsity training on the network model.