CN115278709A

CN115278709A - Communication optimization method based on federal learning

Info

Publication number: CN115278709A
Application number: CN202210906790.4A
Authority: CN
Inventors: 张盼; 许春根; 徐磊; 梅琳; 窦本年
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01
Anticipated expiration: 2042-07-29
Also published as: CN115278709B

Abstract

The invention provides a communication optimization method based on federal learning, belonging to the technical field of federal machine learning, comprising the following steps: acquiring local gradient core characteristics uploaded by each client; the local gradient core characteristics refer to data extracted by preprocessing target data by each client, training a local model and performing singular value decomposition on the local model; performing sketch gradient compression and sketch gradient aggregation on the local gradient core characteristics, extracting the first k sparse parameter matrixes, executing a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and sending the global model to each client; and calculating the accumulated gradient error value and the gradient error compensation value of each client, adaptively adjusting the weight of the gradient of the local gradient core characteristic to be compensated and the weight of the local gradient core characteristic in the process of sketch gradient aggregation, and then iterating. The method can effectively compress the gradient of the model, reduce communication overhead generated in model training and improve the accuracy of model prediction.

Description

Communication optimization method based on federal learning

Technical Field

The invention relates to the technical field of federal machine learning, in particular to a communication optimization method based on federal learning.

Background

The existing federal learning generally uploads the intermediate parameters of model training to a parameter server to realize model aggregation, and two problems exist in the process: firstly, large-scale distributed training needs a large amount of communication bandwidth for gradient exchange, which limits the expandability of multi-node training; the second is the need for expensive high bandwidth network infrastructure.

When distributed training is performed on a mobile device, there is higher network latency, lower system throughput, and intermittent poor connectivity.

In order to solve the above problem, if the traditional SVD singular value decomposition method is directly adopted, although the accuracy of model prediction is improved, the gradient compression effect is not ideal. If the sketch compression method is adopted, although the gradient compression effect is ideal, the accuracy of model prediction is reduced.

Disclosure of Invention

The invention aims to provide a communication optimization method based on federal learning, which can effectively compress model gradient, reduce communication overhead generated in model training and improve the accuracy of model prediction.

In order to achieve the purpose, the invention adopts the technical scheme that:

a communication optimization method based on federal learning comprises the following steps: s1, acquiring local gradient core characteristics uploaded by each client; the local gradient core characteristics refer to model gradients extracted by preprocessing target data by each client to train a local model and performing singular value decomposition on the local model; the local model is a federal model; s2, carrying out sketch gradient compression on the local gradient core characteristics, carrying out sketch gradient aggregation, extracting the first k sparse parameter matrixes, executing a Federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client; s3, calculating the accumulated gradient error value and the gradient error compensation value of each client, adaptively adjusting the gradient of the local gradient core characteristic to be compensated and the weight of the local gradient core characteristic in the process of sketch gradient aggregation, and then iteratively executing S1 to S3.

Further, preprocessing the target data includes: dividing target data into a plurality of data sets, and executing uniform random sampling operation to select the batch size of the data sets in each iteration to obtain a target data set; performing paradigm boundary calculation on the target data set; acquiring initialized global model parameters, training a local model through a target data set after the normal form boundary calculation, and calculating the local gradient of the local model; and cutting the local gradient to obtain a gradient matrix of the local model.

Further, the local model is subjected to singular value decomposition as: and each client executes a rapid gradient descent algorithm, decomposes the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing the gradient, and obtains the core characteristic of the local gradient.

Further, each client executes a fast gradient descent algorithm, decomposes the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, and obtains local gradient core features, including: storing data of a gradient matrix of a local model in a zero matrix in sequence, and compressing until zero-value rows do not exist in the zero matrix to obtain a first matrix; constructing a sub-sampled random Hadamard transform matrix, calculating the product of the sub-sampled random Hadamard transform matrix and a first matrix, and compressing the first matrix from m to

Obtaining a second matrix; and executing an SVD algorithm on the second matrix, decomposing the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, and obtaining the core characteristics of the local gradients.

Further, S3 includes: calculating a local model loss value of each client in the current iteration process; calculating the cross entropy of the loss value of the local model, and calculating the quality evaluation weight of the local model according to the cross entropy; calculating local gradient errors generated by draft gradient compression in the draft gradient aggregation process to obtain accumulated gradient error values and gradient error compensation values of all clients; adjusting the weight of the local gradient core characteristics of each client for sketch gradient aggregation according to the quality evaluation weight and the accumulated gradient error value of the local model of each client; and adaptively adjusting the gradient to be compensated of the local gradient core characteristic of the whole client according to the gradient error compensation value, and entering the next iteration.

Drawings

The invention and its features, aspects and advantages will become more apparent from reading the following detailed description of non-limiting embodiments with reference to the accompanying drawings. Like reference symbols in the various drawings indicate like elements. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a general flow chart of a federated learning-based communication optimization method provided by the present invention;

FIG. 2 is a detailed flowchart of a preprocessing algorithm in a federated learning-based communication optimization method provided by the present invention;

FIG. 3 is a detailed flowchart of a fast gradient descent algorithm in a communication optimization method based on federated learning according to the present invention;

fig. 4 is a specific flowchart for performing gradient error compensation in the communication optimization method based on federal learning according to the present invention.

Detailed Description

The invention will be further described with reference to the following drawings and specific examples, which should not be construed as limiting the invention thereto.

The method aims to solve the problems that in the prior art, when the intermediate parameters of model training are uploaded to a parameter server to realize model aggregation in a Federal learning algorithm, a traditional SVD singular value decomposition method is directly adopted, although the accuracy of model prediction is improved, the gradient compression effect is not ideal; the method is characterized in that a sketch compression method is directly adopted, although the gradient compression effect is ideal, the problem of the accuracy of model prediction is reduced, therefore, as shown in fig. 1, the invention provides a communication optimization method based on federal learning, in particular to a mobile edge network communication optimization method, aiming at the problems that when distributed training is carried out under a mobile edge network, the resources of a mobile terminal are limited, and the training can bring higher network delay, lower system throughput and intermittent network connection failure, the communication optimization is carried out on the communication optimization, and the huge communication overhead generated in the model training process is reduced.

Firstly, each client needs to preprocess local target data, and then trains out a local model, wherein the local model is a federal model; as shown in fig. 2, the steps specifically include:

splitting target data into minipatch datasets

And executing uniform random sampling operation to select the batch size of the data set in each iteration to obtain a target data set d'_i＝d_i⊙S_c(ii) a Wherein i is the ith client, d_iIs the data of the ith client, c is the number of matrix columns, S_cInputting a random uniform sampling matrix with the same dimension as the sensitive original data;

since each client is to be in target dataset d'_iRun a Federal learning model to extract features, d'_il＝F(d′_i) To limit a certain global sensitivity, the target data set d'_iPerforming normal form boundary calculations, i.e.

Wherein,

is the minimum data batch size; d_lAn infinite norm representing a first level output of the neural network model; d 'can be known from the formula'_lHas an upper limit of S_f＝max||f(d)-f(d′)||₂(ii) a Wherein f is a real-valued function, provided that

d_lThe sensitivity of the data can be protected, and the sensitive data can be protected from being leaked to a certain extent;

the global model parameters w are then initialized^globalAnd distributing the data to each client, training a local model by each client according to the target data set calculated by the paradigm boundary, and calculating the local gradient g of the local model⁽ⁱ⁾(ii) a Clipping local gradients, i.e.

So as to prevent gradient explosion and finally obtain the gradient matrix A of the local model.

Then each client executes a rapid gradient descent algorithm, as shown in fig. 3, a gradient matrix of the local model is decomposed into a left singular vector, a right singular vector and a diagonal matrix containing gradients to approximately describe matrix information of a certain percentage of the gradient matrix, and then the gradients are compressed in a matrix form to obtain local gradient core characteristics, so that the high communication cost generated by federal training is reduced; the method specifically comprises the following steps:

collecting gradient matrix A epsilon R of local model in sequence^m×nN data points in the mth row and stored in the zero matrix F e R^m×nCompressing all zero-value rows in the zero matrix to obtain a first matrix F';

when judging that zero-value rows do not exist in the F, constructing a sub-sampling random Hadamard transform matrix phi = N.H.D; wherein N ∈ R^q×mRepresents a scaled random matrix that is uniformly sampled per row without replacement from m rows of an m identity matrix, rescaled to

D represents an m x m diagonal matrix with elements of i.i.d, being a random variable (equal probability-1 or 1); h belongs to { +/-1 }^m×mExpressed as Hadamard matrix (Hadamard) meaning h_ij＝(-1)^<i-1，j-1>，<i-1，j-1>Is the dot product of the b-bit binary vectors of integers i-1 and j-1,

the i, j represents the ith row and the jth column) Hadamard matrix can also be expressed as

Where m is the size of the matrix;

computing the product of the sub-sampled random Hadamard transform matrix Φ and the first matrix F 'to compress the size of F' from m to

And connecting the new compressed gradient and the shrunk gradient to obtain a second matrix B epsilon R^k×n；

Performing SVD algorithm on the second matrix according to [ U, lambda, V]= SVD (B) decompose gradient matrix of local model into left singular vector, right singular vector and diagonal matrix containing gradient and pass

Compressing the original information to the front of B

And improving the communication efficiency of uploading the local model parameters under the condition that the influence on the local gradient kernel is small.

Performing sketch gradient compression on local gradient core characteristics of each client, performing sketch gradient aggregation, extracting the first k sparse parameter matrixes, performing a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client;

then, as shown in fig. 4, calculating an accumulated gradient error value and a gradient error compensation value of each client, adaptively adjusting the gradient of the local gradient core feature to be compensated and the weight of the local gradient core feature during sketch gradient aggregation, and substituting the weights into the next iterative computation global model; the method specifically comprises the following steps:

calculating the locality of each client in the current iteration processLoss value of model

Wherein, DO_iIs the ith client(s),

is a parameter of the local model of the ith client in the t-th iteration;

calculating the cross entropy H (f (x) of the local model loss value_i)，y_i)＝-∑_xiy_ilogf(x_i) Evaluating local model penalty values by cross entropy

If the cross entropy H (f (x)_i)，y_i) The smaller the value is, the closer the local model prediction result is to the real result is, which indicates that the model training quality is higher;

then, the quality evaluation weight of the local model is calculated according to the cross entropy

Wherein (x)_i，y_i) Is the local data tag, x, of the ith client_iIs input data, y_iIs the desired output, f (x)_i) Is the local model prediction result of the ith client;

aggregating the quality assessment weights of the clients, client D, each time in a sketch gradient manner_iThe local gradient error generated by the compression of the gradient of the sketch in the gradient aggregation process of the sketch can be used as the accumulated gradient error value

It is shown that the process of the present invention,

the accumulated gradient error value of the ith client in the t-th iteration is represented, beta is an error factor, and the accumulated gradient error value can be continuously updated along with the iteration; then according to

Wherein α is a compensation coefficient;

is the gradient after SVD and Sketch;

adjusting the weight of the local gradient core characteristics of each client for sketch gradient aggregation according to the quality evaluation weight and the accumulated gradient error value of the local model of each client; according to the gradient error compensation value to adaptively adapt to the gradient to be compensated of the local gradient core characteristic of the whole client, optimal parameters alpha and beta are selected and substituted into the next iterative computation, so that the accuracy of the global model is improved, and the optimal parameters alpha and beta are expressed as the gradient to be compensated in the global model updating computation process

Where n is the number of clients participating in model training, ψ is the number of high quality clients selected uniformly and randomly, η is the local model learning rate, P_i(D_i) Is a client D_iThe quality evaluation weight of the local model in the ith model aggregation, wherein the model aggregation is to perform federal averaging on uploaded gradient updates of all clients and is used for updating the global model);

is the local gradient that client i did not compensate in the t-th iteration, alpha is the compensation coefficient,

is the cumulative gradient error value of the ith client in the tth iteration.

In summary, according to the communication optimization method based on federal learning provided by the present invention, each client independently trains a local model, then performs Singular Value Decomposition (SVD) on the local model, extracts local gradient core features, sends the local gradient core features to a parameter server, performs Sketch (Sketch) gradient compression by the parameter server, then performs Sketch (Sketch) gradient aggregation, extracts the first k approximate gradient values from the aggregated parameters, performs the federal average aggregation algorithm on the first k approximate gradient values to obtain a global model, and sends the global model to each client to perform the next iteration process. And aiming at local gradient errors generated after the Sketch gradient lossy compression, self-adaptive gradient error accumulation and gradient error compensation are supported, the gradient to be compensated is self-adaptively adjusted, the gradient is added into the next iteration, the model gradient can be effectively compressed through SVD and Sketch compression, the communication overhead generated in model training is reduced, the gradient errors between SVD and Sketch operation and SVD operation are only executed due to the combination of SVD and Sketch operation are solved through the gradient error compensation, and therefore the accuracy of global model prediction is improved. The invention can greatly reduce the communication overhead between the client and the server in the whole training process, accelerate the convergence of the model and ensure that the accuracy of the model prediction is almost the same as the original prediction accuracy.

The above description is of the preferred embodiment of the invention; it is to be understood that the invention is not limited to the particular embodiments described above, in that devices and structures not described in detail are understood to be implemented in a manner common in the art; any person skilled in the art can make many possible variations and modifications, or modify equivalent embodiments, without departing from the technical solution of the invention, without affecting the essence of the invention; therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are within the scope of the technical solution of the present invention, unless the technical essence of the present invention is not departed from the content of the technical solution of the present invention.

Claims

1. A communication optimization method based on federal learning is characterized by comprising the following steps:

s1, acquiring local gradient core characteristics uploaded by each client; the local gradient core characteristics refer to model gradients extracted by preprocessing target data by each client, training a local model and performing singular value decomposition on the local model; the local model is a federal model;

s2, performing sketch gradient compression on the local gradient core features, performing sketch gradient aggregation, extracting the first k sparse parameter matrixes, performing a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client;

s3, calculating accumulated gradient error values and gradient error compensation values of all the clients, adaptively adjusting the gradient of the local gradient core characteristics to be compensated and the weight of the local gradient core characteristics during sketch gradient aggregation, and then iteratively executing S1 to S3.

2. The method of claim 1, wherein the preprocessing the target data comprises:

dividing target data into a plurality of data sets, and executing uniform random sampling operation to select the batch size of the data sets in each iteration to obtain a target data set;

performing paradigm boundary calculation on the target data set;

acquiring initialized global model parameters, training a local model through a target data set after the normal form boundary calculation, and calculating the local gradient of the local model;

and cutting the local gradient to obtain a gradient matrix of the local model.

3. The federal learning-based communication optimization method as claimed in claim 2, wherein the singular value decomposition of the local model is as follows:

and each client executes a rapid gradient descent algorithm, and decomposes the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients to obtain local gradient core characteristics.

4. The federal learning-based communication optimization method of claim 3, wherein each client executes a fast gradient descent algorithm to decompose a gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, so as to obtain local gradient core features, and the method includes:

storing data of a gradient matrix of a local model in a zero matrix in sequence, and compressing until zero-value rows do not exist in the zero matrix to obtain a first matrix;

constructing a sub-sampled random Hadamard transform matrix, calculating the product of the sub-sampled random Hadamard transform matrix and a first matrix, compressing the first matrix from m to

Obtaining a second matrix;

and executing an SVD algorithm on the second matrix, decomposing the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, and obtaining the core characteristics of the local gradients.

5. The method according to claim 4, wherein the S3 comprises:

calculating a local model loss value of each client in the current iteration process;

calculating the cross entropy of the loss value of the local model, and calculating the quality evaluation weight of the local model according to the cross entropy;

calculating local gradient errors generated by the gradient compression of the sketch in the gradient aggregation process of the sketch to obtain accumulated gradient error values and gradient error compensation values of all clients;

adjusting the weight of the local gradient core characteristics of each client for sketch gradient aggregation according to the quality evaluation weight and the accumulated gradient error value of the local model of each client; and adaptively adjusting the gradient to be compensated of the local gradient core characteristic of the whole client according to the gradient error compensation value, and entering the next iteration.