CN115278709B

CN115278709B - Communication optimization method based on federal learning

Info

Publication number: CN115278709B
Application number: CN202210906790.4A
Authority: CN
Inventors: 张盼; 许春根; 徐磊; 梅琳; 窦本年
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2024-04-26
Anticipated expiration: 2042-07-29
Also published as: CN115278709A

Abstract

The invention provides a communication optimization method based on federal learning, which belongs to the technical field of federal machine learning and comprises the following steps: acquiring local gradient core characteristics uploaded by each client; the local gradient core features refer to data extracted by preprocessing target data by each client, training a local model and carrying out singular value decomposition on the local model; carrying out sketch gradient compression and sketch gradient aggregation on the local gradient core characteristics, extracting first k sparse parameter matrixes, carrying out a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and sending the global model to each client; and calculating an accumulated gradient error value and a gradient error compensation value of each client, adaptively adjusting the gradient to be compensated of the local gradient core characteristics and the weight of the local gradient core characteristics when sketch gradient polymerization is carried out, and then iterating. The method can effectively compress the model gradient, reduce communication overhead generated in model training and improve the accuracy of model prediction.

Description

Communication optimization method based on federal learning

Technical Field

The invention relates to the technical field of federal machine learning, in particular to a federal learning-based communication optimization method.

Background

The existing federal learning generally uploads intermediate parameters of model training to a parameter server to realize model aggregation, and two problems exist in the process: firstly, large-scale distributed training requires a large amount of communication bandwidth for gradient exchange, which limits the expandability of multi-node training; and secondly, requires expensive high bandwidth network infrastructure.

When distributed training is performed on mobile devices, there is higher network delay, lower system throughput, and intermittent connectivity failures for such training.

In order to solve the above problems, if the conventional SVD singular value decomposition method is directly adopted, the accuracy of model prediction is improved, but the gradient compression effect is not ideal. If a sketch compression method is adopted, although the gradient compression effect is ideal, the accuracy of model prediction is reduced.

Disclosure of Invention

The technical problem of the invention is to provide a federal learning-based communication optimization method, which can effectively compress model gradients, reduce communication overhead generated in model training and improve model prediction accuracy.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A federal learning-based communication optimization method, comprising the steps of: s1, acquiring local gradient core characteristics uploaded by each client; the local gradient core features refer to model gradients extracted by preprocessing target data, training a local model and carrying out singular value decomposition on the local model by each client; the local model is a federal model; s2, carrying out sketch gradient compression on the local gradient core features, carrying out sketch gradient aggregation, extracting first k sparse parameter matrixes, carrying out a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client; s3, calculating an accumulated gradient error value and a gradient error compensation value of each client, adaptively adjusting the gradient to be compensated of the local gradient core feature and the weight of the local gradient core feature when sketch gradient polymerization is carried out, and then iteratively executing S1 to S3.

Further, preprocessing the target data includes: dividing target data into a plurality of data sets, and executing uniform random sampling operation to select the batch size of the data sets in each iteration to obtain the target data sets; performing normal form boundary calculation on the target data set; acquiring initialized global model parameters, training a local model through a target data set calculated by a normal form boundary, and calculating a local gradient of the local model; and cutting the local gradient to obtain a gradient matrix of the local model.

Further, singular value decomposition is performed on the local model into: and each client executes a rapid gradient descent algorithm to decompose a gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, so as to obtain local gradient core characteristics.

Further, each client executes a rapid gradient descent algorithm to decompose a gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients, and obtain a local gradient core feature, including: sequentially storing the data of the gradient matrix of the local model in a zero matrix, and compressing until zero-value rows do not exist in the zero matrix to obtain a first matrix; constructing a sub-sampling random Hadamard transform matrix, calculating the product of the sub-sampling random Hadamard transform matrix and a first matrix, and compressing the size of the first matrix from m to mObtaining a second matrix; and executing an SVD algorithm on the second matrix, and decomposing the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients to obtain the core characteristics of the local gradient.

Further, S3 includes: calculating a local model loss value of each client in the current iteration process; calculating cross entropy of the loss value of the local model, and calculating the quality evaluation weight of the local model according to the cross entropy; calculating local gradient errors generated by sketch gradient compression in the sketch gradient aggregation process, and obtaining accumulated gradient error values and gradient error compensation values of all clients; according to the quality evaluation weight and the accumulated gradient error value of the local model of each client, the weight of the local gradient core characteristics of each client when sketch gradient polymerization is carried out is adjusted; and adaptively adjusting the gradient to be compensated of the local gradient core characteristics of each client according to the gradient error compensation value, and then entering the next iteration.

Drawings

The invention and its features, aspects and advantages will become more apparent from the detailed description of non-limiting embodiments with reference to the following drawings. Like numbers refer to like parts throughout. The drawings are not intended to be drawn to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a general flow chart of a federal learning-based communications optimization method provided by the present invention;

FIG. 2 is a specific flowchart of a preprocessing algorithm in a federal learning-based communication optimization method provided by the invention;

FIG. 3 is a specific flowchart of a fast gradient descent algorithm in a federal learning-based communication optimization method provided by the invention;

Fig. 4 is a specific flowchart for gradient error compensation in a federal learning-based communication optimization method provided by the invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings and specific examples, which are not intended to limit the invention.

In order to solve the problem that in the prior art, when intermediate parameters trained by a model are uploaded to a parameter server to realize model polymerization in a federal learning algorithm, a traditional SVD singular value decomposition method is directly adopted, and the accuracy of model prediction is improved, but the gradient compression effect is not ideal enough; the method is characterized in that a sketch compression method is directly adopted, the gradient compression effect is ideal, but the problem of model prediction accuracy is reduced, so as shown in fig. 1, the invention provides a communication optimization method based on federal learning, in particular to a mobile edge network communication optimization method, when distributed training is performed under a mobile edge network, the resources of a mobile end are limited, the training can bring the problems of higher network delay, lower system throughput and intermittent network connection failure, the communication optimization is performed on the mobile end, and huge communication expenditure is reduced in the model training process.

Firstly, each client needs to preprocess local target data, and then trains out a local model which is a federal model; this step is shown in fig. 2, and specifically includes:

partitioning target data into minibatch datasets And performing uniform random sampling operation to select the batch size of the data set in each iteration, so as to obtain a target data set d' _i＝d_i⊙S_c; wherein i is the ith client, d _i is the data of the ith client, c is the number of matrix columns, and S _c inputs a random uniform sampling matrix with the same dimension as the sensitive original data;

since each client runs a federal learning model on the target dataset d ' _i to extract features, i.e., d ' _l＝F(d′_i), to limit certain global sensitivity, a paradigm boundary calculation is performed on the target dataset d ' _i, i.e. Wherein/>Is the minimum data batch size; d _l represents an infinite norm of the first layer output of the neural network model; from this formula, the upper limit of d' _l is S _f＝max||f(d)-f(d′)||₂; wherein f is a real-valued function, provided that/>The sensitivity of d _l can be protected, and sensitive data can be protected from leakage to a certain extent;

initializing global model parameters w ^global and distributing the parameters to each client, training a local model by each client through a target data set after the calculation of a normal form boundary, and calculating a local gradient g ⁽ⁱ⁾ of the local model; clipping local gradients, i.e. So as to prevent gradient explosion and finally obtain the gradient matrix A of the local model.

Then each client executes a rapid gradient descent algorithm, as shown in fig. 3, the gradient matrix of the local model is decomposed into a left singular vector, a right singular vector and a diagonal matrix containing gradients to approximately describe matrix information of a certain percentage of the gradient matrix, and then the gradients are compressed in a matrix form to obtain local gradient core characteristics, so that high communication cost generated by federal training is reduced; the method specifically comprises the following steps:

N data points in the m-th row in a gradient matrix A epsilon R ^m×n of the local model are collected in sequence and stored in a zero matrix F epsilon R ^m×n, and all zero-value rows in the zero matrix are compressed to obtain a first matrix F'; wherein R represents a matrix and R ^m×n represents a dimension of the matrix is m×n;

When zero value exists in the F, constructing a sub-sampling random Hadamard transformation matrix phi=N.H.D; where N ε R ^q×m represents a scaled random matrix that is uniformly sampled per row without replacing it from m rows of an m identity matrix, rescaling to Wherein q represents the number of rows of the random matrix N; d represents an m×m diagonal matrix with elements of I.I.D, which is a random variable (equal probability-1 or 1); h.epsilon. {. + -.1 } ^m×m is expressed as a Hadamard matrix (Hadamard), meaning that H _ij＝(-1)^<i-1,j-1>, < i-1, j-1> is the dot product of b-bit binary vectors of integers i-1 and j-1,/> Where i, j represents the ith row and jth column where it is located, the Hadamard matrix may also be expressed as/>Where m is the size of the matrix;

Computing the product of the subsampled random hadamard transform matrix Φ and the first matrix F 'to compress the size of F' from m to Connecting the newly compressed gradient and the contracted gradient to obtain a second matrix B epsilon R ^k×n;

Performing an SVD algorithm on the second matrix, decomposing the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients according to [ U, Λ, V ] = SVD (B), wherein U represents the left singular vector, V represents the right singular vector, Λ represents the diagonal matrix, and passing through Compression of raw information to front of B/>Line, under the condition that the obtained local gradient core is slightly influenced, the communication efficiency of local model parameter uploading is improved, wherein I _k refers to AND/>Unit vector of the same dimension,/>The diagonal matrix of the singular values of the k/2 rows before is obtained by subtracting the meaning of the representation and then deleting the previous calculation, and the diagonal matrix of the singular values of the k/2 rows after the diagonal matrix is reserved.

Carrying out sketch gradient compression on the local gradient core features of each client, carrying out sketch gradient aggregation, extracting first k sparse parameter matrixes, carrying out a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client;

then as shown in fig. 4, calculating an accumulated gradient error value and a gradient error compensation value of each client, adaptively adjusting the gradient to be compensated by the local gradient core feature and the weight of the local gradient core feature when carrying out sketch gradient polymerization, and substituting the weights into the next iteration calculation global model; the method specifically comprises the following steps:

Calculating local model loss values of all clients in current iteration process Wherein DO _i is the ith client,/>Is a parameter of the local model of the ith client in the t-th iteration;

Calculating cross entropy H (f (x _i),y_i)＝-∑x_iy_ilogf(x_i) of local model loss values, by which to evaluate local model loss values If the cross entropy H (f (x _i),y_i) value is smaller, the predicted result of the representative local model is closer to the real result, and the model training quality is higher;

Then, calculating the quality assessment weight of the local model according to the cross entropy Where (x _i,y_i) is the local data tag of the i-th client, x _i is the input data, y _i is the expected output, and f (x _i) is the local model prediction result of the i-th client;

Each time the sketch gradient aggregates the quality evaluation weight of each client, the client D _i generates a local gradient error by the sketch gradient compression in the sketch gradient aggregation process, wherein D _i represents the data set of the ith client and can use the accumulated gradient error value Representation of/> The gradient error is expressed as the gradient error accumulated by the ith client in the t-th iteration, beta is an error factor, and the accumulated gradient error value can be updated continuously along with the progress of the iteration; according to/>Wherein α is a compensation coefficient; /(I)Is the gradient after SVD and Sketch;

According to the quality evaluation weight and the accumulated gradient error value of the local model of each client, the weight of the local gradient core characteristics of each client when sketch gradient polymerization is carried out is adjusted; according to the gradient error compensation value, self-adaptively adjusting the gradient to be compensated of the local gradient core characteristics of each client, selecting optimal parameters alpha and beta, substituting the optimal parameters alpha and beta into the next iterative computation, thereby improving the accuracy of the global model, and particularly representing the optimal parameters in the global model updating computation process as follows Where n is the number of clients participating in model training, ψ is the number of high quality clients uniformly and randomly selected, v is the number of data samples owned by the ith client, η is the local model learning rate, P _i(D_i) is the quality assessment weight of the local model of the client D _i in the ith model aggregation, and the model aggregation is to perform federal average on the uploaded gradient updates of all clients for updating the global model); /(I)Is the local gradient of client i not compensated in iteration t, alpha is the compensation coefficient,/>Is the cumulative gradient error value of the ith client in the t-th iteration.

In summary, according to the federal learning-based communication optimization method provided by the invention, each client side firstly trains a local model independently, then performs singular value decomposition (Singular Value Decomposition, SVD) on the local model, extracts local gradient core features, sends the local gradient core features to a parameter server, performs Sketch (Sketch) gradient compression by the parameter server, performs Sketch (Sketch) gradient aggregation, extracts k approximate gradient values before the aggregated parameters, performs a federal average aggregation algorithm on the k approximate gradient values to obtain a global model, and sends the global model to each client side to perform the next round of iteration process. And for the local gradient error generated after the depletion compression of the Sketch gradient, the adaptive gradient error accumulation and gradient error compensation are supported, the gradient to be compensated is adaptively adjusted and added to the next iteration, the model gradient can be effectively compressed by combining SVD with the depletion compression, the communication overhead generated in the model training is reduced, and the gradient error between the SVD operation and the execution of only SVD operation caused by combining SVD with the depletion operation is solved by gradient error compensation, so that the accuracy of global model prediction is improved. The invention can greatly reduce the communication cost between the client and the server in the whole training process, quicken the model convergence, and the accuracy of model prediction is almost the same as the original prediction accuracy.

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A federal learning-based communication optimization method, comprising the steps of:

S1, acquiring local gradient core characteristics uploaded by each client; the local gradient core features refer to model gradients extracted by performing preprocessing on target data, training a local model and performing singular value decomposition on the local model by each client; the local model is a federal model;

s2, carrying out sketch gradient compression on the local gradient core features, carrying out sketch gradient aggregation, extracting first k sparse parameter matrixes, carrying out a federal average aggregation algorithm on the first k sparse parameter matrixes to obtain a global model, and distributing the global model to each client;

s3, calculating an accumulated gradient error value and a gradient error compensation value of each client, adaptively adjusting the gradient to be compensated of the local gradient core feature and the weight of the local gradient core feature when sketch gradient polymerization is carried out, and then iteratively executing S1 to S3;

each client preprocesses local target data, trains a local model, and the local model is a federal model; the implementation steps specifically comprise:

Each client runs a federal learning model on the target dataset d ' _i to extract features, d ' _l＝F(d′_i), performs a paradigm boundary calculation on the target dataset d ' _i, i.e Wherein/>Is the minimum data batch size; d _l represents an infinite norm of the first layer output of the neural network model; the upper limit of d' _l is S _f＝max||f(d)-f(d′)||₂; wherein f is a real-valued function;

Initializing global model parameters w ^global and distributing the parameters to each client, training a local model by each client through a target data set after the calculation of a normal form boundary, and calculating a local gradient g ⁽ⁱ⁾ of the local model; the local gradient is cut out, Obtaining a gradient matrix A of a local model;

Each client executes a rapid gradient descent algorithm to decompose a gradient matrix of a local model into a matrix information which approximately describes a certain percentage of the gradient matrix and comprises a left singular vector, a right singular vector and a diagonal matrix of the gradient, compresses the gradient in a matrix form, and obtains a local gradient core feature, and the method specifically comprises the following steps:

When zero value exists in the F, constructing a sub-sampling random Hadamard transformation matrix phi=N.H.D; where N ε R ^q×m represents a scaled random matrix that is uniformly sampled per row without replacing it from m rows of an m identity matrix, rescaling to Wherein q represents the number of rows of the random matrix N; d represents m multiplied by m diagonal matrix with the element of I.I.D, and is random variable, equal probability-1 or 1; h.epsilon.+ -. 1} ^m×m is represented as a Aldama matrix, meaning H _ij＝(-1)^<i-1,j-1>, < i-1, j-1> is the dot product of b-bit binary vectors of integers i-1 and j-1,/>Where i, j represents the ith row and jth column where it is located, the Hadamard matrix may also be expressed as/> Where m is the size of the matrix;

Calculating the product of the subsampled random Hadamard transform matrix phi and the first matrix F', and connecting the newly compressed gradient and the contracted gradient to obtain a second matrix B epsilon R ^k×n;

Performing an SVD algorithm on the second matrix, decomposing the gradient matrix of the local model into a left singular vector, a right singular vector and a diagonal matrix containing gradients according to [ U, Λ, V ] = SVD (B), wherein U represents the left singular vector, V represents the right singular vector, Λ represents the diagonal matrix, and passing through Compression of raw information to front of B/>Line, where I _k refers to AND/>Unit vector of the same dimension,/>The diagonal matrix of the singular values of the k/2 rows is obtained by deleting the singular values of the k/2 rows before calculation, and the diagonal matrix of the singular values of the k/2 rows after the subtraction is reserved.

2. A federally learning-based communications optimization method according to claim 1, wherein S3 comprises:

Calculating a local model loss value of each client in the current iteration process;

Calculating cross entropy of the local model loss value, and calculating quality evaluation weight of the local model according to the cross entropy;

Calculating local gradient errors generated by sketch gradient compression in the sketch gradient aggregation process, and obtaining accumulated gradient error values and gradient error compensation values of all clients;

According to the quality evaluation weight and the accumulated gradient error value of the local model of each client, the weight of the local gradient core characteristics of each client when sketch gradient polymerization is carried out is adjusted; and adaptively adjusting the gradient to be compensated of the local gradient core characteristics of each client according to the gradient error compensation value, and then entering the next iteration.