CN115983320A

CN115983320A - Federal learning model parameter quantification method based on deep reinforcement learning

Info

Publication number: CN115983320A
Application number: CN202211657889.1A
Authority: CN
Inventors: 董宇涵; 郑斯辉; 陈翔; 李志德
Original assignee: Shenzhen Research Institute Tsinghua University; Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen Research Institute Tsinghua University; Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-04-18

Abstract

A federal learning model parameter quantification method based on deep reinforcement learning comprises the following steps: s1, obtaining current global model parameters; s2, counting M percentage points of the current global model parameters as environmental states observed by the intelligent deep reinforcement learning agent; s3, according to the action output by the agent, constructing L quantization step points according to a given rule, and using the L quantization step points as a mapping set of quantization operation; s4, quantization transmission: carrying out model quantization and transmission of X rounds, wherein quantization of each round adopts a quantization mapping set of the previous step, statistics is carried out on quantization errors and training errors of the X rounds, a mean value of the quantization errors and training errors is calculated, a reward value is obtained according to a reward function, and the reward value is input to an intelligent agent to serve as feedback; and S5, continuously recording the state, action and reward condition of each time by the intelligent agent, and updating the network model of the intelligent agent when the number of records reaches a given threshold value. The method has smaller quantization error and higher test accuracy.

Description

Federal learning model parameter quantification method based on deep reinforcement learning

Technical Field

The invention relates to the field of distributed artificial intelligence, in particular to a federal learning model parameter quantification method based on deep reinforcement learning.

Background

Machine Learning (ML) is one of the most representative artificial intelligence techniques at present, and can learn an optimization strategy in a massive data sample, and has shown a great potential in many applications, and particularly, has obtained extensive research and application in the fields of computer vision, natural language processing, and the like. However, with the increasing importance of privacy for individuals, institutions, and even countries, it is becoming increasingly difficult to collect large amounts of data and then place them on a central server for training. Federal Learning (FL) is produced as a distributed ML method, and it completes training through model exchange between a user and a server, and does not need to upload the original data of the user, and can avoid the problem of privacy disclosure to a certain extent. However, the machine learning network model is large in scale and requires hundreds of iterations to converge, which results in a large amount of communication resources consumed in the training process of federal learning, which is particularly prominent in wireless communication systems.

In order to improve the communication efficiency of wireless federal learning and reduce the communication overhead required by distributed model training, it is necessary to compress the federal learning model. At present, the main model compression method comprises low-rank approximation, sparsification, quantization and the like, wherein the quantization refers to the transmission of low-precision approximation values which can be represented by converting high-precision neural network model parameters into a few bits (namely, quantization bit width); the method has been shown to greatly reduce communication overhead without unduly affecting neural network model performance. For example, it is proposed to round the model parameter vector to a limited set of discrete values at random, and to efficiently lossless-code the model by using the characteristic that the occurrence probabilities of the discrete values are not equal, thereby effectively improving the communication efficiency of the FL.

In the prior art, the main concern is the quantization of model parameters when a user transmits a model to a server, i.e., uplink communication, and less concern is the quantization of model parameters when the server broadcasts the model to the user, i.e., downlink communication. Some consider the quantization compression problem of downlink communication, the proposed algorithm only transmits the difference value between the broadcast global model and the previous model, and by means of the characteristic that the difference value has a smaller dynamic range than the model itself, the scheme can obtain a lower error. However, in this scheme, the latest global model must be downloaded every turn regardless of whether the user is involved in training, which may add additional communication overhead from the user's perspective. In addition, the current solution only considers the uniform quantization scheme, because the non-uniform quantization can further reduce the quantization error, but the setting and optimization of the quantization step point are difficult. The performance of the existing wireless federal learning model parameter quantization method is seriously damaged when the bit width is low.

It is noted that the information disclosed in the above background section is only for understanding of the background of the present application and therefore may include information that does not constitute prior art that is known to a person of ordinary skill in the art.

Disclosure of Invention

The invention mainly aims to provide a Federal learning model parameter quantification method based on deep reinforcement learning, so as to reduce performance loss when a model adopts low quantification bit width for transmission.

In order to realize the purpose, the invention adopts the following technical scheme:

in a first aspect, a method for quantifying parameters of a federated learning model based on deep reinforcement learning comprises the following steps:

s1, obtaining current global model parameters;

s2, state processing: counting M percentile points of the current global model parameters as the environmental states observed by the deep reinforcement learning agent; the deep reinforcement learning agent comprises an action network and an evaluation network;

s3, action processing: according to the action output by the deep reinforcement learning agent, constructing L quantization step points according to a given rule as a mapping set of quantization operation;

s4, quantization transmission: carrying out model quantization and transmission of X rounds, wherein quantization of each round adopts a quantization mapping set of the previous step, statistics is carried out on quantization errors and training errors of the X rounds, a mean value of the quantization errors and training errors is calculated, a reward value is obtained according to a reward function, and the reward value is input to a reinforcement learning agent to serve as feedback;

s5, updating the model: the reinforcement learning agent continuously records the state, action and reward condition of each time, and updates the network model of the deep reinforcement learning agent when the recorded number reaches a given threshold value.

In a second aspect, a computer-readable storage medium stores a computer program, which when executed by a processor, implements the method for quantifying parameters of a federated learning model based on deep reinforcement learning.

The invention has the following beneficial effects:

the invention provides a method for quantizing federal learning model parameters based on deep reinforcement learning, which solves the problem that the performance of the existing method for quantizing the parameters of the wireless federal learning model is seriously damaged when the bit width is low. The invention reduces the performance loss of low quantization bit width, therefore, in order to reach the same performance, the quantization bit width lower than that of the common method can be adopted, and the invention has the indirect benefit of reducing communication overhead. The method adopts non-uniform quantization, autonomously optimizes the selection of quantization step points through a reinforcement learning intelligent agent, is suitable for an uplink communication link and a downlink communication link, and can obtain smaller quantization error and higher test accuracy compared with the traditional uniform quantization method. Through test comparison and verification, the accuracy of the model on the test set is higher when the method is adopted under the same training round.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a process of quantifying federal learning model parameters based on deep reinforcement learning according to an embodiment of the present invention.

FIG. 3 is a graph comparing training and quantization errors of an embodiment of the present invention with a prior art method.

FIG. 4 is a graph comparing the test accuracy of the present invention embodiment with the prior art method.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

Referring to fig. 1, an embodiment of the present invention provides a method for quantifying parameters of a federated learning model based on deep reinforcement learning, including the following steps:

s1, obtaining current global model parameters;

s2, state processing: counting M percentile points of the current global model parameters as the environmental states observed by the intelligent deep reinforcement learning agent; the deep reinforcement learning agent comprises two deep neural networks, namely an action network and an evaluation network;

The embodiment of the invention provides a federal learning model parameter quantification method based on deep reinforcement learning, which can make a reasonable quantification strategy through interaction with the environment, and effectively reduce the performance loss caused by model transmission by adopting low quantification bit width. The method adopts non-uniform quantization, autonomously optimizes selection of quantization step points through a reinforcement learning agent, is suitable for an uplink communication link and a downlink communication link, and can obtain smaller quantization error and higher test accuracy compared with the traditional uniform quantization method.

Specific embodiments of the present invention are further described below.

The method for quantifying the parameters of the federated learning model based on the deep reinforcement learning mainly comprises the following steps: constructing a deep reinforcement learning agent, wherein the agent comprises two deep neural networks, namely an action network and an evaluation network; state processing, in which the system counts M percentile points of the current global model parameters as environmental states observed by the reinforcement learning agent; the method comprises the following steps of performing action processing, wherein the system constructs L quantization step points according to given rules and the action output by the reinforcement learning agent as a mapping set of quantization operation; then, carrying out quantitative transmission, wherein in the step, the system carries out model quantization and transmission in X rounds, the quantization of each round adopts the quantization mapping set of the previous step, and carries out statistics on the quantization errors and the training errors of the X rounds, calculates the mean value of the quantization errors and obtains an incentive value according to an incentive function, and inputs the incentive value to the reinforcement learning agent as feedback; finally, the reinforcement learning agent keeps recording the state, action and reward condition of each time in the process, and when the number of the records reaches a given threshold value, the network model of the agent is updated. The specific flow is shown in fig. 2.

1. Building reinforcement learning agent

A Deep Reinforcement Learning (DRL) agent is a core unit responsible for decision making in the method of the present invention, and is composed of two neural networks, including: action networks π (a, s; θ) _a ) Wherein theta _α Representing the parameters of the neural network, s representing the state vectors observed by the agent at the present time, pi (a, s; theta) _a ) Giving the probability that the agent decides the action a given in this state; evaluation network V (s; theta) _v ) Wherein theta _v Parameters representing the neural network, the function giving the current intelligenceThe value of the state s that can be considered.

2. State processing

Assuming that the neural network model to be transmitted is

Where d is the dimension of the network model, i.e. the number of parameters it contains. Suppose that w is obtained by sorting the parameters in w from small to large according to absolute values ^s Defining the xth percentile as

Wherein the content of the first and second substances,

is w ^s The (i) th element of (a), device for selecting or keeping>

Indicating rounding up.

Assuming that the dimension of the input state of the reinforcement learning agent is M ≧ 1, the following vector is constructed

s＝[p ₁ ,p ₂ ,…,p _M ] ^T ,

As observed at the current time, the m-th element of the vector is

/>

Here, a balance can be struck between the representation accuracy and the state space dimension by adjusting the value of M.

3. Action processing

Assume that the action dimension of the DRL agent output is L/2. After the state vector s is obtained in step 4.2, it is input into the action network π (as; θ) of the DRL agent _a ) To obtain an output vector of the network

μ＝[μ ₁ ,μ ₂ ,…,μ _L/2 ] ^T .

Given the variance of the samples for all actions as σ, then distribute N (μ) from normal _i σ) to obtain action a _i I =1,2, \ 8230;, L/2, and constitutes a motion vector

a＝[a ₁ ,a ₂ ,…,a _L/2 ] ^T ，

Based on the vector, the quantization step point and the complete set of mappings can be further computed. Assuming that the quantization step points are only in a given range [ -B, B [ ]],B>0, and searching and optimizing the interval [ -B,0 [ -B [ -C [ ]]Are divided into L/2 sub-intervals, and the ith interval is marked as I _i ＝[l _i ,u _i ]Wherein the left end point l _i Right end point u _i Respectively calculated according to the following formula:

and defines a uniformly distributed reference vector c, the ith element of which is the center point of the ith sub-interval, i.e. the

Then, a quantization step point vector for the negative part is calculated

Where Sort (x) denotes ordering the elements in vector x from small to large. On this basis, a quantization step vector for the non-negative part can be calculated

It can be represented by a vector q ⁿ The reverse order is then obtained by taking the inverse number. Q is to be ⁿ ,q ^p Splicing the two vectors together to obtain a complete quantization step point with dimension L(Vector)

4. Quantized transmission

According to the quantization step point vector Q obtained in step 4.3, a mapping set Q = { Q } can be constructed ₁ ,q ₂ ,…,q _L Wherein q is _l Is the l-th element of the vector q. The process of broadcasting the global model by the server, receiving and training the model by the user equipment, uploading the model by the user equipment and aggregating the uploaded models by the server is called a round. Assuming that the current global model is w, the quantization transmission is performed in the following X rounds according to the following steps:

the method comprises the following steps that firstly, the current turn is assumed to be the t-th turn, wherein t =1,2, \8230, and X. for each parameter w in the global model of the current turn, random quantization is carried out according to the following rule:

wherein ξ (w, q) _i ,q _j ) Is a random variable, which is defined as

Suppose the quantized model parameters are denoted as Q (w) = [ Q (w) ₁ ),Q(w ₂ ),…,Q(w _d )] ^T Then the quantization error e for that round can be calculated _t ＝||Q(w)-w|| ² 。

In a second step, the quantized global model Q (w) is broadcast to users, who receive the model, and train on the local data set based on the model. Suppose the set of users is S, the training error of the kth user is

After each user finishes training, the updated model w _k And uploading the training error to a server, and the server is according to the { w _k } _k∈S And aggregating to obtain a new global model w, and calculating an average training error according to the following formula:

after completing the calculation of X rounds, X quantization errors and average training error values can be collected, and the average value of the quantization errors and the average training error values is calculated according to the following formula:

further calculating the reward of the training process according to the calculated reward

Wherein the content of the first and second substances,

represents the average training error obtained by the last quantized transmission, and makes the decision on whether or not the quantized transmission is the first time>

This condition is true; if steps S1 to S5 are referred to as an iteration, this variable indicates the value which was acquired at this step in the last iteration>

α>1 is a constant factor for amplifying the influence of quantization error; beta is a ₁ ,β ₂ >0 is also a constant weight factor used to adjust the effect of quantization error and training error on the reward value; i (-) is an indicator function which takes on the value 1 when the conditions in parentheses are established, and takes on the value 0. Delta>0 is the penalty given when a gradient explosion occurs in the training process, i.e. there is an excess of INF in w, where INF representsA very large value (typically 1.796e 308).

5. Model updating

In step 4.2, the agent may obtain state s; in step 4.3, the agent outputs action a; in step 4.4, the agent receives the reward r and can calculate the next state s' based on the latest w in the method of step 4.2. Storing the quadruple (s, a, r, s') in the iteration process into a record cache region B of the agent, and if the number of records in the cache region is less than a given threshold value P, continuing to generate the next record according to the steps 4.2-4.3; otherwise, updating the intelligent agent model according to the following steps.

In the first step, assume that the ith record in the buffer is p _i ＝(s _i ,a _i ,r _i ,s' _i ) Where i =1,2, ..., P. Let A _P =0, and calculates the merit function of each successive state-action pair according to the following formula

A _i-1 ＝r _i-1 +γ·V(s _i '；θ _v )-V(s _i ；θ _v )+γλ·A _i ,

Where both γ and λ are constants, used to adjust the impact of future revenues on current dominance.

And secondly, grouping the records in the cache region, wherein each group comprises C records. Make theta' _a ＝θ _a Then, for each set of data, the c-th data is calculated (assuming that its position in B is i) _c ) Corresponding action loss function

Where C =1,2, ..., C, the probability ratio ρ is defined as:

the clipping function clip (. Cndot.) is defined as

The model is updated according to the following formula:

wherein eta is _a Is the learning rate of the learning rate,

is the gradient of the loss function.

Similarly, let θ' _c ＝θ _c . Then, for each group of data, calculating the value loss function corresponding to the c-th data

The model is then updated according to the following formula:

wherein eta is _c Is the learning rate.

Thirdly, after all the training is finished, updating the model, namely enabling the theta to be adjusted _a ＝θ' _a ,θ _c ＝θ' _c And all records in the buffer B are emptied. And returning to the step 4.2, and continuing to carry out interaction to obtain training data.

Performance analysis

In order to verify the benefits brought by the method, a simulation experiment is carried out based on a public data set CIFAR-10. The data is an object recognition image data set, and the training set contains 50000 samples which are randomly divided into 100 non-overlapping small data sets in the test example, so as to simulate local data sets of 100 users. In the FL training process, 10 users are randomly selected per round for local computation, the local batch size is set to 50, the iterative Epoch is set to 5, the learning rate is initialized to 0.15, and the decay rate is set to 0.99 per 10 rounds.

For the DRL agent, two identical multilayer perceptron models are respectively used as an action network and an evaluation network, the middle layer of each model contains 150 neurons, and in the DRL training stage, each user only uses 20% of a local data set to train so as to accelerate the speed. The state dimension M is 5, and 3-bit quantization is considered, so that the total number of quantization step points is L =8, the action dimension is 4, the action sampling variance σ =0.1, the optimization range R of the quantization step points is 0.15, and the execution round X =4 of each action. In the reward function, α =10, β ₁ ,β ₂ The values are 150000 and 0.3 respectively; the penalty factor delta is 5. In the model training parameters, the cutoff rate oa =0.2, the gain adjustment factors γ and λ take the values 0.99 and 0.95, respectively, and the learning rates η of the action network and the evaluation network are η _a ,η _c All set to 0.0004, the buffering threshold set to P =2048, and the packet size C =16. The number of training sessions for the DRL was 40000.

After finishing DRL agent training, using the DRL agent training device to perform 1000 rounds of FL training, wherein all users use a complete training set at the moment, and perform accuracy testing on a testing set containing 10000 samples to verify the performance of the model, and the higher the accuracy is, the better the representation effect is. Here, it is compared with the conventional uniform quantization, and the quantization step vector of the uniform quantization is fixed to 0.0375 × [ -4, -3, -2, -1, 2,3,4] ^T 。

Fig. 3 shows an error comparison graph of quantization of neural network model parameters of downlink communication process on a CIFAR-10 data set by using the existing uniform quantization and the method proposed by the present invention. According to the reward function setting of the DRL agent, if the quantization step point is selected to reduce the training error and the quantization error, the DRL agent obtains reward, otherwise, the DRL agent is punished, and therefore the final scheme can obtain lower quantization error and training error compared with a general uniform quantization scheme.

Fig. 4 shows the comparison of the accuracy obtained on the test set when the downlink communication process is processed by uniform quantization and the method proposed by the present invention on the CIFAR-10 data set. The method provided by the invention can reduce quantization error and training error, which means that the error between the broadcast model and the unquantized model received by each user during training is smaller, and meanwhile, the convergence on the local data set is faster, so that the model convergence is faster when the method is adopted, and the accuracy of the model on the test set is higher under the same training round.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The background section of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments and it is not intended to limit the invention to the specific embodiments described. It will be apparent to those skilled in the art that numerous alterations and modifications can be made to the described embodiments without departing from the inventive concepts herein, and such alterations and modifications are to be considered as within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the application.

Claims

1. A federal learning model parameter quantification method based on deep reinforcement learning is characterized by comprising the following steps:

s1, obtaining current global model parameters;

s2, state processing: counting M percentile points of the current global model parameters as the environmental states observed by the intelligent deep reinforcement learning agent; the deep reinforcement learning agent comprises an action network and an evaluation network;

s3, action processing: according to the action output by the intelligent deep reinforcement learning agent, constructing L quantization step points according to a given rule, and using the L quantization step points as a mapping set of quantization operation;

s4, quantization transmission: performing model quantization and transmission of X rounds, wherein quantization of each round adopts the quantization mapping set of the previous step, and statistics is performed on quantization errors and training errors of the X rounds, and the average value of the quantization errors and the training errors is calculated to obtain a reward value according to a reward function and input to the reinforcement learning agent as feedback;

s5, updating the model: the reinforcement learning agent continuously records the state, action and reward condition of each time, and updates the network model of the deep reinforcement learning agent when the number of records reaches a given threshold value.

2. The method as claimed in claim 1, wherein the action network is represented by pi (a, s; theta) _a ) Wherein theta _α Representing the parameters of the neural network, s representing the state vectors observed by the agent at the present time, pi (a, s; theta) _a ) Giving the probability that the agent decides the action a given in this state; the evaluation network is denoted V (s; theta) _v ) Wherein theta _v A parameter, V (s; theta), representing the neural network _v ) The magnitude of the value of state s currently considered by the agent is given.

3. Federal learning model parameter quantity based on deep reinforcement learning according to claim 1 or 2The method is characterized in that in step S2, the neural network model to be transmitted is set as

D is the dimension of the network model, and w is obtained by sequencing parameters in w from small to large according to absolute values ^s Defining the xth percentile as:

wherein, the first and the second end of the pipe are connected with each other,

is w ^s The (i) th element of (a), device for selecting or keeping>

Represents rounding up;

setting the dimension of the input state of the deep reinforcement learning agent to be M more than or equal to 1, and constructing the following vectors:

s＝[p ₁ ,p ₂ ,…,p _M ] ^T ,

as observed at the current time, the m-th element of the vector is

By adjusting the value of M, a balance is achieved between the representation accuracy and the state space dimensions.

4. The method for quantifying parameters of a federated learning model based on deep reinforcement learning as defined in claim 2, wherein in step S3, after obtaining the state vector S in step S2, it is input into the action network pi (a | S; θ) _a ) To obtain an output vector for the network:

μ＝[μ ₁ ,μ ₂ ,…,μ _L/2 ] ^T

l/2 is the action dimension of the deep reinforcement learning agent output, the sampling variance of all actions is given as sigma, and then the normal distribution N (mu) is obtained _i σ) to obtain action a _i I =1,2, \ 8230;, L/2, and constitutes a motion vector:

a＝[a ₁ ,a ₂ ,…,a _L/2 ] ^T ，

based on the vector, a quantization step point and a complete set of mappings are calculated, the quantization step point being in a given range [ -B, B],B>0, and searching and optimizing the interval [ -B,0 [ -B]Are divided into L/2 sub-intervals, and the ith interval is marked as I _i ＝[l _i ,u _i ]Wherein the left end point l _i Right end point u _i Respectively calculated according to the following formula:

defining a uniformly distributed reference vector c, wherein the ith element is the central point of the ith sub-interval, namely:

next, a quantization step vector for the negative part is calculated:

wherein Sort (x) denotes ordering the elements in vector x from small to large;

computing a quantization step vector for the non-negative part:

q is to be ⁿ ,q ^p Two vector stitchingTogether, a complete quantization step point vector with dimension L is obtained

5. The method for quantifying parameters of a deep reinforcement learning-based federal learning model as claimed in claim 1 or 2, wherein in step S4, a mapping set Q = { Q } is constructed according to the quantization order point vector Q obtained in step S3 ₁ ,q ₂ ,…,q _L In which q is _l Is the l-th element of the vector q; the method comprises the following steps of (1) carrying out server broadcasting of a global model, receiving and training of a user equipment, uploading of the user equipment model and aggregation of the uploaded models by the server to form a round, and carrying out quantitative transmission on the current global model w in the following X rounds according to the following steps:

the method comprises the following steps of firstly, setting the current as the tth round, wherein t =1,2, \8230, and X, carrying out random quantization on each parameter w in the global model of the current round according to the following rules:

wherein ξ (w, q) _i ,q _j ) Is a random variable, which is defined as

The quantized model parameter is denoted as Q (w) = [ Q (w) ₁ ),Q(w ₂ ),…,Q(w _d )] ^T Calculating the quantization error e of the round _t ＝||Q(w)-w|| ² ；

Secondly, broadcasting the quantized global model Q (w) to users so that the users receiving the model train on the local data set based on the model; training errors of the kth user as S in the user setThe difference is l _k The server receives the updated model w after each user finishes training _k And training error, according to { w } _k } _k∈S And (4) aggregating to obtain a new global model w, and calculating an average training error according to the following formula:

after the calculation of X rounds is completed, X quantization errors and average training error values are collected, and the average value is calculated according to the following formula:

further calculating the reward of the training process according to the mean value

representing the average training error obtained by the last quantization transmission, and if the quantization transmission is the first time, controlling the transmission to be the next time

This condition is true; alpha (alpha) ("alpha")>1 is a constant factor for amplifying the influence of quantization error; beta is a ₁ ,β ₂ >0 is a constant weight factor used to adjust the effect of quantization error and training error on the reward value; i (-) is an indication function, when the condition in the bracket is satisfied, the value is 1, otherwise the value is 0; delta>0 is the penalty given when the training process has a gradient explosion, i.e. there is a large value INF in w that exceeds the set value.

6. The method for quantifying parameters of a federated learning model based on deep reinforcement learning of claim 1 or 2, wherein in step S5, according to the state S acquired by the agent in step S2, the action a output by the agent in step S3; the reward r received by the intelligent agent in the step S4 and the next state S' calculated according to the method in the step S2 based on the latest w are stored into a record cache region of the intelligent agent, and if the number of records in the cache region is less than a given threshold value P, the next record is continuously generated according to the steps S2-S4; otherwise, updating the intelligent agent model according to the following steps:

in the first step, assume that the ith record in the buffer is p _i ＝(s _i ,a _i ,r _i ,s' _i ) Wherein i =1,2, ..., P; let A be _P =0, and calculates the merit function for each successive state-action pair according to the following equation:

A _i-1 ＝r _i-1 +γ·V(s _i '；θ _v )-V(s _i ；θ _v )+γλ·A _i ,

wherein gamma and lambda are constants and are used for adjusting the influence of future income on the current advantage;

secondly, grouping the records in the cache area, wherein each group comprises C records; make theta' _a ＝θ _a Then, for each set of data, the action loss function corresponding to the c-th data is calculated:

wherein C =1,2, ..., C, the position of the C-th data in B is i _c The probability ratio ρ is:

the clipping function clip (. Cndot.) is

The model is updated according to the following formula:

wherein eta _a Is the learning rate of the learning rate,

is the gradient of the loss function;

make theta' _c ＝θ _c (ii) a Then, for each group of data, calculating the value loss function corresponding to the c-th data

The model is then updated as follows:

wherein eta is _c Is the learning rate;

thirdly, after all the training is finished, updating the model, namely enabling theta to be adjusted _a ＝θ' _a ,θ _c ＝θ' _c Emptying all records in the cache region; returning to the step S2, and continuing to carry out interaction to obtain training data.

7. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for quantifying federal learning model parameters for deep reinforcement learning as claimed in any one of claims 1 to 6.