CN115016858B

CN115016858B - Task unloading method based on post-decision state deep reinforcement learning

Info

Publication number: CN115016858B
Application number: CN202210572305.4A
Authority: CN
Inventors: 张竞哲; 贺晓帆; 张晨; 周嘉曦
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2024-03-29
Anticipated expiration: 2042-05-24
Also published as: CN115016858A

Abstract

The invention discloses a task unloading method based on deep reinforcement learning of a post-decision state, which can make decisions on each dimension of unloading actions, such as unloading objects, unloading quantity and the like of tasks. The optimal strategy under different targets can be realized by changing the cost function for different optimization targets. The invention utilizes the experience playback mechanism of DQN, randomly selects the collected historical experience as a training sample, and improves the learning efficiency. Meanwhile, a post-decision state learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision learning framework has higher learning efficiency, but requires additional prior information. The invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which utilizes an additional learning process to acquire additional information required in traditional post-decision learning, and utilizes an efficient post-decision state learning framework, a hot start process and an experience playback mechanism to realize rapid convergence of the unloading method.

Description

Task unloading method based on post-decision state deep reinforcement learning

Technical Field

The invention relates to the technical field of machine learning and distributed computing, in particular to a task unloading method based on deep reinforcement learning of a post-decision state.

Background

In the context of explosive growth of computing demands and data sizes, edge computation is often applied to solve the problem of limited computing power of terminal devices. Edge computing is a mode of offloading tasks to edge devices for processing. Mobile devices are often focused on reducing latency and power consumption, so that when wireless channel conditions are poor, the mobile device will prefer to process tasks at the local CPU, while when wireless channel conditions are good, the mobile device will tend to offload most tasks to the network edge for processing. If the task is offloaded to an unreliable server for processing, information such as the position, the identity and the like of the user can be revealed, and the privacy of the user is threatened. Therefore, the problem of privacy disclosure needs to be considered while balancing energy consumption.

On the other hand, distributed computing is also widely used in edge computing due to the increasing scale of computing tasks. The computational efficiency of a distributed computing system is susceptible to the computational power of individual nodes or the communication environment. Some nodes may take a long time to complete calculation and return a calculation result, so that a dequeue effect is caused to bring calculation delay and influence calculation efficiency. The coding calculation is a framework for applying the coding theory to the field of distributed calculation, and redundancy is properly introduced through flexible and diverse coding technologies, so that the effect of a dewing person can be effectively relieved. Duplicate computation is a simple and common coding method, and the same task is offloaded to a plurality of different users for processing, so that a computing result can be obtained by only waiting for one node to complete computation. However, when the channel condition is not good, tasks are blindly copied and unloaded to a plurality of servers, and the processing is not only energy-wasting, but also unfavorable for privacy protection. To balance the various demands of energy consumption, privacy preservation, etc., the best offloading strategy to minimize long-term costs can be solved with reinforcement learning algorithms by modeling such problems as a Markov Decision Process (MDP) with appropriate states, action space, and cost loss functions.

In practical situations, the state space of the markov problem is usually larger, and the general reinforcement learning algorithm is low in efficiency, so that the practical application is not facilitated.

Disclosure of Invention

The invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which is used for solving or at least partially solving the technical problem of low task unloading efficiency in the prior art.

The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:

s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;

s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight parameter of (2) target network->Weight parameters of (c), corresponding evaluation network function, objective network function and experience buffer, k represents from state s to post-decision state +.>Is evaluated against the network using a Markov random problem pair corresponding to the target task>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;

s3: selecting an action according to the strategy, wherein the action corresponds to an unloading scheme;

s4: observing the post decision state, forming a group of experience by the post decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experience into an experience buffer;

s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Weight parameters and corresponding evaluation network functions of (a) and to update the evaluation network +.>Assignment of weight parameters to target network +.>Updating the corresponding target network function;

s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluatedConverging and finishing hot start;

s7: the current iteration number is 1, the task buffer is emptied, and the initialization is performed againEvaluation network obtained with warm start->Repeating the steps S3-S6 for the target task until the convergence of the network is evaluated; according to the evaluation network->And correspondingly evaluating the network function to obtain the optimal unloading strategy under different states.

In one embodiment, the system state in the state set in step S1 is in the form of:

wherein s is _n The system state at time n is defined by the channel state and the states of the task buffers, and the task buffers b have i states, which is represented by b= { b ¹ ,b ² ,....,b ⁱ And b is }, where ¹ 、b ⁱ Respectively representing the 1 st state and the i st state of the task buffer, b _n The number of tasks in the task buffer at time n is represented, and channel h has j states, denoted as h= { h ¹ ,h ² ,....,h ^j }，h ¹ 、h ^j Representing the 1 st and j th states of the channel, respectively, m represents the number of edge servers,the channel state of the 1 st edge server at the time n and the channel state of the m-th edge server at the time n are respectively shown.

In one embodiment, the actions of the action set in step S1 correspond to offloading decisions, with the action taken at time n being a _n The offloading decision includes three cases, the first one, p in the task buffer _n The tasks are processed in the local CPU, the second is not processed any task, p is the time _n =0, third, p in the task buffer _n The tasks are simultaneously and respectively unloaded to k with the best channel _n Processing by the edge servers, wherein p _n K is the number of tasks to be processed in the task buffer at the time n _n The number of edge servers, k, for performing task processing at time n _n ≤m。

In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with probability e, and selecting the current state with probabilities 1-e such that the action value Q _eval The action with the smallest function.

In one embodiment, after the post-decision state in step S4 is the intermediate state before moving to the next state after taking action for the current state, the post-decision state is expressed in the form of:

wherein p is _n For the number of tasks to be processed in the task buffer at time n, Δb _n Indicating the number of tasks that have arrived newly,representing a post-decision state after taking action at time n;

state s at the next time _n+1 The expression form is:

b _max indicating the capacity of the task buffer,respectively representing the channel state of the 1 st edge server at the time of n+1, the channel state of the 2 nd edge server at the time of n+1, the channel state of the m th edge server at the time of n+1, and the state s at the next time _n+1 I.e. the state at time n + 1.

In one embodiment, step S4 includes:

obtaining the caching cost of a task in a task buffer, the privacy cost of unloading the task to an edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task ^energy The method comprises the steps of processing energy consumption of a task in a local CPU and unloading the task to an edge server for processing; cache cost c of task in task cache ^holding ＝b _n -p _n Privacy cost c of task offloading to edge server processing ^privacy ＝p _n Overflow cost c generated when the task buffer overflows due to insufficient capacity ^overflow ＝max{b _n -p _n +Δb _n -b _max ,0}，b _max The size of the task buffer; p is p _n B, for the number of tasks to be processed in the task buffer at the moment n _n Indicating the number of tasks in the task buffer at time n, Δb _n Representing the number of newly arrived tasks;

obtaining the cost s according to the holding cost of the task in the task buffer, the privacy cost of the task unloading to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity _n To the point ofCost function c of (2) ^k (s, a) and from->To s _n+1 Cost function of->

c ^k (s,a)＝η ₁ ·c ^holding +η ₂ ·c ^energy +η ₃ ·c ^privacy ,

Wherein eta ₁ ,η ₂ ,η ₃ ,η ₄ K and u respectively represent state transition identifiers for corresponding weight coefficients;

after a complete state transition is observed, a set of experiences is formed of the current state, the post-decision state, the next-time state, the actions taken, and the costs incurred by the actionsStored in an experience buffer.

In one embodiment, when a task is processed at the local CPU, the energy consumed per task is:

e ^local ＝κL ³ ζ ³ /τ ² ,

wherein, κ is CPU parameter, L is task size, ζ is CPU frequency, τ is time interval; when offloading tasks to an edge server, the energy consumed per task is:

wherein W is the bandwidth of the edge computing network, h is the power gain, N ₀ Is the noise power spectral density.

In one embodiment, the update in step S5The method adopts a statistical updating method.

In one embodiment, when a single set of experiences is randomly selected from the experience buffer in step S5 asThe weight parameters of the evaluation network are updated by minimizing the loss of the following loss function when the experience playback is performed from a batch of experiences:

where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 _n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q _target (s ', a ') is the input state s ' and Q after action a _target And (3) outputting a function. Updating the corresponding Q according to the following formula _eval And Q _target Function:

wherein Q (s, a) represents the value of the Q function under state s and action a, the Q function comprising an evaluation function Q _eval And an objective function Q _target ，Expressed as decision state after input->And action a is followed by corresponding evaluation network or target network->Is provided.

Compared with the prior art, the invention has the following advantages and beneficial technical effects:

the invention realizes the evaluation network by introducing the deep neural network and playing back with experience and minimizing the loss functionIs +.>Is used for updating the weight parameters of the (a). The additionally adopted hot start learning process can accelerate the parameter updating of the deep network. Second, the invention is an enhancement based on a post-decision learning frameworkThe algorithm and the traditional post-decision state learning framework need additional prior information, and the invention carries out additional estimation on the state transition probability from the current state to the post-decision state by adding an additional learning process, so that the structural advantage of the post-decision state can be utilized to further accelerate the updating of the network while eliminating the prior knowledge requirement, and the algorithm performance of the invention is better than the traditional deep learning algorithm DQN, thereby improving the task unloading efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a task offloading method based on post-decision state deep reinforcement learning in an embodiment of the invention;

FIG. 2 is a diagram illustrating a post-decision state according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a processing framework of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;

fig. 4 is a diagram of simulation results of a method according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a computational unloading strategy learning framework based on post-decision deep reinforcement learning for realizing the problem of fast solving optimal decisions in task unloading. The framework introduces a post decision learning framework and a hot start process on the basis of a traditional deep learning algorithm DQN, so that algorithm convergence is quickened.

The main conception and innovation of the invention are as follows:

the invention relates to a task unloading method based on deep reinforcement learning, which can make decisions on each dimension of unloading actions, such as unloading objects, unloading quantity and the like of tasks. The optimal strategy under different targets can be realized by changing the cost function for different optimization targets. The method utilizes the experience playback mechanism of the DQN, randomly selects the collected historical experience as a training sample, and can improve the learning efficiency. Meanwhile, a post-decision learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision state learning framework has higher learning efficiency but requires additional prior information, and the invention provides a task unloading algorithm based on post-decision state deep reinforcement learning, which utilizes an additional learning process to acquire the additional information required in the traditional post-decision state learning, and utilizes an efficient post-decision learning framework, a hot start process and an experience playback mechanism to realize rapid algorithm convergence.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which comprises the following steps:

s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight parameter of (2) target network->Weight parameters of (a) corresponding evaluation network function, target network function and experience buffer, k represents the transition identification from state s to post-decision state, evaluation network>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;

s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Updating the corresponding evaluation network function and updating the updated evaluation networkAssignment of weight parameters to target network +.>Updating the corresponding target network function;

s6: will beRepeating steps S3-S5 until evaluating the network by adding 1 to the iteration numberConverging and finishing hot start;

Referring to fig. 1, a flowchart of a task offloading method based on post-decision state deep reinforcement learning in an embodiment of the invention is shown. In the implementation process, a markov random problem corresponding to the target task, namely a task similar to the target task, is different from the example target task in that the new task arrives in different distribution. Fig. 2 is a schematic diagram of a state after decision making in the embodiment of the invention.

It should be noted that, for the task of the hot start process, the steps S3 to S6 need to be repeatedly executed, and for the target task, the steps S3 to S6 need to be repeatedly executed, and the target task is the task that needs to make the task unloading decision.

The optimal unloading strategy comprises an unloading mode and an unloading quantity, wherein an unloading object is the position where the task needs to be unloaded for processing, and the unloading quantity is the number of the unloaded tasks.

In general, the invention relates to a task unloading method based on deep reinforcement learning of a post-decision state, and corresponding optimal unloading strategies can be solved for different targets by changing a cost function. The method is a deep reinforcement learning algorithm combining the post-decision state learning framework in deep learning and reinforcement learning, has the advantages of a common deep learning algorithm DQN, can eliminate the need of priori knowledge in the post-decision state learning framework, and further improves the training speed of the model by using an additional hot start process.

In one embodiment, the actions of the action set in step S1 correspond to offloading decisions, with the action taken at time n being a _n The offloading decision includes three cases, the first one, p in the task buffer _n The tasks are processed in the local CPU, the second is not processed any task, p is the time _n =0, third, p in the task buffer _n The tasks are simultaneously and respectively unloaded to k with the best channel _n Processing by the edge servers, wherein p _n K is the number of tasks to be processed in the task buffer at the time n _n The number of edge servers, k, that handle offload tasks for time n _n ≤m。

In a specific implementation process, the method of the invention also initializes the corresponding action set. And employs repetition coded computation, i.e., as long as one of the edge servers completes the computation, then p _n The individual tasks have been successfully processed. In this embodiment, m= 5,j = 2,h = { -130, -125} (dB), the corresponding state transition probability isk _n = {1,2,3,4,5}, the probability of each edge server completing the calculation is 0.5.

In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with probability e, and selecting the action value function Q in the current state with probabilities 1-e _eval Minimal action.

In particular, that is to say so that a _n ＝argmin _a Q _eval (s _n A) to accelerate evaluation of a networkIs not limited. In this embodiment, e=0.1. The action value function is an evaluation network function Q _eval It is calculated by the following formula:

wherein p is _n For the number of tasks to be processed in the task buffer at time n, Δb _n Indicating the number of tasks that have been reached,representing a post-decision state after taking action at time n;

state s at the next time _n+1 The expression form is:

In one embodiment, step S4 includes:

based on the cost of task holding in task buffer, the cost of task offloading to edge server, the cost of energy loss from processing task and the task bufferingThe overflow cost of the memory when overflow occurs due to insufficient capacity is obtained from s _n To the point ofCost function c of (2) ^k (s, a) and from->To s _n+1 Cost function of->

c ^k (s,a)＝η ₁ ·c ^holding +η ₂ ·c ^energy +η ₃ ·c ^privacy ,

Specifically, storing an unprocessed task in a task buffer will generate a buffer cost, and if the task is offloaded to an edge server, a corresponding privacy cost will be generated, and if the task buffer overflows due to insufficient capacity, a corresponding overflow cost will be generated.

In the embodiment, the buffer can store at most 15 tasks, i.e. b _max =15, Δb may be {0,1,2,3,4,5}, where the probability distribution corresponding to the hot start task is random, and the target task is uniform. The weight coefficient can be eta ₁ ＝50，η ₂ ＝10^6，η ₃ ＝150，η ₄ ＝300。

e ^local ＝κL ³ ζ ³ /τ ² ，

The energy consumed by a unit task is the energy consumed by a single task.

In this example, κ=10 ^-28 ，L＝10 ³ ，ζ＝800，τ＝10 ^-3 ，W＝10MHz,N ₀ ＝10 ^-19 W/Hz。

where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 _n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q _target (s ', a ') is the function Q after the input state s ' and the action a _target Is provided. According to the experience playback mechanism, the evaluation network is updated for each group of experiences>Corresponding function Q _eval After a batch of experience is completed to update the network, the parameters of the evaluation network are assigned to the target network +.>Simultaneously updating the corresponding function Q _target The method comprises the steps of carrying out a first treatment on the surface of the Updating the corresponding Q according to the following formula _eval And Q _target Function:

Specifically, evaluation networkThe input variables of (a) include the post-decision state +.>And action a, θ is evaluation network +.>Is input with a set of +/for each neural network>Then an output +.>And updating the network. Since a batch of experience is randomly selected in the experience playback mechanism, multiple groups of such post-decision states and action pairs can be input>Update the network). Evaluation network->And target network->All need to be substituted into the above equation to calculate the corresponding Q _eval And Q _target A function.

The motion value function Q _eval By an evaluation networkUpdating according to the above formula, only used at the time of action selection (S3); action value function Q _target By the target network->According to the above formula update, it is used as a target value only at the time of network update (S5).

Fig. 3 is a schematic diagram of a processing framework of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the invention.

In a specific embodiment, every 200 updatesAnd Q _eval Every 1000 updates->And Q _target Average cost of 100000 state transitions was counted randomly every 10000 times to evaluate algorithm performance.

In order to more clearly illustrate the method proposed by the present invention, the following description is made by specific experimental data.

1. Emulation conditions and content

The operating system is Microsoft Windows and the programming simulation language is python. The simulation adopts a group of parameters to simulate the effect of the algorithm and the conventional common deep reinforcement learning algorithm DQN.

2. Simulation result analysis

Fig. 4 is a graph comparing the effects of the currently popular deep reinforcement learning algorithm DQN and the proposed algorithm. It can be seen from the figure that the proposed algorithm can reduce average cost faster than DQN, and is a more efficient task offloading algorithm based on post-decision state deep reinforcement learning.

Aiming at the problem that the convergence speed of a deep learning algorithm is still limited in a task unloading method in the prior art, the invention realizes the updating of the parameter values of an evaluation network and a target network by introducing a deep neural network and playing back with experience and minimizing a loss function. Additional employed hot-start learning processes may expedite parameter updates for the deep network. Secondly, an enhanced algorithm based on a post-decision state learning framework is adopted, and the problem that additional prior information is needed by a traditional post-decision state learning framework is solved.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The task unloading method based on the deep reinforcement learning of the post-decision state is characterized by comprising the following steps of:

s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight of (2)Heavy parameters, target network->Weight parameters of (a) corresponding evaluation network function, target network function and experience buffer, k represents the transition identification from state s to post-decision state, evaluation network>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;

s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Weight parameters and corresponding evaluation functions of (a) and to update the evaluation network +.>Assignment of weight parameters to target network +.>Updating the corresponding target network function;

s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5, up to the evaluation networkConverging and finishing hot start;

2. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein the system states in the state set in step S1 are in the form of:

3. The post-decision state deep reinforcement learning based task offloading method of claim 2, wherein the actions of the action set in step S1 correspond to offloading decisions, and the action taken at time n is a _n The offloading decision includes three cases, the first one, p in the task buffer _n The tasks are processed in the local CPU, the second is not processed any task, p is the time _n =0, third, p in the task buffer _n The tasks are simultaneously and respectively unloaded to k with the best channel _n Processing by the edge servers, wherein p _n K is the number of tasks to be processed in the task buffer at the time n _n The number of servers, k, that handle off-load tasks for time n _n ≤m。

4. The task offloading method of claim 1, wherein the strategy in step S3 is a greedy strategy, specifically comprising randomly selecting an action with probability ∈, and selecting the action value function Q in the current state with probability 1-epsilon _eval Minimal action.

5. The post-decision state deep reinforcement learning-based task offloading method of claim 2, wherein the post-decision state in step S4 is an intermediate state before moving to the next state after taking an action for the current state, and the post-decision state is expressed in the form of:

wherein p is _n For the number of tasks to be processed in the task buffer at time n, Δb _n Representing newly arrived tasksThe number of the pieces of the plastic material,representing a post-decision state after taking action at time n;

state s at the next time _n+ 1 is expressed in the following form:

wherein b _max Indicating the capacity of the task buffer,respectively representing the channel state of the 1 st edge server at the time of n+1, the channel state of the 2 nd edge server at the time of n+1, the channel state of the m th edge server at the time of n+1, and the state s at the next time _n+1 I.e. the state at time n + 1.

6. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein step S4 comprises:

obtaining the caching cost of a task in a task buffer, the privacy cost of unloading the task to an edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task ^energy The method comprises the steps of processing energy consumption of a task in a local CPU and unloading the task to an edge server for processing; cache cost c of task in task cache ^holding ＝b _n -p _n Privacy cost c of task offloading to edge server processing ^privacy ＝p _n Overflow cost c generated when the task buffer overflows due to insufficient capacity ^overflow ＝max{b _n -p _n +Δb _n -b _max ,0}，b _max The size of the task buffer; p is p _n For the task to be processed in the n-time task bufferQuantity b _n Indicating the number of tasks in the task buffer at time n, Δb _n Representing the number of newly arrived tasks;

7. The post-decision state deep reinforcement learning based task offloading method of claim 6, wherein when the task is processed at the local CPU, the energy consumed per task is:

e ^local ＝κL ³ ζ ³ /τ ² ,

8. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein the updating in step S5The method adopts a statistical updating method.

9. The task offloading method of claim 1, wherein when a single set of experiences is randomly selected from the experience buffer in step S5 as followsThe weight parameters of the evaluation network are updated by minimizing the loss of the following loss function when the experience playback is performed from a batch of experiences:

where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 _n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q _target (s ', a ') is the input state s ' and Q after action a _target Output of the function, update the evaluation network for each group of experiences according to the experience playback mechanism>Corresponding function Q _eval After a batch of experience is completed to update the network, the parameters of the evaluation network are assigned to the target network +.>Simultaneously updating the corresponding function Q _target The method comprises the steps of carrying out a first treatment on the surface of the Updating the corresponding Q according to the following formula _eval And Q _target Function:

wherein Q (s, a) represents the value of the Q function under state s and action a, the Q function comprising an evaluation function Q _eval And an objective function Q _target ，Expressed as post-input decision state/>And action a is followed by corresponding evaluation network or target network->Is provided.