CN115016858A

CN115016858A - Task unloading method based on post-decision state deep reinforcement learning

Info

Publication number: CN115016858A
Application number: CN202210572305.4A
Authority: CN
Inventors: 张竞哲; 贺晓帆; 张晨; 周嘉曦
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-09-06
Anticipated expiration: 2042-05-24
Also published as: CN115016858B

Abstract

The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The invention utilizes the experience playback mechanism of DQN, randomly selects collected historical experiences as training samples and improves the learning efficiency. Meanwhile, a post-decision state learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision learning framework has higher learning efficiency, but needs additional prior information. The invention provides a task unloading method based on post-decision state deep reinforcement learning, which utilizes an additional learning process to obtain additional information required in the traditional post-decision learning, and realizes the rapid convergence of the unloading method by utilizing an efficient post-decision state learning framework, a hot start process and an experience playback mechanism.

Description

Task unloading method based on post-decision state deep reinforcement learning

Technical Field

The invention relates to the technical field of machine learning and distributed computing, in particular to a task unloading method based on post-decision state deep reinforcement learning.

Background

Under the background of explosive increase of operation demand and data scale, edge calculation is often applied to solve the problem of limited computing capability of terminal equipment. Edge computing is a mode of offloading tasks to edge devices for processing. Mobile devices are typically focused on reducing latency and power consumption, so when wireless channel conditions are poor, the mobile device will preferentially process tasks at the local CPU, while when wireless channel conditions are good, the mobile device will tend to offload most of the tasks to the edge of the network for processing. If the task is unloaded to an unreliable server for processing, information such as the position and the identity of the user can be revealed, and the privacy of the user is threatened. Therefore, the issue of privacy disclosure needs to be considered while balancing energy consumption.

On the other hand, distributed computing is also widely used in edge computing due to the increasing scale of computing tasks. The computational efficiency of a distributed computing system is susceptible to the computing power of individual nodes or the communication environment. Part of nodes may take a long time to complete calculation and return calculation results, which further causes the effect of a queue-dropping person to bring calculation delay and affects calculation efficiency. The encoding calculation is a framework for applying an encoding theory to the field of distributed calculation, and through flexible and various encoding technologies, redundancy is properly introduced, so that the effect of falling behind can be effectively relieved. Replication calculation is a simple and common encoding mode, and the same task is offloaded to a plurality of different users for processing, so that a calculation result can be obtained only by waiting for one node to complete calculation. However, when the channel condition is not good, the task is copied and unloaded to a plurality of servers blindly, and the simultaneous processing not only wastes energy, but also is not beneficial to privacy protection. To balance various requirements of energy consumption, privacy protection, etc., an optimal offloading strategy that minimizes long-term cost can be solved using a reinforcement learning algorithm by modeling such problems as a Markov Decision Process (MDP) with appropriate states, action space, and cost loss functions.

In practical situations, the state space of the markov problem is usually large, and the efficiency of a general reinforcement learning algorithm is low, which is not beneficial to practical application.

Disclosure of Invention

The invention provides a task unloading method based on post-decision state deep reinforcement learning, which is used for solving or at least partially solving the technical problem of low task unloading efficiency in the prior art.

The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:

s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;

s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state s

State transition probability of

Evaluating a network

Weight parameter of, target network

K represents the state from s to the post-decision state

Using a Markov random problem pair corresponding to the target task

Performing hot start, and setting the iteration number to 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken and the corresponding post-decision stateCost of taking action and status to the next time;

s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;

s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;

s5: at certain intervals, update

Randomly selecting a batch of experiences from the experience buffer, and updating the evaluation network by experience playback

And corresponding evaluation network function, and updating the evaluation network

Assigning the weight parameter of to the target network

Updating the corresponding target network function;

s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluated

Converging to finish hot start;

s7: setting the current iteration number to 1, emptying the task buffer and reinitializing

Evaluation network obtained by hot start

Repeating the steps S3-S6 for the target task until the evaluation network converges; based on an evaluation network

And correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.

In one embodiment, the system states in the state set in step S1 are in the form of:

wherein s is _n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b ¹ ,b ² ,....,b ⁱ In which b is ¹ 、b ⁱ Respectively representing the 1 st and ith states of the task buffer, b _n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } ¹ ,h ² ,....,h ^j }，h ¹ 、h ^j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,

respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.

In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a _n The offload decision includes three cases, the first, to store p in the task buffer _n One task is processed at the local CPU and the second is no task, when p is _n 0, third, p in task buffer _n The tasks are simultaneously and respectively unloaded to the k with the best channel _n An edge server processes, wherein p _n For the number of tasks to be processed in the task buffer at n moments, k _n Number of edge servers tasking at time n, k _n ≤m。

In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includesRandomly selecting an action with probability e, and selecting an action value Q under the current state with probability 1-e _eval The action for which the function is minimal.

In one embodiment, the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and is represented by:

wherein p is _n For the number of tasks to be processed in the task buffer at time n, Δ b _n Indicating the number of newly arrived tasks,

representing a post-decision state after taking action at n moments;

state s at the next moment _n+1 Is represented by the form:

b _max which represents the capacity of the task buffer and,

respectively show the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the m-th edge server at the time of n +1, and the state s at the next time _n+1 I.e. the state at time n + 1.

In one embodiment, step S4 includes:

acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task ^energy Involving the energy consumption and task handling at the local CPUEnergy consumption of traffic offloading to edge servers for processing; caching cost c of task in task cache ^holding ＝b _n -p _n Privacy cost of task offload to edge server processing c ^privacy ＝p _n Overflow cost c generated when the task buffer overflows due to insufficient capacity ^overflow ＝max{b _n -p _n +Δb _n -b _max ,0}，b _max The size of the task buffer; p is a radical of _n For the number of tasks to be processed in the task buffer at time n, b _n Indicating the number of tasks in the task buffer at time n, Δ b _n Representing the number of newly arrived tasks;

the slave s is obtained according to the cost of holding the task buffer by the task, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity _n To

Cost function c of ^k (s, a) and from

To s _n+1 Cost function of

c ^k (s,a)＝η ₁ ·c ^holding +η ₂ ·c ^energy +η ₃ ·c ^privacy ,

Wherein eta is ₁ ,η ₂ ,η ₃ ,η ₄ K and u represent state transition identifiers respectively for corresponding weight coefficients;

after observing a complete state transition process, the current state, the post-decision state, the state at the next moment, and the sampling are carried outA set of experiences comprising the action taken and the cost of the action

And storing the data into an experience buffer.

In one embodiment, when a task is processed at the local CPU, the energy consumed per unit task is:

e ^local ＝κL ³ ζ ³ /τ ² ,

wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:

where W is the bandwidth of the edge computation network, h is the power gain, N ₀ Is the noise power spectral density.

In one embodiment, the update in step S5

The method of (2) adopts a statistical updating method.

In one embodiment, when the single set of experience is randomly selected from the experience buffer in step S5, the selected experience is

When a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:

where γ represents a discount factor, θ is a weight parameter of the evaluation network,

representing a post-decisionStatus of state

State s by time n +1 _n+1 Is determined by the cost function of (a) a,

for decision states after input

And evaluating the network after action a

Output of (Q) _target (s ', a ') is the input state s ' and Q after action a _target And (4) outputting the function. Update the corresponding Q according to the following formula _eval And Q _target Function:

where Q (s, a) represents the value of a Q function in state s and action a, the Q function including an evaluation function Q _eval And an objective function Q _target ，

Expressed as a post-input decision state

Evaluating the network or target network in response to action a

To output of (c).

Compared with the prior art, the invention has the advantages and beneficial technical effects as follows:

the invention realizes the network evaluation by introducing the deep neural network and using the method of empirical playback and minimized loss function

With a target network

Updating the weight parameter of (2). The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. The invention is a reinforcement algorithm based on a post-decision learning framework, and the traditional post-decision state learning framework needs additional prior information.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a post-decision state according to an embodiment of the present invention;

FIG. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;

FIG. 4 is a diagram of simulation results of a method according to an embodiment of the present invention.

Detailed Description

The invention aims to solve the problem of quickly solving an optimal decision in task unloading, and provides a computation unloading strategy learning framework based on post-decision depth reinforcement learning. The framework is introduced on the basis of the traditional deep learning algorithm DQN, and then the framework and the hot start process are decided and learned, so that the algorithm convergence is accelerated.

The main concept and innovation of the invention are as follows:

the invention relates to a task unloading method based on deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The method utilizes an experience playback mechanism of the DQN, randomly selects collected historical experiences as training samples, and therefore learning efficiency can be improved. Meanwhile, a post-decision learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision state learning framework has higher learning efficiency but needs additional prior information, but the invention provides a task unloading algorithm based on the post-decision state deep reinforcement learning, the additional information needed in the traditional post-decision state learning is obtained by utilizing an additional learning process, and the rapid convergence of the algorithm is realized by utilizing the efficient post-decision learning framework, the hot start process and the experience playback mechanism.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:

State transition probability of

Evaluating a network

Weight parameter of, target network

K represents a transition identifier from the state s to the post-decision state, and the evaluation network is paired with a Markov random problem corresponding to the target task

Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken, the corresponding post-decision state, the cost generated by the action taken and the state at the next moment;

s5: at certain intervals, update

Randomly selecting a batch of experiences from the experience buffer to update the evaluation network through experience playback

Updating the corresponding evaluation network function, and updating the updated evaluation network

Assigning the weight parameter of to the target network

Updating the corresponding target network function;

Converging to finish hot start;

Evaluation network obtained by hot start

Repeating steps S3-S6 for the target task until network convergence is evaluated; based on an evaluation network

Fig. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention. In the specific implementation process, a Markov random problem corresponding to a target task, namely a task similar to the target task, is different from an example target task in the distribution of new tasks. Fig. 2 is a schematic diagram of a post-decision state according to an embodiment of the present invention.

It should be noted that, for the task of the warm start process, the steps S3-S6 need to be repeatedly executed, and for the target task, that is, the task that needs to make task unloading decision, the steps S3-S6 need to be repeatedly executed.

The optimal unloading strategy comprises unloading modes and unloading quantities, wherein the unloading objects, namely tasks, need to be unloaded to where to process, and the unloading quantities, namely the unloading task quantities.

Generally speaking, the invention discloses a task unloading method based on post-decision state deep reinforcement learning, and corresponding optimal unloading strategies can be solved for different targets by changing a cost function. The deep reinforcement learning algorithm combines the post-decision state learning framework in deep learning and reinforcement learning, has the advantages of a common deep learning algorithm DQN, can eliminate the need of prior knowledge in the post-decision state learning framework, and can further improve the training speed of the model by utilizing an additional hot start process.

wherein s is _n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b ¹ ,b ² ,....,b ⁱ In which b ¹ 、b ⁱ Respectively representing the 1 st and ith states of the task buffer, b _n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } ¹ ,h ² ,....,h ^j }，h ¹ 、h ^j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,

In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a _n The offload decision includes three cases, the first, to store p in the task buffer _n One task is processed at the local CPU and the second is no task, when p is _n The number of the first, second,p in task buffer _n The tasks are simultaneously and respectively unloaded to the k with the best channel _n An edge server processes, wherein p _n For the number of tasks to be processed in the task buffer at n moments, k _n Number of edge servers, k, handling offload tasks for n moments _n ≤m。

In the specific implementation process, the method of the invention also initializes the corresponding action set. And adopts a repetition coding calculation method, i.e. as long as one of the edge servers completes the calculation, p _n The individual tasks have been successfully processed. In this embodiment, m-5, j-2, h-130, -125 (dB), and the corresponding state transition probability is

k _n With {1,2,3,4,5}, the probability of each edge server completing the computation is 0.5.

In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with a probability e, and selecting the action value function Q in the current state with probabilities 1-e _eval Minimum motion.

In particular, that is to say that a _n ＝argmin _a Q _eval (s _n A) to expedite evaluation of the network

Convergence of (2). In this example, e is 0.1. The action value function here is an evaluation network function Q _eval It needs to be calculated by the following formula:

wherein p is _n For the number of tasks to be processed in the task buffer at time n, Δ b _n Indicating the number of tasks that have been newly reached,

representing a post-decision state after taking action at n moments;

state s at the next moment _n+1 Is represented by the following form:

b _max which represents the capacity of the task buffer and,

respectively shows the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the mth edge server at the time of n +1, and the state s at the next time _n+1 I.e. the state at time n + 1.

In one embodiment, step S4 includes:

acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task ^energy The method comprises the steps of energy consumption of processing a task in a local CPU and energy consumption of unloading the task to an edge server for processing; caching cost c of task in task cache ^holding ＝b _n -p _n Privacy cost of task offload to edge server processing c ^privacy ＝p _n Overflow cost c generated when the task buffer overflows due to insufficient capacity ^overflow ＝max{b _n -p _n +Δb _n -B _max ,0}，B _max The size of the task buffer; p is a radical of _n For the number of tasks to be processed in the task buffer at time n, b _n Indicating the number of tasks in the task buffer at time n, Δ b _n Representing the number of newly arrived tasks;

the slave s is obtained according to the holding cost of the task buffer, the privacy cost of the task unloading to the edge server, the energy loss cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity _n To

Cost function c of ^k (s, a) and from

To s _n+1 Cost function of

c ^k (s,a)＝η ₁ ·c ^holding +η ₂ ·c ^energy +η ₃ ·c ^privacy ,

Wherein eta ₁ ,η ₂ ,η ₃ ,η ₄ K and u represent state transition identifiers respectively for corresponding weight coefficients;

after observing a complete state transition process, a set of experiences comprising the current state, the post-decision state, the state at the next time, the action taken and the cost of the action generation

And storing the data into an experience buffer.

Specifically, storing unprocessed tasks in the task buffer incurs a buffering cost, and if the tasks are offloaded to the edge server for processing, a corresponding privacy cost is incurred, and if the task buffer overflows due to insufficient capacity, a corresponding overflow cost is incurred.

In a particular embodiment, the buffer is capable of storing up to 15 tasks, i.e. b _max 15, Δ b may be {0,1,2,3,4,5}, the corresponding probability distribution in the hot start tasks is random, and the target tasks are uniform. The weighting factor can take on a value of η ₁ ＝50，η ₂ ＝10^6，η ₃ ＝150，η ₄ ＝300。

e ^local ＝κL ³ ζ ³ /τ ² ，

The energy consumed by a unit task is the energy consumed by a single task.

Kappa-10 in this example ^-28 ，L＝10 ³ ，ζ＝800，τ＝10 ^-3 ，W＝10MHz,N ₀ ＝10 ^-19 W/Hz。

In one embodiment, the update in step S5

The method of (2) adopts a statistical updating method.

representing post-decision state

State s by time n +1 _n+1 Is determined by the cost function of (a) a,

for decision states after input

And evaluating the network after action a

Output of Q _target (s ', a ') is the function Q after the input state s ' and the action a _target To output of (c). Updating the evaluation network for each set of experiences based on an experience replay mechanism

And corresponding function Q _eval After a batch of experience is finished and network updating is carried out, the parameters of the evaluation network are assigned to the target network

Updating the corresponding function Q simultaneously _target (ii) a Update the corresponding Q according to the following formula _eval And Q _target Function:

wherein Q (s, a) represents the value of a Q function in state s and action a, the Q function comprising an evaluation function Q _eval Eyes of HemuStandard function Q _target ，

Expressed as a post-input decision state

Evaluating the network or target network in response to action a

To output of (c).

In particular, evaluating a network

The input variables of (1) include post-decision state

And action a, theta is an evaluation network

Each parameter of the neural network is input into a group

Then an output is obtained

The network is updated. Because a batch of experiences are randomly selected in the experience playback mechanism, a plurality of groups of the post-decision state and action pairs can be input

Updates to the network). Evaluating a network

And a target network

All need to be substituted into the above equation to calculate the corresponding Q _eval And Q _target A function.

The operation value function Q is _eval By evaluating networks

Updating according to the formula, and only using the motion selection (S3); function of action value Q _target By the target network

According to the above formula update, it is used as a target value only at the time of network update (S5).

Fig. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention.

In one embodiment, the updates are performed every 200 times

And Q _eval Every 1000 updates

And Q _target The average cost of 100000 state transitions was randomly counted every 10000 times to evaluate algorithm performance.

In order to more clearly illustrate the method proposed by the present invention, it is explained below by specific experimental data.

1. Simulation conditions and content

The operating system is Microsoft Windows 10 and the programming emulation language is python. The simulation adopts a group of parameters to carry out effect simulation on the algorithm and the existing common deep reinforcement learning algorithm DQN.

2. Analysis of simulation results

Fig. 4 is a comparison graph of the effect of the currently popular deep reinforcement learning algorithm DQN and the proposed algorithm. Compared with DQN, the average cost of the proposed algorithm can be reduced quickly, and the algorithm is a more efficient task unloading algorithm based on post-decision state deep reinforcement learning.

Aiming at the problem that the convergence speed of a deep learning algorithm in the task unloading method in the prior art is still limited, the invention realizes the updating of the parameter values of an evaluation network and a target network by introducing a deep neural network and playing back the parameters by experience and minimizing a loss function. The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. Secondly, a strengthening algorithm based on a post-decision state learning framework is adopted, and aiming at the problem that the traditional post-decision state learning framework needs additional prior information, an additional learning process is added to additionally estimate the state transition probability from the current state to the post-decision state, so that the requirement on prior knowledge is eliminated, the network updating can be further accelerated by utilizing the structural advantage of the post-decision state, and the algorithm performance of the invention is better than that of the traditional deep learning algorithm DQN.

It should be understood that the above description of the preferred embodiments is illustrative, and not restrictive, and that various changes and modifications may be made therein by those skilled in the art without departing from the scope of the invention as defined in the appended claims.

Claims

1. A task unloading method based on post-decision state deep reinforcement learning is characterized by comprising the following steps:

State transition probability of

Evaluating a network

Weight parameter of, target network

s5: at certain intervals, update

And corresponding evaluation function, and updating the evaluation network

Assigning the weight parameter of to the target network

Updating a corresponding target network function;

Converging to finish hot start;

Evaluation network obtained by hot start

2. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein the system state in the state set in step S1 is in the form of:

wherein s is _n The system state at time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in common, and represents that b is { b ═ b } ¹ ,b ² ,....,b ⁱ In which b ¹ 、b ⁱ Respectively representing the 1 st and ith states of the task buffer, b _n Indicates the number of tasks in the task buffer at time n, and the channel h has j states, which is expressed as h ═ h { (h) ¹ ,h ² ,....,h ^j }，h ¹ 、h ^j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,

3. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the action of the action set in step S1 corresponds to an offloading decision, and the action taken at time n is a _n The offload decision includes three cases, the first, to store p in the task buffer _n One task is processed at the local CPU and the second is no task, when p is _n Third, p in task buffer is defined as 0 _n The tasks are simultaneously and respectively unloaded to the k with the best channel _n An edge server processes, wherein p _n For the number of tasks to be processed in the task buffer at n moments, k _n Number of servers, k, handling offload tasks for n moments _n ≤m。

4. The task offloading method based on post-decision state depth reinforcement learning of claim 1, wherein the strategy in step S3 is a greedy strategy, and specifically includes randomly selecting an action with a probability epsilon, and selecting the action value function Q in the current state with a probability 1-epsilon _eval Minimum motion.

5. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and the representation form of the post-decision state is:

representing a post-decision state after taking action at n moments;

state s at the next moment _n+ 1 is represented by the form:

wherein, b _max Which represents the capacity of the task buffer and,

6. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein step S4 comprises:

Cost function c of ^k (s, a) and from

To s _n+1 Cost function of

after observing a complete state transition process, form a set of experiences of the current state, the post-decision state, the state at the next time, the action taken, and the cost of the action generation

And storing the data into an experience buffer.

7. The task offloading method based on post-decision state deep reinforcement learning of claim 6, wherein when the task is processed in the local CPU, the energy consumed by the unit task is:

e ^local ＝κL ³ ζ ³ /τ ² ,

8. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein the step S5 is updated

The method of (2) adopts a statistical updating method.

9. The method for task off-loading based on post-decision state deep reinforcement learning of claim 1, wherein when step S5 is executed, a single set of experience is randomly selected from the experience buffer

wherein γ representsA discount factor, theta is a weight parameter for evaluating the network,

representing post-decision state

State s by time n +1 _n+1 Is determined by the cost function of (a) a,

for making a decision on the state after input

And evaluating the network after action a

Output of Q _target (s ', a ') is the input state s ' and Q after action a _target The output of the function updates the evaluation network for each group of experiences according to an experience playback mechanism

Updating the corresponding function Q simultaneously _target (ii) a Update the corresponding Q according to the following formula _eval And Q _target The function is:

Expressed as a post-input decision state

Evaluating the network or target network in response to action a

To output of (c).