CN113157344A

CN113157344A - DRL-based energy consumption perception task unloading method in mobile edge computing environment

Info

Publication number: CN113157344A
Application number: CN202110481249.9A
Authority: CN
Inventors: 胡海洋; 胡宇航; 李忠金; 魏泽丰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-23
Anticipated expiration: 2041-04-30
Also published as: CN113157344B

Abstract

The invention discloses an energy consumption perception task unloading method based on DRL in a mobile edge computing environment. The invention designs a state space, an action space and a reward function of a task unloading problem under a multi-eNB MEC environment. An actor-critic framework is adopted as the basic structure of the whole DRL-E2D algorithm, namely two neural networks of actor and critic are mainly included. At the same time, the state observed by MD under the environment is used as the input of the operator, and the action and state of the operator output are used as the critical network input. The invention combines the relevant knowledge of the intensive deep learning and considers the deadline constraint into the reward function, so that the MD can make the optimal decision of unloading the tasks to a plurality of eNBs under the condition of limiting the task duration according to the system state.

Description

DRL-based energy consumption perception task unloading method in mobile edge computing environment

Technical Field

The invention belongs to the technical field of mobile edge computing, and relates to an energy consumption perception task unloading decision method in mobile edge computing, in particular to a DRL-based model-free task unloading decision method under the constraint of deadline.

Background

With the development of wireless networks, more and more mobile applications are beginning to emerge and are receiving tremendous popularity. These mobile applications cover a wide range of fields, such as traffic monitoring, smart homes, real-time vision processing, target tracking, etc., often requiring computationally intensive resources to achieve a high quality of experience (QoE), and running all applications on a single MD can result in high energy consumption and delay despite the increasing performance of Mobile Devices (MDs). Mobile Edge Computing (MEC) has become a promising technology to address this problem, providing Computing power within a wireless access network compared to traditional cloud Computing systems using a remote public cloud. The advent of MEC allows MD to offload its computationally intensive tasks to near-end enodebs (enbs) to enhance computational power. Task or computing offloading in the MEC environment has been extensively studied at present. Conventional offloading schemes are model-based, i.e. it is generally assumed that the mobile signals between the MD and the eNB are well modeled. However, the MEC environment is very complex and the mobility of the user is highly dynamic, making the mobility model difficult to build and predict. With the generation of Deep Reinforcement Learning (DRL), more and more researchers unload the tasks applied to the MEC, and the DRL has three advantages that 1) the DRL is a model-free optimization method and does not need any mathematical knowledge based on models; 2) the optimization problem in a high dynamic time-varying system can be solved; 3) it can handle large state and motion space problems. The above features indicate that DRL is an ideal method for MEC to accomplish task offloading. However, applying DRL technology for MEC task offloading should consider and solve the following problems: first, the proposed MEC task offloading problem for high density enbss is a large discrete action space problem. For example, there are 5 e NBSs in the MEC for the MD to offload 20 tasks, and there are 5 million offload operations. In this case, deep neural network (DQN) based DRL does not work well because it has only the potential to handle small motion space problems. Second, task offloading is a discrete control problem, so continuous control methods such as depth-deterministic policy gradients (DDPG) will not work properly. And all the above methods take the task processing time as the average performance requirement, and do not consider the cutoff running time of the task, which is unreasonable. Thus, the reward functions for current task offloading schemes focus primarily on average-based performance metrics, failing to meet the deadline constraints for the task. The invention provides a DRL-based energy consumption perception task unloading method (DRL-E2D) under a mobile edge computing environment, which is used for learning an optimal decision from an unknown environment based on a deep reinforcement learning technology, so that the MD can maximize the task unloading utility under the condition of meeting task deadline constraints.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an energy consumption perception task unloading method based on DRL in a mobile edge computing environment.

The general idea of the inventive method is:

the task unloading architecture of the multi-eNB MEC environment mainly comprises an MD and a plurality of eNBs. The MD may generate a certain number of tasks in each time period, and each task may be offloaded to any eNB through the wireless network for execution. Therefore, a reasonable offloading scheme is very important, which directly affects the execution time of the task and the energy consumption of the MD. Aiming at the condition that the deadline constraint of tasks is not considered by the reward function of most of the current task unloading schemes, the invention combines the deadline constraint with the utility of MD to finish the tasks, considers the energy consumption of MD and the task discarding penalty, and designs a combined reward function for processing the optimization problem.

The invention adopts DRL-E2D algorithm to solve the problems, firstly, the state space, the action space and the reward function of the task unloading problem under the multi-eNB MEC environment are designed. An actor-critic framework is adopted as the basic structure of the whole DRL-E2D algorithm, namely two neural networks of actor and critic are mainly included. At the same time, the state observed by MD under the environment is used as the input of the operator, and the action and state of the operator output are used as the critical network input. In order to deal with the problem of dimensionality disaster of a high-dimensional discrete motion space, an embedding layer is added into an actor network and a critic network, the embedding layer is used for converting continuous motion under the space into discrete motion, and a KNN algorithm with low complexity is adopted to extract a nearest neighbor motion value.

The method comprises the following specific steps:

step (1), constructing a task unloading scene under a multi-eNB MEC environment;

step (2), constructing a joint reward function of a task unloading scene under the constraint of deadline under the environment of multiple eNB MECs:

Max:R(τ)＝U(τ)-P(τ)-E(τ) (a)

step (3), under a task unloading scene under a multi-eNB MEC environment, constructing an operator-critical deep reinforcement learning network framework;

step (4), an operator-critic deep reinforcement learning network framework is adopted to carry out optimization solution on the joint reward function of task unloading in the step (2), and a solution of optimal task unloading is obtained;

it is a further object of the present invention to provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.

It is a further object of the present invention to provide a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method described above.

The invention has the beneficial effects that: the invention is used for a multi-eNB environment with a high-dimensional discrete action space in mobile edge calculation, such as various application scenes of traffic monitoring, smart home, real-time visual processing, AI application and the like, and aims to optimize the long-term energy consumption of MD (machine direction) so as to save the battery capacity of MD. The invention combines the relevant knowledge of the intensive deep learning and considers the deadline constraint into the reward function, so that the MD can make the optimal decision of unloading the tasks to a plurality of eNBs under the condition of limiting the task duration according to the system state.

n: representing the number of eNBs in the MEC;

T_slot: represents the duration of each time period;

w: representing the workload of the task;

d: a data size representing a task;

λ: representing the rate at which tasks arrive at the MD;

T_DL: representing a cutoff constraint for the task;

z (τ): represents the number of task arrivals over the τ period MD;

η_i(τ): represents the data transmission rate from the MD to the eNB during the τ period;

L_i(τ) represents a task queue processed by the eNB during the τ period;

α_i(τ): represents the number of tasks offloaded on the eNB over a period of τ;

β_i(τ): represents the completed task processed at the time period of tau;

c_i(τ): represents the computation capacity of the MD or eNB;

d_i(τ) represents the amount of tasks deleted by each eNB and MD for a period of τ;

a data transmission rate representing an offloading task from the MD to the eNB for a period of τ;

representing the execution time of a task on MD or eNB;

representing the computational power of the MD over the period of tau, as determined by its own hardware.

E (τ): represents the total amount of energy consumed during the τ period MD;

u (τ): indicating the overall utility over the period of time tau.

P (τ): represents the penalty incurred by all the drop tasks during the period of τ;

r (τ): represents all rewards during the period of τ;

drawings

FIG. 1 is an architecture for task offloading in a multi-eNB MEC environment;

FIG. 2 is an architectural diagram of DRL-E2D;

FIG. 3(1) - (3) shows the convergence comparison experiment of DRL-E2D of the present invention and the conventional DQN algorithm under the condition that nb number k is 1, 3, and 5, respectively;

FIGS. 4(1) - (3) respectively show the reward, energy consumption and loss cost obtained by LB, Remote, DRL-E2D, DQN and MD algorithms at different enb numbers;

FIGS. 5(1) - (3) are respectively the reward, energy consumption and loss cost obtained by LB, Remote, DRL-E2D, DQN, MD algorithms at different task workloads W;

fig. 6(1) - (3) respectively show the reward, energy consumption and loss cost obtained by the LB, Remote, DRL-E2D, DQN and MD algorithms at different data sizes D.

Detailed Description

The invention is further analyzed with reference to the following figures.

FIG. 2 is an architectural diagram of DRL-E2D. The DRL-based energy consumption perception task unloading method under the mobile edge computing environment comprises the following steps:

step (1), constructing a task unloading scene under a multi-eNB MEC environment; FIG. 1 is an architecture for task offloading in a multi-eNB MEC environment;

the overall architecture of a task unloading scene under a multi-eNB MEC environment mainly comprises a single MD and n base station eNBs; the MD is used for sending the designated tasks to each base station for unloading and simultaneously executing the tasks locally;

(1.1) dividing the system time into equally spaced time periods, assuming that z (τ) tasks arrive at MD at the beginning of each time period, they are considered as an independent and identically distributed sequence, and each arriving task has constant data D and execution workload W;

(1.2) defining the ith time period from MD to ith base station eNB_iData transmission rate η_i(τ)：

η_i(τ)＝B_ilog₂[1+SNR_i(τ)] (1)

Wherein B is_iRepresenting eNB_iThe bandwidth allocated to the MD is such that,

which is indicative of the signal-to-noise ratio,

representing the transmission power, σ, of the MD²Representing white Gaussian noise, g_i(τ) represents the channel gain, defined as

And theta denotes a path loss constant and a path loss exponent, respectively, d_i(τ) denotes eNB_iPath distance from MD at time period τ;

(1.4) definition of eNB_iτ +1 th slot task processing queue L_i(τ+1)：

L_i(τ+1)＝max{L_i(τ)-β_i(τ),0}+α_i(τ) (2)

Wherein alpha is_i(τ) indicates all offloading to eNB_iTask of (1), beta_i(τ) denotes eNB_iProcessing completed tasks in the tau time period;

define the # 1 time slot task processing queue L of MD₀(τ+1)：

L₀(τ+1)＝max{L₀(τ)-β₀(τ),0}+α₀(τ) (3)

Wherein alpha is₀(τ) is a local task of MD, β₀(τ) is the task processed and completed in the τ -th time period of the MD;

(1.5) since the task can be executed on MD or eNB respectively, its execution time is defined respectively

And consumption of capacity

(1.5.1) for the case where the task is executed locally in the MD, its execution time and energy consumption are defined as:

wherein

Represents the computing power of an MD with an M-core CPU, and is defined as

Wherein

Represents a constant related to the chip architecture; f (tau) represents the working frequency of the M-core CPU; m represents an M core; c. C₀(τ) represents the calculated capacity of the MD, denoted c₀(τ) ═ MF (τ); w represents the workload of the task;

(1.5.2) offloading tasks to eNB for MD_iThe execution condition needs to consider the data transmission time and the execution time respectively; defining dataThe input time is as follows:

meanwhile, the energy consumed by data transmission can be defined as:

wherein D represents the data size of the task;

represents the transmission power of the MD;

when eNB_iAfter receiving the task, the task is put into a task processing queue Q of the task according to the rule of first-come first-obtained_i(τ); defining the task execution time as follows:

wherein W represents the workload of a task; c. C_i(τ) denotes eNB_iThe calculated capacity of (a);

step (2), constructing a joint reward function of a task unloading scene under the constraint of deadline under the environment of multiple eNB MECs, specifically as follows:

(2.1) define the total energy consumption E (τ) to perform tasks and offload tasks locally to the eNB at each time period MD as:

wherein

Denotes MD local execution beta₀(τ) the amount of energy consumed by the tasks,

denotes alpha_i(τ) offloading of tasks to eNB_iTotal transmission energy consumption of;

(2.2) considering the task deadline constraint, defining the total utility U (τ) of MD and all base stations as:

wherein n represents the number of eNBs in the MEC; t (T)_j) Representing the jth task t_jWaiting or execution time of, T_DLRepresenting the deadline of the task; beta is a₀(τ) represents the number of tasks processed by the MD over a period of τ, α_i(τ) denotes the ith base station eNB_iThe number of the tasks processed and completed in the time period tau, and u represents the obtained effect of the MD on successfully completing the tasks;

(2.3) if a task misses the deadline, considering that the task is overtime and will be discarded by the system, thus generating a loss, defining a loss function:

wherein d is₀(τ) represents the number of tasks dropped by MD, d_i(τ) denotes eNB_iThe number of tasks dropped;

and (2.4) defining an optimization problem model of task unloading under the scene according to the steps (2.1) to (2.3):

Max:R(τ)＝U(τ)-P(τ)-E(τ) (a)

wherein formula (a) represents an optimized objective reward function R (τ), i.e. maximizing the total utility of the acquired tasks U (τ) while minimizing the loss function P (τ) and the energy consumption E (τ);

formula (b) represents the number constraint of task offloads, z (τ) represents the number of task arrivals over the τ period MD;

equation (c) represents the link transmission capacity constraint, η, between MD and each eNB_i(τ) represents a data transmission rate from the MD to the eNB for a period of τ;

equation (d) represents the time constraint for task offloading, T_slotRepresents the duration of each time period;

equation (e) represents the computation capability constraint, β, for each base station and MD_i(τ) represents the task completed by the process for a period of τ, c_i(τ) denotes eNB_iThe calculated capacity of (a);

the operator-critical deep reinforcement learning network framework is composed of all eNBs_iAnd a task processing queue of the time period tau of the MD, and the MD sends to all the eNBs_iThe data transmission rate and the total number of tasks reached by MD are input states s_τTaking a task unloading solution and the calculation capacity of the MD as an action space, taking the task unloading solution as an output, and taking a target reward function of a formula (a) as a reward;

state s of the period of time τ_τ＝[L₀(π)，L₁(τ)，...,L_i(τ)，...,L_n(τ)，η₁(τ)，...,η_i(τ)，...,η_n(τ)，z(τ)]

Wherein L is₀(π) represents the task processing queue in MD, L_i(τ) denotes eNB_iAn upper task processing queue, i ═ 1,2, … …, n; eta_i(τ) denotes MD and eNB_iZ (τ) represents the total number of tasks reached by the MD;

the vector form of each motion of the motion space is a_τ＝[a₀(τ)，...,a_i(τ)，...,a_n(τ)，c₀(τ)]I.e. each action contains the number of MD locally reserved tasks a₀(τ), offloading to individual eNBs_iTask a of_i(τ) and MD calculation Capacity c₀(τ)；

The operator-critic deep reinforcement learning network framework adopts an operator network and a critic network;

the operator network adopts [100, n +1 ]]The activation function is RELU, the last layer is an action layer, and n +1 probability values of different actions are output; wherein the operator network policy function is

Is represented as state s_τObtaining an action value; theta^μIs an operator network weight parameter;

the critic network structure is the same as the actor network; wherein the criticc network evaluation function is

Is shown in state s_τTake action a_τThe action expected value obtained later; theta^QIs a criticc network weight parameter;

2. the method for energy consumption aware task offloading based on DRL in a mobile edge computing environment according to claim 1, wherein the step (1) is specifically as follows:

3. the method for energy consumption aware task offloading based on DRL in mobile edge computing environment according to claim 2, wherein the step (1.5) is specifically as follows:

4. the method for energy consumption aware task offloading based on DRL in a mobile edge computing environment according to claim 1, wherein the fourth step is as follows:

(4.1) randomly initializing weights θ of the operator network and the critic network^μAnd theta^QRespectively copying the weights to a target actor network and a target critic network, and setting the empirical playback pool capacity to be D, D>0, simultaneously emptying the experience playback pool;

θ^μ′←θ^μ，θ^Q′←θ^Q；

wherein theta is^μ′、θ^Q′Respectively representing the weights of the target operator network and the target critic network;

(4.2) initializing an MD system environment and distributing tasks to the MD to obtain an initial state value under the current round; the method comprises the following specific steps:

4.2.1 initializing MD system environment and generating a random noise generator N;

4.2.2 allocate z (τ) task for MD, when τ ═ 0 denotes the initial time period;

4.2.3 obtain the initial state value observed by MD from the system environment, i.e. MD local state when the task is not running and since MD is not offloading the task to eNB at this time, when τ is 0, MD local state is:

s_τ＝[L₀(π),η₀(τ),z(τ)] (5)

(4.3) operating an operator-critic deep reinforcement learning network framework to obtain an optimal value action for the state in each time period; the method comprises the following specific steps:

4.3.1 operator network according to the current time period status s_τOutputting prototype actions, entering an embedding layer for mapping, and extracting k nearest neighbor value actions by using a KNN algorithm; the method comprises the following specific steps:

4.3.1.1 State s_τInputting into the operator network, the operator network based on the input policyPi to obtain output

And in order to increase the learning randomness, a search noise point N is added_τGet the prototype action a_pI.e. by

4.3.1.2 to convert the motion value a in continuous space_pMapping to action value a in discrete space_p'An embedding layer is arranged between the operator and the critical, and the obtained a is processed_pInputting the embedding layer and outputting d mapped a_p'(ii) a D mapped action values a_p'K neighbor value sets A extracted by KNN algorithm_kMeasured as Euclidean distance between actions, i.e. A_k＝knn(a_p') K may be selected to be 10;

4.3.2 criticic network obtaining all the nearest neighbor value actions obtained in the step (3.2.2.1), and screening to obtain an optimal value action; the MD saves the current state to an experience playback pool after executing the optimal value action; the method comprises the following specific steps:

4.3.2.1 will A_kThe actions are respectively input into the critic current network, and the critic performs the functions according to the strategy

Outputting different behavior actions A in the current state_kThe corresponding value is selected to have the maximum value of a_xAs predictive input action of MD, i.e.

4.3.2.2MD according to action a_τExecuting task unloading decision, and obtaining return r according to action execution result_τAnd a new state s is observed_τ+1Forming new vector samples s_τ，a_τ，r_τ，s_τ+1]And storing the experience playback pool;

4.3.3 updating network parameters; the method comprises the following specific steps:

1) randomly sampling m samples [ s ] from an empirical playback pool_τ，a_τ，r_τ，s_τ+1]Sending the data to the current actor network, the current critic network, the target actor network and the target critic network;

2) the target actor network follows the state s of the next time period_τ+1Output action a'_τ+1The target critic network depends on the state s_τ+1And action a 'output from target actor network'_τ+1Obtaining the current target expected value y_τ(ii) a And the current target expected value y_τDelivered to the mean square error loss function

Wherein

Representing the target network, gamma representing the reduction factor;

3) the current critic network depends on the state s_τAction a_τAnd a prize r_τOutputting an evaluation function

Given sampling strategy gradient

Sum mean square error loss function

4) Updating all weights theta of operator network and critic network through back propagation of neural network^Q，θ^u；

A loss function of mean square error of

The sampling strategy has a gradient function of

5) Updating network parameters of the target operator network and the target critical network, namely:

θ^Q′←σθ^Q+(1-σ)θ^Q′

θ^μ′←σθ^μ+(1-σ)θ^μ′

wherein σ is a network update weight, set to 0.1;

6) the actor network obtains the state s of the next time period from the experience recycle pool_τ+1Repeating steps 1) to 6) up to a maximum time period;

4.3.4 repeat steps 4.3.1-4.3.3 until the maximum number of rounds is reached to obtain stable model parameters.

In order to verify the feasibility of the method, the method is compared with the traditional three algorithms LB, Remote, Local and the reinforced learning network DQN through experiments.

Calculated capacity c of each enb in this experiment_i(tau) 10GHz, transmission power with MD

The total working time is 1000s, and the size T of each time period_slot1s, each task has the same workload W25 GHz · s, a data size D10 MB, and a deadline T for each time period τ_DLSet to 3s, when the task is completed at the deadline, the MD can get the utility u equal to 1, the bandwidth B of the wireless network equal to 100MHz, and white gaussian noise σ²-174dbm/Hz, constant of path loss

The path loss exponent θ is 4, and the distance length d of each enb from the MD is 1000.

The CPU core number M of MD is 4, and the operating frequency of each CPU is 2.0GHz, so the computation power of MD is

Wherein

And comparing the performance of each algorithm under different conditions by using three indexes of reward, energy consumption and loss cost generated by task discarding of the MD.

1. Convergence comparison

Since the invention applies Knn algorithm to extract motion characteristics from continuous space to discrete space, the influence of different k sizes in KNN algorithm on convergence is considered in experiments, wherein k is 1 to extract only one motion from the prototype motion, and k is 1% to extract 1% from the prototype motion. In the case that the number of enb in fig. 3(1), (3) and (5) is 1, 3 and 5, respectively, the convergence comparison experiment is performed on the DRL-E2D provided by the present invention and the conventional DQN algorithm, and the upper limit of the number of cycles is 250.

From fig. 3(2) - (3), it can be seen that DRL-E2D performs better than k-1% when k is 1%, because the larger k is more beneficial for the neural network to infer better next action based on its own strategy, and it can be seen that the DQN convergence performance is consistently worse than DRL-E2D within the same number of cycles regardless of the number of enb.

Effect of the number of eNBs

Fig. 4(1) shows that as the number of eNB increases, the rewards gained by LB, Remote, DRL-E2D and DQN increase because these algorithms can benefit and offload tasks to eNB, and furthermore, as eNB increases, MD gains more rewards by doing more tasks and consuming less energy. Fig. 4(2) shows that the power consumption of DRL-E2D remains constant regardless of the number of enb, since the MD tends to give up tasks instead of performing tasks in order to obtain the maximum reward. FIG. 4(3) shows that as enb increases, the penalty for the remaining algorithms, with the exception of the Local algorithm, decreases accordingly.

3. Effect of task workload W

Fig. 5(1) can see that as W increases, the rewards earned by all algorithms gradually decrease because for a fixed task arrival rate λ, a larger W requires more computing resources, resulting in higher energy consumption, fewer completed tasks and lower rewards, but DRL-E2D consistently performs better than other algorithms, indicating that it is more adaptable to changes in W. Fig. 5(2) shows that Remote has the lowest energy consumption and is independent of W. The energy consumption of LB, DRL-E2D and DQN increases with increasing W, since a larger W requires more computing resources and time, resulting in higher energy consumption. Figure 5(3) shows that as W increases, the penalty increases for all algorithms, with loss variation for Local being the most drastic, since it discards more tasks than other algorithms.

4. Influence of data size D

FIG. 6(1) shows that in addition to Local, the rewards earned by the remaining algorithms increase as D increases because DRL-E2D, LB, Remote and DQN apply task offload policies. Thus, with larger D, the MD will spend more energy offloading tasks to the eNB. While Local does not employ task offloading, so its reward is independent of D.

Similarly, the power loss and loss penalty of MD in fig. 6(1) - (2) also gradually increases with increasing D. In this case, since the Remote algorithm offloads all tasks to enb, the number of tasks discarded is larger, and the loss cost thereof changes faster.

In conclusion, the DRL-E2D algorithm provided by the invention performs well under various conditions.

Claims

1. The DRL-based energy consumption perception task unloading method under the mobile edge computing environment is characterized by comprising the following steps of:

wherein

Max:R(τ)＝U(τ)-P(τ)-E(τ) (a)

step (3) all eNBs under the task unloading scene under the multi-eNB MEC environment_iAnd a task processing queue of the time period tau of the MD, and the MD sends to all the eNBs_iThe data transmission rate and the total number of tasks reached by MD are input states s_τConstructing an operator-critical deep reinforcement learning network framework by taking a task unloading solution and the calculation capacity of the MD as an action space, taking the task unloading solution as an output and taking a target reward function of a formula (a) as a reward;

state s of the period of time τ_τ＝[L₀(π)，L₁(τ)，...,L_i(τ)，...,L_n(τ)，η₁(τ)，...,η_i(τ)，...,η_n(τ)，z(τ)]；

Wherein L is₀(π) represents the task processing queue in MD, L_i(τ) denotes eNB_iAn upper task processing queue, i ═ 1,2, … …, n; eta_i(τ) denotes MD and eNB_iThe rate of data transmission between the first and second,z (τ) represents the total number of tasks reached by the MD;

And (4) adopting an operator-critic deep reinforcement learning network framework to carry out optimization solution on the joint reward function for task unloading in the step (2) to obtain a solution for optimal task unloading.

η_i(τ)＝B_ilog₂[1+SNR_i(τ)] (1)

Wherein B is_iRepresenting eNB_iThe bandwidth allocated to the MD is such that,

which is indicative of the signal-to-noise ratio,

representing the transmission power, σ, of the MD²Representing white gaussian noiseSound, g_i(τ) represents the channel gain, defined as

(1.4) definition of eNB_iτ +1 th slot task processing queue L_i(τ+1)：

L_i(τ+1)＝max{L_i(τ)-β_i(τ),0}+α_i(τ) (2)

define the # 1 time slot task processing queue L of MD₀(τ+1)：

L₀(τ+1)＝max{L₀(τ)-β₀(τ),0}+α₀(τ) (3)

And consumption of capacity

wherein

Represents the computing power of an MD with an M-core CPU, and is defined as

Wherein

(1.5.2) offloading tasks to eNB for MD_iThe execution condition needs to consider the data transmission time and the execution time respectively; defining the data transmission time as:

T_i ^θx(τ)＝D/η_i(τ) (6)

meanwhile, the energy consumed by data transmission can be defined as:

wherein D represents the data size of the task;

represents the transmission power of the MD;

when eNB_iReceiving a taskThen, the task is put into the self task processing queue Q according to the rule of first-come-first-obtained_i(τ); defining the task execution time as follows:

T_i ^θx(τ)＝W/c_i(τ) (6)

wherein W represents the workload of a task; c. C_i(τ) denotes eNB_iThe computing capacity of (2).

θ^μ’←θ^μ，θ^Q’←θ^Q；

wherein theta is^μ’、θ^Q’Respectively representing the weights of the target operator network and the target critic network;

(4.2) initializing an MD system environment and distributing tasks to the MD to obtain an initial state value under the current round;

and (4.3) operating the operator-critic deep reinforcement learning network framework to obtain the optimal value action for the state in each time period.

5. The method for DRL-based energy consumption aware task offloading in a mobile edge computing environment according to claim 4, wherein the step (4.2) is specifically as follows:

s_τ＝[L₀(π),η₀(τ),z(τ)] (5)。

6. the method for DRL-based energy consumption aware task offloading in a mobile edge computing environment according to claim 4, wherein the step (4.3) is specifically as follows:

4.3.1 operator network according to the current time period status s_τOutputting prototype actions, entering an embedding layer for mapping, and extracting k nearest neighbor value actions by using a KNN algorithm;

4.3.2 criticic network obtaining all the nearest neighbor value actions obtained in the step (3.2.2.1), and screening to obtain an optimal value action; the MD saves the current state to an experience playback pool after executing the optimal value action;

4.3.3 updating network parameters;

7. The method for DRL-based energy consumption aware task offloading in a mobile edge computing environment according to claim 6, wherein the step (4.3.1) is specifically as follows:

4.3.1.1 State s_τInputting the operator network, and obtaining output by the operator network according to the input strategy pi

4.3.1.2 to convert the motion value a in continuous space_pMapping to action value a in discrete space_p'An embedding layer is arranged between the operator and the critical, and the obtained a is processed_pInputting the embedding layer and outputting d mapped a_p'(ii) a D mapped action values a_p'K neighbor value sets A extracted by KNN algorithm_kMeasured in Euclidean between actionsDistance, i.e. A_k＝knn(a_p')。

8. The method for energy consumption aware task offloading based on DRL in mobile edge computing environment according to claim 1, wherein the step (4.3.2) is specifically as follows:

4.3.2.2MD according to action a_τExecuting task unloading decision, and obtaining return r according to action execution result_τAnd a new state s is observed_τ+1Forming new vector samples s_τ，a_τ，r_τ，s_τ+1]And stores in an experience playback pool.

9. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-8.

10. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-8.