CN114938381B

CN114938381B - D2D-MEC unloading method based on deep reinforcement learning

Info

Publication number: CN114938381B
Application number: CN202210771544.2A
Authority: CN
Inventors: 施苑英; 王选宏; 石薇; 蒋军敏
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-09-01
Anticipated expiration: 2042-06-30
Also published as: CN114938381A

Abstract

The invention belongs to a calculation unloading method, and provides a D2D-MEC unloading method and a computer program product based on deep reinforcement learning, which are used for solving the technical problem that in practical application, one mode of MEC unloading or D2D unloading is selected, and the advantages of the MEC unloading and the D2D unloading cannot be considered, so that the long-term service cost of a requesting user is minimized. Combining MEC offloading and D2D offloading allows the mobile device to perform computing tasks locally, or offload part of the tasks to a MEC server, or offload part of the tasks to a nearby D2D device for processing. In order to realize the joint optimization of unloading decision, power control and computing resource allocation, and aim at minimizing energy consumption and business service delay, a deep reinforcement learning algorithm based on near-end policy optimization is adopted to obtain an optimal control policy.

Description

D2D-MEC unloading method based on deep reinforcement learning

Technical Field

The invention belongs to a calculation unloading method, and particularly relates to a D2D-MEC unloading method based on deep reinforcement learning and a computer program product.

Background

With the rapid development of 5G technology, various new services represented by face recognition, augmented reality and natural language processing are continuously emerging. These services often have computationally intensive and latency sensitive characteristics, which pose a significant challenge for mobile devices with limited computing power and battery power.

The mobile edge computing (Mobile Edge Computing, MEC) is a novel computing model proposed by the european telecommunications standards institute, which sinks cloud computing service from the cloud to the edge of the mobile network, so that a user can offload computing tasks to an edge server to execute, which not only solves the shortages of terminal equipment in terms of computing performance and energy efficiency, but also meets the requirement of ultra-low time delay of new services, and enables the user to obtain better service experience. inter-Device-to-Device (D2D) is a local direct communication technology that does not require a central infrastructure. The mobile device can realize cooperative sharing of computing resources by using the D2D technology, so that the devices with insufficient resources can offload computing tasks to the devices with idle adjacent resources for processing. Compared with MEC unloading, the unloading mode based on the near field communication not only can further reduce data transmission time delay and energy consumption of mobile equipment, but also is beneficial to relieving pressure caused by concurrent unloading of a large number of users on network communication and an MEC server. Thus, D2D computing offloading has received increasing attention in recent years as an effective aid to MEC offloading.

However, in practical applications, one of MEC offloading and D2D offloading is generally selected, which cannot achieve both advantages, and long-term service overhead of the requesting user cannot be minimized.

Disclosure of Invention

The invention provides a D2D-MEC unloading method and a computer program product based on deep reinforcement learning, which are used for solving the technical problem that in practical application, one mode of MEC unloading or D2D unloading is selected, and the advantages of the MEC unloading and the D2D unloading cannot be considered, so that the long-term service cost of a requesting user is minimized.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the D2D-MEC unloading method based on deep reinforcement learning is characterized by comprising the following steps of:

s1, establishing a D2D-MEC edge computing system model, and executing steps S2 to S8 according to the D2D-MEC edge computing system model;

s2, establishing a task model and determining task queuing delay;

s3, determining the transmission rate of the sub-channel in the network according to a shannon formula;

s4, determining the processing time delay and the energy consumption of each unloading mode according to the task model, the task queuing time delay and the transmission rate of the sub-channel, and obtaining the total service time delay of the task and the total energy consumption generated by requesting a user to process the task;

s5, setting an overhead function according to total energy consumption and total service time delay generated by a request user for processing tasks;

s6, constructing a multi-agent deep reinforcement learning model, wherein the multi-agent deep reinforcement learning model comprises an Actor network and a Critic network;

s6.1, determining local state information observed by a requesting user at the beginning of each time slot;

s6.2, splicing the local observation state of any request user with the local observation states of all other request users, and deleting repeated information to obtain the global state of the request user;

s6.3, determining a reward function for indicating environmental feedback rewards and accumulated discount rewards after the user is requested to execute actions according to the overhead function;

s6.4, determining optimization targets of an Actor network and a Critic network;

s7, training an Actor network and a Critic network, and optimizing parameters;

s8, selecting an unloading mode when the D2D-MEC network is unloaded through the trained Actor network, determining an unloading proportion, and distributing computing resources and transmitting power.

Further, the step S2 specifically includes:

s2.1, establishing a triplet task model:

T _i，j ＝＜D _i，j ，C _i，j ，a _i，j ＞

wherein ,D_i,j Representing the calculated data amount of the j-th task which the request user i needs to process; c (C) _i,j Representing the number of CPU cycles required for requesting the j-th task that user i needs to process; a, a _i,j Indicating the arrival time of the j-th task which the requesting user i needs to process; t (T) _i,j Indicating the j-th task which the requesting user i needs to process;

s2.2, determining task T by the following formula _i,j Queuing delay q of (2) _i,J ：

q _i，j ＝b _i，j -a _i，J ；

wherein ,b_I,J The start processing time of the j-th task to be processed for requesting user i.

Further, in step S3, cellular communication between the mobile user and the base station in the network and D2D communication between the mobile users all adopt an orthogonal frequency division multiple access mode, and the D2D communication and the cellular communication do not multiplex spectrum resources;

in step S3, specifically, the transmission rate of the sub-channel is determined by the following formula:

wherein ,r_i,k (t) represents the sub-channel transmission rate between the requesting user i and the offload target k; h is a _i,k (t) represents the subchannel gain between the requesting user i and the offload target k; p is p _i,k (t) represents the sub-channel transmit power between the requesting user i and the offload target k;representing noise power; w represents the bandwidth of the subchannel.

Further, in step S4, the offloading modes include a mode 1, a mode 2 and a mode 3, where the mode 1 is that tasks are all executed locally by a user, the mode 2 is that tasks are partially or fully offloaded to an MEC server for processing, and the mode 3 is that tasks are partially or fully offloaded to an adjacent D2D service user for processing;

the step S4 specifically comprises the following steps:

s4.1, determining processing time delay and energy consumption of three unloading modes through local calculation time delay and energy consumption, MEC unloading time delay and energy consumption and D2D unloading time delay and energy consumption in the three unloading modes:

local computation delay and energy consumption:

local computation delayThe method comprises the following steps:

local computing energy consumptionThe method comprises the following steps:

wherein ,f_i ^loc (t) is the CPU frequency of user i in time slot t, κ _i Is an effective switched capacitor; alpha _i (t) is the task offload ratio, α for mode 1 _i (t) =0, α for mode 2 and mode 3 _i (t)∈(0,1]；

MEC unloading delay and energy consumption:

MEC offload delays including transmission delaysAnd calculate the delay->

wherein ,r_i,0 (t) represents r _I,K In (t), k is 0, f _mec CPU frequency of MEC server, u _mec (t) represents the number of MEC offload users for time slot t;

MEC unloading energy consumptionThe method comprises the following steps:

wherein ,p_i,0 (t) represents p _i,k In (t)k is 0;

D2D offload latency and energy consumption:

the D2D offload delay includes a transmission delayAnd calculate the delay->

wherein ,is the CPU frequency of service user k in time slot t;

D2D unloading energy consumptionThe method comprises the following steps:

s4.2, calculating task T _i,j Is of the total service delay L _i,j Task T _i,j Including processing delays and queuing delays:

s4.3, calculating the processing task T of the request user _i,j Total energy consumption E produced _i,j ：

Further, in step S5, the cost function c _i,j The method comprises the following steps:

wherein ,β₁ 、β ₂ 、β ₃ Weight factors for latency overhead, energy consumption overhead and service timeout penalty respectively,is an indication function, τ _max Is task T _i,j Maximum service delay that can be tolerated.

Further, in step S6.1, the local observation state o _i (t) including the channel gain h between the requesting user and each candidate offload target _i (T), candidate target device previous T _W Calculation load history information F (t) of each slot, length Q of task queue at the beginning of slot t _i (t)；

wherein ,h_i (t)＝[h _i，0 (t)，h _i，1 (t)，…，h _i，N (t)]；F(t)＝[f ₀ (t)，f ₁ (t)，…，f _N (t)] ^T ；

f ₀ (t)＝[u _mec (t-T _W )，u _mec (t-T _W +1)，…，u _mec (t-1)] ^T

，f ₀ (t) represents computation load information of MEC server, u _mec () Indicating the number of MEC unloading users; f (f) _k (t)＝[d _k (t-T _W )，d _k (t-T _W +1)，…，d _k (t-1)] ^T ，f _k (t) representing offloading of service records of target k, d _k () Representing the amount of offload data handled by k users; t (T) _W Representing the number of columns of the F (t) matrix; k=1, 2, … N, N representing the total number of D2D service users.

Further, step S6.4 specifically includes:

s6.4.1 the estimated value of the dominance function of the requesting user i in the time slot t generated by the Critic network is obtained by the following formula

wherein ,gamma is a discount factor; lambda E [0,1]]Is a parameter for balancing the estimated bias and variance; t (T) _max Representing the time slot set +.> Is a radix of (2);representing global state s of Critic network according to request user _i (t) estimating a resulting state value function; r is (r) _i (t) represents an instant prize requested for user i in time slot t;

critic network parameter omega takes optimal value omega ^* Optimum value omega ^* Determined by the following formula:

wherein J (ω) is the objective function of the Critic network:

wherein ,representation strategy->An empirical average of the generated trace samples τ;

jackpot for requesting user:

t' is the jackpot slot;

s6.4.2, determining a policy function for an Actor network that can support discrete-continuous hybrid decisions:

s6.4.3 determining an optimum value θ of an Actor network parameter θ ^* ，θ＝[θ _d ，θ _c ]：

Wherein L (θ) is an objective function of the Actor network; θ _d Network parameters, θ, representing a discrete action policy network _c Network parameters representing a continuous action policy network; the Actor network comprises a discrete action policy networkAnd continuous action policy network->

wherein ,representation strategy->The resulting sample (o _i ，a _i ) Is the empirical mean value of ε is the super parameter, pi _θ (a _i (t)|o _i (t)) represents an updated Actor networkPolicy function->Policy function representing an Actor network before update, clip () represents a policy function for restricting +.>Clip function of ratio->Representing discrete actions +.>Representing a continuous action.

Further, step S7 specifically includes:

s7.1, at the beginning of each round, all the request users randomly set an initial state;

s7.2, at the beginning of each time slot, the local observation state o observed by the time slot _i (t) inputting to the Actor network to obtain the unloading mode and target selected by the request user i in the time slot t, unloading proportion, resource allocation strategy, and performing discrete-continuous mixing action a _i Probability of (t)

wherein ,

s7.3, at the end of each time slot, the local observation states o of all the requesting users _i (t), discrete-continuous mixing action a _i (t) requesting the user to execute the instant prize r obtained after unloading _i (t) performing a discrete-continuous mixing action a _i Probability of (t)And a new local observation state o for requesting the user to observe _i Information composed of (t+1)Storing the data into a data cache D;

s7.4, repeatedly executing the step S7.2 and the step S7.3 for each time slot until one round is finished; each of the rounds has a duration T _max A time slot;

s7.5, according to the cumulative discount prize, the estimated value sum of the dominance functionRespectively calculating the cumulative discount prize +.>And requesting an estimate of the dominance function of user i in time slot t +.>And updates the information in the data cache D to

S7.6, repeatedly executing the steps S7.1 to S7.5 for each round until the data cache D is full, and taking the information stored in the data cache D as training data;

s7.7, updating the network parameters theta of the discrete action strategy network in the Actor network according to the objective function L (theta) of the Actor network through partial training data stored in the data cache D _d And network parameters θ for a continuous action policy network _c The method comprises the steps of carrying out a first treatment on the surface of the Updating Critic network parameters omega according to objective function J (omega) of Critic network;

s7.8, repeatedly executing the step S7.7 until all training data stored in the data cache D are executed by the step S7.7, and emptying the data cache D;

s7.9, judging whether the round number reaches a preset round number, if so, completing training to obtain a trained Actor network and a Critic network; otherwise, repeating the steps S7.1 to S7.8 until the round number reaches the preset round number.

The invention also provides a computer program product, which comprises a computer program and is characterized in that the program is executed by a processor to realize the steps of the D2D-MEC unloading method based on deep reinforcement learning.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a D2D-MEC unloading method based on deep reinforcement learning, which combines MEC unloading and D2D unloading, allows a mobile device to locally execute a computing task, or unloads part of the task to an MEC server, or unloads part of the task to adjacent D2D equipment for processing. In order to achieve joint optimization of offloading decisions, power control and computing resource allocation, with the goal of minimizing energy consumption and traffic service latency, a deep reinforcement learning algorithm based on near-end policy optimization (Proximal Policy Optimization, PPO) is employed to obtain an optimal control policy.

2. The invention describes the dynamic characteristics of the system by using a discrete time model. At the beginning of each time slot, the requesting user observes the current state of the D2D-MEC network, and accordingly makes an offloading decision, thereby minimizing the long-term service overhead of the requesting user in the D2D-MEC system.

3. The invention provides a multi-agent PPO algorithm capable of supporting a discrete-continuous mixed action space based on the PPO algorithm, which is used for jointly optimizing unloading decision and calculating a resource and transmitting power allocation scheme. Each request user is used as an independent intelligent agent by adopting a centralized training and distributed executing mechanism and an Actor-Critic framework, and the action to be taken is determined by utilizing an Actor network according to local state information observed by the request user; the centralized Critic network combines the local observation information of all the agents into the total state information, and then estimates the dominance function of each agent by combining the environmental rewards value, and guides the updating of the Actor network.

4. The invention also provides a computer program product capable of executing the steps of the method, and the method can be popularized and applied to monitoring on corresponding hardware equipment.

Drawings

FIG. 1 is a schematic diagram of a D2D-MEC network architecture;

FIG. 2 is a schematic diagram of a multi-agent centralized training distributed execution in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation principle of an Actor network according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating the process of collecting training data and updating network parameters according to an embodiment of the present invention.

Wherein: 1-first requesting user, 2-service user, 3-MEC server, 4-base station, 5-second requesting user, 6-third requesting user.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Reinforcement learning is an important branch of machine learning research, and refers to a process that an intelligent agent continuously explores and tries to make mistakes in the process of interacting with the environment, and gradually adjusts a behavior mode according to environmental feedback until an optimal behavior strategy is obtained. In recent years, with the advent of deep learning, reinforcement learning was fused with deep neural networks, forming a deep reinforcement learning (Deep Reinforcement Learning, DRL) technique. Compared with the traditional reinforcement learning, the DRL can fully utilize the strong expression capability of the deep neural network to process the sequential decision problem with high-dimensional state space and action space, so that the DRL is suitable for solving the problem of edge calculation unloading in the random dynamic environment of the wireless communication network.

The D2D-MEC network shown in fig. 1 is composed of one Base Station 4 (BS) and M Mobile Devices (MD). The base station 4 is equipped with a local MEC server 3 that can provide computing offload services for users. In this network architecture, there are two types of mobile users: one is that there are computing tasks to process, called Request Devices (RDs), such as first, second and third requesting users 1, 5, 6 in fig. 1, another is that D2D offload services can be provided, called Service Devices (SDs), and in addition, the arrows in fig. 1 represent task offloading.

The requesting user can freely select a task offloading mode and an offloading target based on information such as the local calculation task amount, channel conditions, calculation loads of the MEC server 3 and the service user 2, and the like. The three offloading modes defined in the present invention are: mode 1 is local computing, i.e., tasks are all performed locally to the user, as the first requesting user in FIG. 1; mode 2 is MEC offloading, i.e. the task is partially or fully uploaded to the MEC server 3 for processing, the remainder being executed locally, as the second requesting user 5 in fig. 1; mode 3 is D2D offloading, i.e. offloading of the task partly or entirely over the D2D link to some service user 2 in the vicinity, the remainder being performed locally, as the third requesting user 6 in fig. 1.

Definition of the definitionFor requesting user set->For service user set, M _target ＝{0}∪M _SD For the candidate offload target set, 0 represents the MEC server 3 base station 4. At most one offload target (MEC server 3 base station 4 or service user 2) can be selected by the requesting user at each instant, and at most 1 requesting user can be served by the service user 2.

The invention adopts a discrete time model to describe the dynamic characteristics of the system, and at the beginning of each time slot, a user is requested to observe the current state of the D2D-MEC network, so as to make an unloading decision. The time slot length is delta t, and the time slot set isThe method comprises the following specific steps:

(1) Task model building

The invention considers time delay sensitive computing task, tau _max Is the maximum service delay that can be tolerated by the task. By triplet T _i，j ＝<D _i，j ，C _i，j ，a _i，j >Describing the j-th task that the requesting user i needs to process, wherein D _i，j The calculation data quantity of the j-th task which is required to be processed by the request user i is expressed in bits; c (C) _i，j The CPU cycle number required by the j-th task which is required to be processed by the request user i is expressed as cycles/s; a, a _i，j The arrival time of the j-th task that the requesting user i needs to process is expressed in units s. The newly arrived tasks are firstly put into a local buffer queue of the equipment, then are sequentially processed according to the FIFO sequence (first-in first-out principle), and each task forms a task queue. Definition Q _i (t) is the length of the task queue at the beginning of time slot t, b _i，j Representing task T _i，j Time to start processing, task T _i，j Queuing delay q of (2) _i，j The method comprises the following steps:

q _i，j ＝b _i，j -a _i，j

(2) Establishing a communication model

In order to avoid signal interference between users, cellular communication between MD (mobile user) and BS (base station) and D2D communication between MD (mobile user) all adopt an orthogonal frequency division multiple access method, and the D2D communication and the cellular communication do not reuse spectrum resources. Assuming that the network allocates one subchannel for each offload user and the bandwidth of the subchannel is WHz, the transmission rate of the subchannel is according to shannon's formula:

wherein ,r_i，k (t) represents the sub-channel transmission rate, h, between the requesting user i and the offload target k _i，k (t) denotes that the requesting user i and the offload target k (k ε M _target ) Sub-channel gain, p _i，k (t) is a decision variable representing the subchannel transmit power between the requesting user i and the offload target k, 0 < p _i，k (t)≤p _max ，p _max Maximum transmit power for the user equipment;is the noise power.

(3) Establishing a calculation model of time delay and energy consumption

In different offload modes, the latency of requesting users to process computing tasks and the resulting device energy consumption have different forms. Mode 1 involves only local computation latency and local computation energy consumption; in modes 2 and 3, the task processing delays include a local computation delay, an offload data transmission delay, and a remote device computation delay, and the energy consumption includes a local computation energy consumption and a data transmission energy consumption. Considering that the data volume of the task processing result is smaller, the invention ignores the time delay and the energy consumption of the result feedback.

(a) Local computation delay and energy consumption

Definition of the decision variable alpha _i (t)∈[0，1]The task offload ratio for requesting user i in time slot t is indicated. When mode 1 is selected, α _i (t) =0; when mode 2 or 3 is selected, α _i (t)∈(0，1]. Assume task T _i，j Processed in time slot t, then the amount of data that needs to be calculated locally is [ 1-alpha ] _i (t)]D _i，j The calculation time delay and the calculation energy consumption caused are respectively as follows:

to reduce power consumption, mobile devices apply dynamic voltage frequency scaling (Dynamic Voltage and Frequency Scaling, DVFS) techniques to schedule local computing resources, defining a decision variable f _i ^loc (t) represents the CPU frequency selected by user i in time slot t, 0 < f _i ^loc (t)≤f _max ，f _max Is the maximum CPU frequency. Kappa (kappa) _i Representing the effective switched capacitance, the size of which is determined by the device chip architecture.

(b) MEC unloading delay and energy consumption

In mode 2, the requesting user transmits the offload data to the BS (base station) via the uplink subchannel for processing by the MEC server. Assuming that the MEC server allocates computing resources evenly to all current offload users, task T _i,j The transmission delay and the calculation delay of the medium unloading data are respectively as follows:

wherein u_mec (t) MEC offload user number, f, for time slot t _mec Is the CPU frequency of MEC server, r _i,0 And (t) is the information transmission rate between the user i and the BS.

The equipment energy consumption of the requesting user depends on the unloading data transmission delay and the transmitting power, and the calculation formula is as follows:

(c) D2D offloading latency and energy consumption

When the user selects mode 3, the offload data will be transmitted over the D2D link to the service user and immediately processed. The transmission delay and the calculation delay caused by the method are respectively as follows:

wherein r_i,k (t) is the information transfer rate between the requesting user i and the serving user k,is the CPU frequency of k at time slot t.

The power consumption of the requesting user's device depends on the transmission delayAnd a transmission power p _i,k (t) the calculation formula is:

to sum up, task T _i,j The total service delay of (1) is composed of processing delay and queuing delay, namely:

if L _i,j ＞τ _max Tasks will be discarded due to service timeouts.

Request user processing T _i,j The total energy consumption generated is:

(4) Setting an overhead function

The system overhead defined by the invention comprises delay overhead, energy consumption overhead and service timeout penalty, thereby obtaining a single task T _i,j Is the overhead function of:

wherein β₁ ,β ₂ ,β ₃ Weight factors for latency overhead, energy consumption overhead and service timeout penalty respectively,is an indication function when condition L _i.j ＞τ _max When the function value is established, the function value is equal to 1, otherwise, the function value is 0.

Since a user can handle multiple tasks within a single slot, the overhead function for slot t is defined as:

wherein ,T_i (t) is user iThe index set of tasks is processed at time slot t.

The specific offloading algorithm is designed as follows:

the main object of the present invention is to minimize the long-term service overhead of the requesting user in the D2D-MEC system. Therefore, the optimization problem is modeled as a partially observable Markov decision process, and a multi-agent PPO algorithm capable of supporting a discrete-continuous mixed action space is provided on the basis of the PPO algorithm for jointly optimizing an unloading decision and an allocation scheme of computing resources and transmitting power. The algorithm employs a "centralized training, distributed execution" mechanism and an Actor-Critic framework, as shown in FIG. 2. Each requesting user is used as an independent intelligent agent and is based on the local state information o observed by the requesting user _i (t) determining an action a to be taken using an Actor network _i (t); the centralized Critic network combines the local observation information of all the intelligent agents into the total state information, and combines the environmental rewarding valueTo estimate the dominance function of each agent>And guides the updating of the Actor network. The method comprises the following steps:

(1) Obtaining local observation state and global state

At the beginning of each time slot, the requesting user i observes the data volume of the local task queue (i.e. the length of the task queue) and the state of the external environment, including the channel quality between the requesting user and the candidate offload target and the computational load of the candidate target, obtaining the local observation state o _i (t)＝{Q _i (t)，h _i (t)，F(t)}。

wherein ,h_i (t)＝[h _i，0 (t)，h _i，1 (t)，…，h _i，N (t)]Representing channel gains between the requesting user and each candidate offload target;

F(t)＝[f ₀ (t)，f ₁ (t)，…，f _N (t)] ^T f (T) is (N+1). Times.T _W Matrix of dimensions representing previous T of candidate target device _W The computation load history information of each slot is collected and distributed to all requesting users by a BS (base station). f (f) ₀ (t)＝[u _mec (t-T _W )，u _mec (t-T _W +1)，…，u _mec (t-1)] ^T Representing computational load information of MEC server, f _k (t)＝[d _k (t-T _W )，d _k (t-T _W +1)，…，d _k (t-1)] ^T Service record, d, representing service user k _k (t-j) is the amount of offload data that the user handles in time slot t-j.

Global state of agent i is determined by its local observation state o _i (t) and other agent's local observed state o _-i (t) spliced, subscript-i represents all agents except i. Deleting repeated information in the splicing process to obtain a global state s _i (t) is defined as follows:

s _i (t)＝{0′ _i (t)，o′ _-i (t)，F(t)}

wherein ,o′_i (t)＝[Q _i (t)，h _i (t)]。

(2) Decision of mixing action

The decision process of the agent in each time slot is divided into two phases: an appropriate offloading mode is first selected, then an offloading ratio is determined, and computing resources and transmit power are allocated. For this purpose, discrete-continuous mixing actions are definedWherein discrete actions->Indicating the offloading mode and offloading destination selected by agent i at time slot t, 0 represents the MEC server, 1,2, …, N represents the service user, and n+1 represents the local calculation. Continuous action->Representing a resource allocation policy, which is defined as: />

As shown in fig. 3, in order to implement the hybrid action decision, the present invention extends the Actor network in the PPO algorithm, and defines discrete action policy networks respectivelyContinuous action policy network(θ _d and θ_c Network parameters of the discrete action strategy network and the continuous action strategy network respectively) and solves the back propagation derivative problem after discrete action sampling by utilizing a Gumbel-Softmax method.

In FIG. 3, the state information F (t) is input to a Long Short-Term Memory (LSTM) layer to predict the recent load level of the candidate load shedding target, and then to be compared with Q _i (t)、h _i (t) stitching together as inputs to the discrete action policy network and the continuous action policy network. Policy functionObeying the Categorical distribution, outputting the probability value of all the discrete actions selected according to the input local observation state, and obtaining the discrete actions after Gumbel-Softmax sampling>G (0, 1) in fig. 3 represents the sampled value of the gummel distribution with parameter (0, 1). Policy function->Obeying Gaussian distribution, outputting the mean value and standard deviation of continuous motion distribution according to the input local observation state and discrete motion value, and obtaining continuous motion +.>Because all the request users have homogeneity and can generate the same decision result under the same input state, the invention adopts a parameter sharing mechanism to lead all the intelligent agents to share the same decision resultNetwork parameter θ _d and θ_c To accelerate the convergence speed of the algorithm.

(3) Setting a bonus function

The environment will immediately feed back the bonus information after all agents perform their respective actions. The instant rewards obtained by the agent i in the time slot t are:

r _i (t)＝-c _i (t)

the goal of the agent is to maximize the long-term rewards for which the following cumulative discount rewards are defined:

where gamma E0, 1 is a discount factor used to balance long-term rewards against instant rewards, and t' is a jackpot time slot.

(4) Setting an objective function

In the present invention, the Actor network is formed by cascading a discrete action policy network and a continuous action policy network, and therefore, the policy function is defined as:

parameter θ= [ θ ] _d ,θ _c ]Optimal value θ ^* Satisfy the following requirements

/>

Wherein L (θ) is an objective function of the Actor network, θ _old Representing the parameter values before the updating of the Actor network, training samples (o _i ,a _i ) By policyGenerate->Is the estimated value of the dominance function of agent i in time slot t, and is generated by Critic network. Pi _θ (a _i (t)|o _i (t)) and->Respectively before and after updating, when the input state is o _i At (t), agent selecting action a _i (t) probability. clip function is used to limit the ratio +.>To avoid the algorithm difficult to converge due to too fast parameter update; epsilon is a hyper-parameter.

Critic network is based on global state s of agent _i (t) estimating a state value functionω is a network parameter whose optimal value satisfies:

j (ω) is the objective function of the Critic network, which is defined in terms of mean square error.

Critic network utilization status valueAnd environmental rewards r _i And (t) further estimating an advantage function, and feeding back to an Actor for parameter training. The invention adopts a generalized dominance estimation (GAE) method to estimate the dominance function, and the calculation formula is as follows:

wherein the discount factor gamma and the parameter lambda epsilon 0,1 are used to balance the estimated deviation and variance.

(5) Algorithm training process

As shown in fig. 4, the training process of the algorithm is divided into two phases: training data is collected and network parameters are updated. In the first phase, the system collects data in rounds (epodes), one round having a duration T _max And each time slot. At the beginning of each round, all agents start from a randomly set initial state and observe the local observation state o observed by the time slot _i (t) input to an Actor network to obtain an offload modeUnloading proportion and resource allocation scheme->Executing action a _i Probability of (t)>After the intelligent agent executes the unloading action, the instant rewards r of the environment feedback are obtained _i (t) and observe a new local observation state o _i (t+1). At the end of each time slot, information of all agents is addedSpliced together and stored in a data buffer D. After one round is completed, the system calculates the cumulative discount prize according to equations (1), (4) and (5), respectively>And dominance function value->Then the record in cache D is updated to +.>In the form of (a). If D is not full, it indicates that the sample data volume has not reached the trainingThe exercise requirements, the system will continue to execute a new round. />

In the second stage, the algorithm trains the Actor network and the Critic network respectively by using the cached data of D. The training process is carried out in batches, a small batch of data is randomly sampled from D during each training, and the parameter theta of the Actor network is updated according to the formula (2) _d and θ_c And updating the parameter omega of the Critic network according to the formula (3). And after the parameter updating of the round is finished, D is emptied, new data are continuously collected, and the parameter updating of the next round is prepared.

After training, a more accurate unloading method selection can be performed in practical application through the trained Actor network.

The offloading method of the invention may also form a computer program product comprising a computer program which, when executed by a processor, implements the steps of a deep reinforcement learning based D2D-MEC offloading method.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A D2D-MEC offloading method based on deep reinforcement learning, comprising the steps of:

s2, establishing a task model and determining task queuing delay;

s7, training an Actor network and a Critic network, and optimizing parameters;

s8, selecting an unloading mode when the D2D-MEC network is unloaded through the trained Actor network, determining an unloading proportion, and distributing computing resources and transmitting power;

in step S3, cellular communication between the mobile user and the base station in the network and D2D communication between the mobile users all adopt an orthogonal frequency division multiple access mode, and the D2D communication and the cellular communication do not multiplex spectrum resources;

wherein ,r_i,k (t) represents the sub-channel transmission rate between the requesting user i and the offload target k; h is a _i,k (t) represents the subchannel gain between the requesting user i and the offload target k; p is p _i,k (t) represents the sub-channel transmit power between the requesting user i and the offload target k;representing noise power; w represents the bandwidth of the subchannel;

in step S4, the offloading modes are three, including a mode 1, a mode 2 and a mode 3, where the mode 1 is that tasks are all executed locally by a user, the mode 2 is that tasks are partially or fully offloaded to an MEC server for processing, and the mode 3 is that tasks are partially or fully offloaded to an adjacent D2D service user for processing;

the step S4 specifically comprises the following steps:

local computation delay and energy consumption:

local computation delayThe method comprises the following steps:

local computing energy consumptionThe method comprises the following steps:

wherein ,is the CPU frequency of user i in time slot t, κ _i Is an effective switched capacitor; alpha _i (t) is the task offload ratio, α for mode 1 _i (t) =0, α for mode 2 and mode 3 _i (t)∈(0,1]；

MEC unloading delay and energy consumption:

MEC offload delays including transmission delaysAnd calculate the delay->

MEC unloading energy consumptionThe method comprises the following steps:

wherein ,p_i,0 (t) represents p _i,k In (t), k is 0;

D2D offload latency and energy consumption:

the D2D offload delay includes a transmission delayAnd calculate the delay->

wherein ,is the CPU frequency of service user k in time slot t;

D2D unloading energy consumptionThe method comprises the following steps:

2. The D2D-MEC unloading method based on deep reinforcement learning according to claim 1, wherein step S2 is specifically:

s2.1, establishing a triplet task model:

T _i,j ＝<D _i,j ,i,j,i,j>

wherein ,D_i,j Indicating the first task that the requesting user needs to processIs calculated according to the data amount; c (C) _i,j Representing the number of CPU cycles required to request the first task that the user needs to process; a, a _i,j Indicating the arrival time of the first task that the requesting user needs to process; t (T) _i,j Representing a first task that the requesting user needs to process;

q _i,j ＝ _i,j -i,j；

wherein ,b_i,j The start processing time of the first task that needs to be processed for the requesting user.

3. The D2D-MEC offloading method of claim 2, wherein the D2D-MEC offloading method is based on deep reinforcement learning and comprises: in step S5, the cost function c _i,j The method comprises the following steps:

4. A D2D-MEC offloading method based on deep reinforcement learning according to claim 3, wherein: in step S6.1, the local observation state o _i (t) including the channel gain h between the requesting user and each candidate offload target _i (T), candidate target device previous T _W Calculation load history information F (t) of each slot, length Q of task queue at the beginning of slot t _i (t)；

wherein ,h_i (t)＝[h _i,0 (t),h _i,1 (t),…,h _i,N (t)]；F(t)＝[f ₀ (t),f ₁ (t),…,f _N (t)] ^T ；

f ₀ (t)＝[u _mec (t-T _W ),u _mec (t-T _W +1),…,u _mec (t-1)] ^T ，

f ₀ (t) represents computation load information of MEC server, u _mec () Indicating the number of MEC unloading users; f (f) _k (t)＝[d _k (t-T _W ),d _k (t-T _W +1),…,d _k (t-1)] ^T ，f _k (t) representing offloading of service records of target k, d _k () Representing the amount of offload data handled by k users; t (T) _W Representing the number of columns of the F (t) matrix; k=1, 2, … N, N representing the total number of D2D service users.

5. The D2D-MEC offloading method of claim 4, wherein: the step S6.4 specifically comprises the following steps:

s6.4.1 the estimation of the dominance function of the requesting user i in the time slot generated by the Critic network is obtained by

wherein ,gamma is a discount factor; lambda E [0,1]]Is a parameter for balancing the estimated bias and variance; t (T) _max Represents a set of time slots t= {0,1, …, T _max -cardinality of 1; />Representing global state s of Critic network according to request user _i (t) estimating a resulting state value function; r is (r) _i (t) represents an instant prize requested for user i in time slot t;

critic networkThe parameter omega takes the optimal value omega ^* Optimum value omega ^* Determined by the following formula:

wherein J (ω) is the objective function of the Critic network:

jackpot for requesting user:

t' is the jackpot slot;

s6.4.3 determining an optimum value of an Actor network parameter θ ^* ，θ＝[θ _d ,c]：

wherein ,representation strategy->The produced sample _i Empirical mean value of i), ε being a super-parameter, pi _θ (a _i ()|o _i () A) represents the policy function of the updated Actor network,/->Policy function representing an Actor network before update, clip () represents a policy function for restricting +.>Clip function of ratio->Representing discrete actions +.>Representing a continuous action.

6. The D2D-MEC offloading method of claim 5, wherein: the step S7 specifically comprises the following steps:

s7.2, at the beginning of each time slot, the local observation state o observed by the time slot _i () Input to an Actor network to obtain an unloading mode selected by a request user i in a time slot, an unloading target, an unloading proportion, a resource allocation strategy and a discrete-continuous mixing action a _i Probability of (t)

wherein ,

s7.3, at the end of each time slot, the local observation states o of all the requesting users _i () Discrete-continuous mixing action a _i (t) requesting the user to execute the instant prize r obtained after unloading _i (t) performing a discrete-continuous mixing action a _i Probability of (t)And a new local observation state o for requesting the user to observe _i Information of composition of +1Storing the data into a data cache D;

s7.5, according to the cumulative discount prize, the estimated value sum of the dominance functionRespectively calculating the cumulative discount rewardsAnd requesting the estimation of the dominance function of user i in the slot +.>And updates the information in the data cache D to