CN114938381B - D2D-MEC unloading method based on deep reinforcement learning - Google Patents

D2D-MEC unloading method based on deep reinforcement learning Download PDF

Info

Publication number
CN114938381B
CN114938381B CN202210771544.2A CN202210771544A CN114938381B CN 114938381 B CN114938381 B CN 114938381B CN 202210771544 A CN202210771544 A CN 202210771544A CN 114938381 B CN114938381 B CN 114938381B
Authority
CN
China
Prior art keywords
mec
network
unloading
user
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210771544.2A
Other languages
Chinese (zh)
Other versions
CN114938381A (en
Inventor
施苑英
王选宏
石薇
蒋军敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202210771544.2A priority Critical patent/CN114938381B/en
Publication of CN114938381A publication Critical patent/CN114938381A/en
Application granted granted Critical
Publication of CN114938381B publication Critical patent/CN114938381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/70Services for machine-to-machine communication [M2M] or machine type communication [MTC]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention belongs to a calculation unloading method, and provides a D2D-MEC unloading method and a computer program product based on deep reinforcement learning, which are used for solving the technical problem that in practical application, one mode of MEC unloading or D2D unloading is selected, and the advantages of the MEC unloading and the D2D unloading cannot be considered, so that the long-term service cost of a requesting user is minimized. Combining MEC offloading and D2D offloading allows the mobile device to perform computing tasks locally, or offload part of the tasks to a MEC server, or offload part of the tasks to a nearby D2D device for processing. In order to realize the joint optimization of unloading decision, power control and computing resource allocation, and aim at minimizing energy consumption and business service delay, a deep reinforcement learning algorithm based on near-end policy optimization is adopted to obtain an optimal control policy.

Description

D2D-MEC unloading method based on deep reinforcement learning
Technical Field
The invention belongs to a calculation unloading method, and particularly relates to a D2D-MEC unloading method based on deep reinforcement learning and a computer program product.
Background
With the rapid development of 5G technology, various new services represented by face recognition, augmented reality and natural language processing are continuously emerging. These services often have computationally intensive and latency sensitive characteristics, which pose a significant challenge for mobile devices with limited computing power and battery power.
The mobile edge computing (Mobile Edge Computing, MEC) is a novel computing model proposed by the european telecommunications standards institute, which sinks cloud computing service from the cloud to the edge of the mobile network, so that a user can offload computing tasks to an edge server to execute, which not only solves the shortages of terminal equipment in terms of computing performance and energy efficiency, but also meets the requirement of ultra-low time delay of new services, and enables the user to obtain better service experience. inter-Device-to-Device (D2D) is a local direct communication technology that does not require a central infrastructure. The mobile device can realize cooperative sharing of computing resources by using the D2D technology, so that the devices with insufficient resources can offload computing tasks to the devices with idle adjacent resources for processing. Compared with MEC unloading, the unloading mode based on the near field communication not only can further reduce data transmission time delay and energy consumption of mobile equipment, but also is beneficial to relieving pressure caused by concurrent unloading of a large number of users on network communication and an MEC server. Thus, D2D computing offloading has received increasing attention in recent years as an effective aid to MEC offloading.
However, in practical applications, one of MEC offloading and D2D offloading is generally selected, which cannot achieve both advantages, and long-term service overhead of the requesting user cannot be minimized.
Disclosure of Invention
The invention provides a D2D-MEC unloading method and a computer program product based on deep reinforcement learning, which are used for solving the technical problem that in practical application, one mode of MEC unloading or D2D unloading is selected, and the advantages of the MEC unloading and the D2D unloading cannot be considered, so that the long-term service cost of a requesting user is minimized.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
the D2D-MEC unloading method based on deep reinforcement learning is characterized by comprising the following steps of:
s1, establishing a D2D-MEC edge computing system model, and executing steps S2 to S8 according to the D2D-MEC edge computing system model;
s2, establishing a task model and determining task queuing delay;
s3, determining the transmission rate of the sub-channel in the network according to a shannon formula;
s4, determining the processing time delay and the energy consumption of each unloading mode according to the task model, the task queuing time delay and the transmission rate of the sub-channel, and obtaining the total service time delay of the task and the total energy consumption generated by requesting a user to process the task;
s5, setting an overhead function according to total energy consumption and total service time delay generated by a request user for processing tasks;
s6, constructing a multi-agent deep reinforcement learning model, wherein the multi-agent deep reinforcement learning model comprises an Actor network and a Critic network;
s6.1, determining local state information observed by a requesting user at the beginning of each time slot;
s6.2, splicing the local observation state of any request user with the local observation states of all other request users, and deleting repeated information to obtain the global state of the request user;
s6.3, determining a reward function for indicating environmental feedback rewards and accumulated discount rewards after the user is requested to execute actions according to the overhead function;
s6.4, determining optimization targets of an Actor network and a Critic network;
s7, training an Actor network and a Critic network, and optimizing parameters;
s8, selecting an unloading mode when the D2D-MEC network is unloaded through the trained Actor network, determining an unloading proportion, and distributing computing resources and transmitting power.
Further, the step S2 specifically includes:
s2.1, establishing a triplet task model:
T i,j =<D i,j ,C i,j ,a i,j
wherein ,Di,j Representing the calculated data amount of the j-th task which the request user i needs to process; c (C) i,j Representing the number of CPU cycles required for requesting the j-th task that user i needs to process; a, a i,j Indicating the arrival time of the j-th task which the requesting user i needs to process; t (T) i,j Indicating the j-th task which the requesting user i needs to process;
s2.2, determining task T by the following formula i,j Queuing delay q of (2) i,J
q i,j =b i,j -a i,J
wherein ,bI,J The start processing time of the j-th task to be processed for requesting user i.
Further, in step S3, cellular communication between the mobile user and the base station in the network and D2D communication between the mobile users all adopt an orthogonal frequency division multiple access mode, and the D2D communication and the cellular communication do not multiplex spectrum resources;
in step S3, specifically, the transmission rate of the sub-channel is determined by the following formula:
wherein ,ri,k (t) represents the sub-channel transmission rate between the requesting user i and the offload target k; h is a i,k (t) represents the subchannel gain between the requesting user i and the offload target k; p is p i,k (t) represents the sub-channel transmit power between the requesting user i and the offload target k;representing noise power; w represents the bandwidth of the subchannel.
Further, in step S4, the offloading modes include a mode 1, a mode 2 and a mode 3, where the mode 1 is that tasks are all executed locally by a user, the mode 2 is that tasks are partially or fully offloaded to an MEC server for processing, and the mode 3 is that tasks are partially or fully offloaded to an adjacent D2D service user for processing;
the step S4 specifically comprises the following steps:
s4.1, determining processing time delay and energy consumption of three unloading modes through local calculation time delay and energy consumption, MEC unloading time delay and energy consumption and D2D unloading time delay and energy consumption in the three unloading modes:
local computation delay and energy consumption:
local computation delayThe method comprises the following steps:
local computing energy consumptionThe method comprises the following steps:
wherein ,fi loc (t) is the CPU frequency of user i in time slot t, κ i Is an effective switched capacitor; alpha i (t) is the task offload ratio, α for mode 1 i (t) =0, α for mode 2 and mode 3 i (t)∈(0,1];
MEC unloading delay and energy consumption:
MEC offload delays including transmission delaysAnd calculate the delay->
wherein ,ri,0 (t) represents r I,K In (t), k is 0, f mec CPU frequency of MEC server, u mec (t) represents the number of MEC offload users for time slot t;
MEC unloading energy consumptionThe method comprises the following steps:
wherein ,pi,0 (t) represents p i,k In (t)k is 0;
D2D offload latency and energy consumption:
the D2D offload delay includes a transmission delayAnd calculate the delay->
wherein ,is the CPU frequency of service user k in time slot t;
D2D unloading energy consumptionThe method comprises the following steps:
s4.2, calculating task T i,j Is of the total service delay L i,j Task T i,j Including processing delays and queuing delays:
s4.3, calculating the processing task T of the request user i,j Total energy consumption E produced i,j
Further, in step S5, the cost function c i,j The method comprises the following steps:
wherein ,β1 、β 2 、β 3 Weight factors for latency overhead, energy consumption overhead and service timeout penalty respectively,is an indication function, τ max Is task T i,j Maximum service delay that can be tolerated.
Further, in step S6.1, the local observation state o i (t) including the channel gain h between the requesting user and each candidate offload target i (T), candidate target device previous T W Calculation load history information F (t) of each slot, length Q of task queue at the beginning of slot t i (t);
wherein ,hi (t)=[h i,0 (t),h i,1 (t),…,h i,N (t)];F(t)=[f 0 (t),f 1 (t),…,f N (t)] T
f 0 (t)=[u mec (t-T W ),u mec (t-T W +1),…,u mec (t-1)] T
,f 0 (t) represents computation load information of MEC server, u mec () Indicating the number of MEC unloading users; f (f) k (t)=[d k (t-T W ),d k (t-T W +1),…,d k (t-1)] T ,f k (t) representing offloading of service records of target k, d k () Representing the amount of offload data handled by k users; t (T) W Representing the number of columns of the F (t) matrix; k=1, 2, … N, N representing the total number of D2D service users.
Further, step S6.4 specifically includes:
s6.4.1 the estimated value of the dominance function of the requesting user i in the time slot t generated by the Critic network is obtained by the following formula
wherein ,gamma is a discount factor; lambda E [0,1]]Is a parameter for balancing the estimated bias and variance; t (T) max Representing the time slot set +.> Is a radix of (2);representing global state s of Critic network according to request user i (t) estimating a resulting state value function; r is (r) i (t) represents an instant prize requested for user i in time slot t;
critic network parameter omega takes optimal value omega * Optimum value omega * Determined by the following formula:
wherein J (ω) is the objective function of the Critic network:
wherein ,representation strategy->An empirical average of the generated trace samples τ;
jackpot for requesting user:
t' is the jackpot slot;
s6.4.2, determining a policy function for an Actor network that can support discrete-continuous hybrid decisions:
s6.4.3 determining an optimum value θ of an Actor network parameter θ * ,θ=[θ d ,θ c ]:
Wherein L (θ) is an objective function of the Actor network; θ d Network parameters, θ, representing a discrete action policy network c Network parameters representing a continuous action policy network; the Actor network comprises a discrete action policy networkAnd continuous action policy network->
wherein ,representation strategy->The resulting sample (o i ,a i ) Is the empirical mean value of ε is the super parameter, pi θ (a i (t)|o i (t)) represents an updated Actor networkPolicy function->Policy function representing an Actor network before update, clip () represents a policy function for restricting +.>Clip function of ratio->Representing discrete actions +.>Representing a continuous action.
Further, step S7 specifically includes:
s7.1, at the beginning of each round, all the request users randomly set an initial state;
s7.2, at the beginning of each time slot, the local observation state o observed by the time slot i (t) inputting to the Actor network to obtain the unloading mode and target selected by the request user i in the time slot t, unloading proportion, resource allocation strategy, and performing discrete-continuous mixing action a i Probability of (t)
wherein ,
s7.3, at the end of each time slot, the local observation states o of all the requesting users i (t), discrete-continuous mixing action a i (t) requesting the user to execute the instant prize r obtained after unloading i (t) performing a discrete-continuous mixing action a i Probability of (t)And a new local observation state o for requesting the user to observe i Information composed of (t+1)Storing the data into a data cache D;
s7.4, repeatedly executing the step S7.2 and the step S7.3 for each time slot until one round is finished; each of the rounds has a duration T max A time slot;
s7.5, according to the cumulative discount prize, the estimated value sum of the dominance functionRespectively calculating the cumulative discount prize +.>And requesting an estimate of the dominance function of user i in time slot t +.>And updates the information in the data cache D to
S7.6, repeatedly executing the steps S7.1 to S7.5 for each round until the data cache D is full, and taking the information stored in the data cache D as training data;
s7.7, updating the network parameters theta of the discrete action strategy network in the Actor network according to the objective function L (theta) of the Actor network through partial training data stored in the data cache D d And network parameters θ for a continuous action policy network c The method comprises the steps of carrying out a first treatment on the surface of the Updating Critic network parameters omega according to objective function J (omega) of Critic network;
s7.8, repeatedly executing the step S7.7 until all training data stored in the data cache D are executed by the step S7.7, and emptying the data cache D;
s7.9, judging whether the round number reaches a preset round number, if so, completing training to obtain a trained Actor network and a Critic network; otherwise, repeating the steps S7.1 to S7.8 until the round number reaches the preset round number.
The invention also provides a computer program product, which comprises a computer program and is characterized in that the program is executed by a processor to realize the steps of the D2D-MEC unloading method based on deep reinforcement learning.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a D2D-MEC unloading method based on deep reinforcement learning, which combines MEC unloading and D2D unloading, allows a mobile device to locally execute a computing task, or unloads part of the task to an MEC server, or unloads part of the task to adjacent D2D equipment for processing. In order to achieve joint optimization of offloading decisions, power control and computing resource allocation, with the goal of minimizing energy consumption and traffic service latency, a deep reinforcement learning algorithm based on near-end policy optimization (Proximal Policy Optimization, PPO) is employed to obtain an optimal control policy.
2. The invention describes the dynamic characteristics of the system by using a discrete time model. At the beginning of each time slot, the requesting user observes the current state of the D2D-MEC network, and accordingly makes an offloading decision, thereby minimizing the long-term service overhead of the requesting user in the D2D-MEC system.
3. The invention provides a multi-agent PPO algorithm capable of supporting a discrete-continuous mixed action space based on the PPO algorithm, which is used for jointly optimizing unloading decision and calculating a resource and transmitting power allocation scheme. Each request user is used as an independent intelligent agent by adopting a centralized training and distributed executing mechanism and an Actor-Critic framework, and the action to be taken is determined by utilizing an Actor network according to local state information observed by the request user; the centralized Critic network combines the local observation information of all the agents into the total state information, and then estimates the dominance function of each agent by combining the environmental rewards value, and guides the updating of the Actor network.
4. The invention also provides a computer program product capable of executing the steps of the method, and the method can be popularized and applied to monitoring on corresponding hardware equipment.
Drawings
FIG. 1 is a schematic diagram of a D2D-MEC network architecture;
FIG. 2 is a schematic diagram of a multi-agent centralized training distributed execution in an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation principle of an Actor network according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating the process of collecting training data and updating network parameters according to an embodiment of the present invention.
Wherein: 1-first requesting user, 2-service user, 3-MEC server, 4-base station, 5-second requesting user, 6-third requesting user.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Reinforcement learning is an important branch of machine learning research, and refers to a process that an intelligent agent continuously explores and tries to make mistakes in the process of interacting with the environment, and gradually adjusts a behavior mode according to environmental feedback until an optimal behavior strategy is obtained. In recent years, with the advent of deep learning, reinforcement learning was fused with deep neural networks, forming a deep reinforcement learning (Deep Reinforcement Learning, DRL) technique. Compared with the traditional reinforcement learning, the DRL can fully utilize the strong expression capability of the deep neural network to process the sequential decision problem with high-dimensional state space and action space, so that the DRL is suitable for solving the problem of edge calculation unloading in the random dynamic environment of the wireless communication network.
The D2D-MEC network shown in fig. 1 is composed of one Base Station 4 (BS) and M Mobile Devices (MD). The base station 4 is equipped with a local MEC server 3 that can provide computing offload services for users. In this network architecture, there are two types of mobile users: one is that there are computing tasks to process, called Request Devices (RDs), such as first, second and third requesting users 1, 5, 6 in fig. 1, another is that D2D offload services can be provided, called Service Devices (SDs), and in addition, the arrows in fig. 1 represent task offloading.
The requesting user can freely select a task offloading mode and an offloading target based on information such as the local calculation task amount, channel conditions, calculation loads of the MEC server 3 and the service user 2, and the like. The three offloading modes defined in the present invention are: mode 1 is local computing, i.e., tasks are all performed locally to the user, as the first requesting user in FIG. 1; mode 2 is MEC offloading, i.e. the task is partially or fully uploaded to the MEC server 3 for processing, the remainder being executed locally, as the second requesting user 5 in fig. 1; mode 3 is D2D offloading, i.e. offloading of the task partly or entirely over the D2D link to some service user 2 in the vicinity, the remainder being performed locally, as the third requesting user 6 in fig. 1.
Definition of the definitionFor requesting user set->For service user set, M target ={0}∪M SD For the candidate offload target set, 0 represents the MEC server 3 base station 4. At most one offload target (MEC server 3 base station 4 or service user 2) can be selected by the requesting user at each instant, and at most 1 requesting user can be served by the service user 2.
The invention adopts a discrete time model to describe the dynamic characteristics of the system, and at the beginning of each time slot, a user is requested to observe the current state of the D2D-MEC network, so as to make an unloading decision. The time slot length is delta t, and the time slot set isThe method comprises the following specific steps:
(1) Task model building
The invention considers time delay sensitive computing task, tau max Is the maximum service delay that can be tolerated by the task. By triplet T i,j =<D i,j ,C i,j ,a i,j >Describing the j-th task that the requesting user i needs to process, wherein D i,j The calculation data quantity of the j-th task which is required to be processed by the request user i is expressed in bits; c (C) i,j The CPU cycle number required by the j-th task which is required to be processed by the request user i is expressed as cycles/s; a, a i,j The arrival time of the j-th task that the requesting user i needs to process is expressed in units s. The newly arrived tasks are firstly put into a local buffer queue of the equipment, then are sequentially processed according to the FIFO sequence (first-in first-out principle), and each task forms a task queue. Definition Q i (t) is the length of the task queue at the beginning of time slot t, b i,j Representing task T i,j Time to start processing, task T i,j Queuing delay q of (2) i,j The method comprises the following steps:
q i,j =b i,j -a i,j
(2) Establishing a communication model
In order to avoid signal interference between users, cellular communication between MD (mobile user) and BS (base station) and D2D communication between MD (mobile user) all adopt an orthogonal frequency division multiple access method, and the D2D communication and the cellular communication do not reuse spectrum resources. Assuming that the network allocates one subchannel for each offload user and the bandwidth of the subchannel is WHz, the transmission rate of the subchannel is according to shannon's formula:
wherein ,ri,k (t) represents the sub-channel transmission rate, h, between the requesting user i and the offload target k i,k (t) denotes that the requesting user i and the offload target k (k ε M target ) Sub-channel gain, p i,k (t) is a decision variable representing the subchannel transmit power between the requesting user i and the offload target k, 0 < p i,k (t)≤p max ,p max Maximum transmit power for the user equipment;is the noise power.
(3) Establishing a calculation model of time delay and energy consumption
In different offload modes, the latency of requesting users to process computing tasks and the resulting device energy consumption have different forms. Mode 1 involves only local computation latency and local computation energy consumption; in modes 2 and 3, the task processing delays include a local computation delay, an offload data transmission delay, and a remote device computation delay, and the energy consumption includes a local computation energy consumption and a data transmission energy consumption. Considering that the data volume of the task processing result is smaller, the invention ignores the time delay and the energy consumption of the result feedback.
(a) Local computation delay and energy consumption
Definition of the decision variable alpha i (t)∈[0,1]The task offload ratio for requesting user i in time slot t is indicated. When mode 1 is selected, α i (t) =0; when mode 2 or 3 is selected, α i (t)∈(0,1]. Assume task T i,j Processed in time slot t, then the amount of data that needs to be calculated locally is [ 1-alpha ] i (t)]D i,j The calculation time delay and the calculation energy consumption caused are respectively as follows:
to reduce power consumption, mobile devices apply dynamic voltage frequency scaling (Dynamic Voltage and Frequency Scaling, DVFS) techniques to schedule local computing resources, defining a decision variable f i loc (t) represents the CPU frequency selected by user i in time slot t, 0 < f i loc (t)≤f max ,f max Is the maximum CPU frequency. Kappa (kappa) i Representing the effective switched capacitance, the size of which is determined by the device chip architecture.
(b) MEC unloading delay and energy consumption
In mode 2, the requesting user transmits the offload data to the BS (base station) via the uplink subchannel for processing by the MEC server. Assuming that the MEC server allocates computing resources evenly to all current offload users, task T i,j The transmission delay and the calculation delay of the medium unloading data are respectively as follows:
wherein umec (t) MEC offload user number, f, for time slot t mec Is the CPU frequency of MEC server, r i,0 And (t) is the information transmission rate between the user i and the BS.
The equipment energy consumption of the requesting user depends on the unloading data transmission delay and the transmitting power, and the calculation formula is as follows:
(c) D2D offloading latency and energy consumption
When the user selects mode 3, the offload data will be transmitted over the D2D link to the service user and immediately processed. The transmission delay and the calculation delay caused by the method are respectively as follows:
wherein ri,k (t) is the information transfer rate between the requesting user i and the serving user k,is the CPU frequency of k at time slot t.
The power consumption of the requesting user's device depends on the transmission delayAnd a transmission power p i,k (t) the calculation formula is:
to sum up, task T i,j The total service delay of (1) is composed of processing delay and queuing delay, namely:
if L i,j >τ max Tasks will be discarded due to service timeouts.
Request user processing T i,j The total energy consumption generated is:
(4) Setting an overhead function
The system overhead defined by the invention comprises delay overhead, energy consumption overhead and service timeout penalty, thereby obtaining a single task T i,j Is the overhead function of:
wherein β123 Weight factors for latency overhead, energy consumption overhead and service timeout penalty respectively,is an indication function when condition L i.j >τ max When the function value is established, the function value is equal to 1, otherwise, the function value is 0.
Since a user can handle multiple tasks within a single slot, the overhead function for slot t is defined as:
wherein ,Ti (t) is user iThe index set of tasks is processed at time slot t.
The specific offloading algorithm is designed as follows:
the main object of the present invention is to minimize the long-term service overhead of the requesting user in the D2D-MEC system. Therefore, the optimization problem is modeled as a partially observable Markov decision process, and a multi-agent PPO algorithm capable of supporting a discrete-continuous mixed action space is provided on the basis of the PPO algorithm for jointly optimizing an unloading decision and an allocation scheme of computing resources and transmitting power. The algorithm employs a "centralized training, distributed execution" mechanism and an Actor-Critic framework, as shown in FIG. 2. Each requesting user is used as an independent intelligent agent and is based on the local state information o observed by the requesting user i (t) determining an action a to be taken using an Actor network i (t); the centralized Critic network combines the local observation information of all the intelligent agents into the total state information, and combines the environmental rewarding valueTo estimate the dominance function of each agent>And guides the updating of the Actor network. The method comprises the following steps:
(1) Obtaining local observation state and global state
At the beginning of each time slot, the requesting user i observes the data volume of the local task queue (i.e. the length of the task queue) and the state of the external environment, including the channel quality between the requesting user and the candidate offload target and the computational load of the candidate target, obtaining the local observation state o i (t)={Q i (t),h i (t),F(t)}。
wherein ,hi (t)=[h i,0 (t),h i,1 (t),…,h i,N (t)]Representing channel gains between the requesting user and each candidate offload target;
F(t)=[f 0 (t),f 1 (t),…,f N (t)] T f (T) is (N+1). Times.T W Matrix of dimensions representing previous T of candidate target device W The computation load history information of each slot is collected and distributed to all requesting users by a BS (base station). f (f) 0 (t)=[u mec (t-T W ),u mec (t-T W +1),…,u mec (t-1)] T Representing computational load information of MEC server, f k (t)=[d k (t-T W ),d k (t-T W +1),…,d k (t-1)] T Service record, d, representing service user k k (t-j) is the amount of offload data that the user handles in time slot t-j.
Global state of agent i is determined by its local observation state o i (t) and other agent's local observed state o -i (t) spliced, subscript-i represents all agents except i. Deleting repeated information in the splicing process to obtain a global state s i (t) is defined as follows:
s i (t)={0′ i (t),o′ -i (t),F(t)}
wherein ,o′i (t)=[Q i (t),h i (t)]。
(2) Decision of mixing action
The decision process of the agent in each time slot is divided into two phases: an appropriate offloading mode is first selected, then an offloading ratio is determined, and computing resources and transmit power are allocated. For this purpose, discrete-continuous mixing actions are definedWherein discrete actions->Indicating the offloading mode and offloading destination selected by agent i at time slot t, 0 represents the MEC server, 1,2, …, N represents the service user, and n+1 represents the local calculation. Continuous action->Representing a resource allocation policy, which is defined as: />
As shown in fig. 3, in order to implement the hybrid action decision, the present invention extends the Actor network in the PPO algorithm, and defines discrete action policy networks respectivelyContinuous action policy networkd and θc Network parameters of the discrete action strategy network and the continuous action strategy network respectively) and solves the back propagation derivative problem after discrete action sampling by utilizing a Gumbel-Softmax method.
In FIG. 3, the state information F (t) is input to a Long Short-Term Memory (LSTM) layer to predict the recent load level of the candidate load shedding target, and then to be compared with Q i (t)、h i (t) stitching together as inputs to the discrete action policy network and the continuous action policy network. Policy functionObeying the Categorical distribution, outputting the probability value of all the discrete actions selected according to the input local observation state, and obtaining the discrete actions after Gumbel-Softmax sampling>G (0, 1) in fig. 3 represents the sampled value of the gummel distribution with parameter (0, 1). Policy function->Obeying Gaussian distribution, outputting the mean value and standard deviation of continuous motion distribution according to the input local observation state and discrete motion value, and obtaining continuous motion +.>Because all the request users have homogeneity and can generate the same decision result under the same input state, the invention adopts a parameter sharing mechanism to lead all the intelligent agents to share the same decision resultNetwork parameter θ d and θc To accelerate the convergence speed of the algorithm.
(3) Setting a bonus function
The environment will immediately feed back the bonus information after all agents perform their respective actions. The instant rewards obtained by the agent i in the time slot t are:
r i (t)=-c i (t)
the goal of the agent is to maximize the long-term rewards for which the following cumulative discount rewards are defined:
where gamma E0, 1 is a discount factor used to balance long-term rewards against instant rewards, and t' is a jackpot time slot.
(4) Setting an objective function
In the present invention, the Actor network is formed by cascading a discrete action policy network and a continuous action policy network, and therefore, the policy function is defined as:
parameter θ= [ θ ] dc ]Optimal value θ * Satisfy the following requirements
/>
Wherein L (θ) is an objective function of the Actor network, θ old Representing the parameter values before the updating of the Actor network, training samples (o i ,a i ) By policyGenerate->Is the estimated value of the dominance function of agent i in time slot t, and is generated by Critic network. Pi θ (a i (t)|o i (t)) and->Respectively before and after updating, when the input state is o i At (t), agent selecting action a i (t) probability. clip function is used to limit the ratio +.>To avoid the algorithm difficult to converge due to too fast parameter update; epsilon is a hyper-parameter.
Critic network is based on global state s of agent i (t) estimating a state value functionω is a network parameter whose optimal value satisfies:
j (ω) is the objective function of the Critic network, which is defined in terms of mean square error.
Critic network utilization status valueAnd environmental rewards r i And (t) further estimating an advantage function, and feeding back to an Actor for parameter training. The invention adopts a generalized dominance estimation (GAE) method to estimate the dominance function, and the calculation formula is as follows:
wherein the discount factor gamma and the parameter lambda epsilon 0,1 are used to balance the estimated deviation and variance.
(5) Algorithm training process
As shown in fig. 4, the training process of the algorithm is divided into two phases: training data is collected and network parameters are updated. In the first phase, the system collects data in rounds (epodes), one round having a duration T max And each time slot. At the beginning of each round, all agents start from a randomly set initial state and observe the local observation state o observed by the time slot i (t) input to an Actor network to obtain an offload modeUnloading proportion and resource allocation scheme->Executing action a i Probability of (t)>After the intelligent agent executes the unloading action, the instant rewards r of the environment feedback are obtained i (t) and observe a new local observation state o i (t+1). At the end of each time slot, information of all agents is addedSpliced together and stored in a data buffer D. After one round is completed, the system calculates the cumulative discount prize according to equations (1), (4) and (5), respectively>And dominance function value->Then the record in cache D is updated to +.>In the form of (a). If D is not full, it indicates that the sample data volume has not reached the trainingThe exercise requirements, the system will continue to execute a new round. />
In the second stage, the algorithm trains the Actor network and the Critic network respectively by using the cached data of D. The training process is carried out in batches, a small batch of data is randomly sampled from D during each training, and the parameter theta of the Actor network is updated according to the formula (2) d and θc And updating the parameter omega of the Critic network according to the formula (3). And after the parameter updating of the round is finished, D is emptied, new data are continuously collected, and the parameter updating of the next round is prepared.
After training, a more accurate unloading method selection can be performed in practical application through the trained Actor network.
The offloading method of the invention may also form a computer program product comprising a computer program which, when executed by a processor, implements the steps of a deep reinforcement learning based D2D-MEC offloading method.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A D2D-MEC offloading method based on deep reinforcement learning, comprising the steps of:
s1, establishing a D2D-MEC edge computing system model, and executing steps S2 to S8 according to the D2D-MEC edge computing system model;
s2, establishing a task model and determining task queuing delay;
s3, determining the transmission rate of the sub-channel in the network according to a shannon formula;
s4, determining the processing time delay and the energy consumption of each unloading mode according to the task model, the task queuing time delay and the transmission rate of the sub-channel, and obtaining the total service time delay of the task and the total energy consumption generated by requesting a user to process the task;
s5, setting an overhead function according to total energy consumption and total service time delay generated by a request user for processing tasks;
s6, constructing a multi-agent deep reinforcement learning model, wherein the multi-agent deep reinforcement learning model comprises an Actor network and a Critic network;
s6.1, determining local state information observed by a requesting user at the beginning of each time slot;
s6.2, splicing the local observation state of any request user with the local observation states of all other request users, and deleting repeated information to obtain the global state of the request user;
s6.3, determining a reward function for indicating environmental feedback rewards and accumulated discount rewards after the user is requested to execute actions according to the overhead function;
s6.4, determining optimization targets of an Actor network and a Critic network;
s7, training an Actor network and a Critic network, and optimizing parameters;
s8, selecting an unloading mode when the D2D-MEC network is unloaded through the trained Actor network, determining an unloading proportion, and distributing computing resources and transmitting power;
in step S3, cellular communication between the mobile user and the base station in the network and D2D communication between the mobile users all adopt an orthogonal frequency division multiple access mode, and the D2D communication and the cellular communication do not multiplex spectrum resources;
in step S3, specifically, the transmission rate of the sub-channel is determined by the following formula:
wherein ,ri,k (t) represents the sub-channel transmission rate between the requesting user i and the offload target k; h is a i,k (t) represents the subchannel gain between the requesting user i and the offload target k; p is p i,k (t) represents the sub-channel transmit power between the requesting user i and the offload target k;representing noise power; w represents the bandwidth of the subchannel;
in step S4, the offloading modes are three, including a mode 1, a mode 2 and a mode 3, where the mode 1 is that tasks are all executed locally by a user, the mode 2 is that tasks are partially or fully offloaded to an MEC server for processing, and the mode 3 is that tasks are partially or fully offloaded to an adjacent D2D service user for processing;
the step S4 specifically comprises the following steps:
s4.1, determining processing time delay and energy consumption of three unloading modes through local calculation time delay and energy consumption, MEC unloading time delay and energy consumption and D2D unloading time delay and energy consumption in the three unloading modes:
local computation delay and energy consumption:
local computation delayThe method comprises the following steps:
local computing energy consumptionThe method comprises the following steps:
wherein ,is the CPU frequency of user i in time slot t, κ i Is an effective switched capacitor; alpha i (t) is the task offload ratio, α for mode 1 i (t) =0, α for mode 2 and mode 3 i (t)∈(0,1];
MEC unloading delay and energy consumption:
MEC offload delays including transmission delaysAnd calculate the delay->
wherein ,ri,0 (t) represents r i,k In (t), k is 0, f mec CPU frequency of MEC server, u mec (t) represents the number of MEC offload users for time slot t;
MEC unloading energy consumptionThe method comprises the following steps:
wherein ,pi,0 (t) represents p i,k In (t), k is 0;
D2D offload latency and energy consumption:
the D2D offload delay includes a transmission delayAnd calculate the delay->
wherein ,is the CPU frequency of service user k in time slot t;
D2D unloading energy consumptionThe method comprises the following steps:
s4.2, calculating task T i,j Is of the total service delay L i,j Task T i,j Including processing delays and queuing delays:
s4.3, calculating the processing task T of the request user i,j Total energy consumption E produced i,j
2. The D2D-MEC unloading method based on deep reinforcement learning according to claim 1, wherein step S2 is specifically:
s2.1, establishing a triplet task model:
T i,j =<D i,j ,i,j,i,j>
wherein ,Di,j Indicating the first task that the requesting user needs to processIs calculated according to the data amount; c (C) i,j Representing the number of CPU cycles required to request the first task that the user needs to process; a, a i,j Indicating the arrival time of the first task that the requesting user needs to process; t (T) i,j Representing a first task that the requesting user needs to process;
s2.2, determining task T by the following formula i,j Queuing delay q of (2) i,j
q i,ji,j -i,j;
wherein ,bi,j The start processing time of the first task that needs to be processed for the requesting user.
3. The D2D-MEC offloading method of claim 2, wherein the D2D-MEC offloading method is based on deep reinforcement learning and comprises: in step S5, the cost function c i,j The method comprises the following steps:
wherein ,β1 、β 2 、β 3 Weight factors for latency overhead, energy consumption overhead and service timeout penalty respectively,is an indication function, τ max Is task T i,j Maximum service delay that can be tolerated.
4. A D2D-MEC offloading method based on deep reinforcement learning according to claim 3, wherein: in step S6.1, the local observation state o i (t) including the channel gain h between the requesting user and each candidate offload target i (T), candidate target device previous T W Calculation load history information F (t) of each slot, length Q of task queue at the beginning of slot t i (t);
wherein ,hi (t)=[h i,0 (t),h i,1 (t),…,h i,N (t)];F(t)=[f 0 (t),f 1 (t),…,f N (t)] T
f 0 (t)=[u mec (t-T W ),u mec (t-T W +1),…,u mec (t-1)] T
f 0 (t) represents computation load information of MEC server, u mec () Indicating the number of MEC unloading users; f (f) k (t)=[d k (t-T W ),d k (t-T W +1),…,d k (t-1)] T ,f k (t) representing offloading of service records of target k, d k () Representing the amount of offload data handled by k users; t (T) W Representing the number of columns of the F (t) matrix; k=1, 2, … N, N representing the total number of D2D service users.
5. The D2D-MEC offloading method of claim 4, wherein: the step S6.4 specifically comprises the following steps:
s6.4.1 the estimation of the dominance function of the requesting user i in the time slot generated by the Critic network is obtained by
wherein ,gamma is a discount factor; lambda E [0,1]]Is a parameter for balancing the estimated bias and variance; t (T) max Represents a set of time slots t= {0,1, …, T max -cardinality of 1; />Representing global state s of Critic network according to request user i (t) estimating a resulting state value function; r is (r) i (t) represents an instant prize requested for user i in time slot t;
critic networkThe parameter omega takes the optimal value omega * Optimum value omega * Determined by the following formula:
wherein J (ω) is the objective function of the Critic network:
wherein ,representation strategy->An empirical average of the generated trace samples τ;
jackpot for requesting user:
t' is the jackpot slot;
s6.4.2, determining a policy function for an Actor network that can support discrete-continuous hybrid decisions:
s6.4.3 determining an optimum value of an Actor network parameter θ * ,θ=[θ d ,c]:
Wherein L (θ) is an objective function of the Actor network; θ d Network parameters, θ, representing a discrete action policy network c Network parameters representing a continuous action policy network; the Actor network comprises a discrete action policy networkAnd continuous action policy network->
wherein ,representation strategy->The produced sample i Empirical mean value of i), ε being a super-parameter, pi θ (a i ()|o i () A) represents the policy function of the updated Actor network,/->Policy function representing an Actor network before update, clip () represents a policy function for restricting +.>Clip function of ratio->Representing discrete actions +.>Representing a continuous action.
6. The D2D-MEC offloading method of claim 5, wherein: the step S7 specifically comprises the following steps:
s7.1, at the beginning of each round, all the request users randomly set an initial state;
s7.2, at the beginning of each time slot, the local observation state o observed by the time slot i () Input to an Actor network to obtain an unloading mode selected by a request user i in a time slot, an unloading target, an unloading proportion, a resource allocation strategy and a discrete-continuous mixing action a i Probability of (t)
wherein ,
s7.3, at the end of each time slot, the local observation states o of all the requesting users i () Discrete-continuous mixing action a i (t) requesting the user to execute the instant prize r obtained after unloading i (t) performing a discrete-continuous mixing action a i Probability of (t)And a new local observation state o for requesting the user to observe i Information of composition of +1Storing the data into a data cache D;
s7.4, repeatedly executing the step S7.2 and the step S7.3 for each time slot until one round is finished; each of the rounds has a duration T max A time slot;
s7.5, according to the cumulative discount prize, the estimated value sum of the dominance functionRespectively calculating the cumulative discount rewardsAnd requesting the estimation of the dominance function of user i in the slot +.>And updates the information in the data cache D to
S7.6, repeatedly executing the steps S7.1 to S7.5 for each round until the data cache D is full, and taking the information stored in the data cache D as training data;
s7.7, updating the network parameters theta of the discrete action strategy network in the Actor network according to the objective function L (theta) of the Actor network through partial training data stored in the data cache D d And network parameters θ for a continuous action policy network c The method comprises the steps of carrying out a first treatment on the surface of the Updating Critic network parameters omega according to objective function J (omega) of Critic network;
s7.8, repeatedly executing the step S7.7 until all training data stored in the data cache D are executed by the step S7.7, and emptying the data cache D;
s7.9, judging whether the round number reaches a preset round number, if so, completing training to obtain a trained Actor network and a Critic network; otherwise, repeating the steps S7.1 to S7.8 until the round number reaches the preset round number.
CN202210771544.2A 2022-06-30 2022-06-30 D2D-MEC unloading method based on deep reinforcement learning Active CN114938381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210771544.2A CN114938381B (en) 2022-06-30 2022-06-30 D2D-MEC unloading method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210771544.2A CN114938381B (en) 2022-06-30 2022-06-30 D2D-MEC unloading method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114938381A CN114938381A (en) 2022-08-23
CN114938381B true CN114938381B (en) 2023-09-01

Family

ID=82868820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210771544.2A Active CN114938381B (en) 2022-06-30 2022-06-30 D2D-MEC unloading method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114938381B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499875B (en) * 2022-09-14 2023-09-22 中山大学 Satellite internet task unloading method, system and readable storage medium
CN115913987A (en) * 2022-10-24 2023-04-04 浙江工商大学 Intelligent bus service unloading method in edge computing environment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113543342A (en) * 2021-07-05 2021-10-22 南京信息工程大学滨江学院 Reinforced learning resource allocation and task unloading method based on NOMA-MEC
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN113543342A (en) * 2021-07-05 2021-10-22 南京信息工程大学滨江学院 Reinforced learning resource allocation and task unloading method based on NOMA-MEC
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度强化学习的移动边缘计算任务卸载研究;卢海峰;顾春华;罗飞;丁炜超;杨婷;郑帅;;计算机研究与发展(第07期);全文 *

Also Published As

Publication number Publication date
CN114938381A (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN113950066B (en) Single server part calculation unloading method, system and equipment under mobile edge environment
CN113612843B (en) MEC task unloading and resource allocation method based on deep reinforcement learning
CN114938381B (en) D2D-MEC unloading method based on deep reinforcement learning
CN111031102B (en) Multi-user, multi-task mobile edge computing system cacheable task migration method
CN111414252B (en) Task unloading method based on deep reinforcement learning
CN111586696B (en) Resource allocation and unloading decision method based on multi-agent architecture reinforcement learning
CN110557732B (en) Vehicle edge computing network task unloading load balancing system and balancing method
CN107708152B (en) Task unloading method of heterogeneous cellular network
CN110489176B (en) Multi-access edge computing task unloading method based on boxing problem
CN112511336B (en) Online service placement method in edge computing system
CN114205353B (en) Calculation unloading method based on hybrid action space reinforcement learning algorithm
CN115809147B (en) Multi-edge collaborative cache scheduling optimization method, system and model training method
CN115344395B (en) Heterogeneous task generalization-oriented edge cache scheduling and task unloading method and system
CN113568727A (en) Mobile edge calculation task allocation method based on deep reinforcement learning
CN113364630A (en) Quality of service (QoS) differentiation optimization method and device
CN114285853A (en) Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN116233927A (en) Load-aware computing unloading energy-saving optimization method in mobile edge computing
KR20230007941A (en) Edge computational task offloading scheme using reinforcement learning for IIoT scenario
CN116366576A (en) Method, device, equipment and medium for scheduling computing power network resources
CN113821346B (en) Edge computing unloading and resource management method based on deep reinforcement learning
CN117354934A (en) Double-time-scale task unloading and resource allocation method for multi-time-slot MEC system
CN116954866A (en) Edge cloud task scheduling method and system based on deep reinforcement learning
CN114025359B (en) Resource allocation and calculation unloading method, system, equipment and medium based on deep reinforcement learning
CN113452625B (en) Deep reinforcement learning-based unloading scheduling and resource allocation method
CN114980160A (en) Unmanned aerial vehicle-assisted terahertz communication network joint optimization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant