CN115118783A - Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning - Google Patents

Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning Download PDF

Info

Publication number
CN115118783A
CN115118783A CN202210756389.7A CN202210756389A CN115118783A CN 115118783 A CN115118783 A CN 115118783A CN 202210756389 A CN202210756389 A CN 202210756389A CN 115118783 A CN115118783 A CN 115118783A
Authority
CN
China
Prior art keywords
task
delay
communication
server
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210756389.7A
Other languages
Chinese (zh)
Inventor
吴琼
汪文华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202210756389.7A priority Critical patent/CN115118783A/en
Publication of CN115118783A publication Critical patent/CN115118783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/502Proximity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which comprises the steps of constructing a vehicle edge calculation scene and a vehicle heterogeneous communication network, wherein a vehicle can unload a task to a server for processing through three communication technologies; a base station queue dynamic change model is constructed to ensure the stability of the base station queue; calculating an upper bound of system delay for unloading based on different communication technologies by using a random network calculus theory, wherein the delay comprises communication transmission time and server processing time; establishing a vehicle edge computing system utility; establishing an optimization problem, wherein the optimization target is to minimize the system utility, and simultaneously ensure task unloading delay and the stability of a base station queue; SoftActor critical reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task. The task unloading strategy and the resource allocation scheme adopted by the invention are superior to other unloading and resource allocation schemes in reducing system utility, controlling system stability and ensuring task transmission delay.

Description

Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning
Technical Field
The invention belongs to the technical field of vehicle networking edge computing reinforcement learning, and particularly relates to a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning.
Background
The internet of vehicles (IoV) will grow rapidly in the future age of 5G, with the demand for immersive quality of experience (QoE) and computing intensive services (such as online 3D gaming, augmented and virtual reality (AR/VR), video, or other interactive applications) increasing dramatically for Vehicle Users (VUs). Furthermore, for an autonomous vehicle, high resolution camera lidar, high speed high definition maps, and other onboard sensors will produce 1GB of data per second. It is a tremendous strain on an on-board vehicle that is not computationally intensive to perform these tasks. To solve the problem of limited on-board computing resources (e.g., CPU), Vehicle Edge Computing (VEC) is considered to be a very promising technology that can alleviate the problem of shortage of on-board computing resources. The VEC provides an open wireless network edge platform that enables vehicles to offload computationally intensive task loads to nearby roadside MEC servers with low latency. Although the bottleneck of insufficient vehicle-mounted computing resources can be alleviated to some extent by VEC technology, the emerging 5G applications and ultra-reliable low-latency (URLLC) requirements of autonomous driving on tasks still put a certain pressure on the development of the internet of vehicles. Ultra-reliable low-latency related performance requirements, including support of up to 1000 times the amount of server data, ultra-low transmission delays below 5ms, and ultra-high reliability of 99.99%, these stringent URLLC requirements pose a significant challenge to a single communication technology on the one hand, and challenge to the reliability of VEC servers on the other hand.
The emerging heterogeneous V2X communication technology brings about an increase in the communication capacity of vehicles. Currently, there are three technologies widely used in car networking, namely Dedicated Short Range Communication (DSRC) communication, cellular-based vehicle-to-all V2X (CV2X) communication, and millimeter wave (mmWave) communication. The DSRC enables short range communication for vehicles without the need to involve a roadside unit RSU, which operates primarily in the 5.9Hz band and is based on the 802.11p standard protocol. C-V2X benefits users from the existing extensive mobile communication infrastructure. In addition to operating at 5.9Hz, C-V2X may also operate on the licensed band of the cellular operator. However, research results show that both techniques do not support reliable delay guarantees at high vehicle densities. In the next generation of wireless technology, millimeter waves work in a large unused frequency spectrum (i.e., 3-300Hz), multi-gigabit transmission capability can be realized for automatic driving, and the method can also adapt to applications with high performance requirements. Heterogeneous V2X communication integrates the advantages of three communication technologies, providing wide area coverage and more efficient and reliable communication transmission for vehicles. However, due to the randomness of task generation and the time-varying channel conditions in the vehicle scene, the unloading performance of the vehicle edge computing task is greatly influenced, and a test is provided for network performance optimization. In recent years, Deep Reinforcement Learning (DRL) has been widely applied to policy decisions for task offloading in the internet of vehicles, and DRL can make optimal decisions for adjusting policies to achieve the optimal long-term goal without any prior information on the vehicle environment.
Therefore, the invention provides an ultra-reliable low-delay reinforcement learning task unloading scheme of the heterogeneous communication technology. The new scheme considers the competition factors of task unloading communication bandwidth resources and server computing resources, and obtains the delay upper bound of mmWave, DSRC and CV2I by adopting a random network algorithm (SNC) method based on a moment-mother function (MGF), so that the low delay of task unloading is ensured. The offloaded task will cause the queue length on the server side to increase, making the server unstable. Lypunov optimization has been widely used in the cohort of stable systems. The scheme uses Lypunov technology to ensure the reliability of the system. In addition, based on deep reinforcement learning Soft operator-critic, the unloading strategy of each task and the distribution strategy of the server CPU are learned under the condition of ensuring delay and reliability, so that the optimal unloading and distribution decision can be made, the consumption utility of the whole system is reduced, and the network and system performance is improved.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which is superior to other unloading and resource allocation methods in reducing system utility, controlling base station queue stability and ensuring task transmission delay requirements.
The technical scheme is as follows: the invention relates to a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which comprises the following steps of:
(1) constructing a vehicle edge calculation scene, wherein the scene consists of a base station connected with a server, a plurality of road side units and a vehicle; the method comprises the steps that a vehicle heterogeneous communication network is formed by three communication technologies of millimeter waves, DSRC and CV2I, and a vehicle can unload tasks to a server through the three communication technologies for processing;
(2) constructing a bounded burst type flow model based on a random network operation theory;
(3) constructing a base station queue dynamic change model to ensure the stability of a base station queue;
(4) establishing communication transmission models of three communication technologies of millimeter waves, DSRC and CV2I based on a random network calculus theory, and establishing a calculation processing model of a CPU; carrying out minimum convolution on the communication transmission model and the calculation processing model by a series theorem to obtain a system processing model;
(5) deriving an upper bound on delay probabilities for offloading and processing based on the respective communication technologies; the delay comprises communication transmission time and server calculation processing time;
(6) establishing a vehicle edge computing system utility, the system utility consisting of a communication utility and a computing utility;
(7) establishing an optimization problem, wherein the optimization target is to minimize the system utility and ensure the task unloading delay and the stability of a base station queue;
(8) SoftActor critical reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task.
Further, the step (2) is realized as follows:
assuming that the vehicle has K types of tasks to process, at the beginning of each t-slot, A i (t) is the amount of task data accumulated to queue i over time interval [ t, t + 1); at the same time, a time interval of 0 ≦ s ≦ t is given, and a binary non-cumulant A is defined i (s,t)=A i (t-s)=A i (t)-A i (s) is the cumulative amount of tasks arriving at queue i for the ith, A i (s, t) is a bounded burst-type flow model, and satisfies a stable non-negative random process:
A i (s,t)=λ ii (t-s)+σ i ] (1)
where ρ is i For the task arrival rate, σ i For the task burst size, both are constants, λ i Satisfy Poisson distribution, λ i Representing the number of vehicles producing the ith task in the [ s, t) time interval.
Further, the step (3) is realized as follows:
the queue length of the base station is expressed as:
Figure BDA0003722642460000031
wherein q is i (t) is the queue length of the ith task at the beginning of time slot t, f E Processing the clock rate, omega, for the maximum CPU of the server i Indicating that the amount of data per bit task i processed by the server requires a CPU clock period, α i (t) represents the CPU clock cycle duty that the server allocates to the ith task, and [ x [ ]] + Max (x, 0); the stability of all queues is controlled by the following definitions:
Figure BDA0003722642460000032
the left end of equation (3) describes the long-term time-averaged backlog of the queue; equation (3) means that the strong stability of the queue corresponds to a finite average backlog with a finite average queuing delay.
Further, the step (4) is realized as follows:
by beta mmw (s, t) represents the total communication transmission available at the time interval mmWave of [ s, t ], and C is used (q) Indicates the channel capacity, ζ, of the q time slots (q) The channel gain of the q-slot is represented,
Figure BDA0003722642460000041
representing the signal-to-noise ratio, B the bandwidth, l and δ the transmission distance and path loss exponent, respectively, the millimeter wave energy provides the total communication transmission:
Figure BDA0003722642460000042
wherein η is Blog 2 e, using
Figure BDA0003722642460000043
The task proportion of the task i for communication transmission through the millimeter waves is represented; the communication transmission capacity of millimeter wave energy for providing the ith task is as follows:
Figure BDA0003722642460000044
wherein
Figure BDA0003722642460000045
The delay-rate model in network calculus theory is used to establish the total traffic throughput that DSRC communications can provide within a time interval [ s, t):
Figure BDA0003722642460000046
R dsrc in order for the DSRC communication bandwidth to be,
Figure BDA0003722642460000047
an average access delay indicating that a collision occurs when data is transmitted by the DSRC; the amount of communications traffic that the DSRC can provide the ith task is:
Figure BDA0003722642460000048
wherein
Figure BDA0003722642460000049
Figure BDA00037226424600000410
Indicating the task proportion of the communication transmission of the task i through the DSRC;
by using
Figure BDA00037226424600000411
Indicates [ s, t) the amount of communications traffic that the time interval DSRC can provide the ith task:
Figure BDA00037226424600000412
wherein R is cv2i For reserving communication bandwidth for the ith task, wherein
Figure BDA00037226424600000413
By using
Figure BDA00037226424600000414
Indicating that the server CPU can provide the calculation amount of processing offloaded to the server task i in the time interval [ s, t), which is equal to the amount of processing by the CPU in equation (2):
Figure BDA0003722642460000051
order set
Figure BDA0003722642460000052
Representing communication technologies that can be offloaded; by using
Figure BDA0003722642460000053
Representing time intervals s, t) via communication techniques
Figure BDA0003722642460000054
The amount of tasks for task i offloaded to the server,
Figure BDA0003722642460000055
indicating the proportion of tasks, q, of communication transmission of task i by communication technique g i (s) backlog task(s) not yet processed in base station queue i before time s, using
Figure BDA0003722642460000056
Indicating that the server CPU is providing at time interval s, t)
Figure BDA0003722642460000057
The amount of calculation processing of (a) is,
Figure BDA0003722642460000058
the amount of computational processing that can be obtained is calculated as:
Figure BDA0003722642460000059
wherein
Figure BDA00037226424600000510
The delay of the task is the sum of the time of communication transmission and the time of server CPU processing, the total processing of the system can be obtained for the task i as the sum of the communication transmission amount and the calculation processing amount of the CPU, and the system can provide the overall service for the task i unloaded based on the communication technology g as the communication transmission of the communication technology g
Figure BDA00037226424600000511
And CPU calculation processing
Figure BDA00037226424600000512
Minimum convolution of (d):
Figure BDA00037226424600000513
in the formula
Figure BDA00037226424600000514
The operator is the minimum convolution operator, is the most important operator in the random network operation theory, and has the following operation rules:
Figure BDA00037226424600000515
further, the step (5) is realized as follows:
by W i g (t) delay of task i for offloading based on communication technology g, using ω i g The upper bound of the probability of the task i being transmitted by the communication technology g is shown, and the task transmission and processing time W is shown i g (t) exceeds
Figure BDA00037226424600000516
Has a probability of being less than epsilon i The definition is as follows:
Figure BDA00037226424600000517
obtaining an upper bound on the delay
Figure BDA00037226424600000518
The solution of (a) is:
Figure BDA00037226424600000519
in the formula
Figure BDA0003722642460000061
Comprises the following steps:
Figure BDA0003722642460000062
wherein
Figure BDA0003722642460000063
To obtain toUpper bound on probabilistic delay of transaction i
Figure BDA0003722642460000064
First needs to calculate
Figure BDA0003722642460000065
Figure BDA0003722642460000066
Suppose that
Figure BDA0003722642460000067
And
Figure BDA0003722642460000068
i.e. the offloaded tasks may obtain far greater communication and computational resources than the offloading rate of tasks offloaded based on communication technology g
Figure BDA0003722642460000069
To obtain
Figure BDA00037226424600000610
Closed-form solution of (c):
Figure BDA00037226424600000611
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037226424600000612
and
Figure BDA00037226424600000613
upper bound of delay
Figure BDA00037226424600000614
The probability e of exceeding the delay bound is determined by a number of factors i Determines the size of the delay, the second item is related to the size of the task burst, and the third item is the communication technology g andthe residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the delay upper bound
Figure BDA00037226424600000615
Further, the step (7) is realized as follows:
Figure BDA00037226424600000616
wherein, T i max The maximum transmission and processing time delay requirements of the ith task are met; control variable alpha (t) ═ alpha 1 (t),α 2 (t),....α N (t)]The clock cycle resources of the CPU are allocated,
Figure BDA0003722642460000071
offloading policies for communication, wherein
Figure BDA0003722642460000072
Condition C1 is for the queue to be in a stable state; condition C2 ensures that the transmission and processing time for each type of task is within the maximum delay requirement, since the task is offloaded through three different communication techniques,
Figure BDA0003722642460000073
and
Figure BDA0003722642460000074
the maximum value of the three is used as the upper bound of the transmission delay of the ith task; constraint C3 ensures that the CPU clock cycle for processing all tasks cannot exceed the total amount of CPU computing resources available on the server; constraint C4 ensures that each task selects mmWave, DSRC, or CV2I to perform the computational task;
the Lypunov technique was used to solve this long-term stochastic constraint C1:
defining a second order Lypunov function L (t) and a 1-slot Lypunov drift amount DeltaL t
Figure BDA0003722642460000075
Figure BDA0003722642460000076
Wherein q (t) ═ q 1 (t),q 2 (t),...q N (t)](ii) a Then, the desired system utility is added to the drift amount resulting in a drift plus penalty term, i.e.
Figure BDA0003722642460000077
Wherein V is a non-negative parameter set by the system for trading off between system utility and queue backlog; for any given control parameter V ≧ 0 with respect to the offload workload α i Next, a drift plus penalty term is derived:
Figure BDA0003722642460000078
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003722642460000079
original time-averaged long-term queue length condition C 1 Absorbing as an optimization objective in an implicit manner, the optimization objective of the problem P1 is converted into F 2 (t):
Figure BDA00037226424600000710
A DRL framework containing states, actions and rewards is employed to formulate computing resource allocation policies and heterogeneous communication offloading policy issues in the VEC:
state space s at time t t Comprises the following steps:
Figure BDA0003722642460000081
due to the fact that
Figure BDA0003722642460000082
And
Figure BDA0003722642460000083
the dimensions are all N dimensions, and
Figure BDA0003722642460000084
the dimension is 4N, and the dimension of the state space is 5N;
motion space a at time t t Comprises the following steps:
Figure BDA0003722642460000085
wherein alpha is i (t) and
Figure BDA0003722642460000086
are required to satisfy the constraint conditions in the formula (30)
Figure BDA0003722642460000087
And
Figure BDA0003722642460000088
adding a virtual variable a N+1 (t) outputting the N +1 dimensional action deep neural network, and then satisfying the N +1 dimensional variables using a softmax function at an output layer:
Figure BDA0003722642460000089
taking only the first N actions; similarly, the output action for each task i
Figure BDA00037226424600000810
And
Figure BDA00037226424600000811
using softmax function, implementing
Figure BDA00037226424600000812
Thus the action space is N +1+3N dimensions, the dimension of the action space increasing with the number of task types; also, by the output action for each task
Figure BDA00037226424600000813
And
Figure BDA00037226424600000814
using softmax function, thereby implementing
Figure BDA00037226424600000815
So the motion space is 4N +1 dimensions;
reward function r at time t t Comprises the following steps:
r t (a t ,s t )=-F 2 (t) (38)
r t (a t ,s t ) Description in the state s t Taking action a t Thereafter, environment reward feedback to agent, in pi (a) t |s t ) Representing agent based on state s t The spatial distribution of actions taken, the expected long-term discount return for the system is calculated as:
Figure BDA00037226424600000816
wherein γ ∈ [0,1 ]]The discount factor represents that the agent pays attention to the long-term or short-term rewards, the higher the value is, the more the agent pays attention to the long-term rewards, and otherwise, the current short-term rewards are paid attention to; τ ═ s 0 ,a 0 ,s 1 ,a a …) is agent-dependent on the spatial distribution of actions pi (a) t |s t ) State and behavior trace.
Further, the step (8) is realized as follows:
SAC algorithm in maximizing optimization target-F 2 (t) while introducing policy entropy
Figure BDA0003722642460000091
Into the reward, the expected long-term discount reward for the model is then
Figure BDA0003722642460000092
β t Is a weight of policy entropy that trades off between exploring feasible policies and the maximum optimization objective; with constant change of reward, fixed beta t Can affect the stability of the whole training, so the beta is automatically adjusted in the training process t Is very necessary; the optimization problem of reinforcement learning is converted into:
Figure BDA0003722642460000093
by setting a lower limit
Figure BDA0003722642460000094
Is to make beta in the formula (40) t H(π t (·|s t ) ) as large as possible; increase beta when agent has not learned optimal action t To explore more total space; conversely, if the best strategy has been learned, then β is reduced t To reduce the exploration and accelerate the training of the model; can be obtained based on the Lagrange multiplier method
Figure BDA0003722642460000095
Figure BDA0003722642460000096
Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the task unloading strategy and the resource allocation scheme adopted by the invention are superior to other unloading and resource allocation schemes in reducing system utility, controlling the stability of the base station queue and ensuring the task transmission delay.
Drawings
FIG. 1 is a scene diagram of task offloading of heterogeneous network vehicle edge computing;
FIG. 2 is a diagram of a server queue model framework;
FIG. 3 is a graph of the number of task types, task arrival rates, and upper delay bounds for the inventive arrangements;
FIG. 4 is a graph of the number of task types, task to burstiness, and delay upper bound for the present inventive arrangements;
FIG. 5 is a diagram illustrating the relationship between server computing resources, heterogeneous communication technologies, and delay violation probability in accordance with aspects of the present invention;
FIG. 6 is a diagram illustrating the relationship between the number of task types, heterogeneous communication technologies and delay violation probability according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating communication bandwidth resources, heterogeneous communication technologies, and delay violation probability according to an embodiment of the present invention;
FIG. 8 is a diagram of the CCDF dependency of the task overrun delay of the present inventive scheme;
FIG. 9 is a graph comparing queue backlog for an evenly distributed average offload policy, a randomly distributed random offload, and a heterogeneous communication distribution policy, in accordance with aspects of the present invention;
fig. 10 is a system utility comparison of the inventive arrangements with an evenly distributed average offload policy, randomly distributed random offload and a heterogeneous traffic distribution policy.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
In a vehicle edge computing network environment, aiming at the requirement of higher data rate, ultralow delay, high reliability and excellent user experience for vehicle networking application in the 5G era, the invention provides an ultralow-reliability and low-delay reinforcement learning task unloading method based on a heterogeneous communication technology, and the network performance of vehicle edge computing is improved based on an ultralow-reliability and low-delay task unloading scheme based on three heterogeneous communication technologies of millimeter waves, DSRC and CV 2I. Firstly, a vehicle edge computing task unloading model is provided, a vehicle user can select millimeter waves, DSRC and CV2I to distribute the task proportion unloaded by each communication technology to unload the task to an edge server so as to ensure low time delay of task unloading, and resources of a CPU are distributed at the server based on the Lypunov technology so as to ensure the reliability of a base station queue. Secondly, the upper bound of the delay for unloading and processing based on different communication technologies is calculated by using a random network calculus theory, and the delay comprises communication transmission time and server processing time. And finally, learning the unloading strategy and the server CPU allocation strategy of each task by using SoftActor criticic reinforcement learning.
The stability of the base station queue is ensured, and the task processing of a single task generated by a single vehicle is not considered. The total time is divided into T equal time slots, each time slot being spaced by Δ T, T sets of time slots
Figure BDA0003722642460000101
And (4) showing. The channel state is time-varying in consideration of the mobility of the vehicle. In the scheme, it is assumed that Channel State Information (CSI), distances between a vehicle and a base station and between the vehicle and an RSU do not change in one time slot, but different time slots are different. Considering that there are K types of tasks that the vehicle needs to process in the scene, a task of type i (i is 1, …, N) is represented by task i. All vehicles can directly unload tasks to a cloud server connected with a base station for processing through three communication modes of millimeter waves, DSRC and CV2I, or the tasks are unloaded to an RSU firstly and then transmitted to the cloud server connected with the base station for processing through a wired mode, and the tasks are collected and used
Figure BDA0003722642460000111
Representing communication technologies that can be offloaded. When the communication technology is g, the communication technology is,
Figure BDA0003722642460000112
when the temperature of the water is higher than the set temperature,
Figure BDA0003722642460000113
represents the proportion of task i that is offloaded using communication technique g, wherein
Figure BDA0003722642460000114
Because the time delay of wired transmission is smaller, the scheme does not consider the transmission time from the RSU to the base station. Base station toolThere are N queues, each queue is assumed to be infinitely long, queue i stores only task i, using λ i Representing the number of vehicles producing task i within the [ s, t) time interval. The cloud server processes the tasks stored in the base station queue, and ignores the transmission time from the base station to the server because the server is close to the base station. It is an object of the present invention to minimize delays, including task communication time and processing time, while meeting power consumption and cost effectiveness and queue stability requirements. The method specifically comprises the following steps:
step 1: constructing a vehicle edge calculation scene, as shown in fig. 1, wherein the scene is composed of a base station connected with a server, a plurality of road side units and a vehicle; a vehicle heterogeneous communication network is constructed by three communication technologies of millimeter waves, DSRC and CV2I, and a vehicle can unload tasks to a server for processing through the three communication technologies.
Step 2: constructing a bounded burst type flow model based on a random network operation theory; a dynamic change model of the base station queue is constructed to ensure the stability of the base station queue, as shown in fig. 2.
At the beginning of each t-slot, A i (t) is the amount of task data accumulated to queue i over time interval [ t, t +1 ]. At the same time, a time interval of 0 ≦ s ≦ t is given, and a binary non-cumulant A is defined i (s,t)=A i (t-s)=A i (t)-A i (s) is the cumulative amount of tasks arriving at queue i for the ith, assuming A i (s, t) is a bounded burst type model, and satisfies a stable non-negative random process:
A i (s,t)=λ ii (t-s)+σ i ] (1)
where ρ is i For the task arrival rate, σ i For the task burst size, both are constants, λ i Satisfying the poisson distribution.
Consider a server having N E Block CPU processing core, maximum CPU processing clock rate of server is f per second E Period (frequency), and processing tasks of different application types per amount of data bit requires CPU clock periods occupying different resources of the server, i.e. processing the amount of data per task i bit requires ω i C of unitPU clock cycle. Server assignment to ith task alpha i (t) proportional CPU clock cycle resources, from the set α (t) [ α ] 1 (t),α 2 (t),...α 3 (t)]Indicating that each element in the set needs to be in the feasible set A, i.e.
Figure BDA0003722642460000121
In this scheme, the Lindley recursion is used to analyze the dynamic change of the queue length, so the queue length at the base station can be expressed as:
Figure BDA0003722642460000122
wherein q is i (t) is the queue length of the ith task at the beginning of time slot t, and [ x [] + Max (x, 0). The reliability of the server is achieved by guaranteeing the stability of each queue, considering the stability of all queues controlled by the following definitions:
Figure BDA0003722642460000123
the left end of the above equation describes the long-term time-averaged backlog of the queue, which means that a strong stability of the queue corresponds to a limited average backlog with a limited average queuing delay.
And step 3: establishing communication transmission models of three communication technologies of millimeter waves, DSRC and CV2I based on a random network calculus theory, and simultaneously establishing a calculation processing model of a CPU; carrying out minimum convolution on the communication transmission model and the calculation processing model by a series theorem to obtain a system processing model; deriving an upper bound on delay probabilities for offloading and processing based on each communication technology; the delay includes the communication transmission time and the server computing processing time.
The small-scale fading in the millimeter-wave channel is very weak due to the propagation characteristics of the millimeter-wave band, so the amplitude of the channel coefficient (millimeter-wave band) is usually modeled as a random variable satisfying the Nakagami-m distribution. For a given transmission distance l and path loss index delta, according to the Shannon formula, the capacity of a millimeter wave band channel is tried to be calculated
Figure BDA0003722642460000124
Wherein
Figure BDA0003722642460000125
Representing the signal-to-noise ratio, B representing the bandwidth, and a random variable ζ being the channel gain, being time-independent and being distributed according to a gamma distribution, i.e., ζ - Γ (M, M) -1 ) Where M is the Nakagami index, the probability density function (p.d.f.) of ζ is
Figure BDA0003722642460000126
By beta mmw (s, t) represents the total communication transmission available at the time interval mmWave of [ s, t ], and C is used (q) Indicates the channel capacity, ζ, of the q time slots (q) Representing the channel gain for the q slot. Then according to the literature the communication traffic of the millimeter wave is
Figure BDA0003722642460000131
Wherein η is Blog 2 e, since the channel gain coefficients are independently and identically distributed
Figure BDA0003722642460000132
Equation (5) can be further written as
Figure BDA0003722642460000133
By using
Figure BDA0003722642460000134
Representing the task volume of a task i offloaded to the server via communication technique g for a time interval s, t), according to a formula
Figure BDA0003722642460000135
Due to unloading by mmWave
Figure BDA0003722642460000136
Need to be matched with other types of tasks
Figure BDA0003722642460000137
Contend for millimeter wave communication bandwidth resources, according to the remaining service theorem in random network calculus,
Figure BDA0003722642460000138
the mmWave communication transmission quantity obtained in the time interval [ s, t) is [ s, t) the total mmWave communication transmission quantity beta of the time interval mmw (s, t) subtract the task volume of all mmWave offload based tasks j ≠ i
Figure BDA0003722642460000139
Figure BDA00037226424600001310
Represents the proportion of communication transmission of the task i through the millimeter waves, so that the millimeter wave energy provides the communication transmission quantity of the task i
Figure BDA00037226424600001311
Since the communication transmission amount provided by the millimeter wave is generally larger than the data amount of the task transmission, [ 2 ]] + Internal is greater than 0, so equation (7) can be collated as
Figure BDA00037226424600001312
Order to
Figure BDA00037226424600001313
Equation (6) can be simplified to
Figure BDA00037226424600001314
DSRC communication is based on IEEE802.11p standard protocol, and in 802.11p protocol, the wireless communication basic access mode is based on distributed coordination function, namely, the retransmission of data packet after collision adopts exponential backoff algorithm, so the DSRC communication delay is mainly the access delay of backoff after collision, and the access delay is used
Figure BDA00037226424600001315
To indicate the average access delay of collisions occurring when transmitting data by DSRC, according to the literature
Figure BDA0003722642460000141
u is a constant and the number of the groups is,
Figure BDA0003722642460000142
is a Parato type tail index, R dsrc Is the DSRC communication bandwidth. By beta dsrc (s, t) represents the total traffic that the [ s, t) time interval DSRC can provide, and such traffic models may be established as delay-rate service models (latency-rate service) according to network theory, as follows:
Figure BDA0003722642460000143
in the same way and the method for establishing the millimeter wave communication transmission quantity model, the communication transmission quantity which can be provided by the DSRC for the task i can be obtained by the following principle according to the rest service theorem:
Figure BDA0003722642460000144
order to
Figure BDA0003722642460000145
The above formula can be simplified into
Figure BDA0003722642460000146
CV2I is a cellular network-based communication technology. CV2X communication was standardized in release 14 of 3GGP, with two scheduling-based modes: mode 3 and mode 4, both of which are communication resources pre-allocated in advance. Therefore, assume that CV2I is based on a bandwidth reservation mode, i.e., bandwidth resources of communication are pre-scheduled by the base station to be allocated to the ith task, i.e., task i, which is unloaded via CV2I, does not need to contend for communication traffic with other task j, which is also unloaded based on the communication technology. The cumulative amount of traffic that the entire CV2I can provide to the ith task at time interval s, t) is
Figure BDA0003722642460000147
Wherein R is cv2i Communication bandwidth is reserved for the tasks. To facilitate derivation of the upper bound of the computation delay, equation (13) is made into a unified form with equations (9) and (12). Thus, it can be seen that for task i, CV2I can provide communication traffic in the time interval of [ s, t) as:
Figure BDA0003722642460000148
it is worth noting here
Figure BDA0003722642460000149
This is because the CV 2I-based model is to reserve bandwidth resources for the transmitted tasks, and there is no contention for communication bandwidth resources between tasks.
And then, establishing a calculation throughput model of the CPU to obtain a system throughput model, and deducing an upper bound of delay probability for task unloading and server processing based on each communication technology by adopting random network calculation based on a moment-mother function (MGF) based on the established flow model and the system throughput model. By using
Figure BDA0003722642460000151
Indicating that the server CPU can provide for offloading at time interval s, t)The amount of computation processing to server task i, which is equal to the amount of tasks processed by the CPU in equation (2):
Figure BDA0003722642460000152
by using
Figure BDA0003722642460000153
Indicating that the server CPU is providing for the time interval s, t)
Figure BDA0003722642460000154
The amount of calculation processing of (a), it should be noted,
Figure BDA0003722642460000155
only the computational processing of task i is processed, so according to the remaining service theorem in the stochastic network calculus,
Figure BDA0003722642460000156
need to be the same as
Figure BDA0003722642460000157
And backlog task q not processed yet in base station queue i i (s) competing for CPU assigned computational throughput
Figure BDA0003722642460000158
Therefore, it is not only easy to use
Figure BDA0003722642460000159
The amount of computational processing that can be obtained can be calculated as:
Figure BDA00037226424600001510
order to
Figure BDA00037226424600001511
The above equation can be simplified to:
Figure BDA00037226424600001512
the delay of the task is defined as the time of communication transmission and the time of server CPU processing, so that the total processing amount of the system can be obtained for the task i as the communication transmission amount and the calculation processing amount of CPU processing. According to the series theorem in the stochastic network calculus theory, the system can provide the total processing amount of the task i unloaded based on the communication technology g for the communication technology g to transmit
Figure BDA00037226424600001513
And CPU calculation processing
Figure BDA00037226424600001514
Minimum convolution of (d):
Figure BDA00037226424600001515
in the formula
Figure BDA00037226424600001516
The operator is the minimum convolution operator, is the most important operator in the random network operation theory, and has the following operation rules:
Figure BDA00037226424600001517
random network calculus overcomes the problem of deterministic envelopes that only consider the worst case, allowing the envelope to be violated with a small probability of certainty to take full advantage of the statistical properties of the arriving data stream. Upper bound of probability delay
Figure BDA00037226424600001518
Representing task transmission and processing time W i g (t) exceeds
Figure BDA00037226424600001519
Has a probability of less than epsilon i Is as defined inThe following:
Figure BDA0003722642460000161
the inequality of the above formula is based on the Chernoff inequality P (X is more than or equal to X) and less than or equal to e -θX E[e θx ]Defining the moment mother function of x, i.e. M, simultaneously x (θ)=E[e θx ]Then equation (20) can be converted to
Figure BDA0003722642460000162
Wherein
Figure BDA0003722642460000163
With respect to a positive number parameter θ and
Figure BDA0003722642460000164
is used as a binary function of (1). To achieve a more compact upper bound on probability, the minimum is taken
Figure BDA0003722642460000165
As equivalent violation probability, i.e.
Figure BDA0003722642460000166
The inequality of the above equation can be converted into
Figure BDA0003722642460000167
Solving equation (22) can obtain the upper delay bound
Figure BDA0003722642460000168
The solution of (a) is:
Figure BDA0003722642460000169
in the formula
Figure BDA00037226424600001610
Can be calculated as:
Figure BDA00037226424600001611
wherein
Figure BDA00037226424600001612
To obtain the delay of task i
Figure BDA00037226424600001613
First needs to calculate
Figure BDA00037226424600001614
Since the foregoing has performed formal unification of the communication traffic model and the CPU computation throughput model of different communication technologies, that is
Figure BDA00037226424600001615
service ∈ { comp, comm }, so formula derivation can be conveniently performed
Figure BDA00037226424600001616
The final results were as follows:
Figure BDA00037226424600001617
suppose that
Figure BDA00037226424600001618
And
Figure BDA00037226424600001619
i.e. the communication and computing resources available for offloaded tasks are much larger than the offloading rate of tasks offloaded based on communication technology g
Figure BDA00037226424600001620
Combining the results of formula (25)
Figure BDA00037226424600001621
Can be substituted by formula (23) to obtain
Figure BDA00037226424600001622
Closed-form solution of (c):
Figure BDA0003722642460000171
wherein
Figure BDA0003722642460000172
And
Figure BDA0003722642460000173
from the above equation, the upper bound of delay can be seen
Figure BDA0003722642460000174
Is determined by a number of factors. The first term of the above equation indicates the probability epsilon of exceeding the delay bound i The size of delay is determined, the second item is that each item of the sub-items is related to the size of the task burst, and the third item is that the communication technology g and the residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the upper limit of the delay
Figure BDA0003722642460000175
And 4, step 4: establishing a vehicle edge computing system utility, the system utility consisting of a communication utility and a computing utility, the system utility consisting of 2 parts: communication utility and computational utility.
Communication utility: it is assumed that only telecom operators charge CV2I for cellular based communications, and DSRC and millimeter waves operate in unlicensed bands, with a unit cost c for each bit of data defined for CV2I based communications comm The amount of work to be offloaded by CV2I is
Figure BDA0003722642460000176
Thus the systemCommunication utility F due to assignment to ith task comm,i (t) is:
Figure BDA0003722642460000177
calculating the utility: the computational utility of the system refers to the power costs defined for the servers to handle the tasks. The unit cost of power consumed by the server per process is c comp . In order to establish a more real computing environment, a Dynamic Voltage Frequency Scaling (DVFS) method is adopted to simulate the CPU power consumption, and the DVFS enables the system to operate at a lower frequency and a corresponding lower Voltage under a low load or a workload highly limited by a memory, thereby saving energy consumption under the condition of hardly losing performance requirements. The servers are commonly allocated for processing the tasks in the queue
Figure BDA0003722642460000178
Based on the assumption of DVFS, the dynamic frequency of each CPU is
Figure BDA0003722642460000179
In general, the power consumption of the CPU is calculated as the third power of the frequency, and the power consumption of the CPU of each block is
Figure BDA0003722642460000181
Where κ represents the effective switched capacitance parameter. Using F comp,i (t) represents the computational utility of the CPU at time t:
Figure BDA0003722642460000182
finally, the utility function of the system can be obtained as:
Figure BDA0003722642460000183
wherein the content of the first and second substances,
Figure BDA0003722642460000184
and
Figure BDA0003722642460000185
respectively, are normalized weighting coefficients to ensure that the magnitudes of the computational and communication utilities of the CPU are consistent.
And 5: and establishing an optimization problem, wherein the optimization target is to minimize the system utility and ensure the task unloading delay and the stability of the base station queue.
And constructing a heterogeneous communication unloading strategy and a CPU resource optimal allocation problem based on the unloading processing delay and the system utility function obtained in the previous step. The optimization objective is to minimize the average delay of all tasks in the model, while satisfying the power consumption and communication cost constraints and queue stability requirements, and the optimal problems are as follows:
Figure BDA0003722642460000186
wherein T is i max The maximum transmission and processing delay requirements of the ith task. Control variable alpha (t) ═ alpha 1 (t),α 2 (t),....α N (t)]The clock cycle resources of the CPU are allocated,
Figure BDA0003722642460000187
offloading policies for communication, wherein
Figure BDA0003722642460000188
Of the above constraints, condition C1 is to have the queue in a stable state, and condition C2 ensures that the transmission and processing time for each type of task is within the maximum delay requirement, since the task is offloaded by three different communication techniques,
Figure BDA0003722642460000191
and
Figure BDA0003722642460000192
the maximum value of the three is used as the upper bound of the transmission delay of the ith task. The constraint (C3) ensures that the CPU clock cycles used to process all tasks cannot exceed the total amount of CPU computing resources available on the server. Constraints (C4) ensure that each task selects mmWave, DSRC, or CV2I to perform the computational task.
It is not easy to solve the P1 for optimal transport offload policies and CPU resource allocation policies, the main reason being that the constraint (C1) is on the stability of the long-term time-averaged queue length, which has a large impact on the long-term stability of the short-term decision queue, and it is more desirable to make decisions without considering future information. Therefore, the Lypunov technique will be first adopted to solve this one long-term stochastic constraint (C1).
The Lypunov function is an efficient framework for designing online control algorithms without any a priori knowledge. Defining a second order Lypunov function L (t) and a 1-slot Lypunov drift amount DeltaL t
Figure BDA0003722642460000193
Figure BDA0003722642460000194
Wherein q (t) ═ q 1 (t),q 2 (t),...q N (t)]. The desired system utility is then added to the drift amount with a drift plus penalty term, i.e.
Figure BDA0003722642460000195
Where V is a non-negative parameter set by the system to trade off between system utility and queue backlog. For any given control parameter V ≧ 0 with respect to the offload workload α i Next, a drift plus penalty term may be derived.
Figure BDA0003722642460000196
Wherein
Figure BDA0003722642460000197
Original time-averaged long-term queue length condition C 1 Absorbing as an optimization objective in an implicit manner, the optimization objective of the problem P1 is converted into F 2 (t):
Figure BDA0003722642460000198
Although the original problem is simplified by means of the Lyapunov method, it is still far from easy to directly solve the problem. On the one hand, the optimization problem P2 is not a convex problem, and on the other hand, its resolution is plagued by dimension cursing due to the complexity of the vehicle environment. Thus, a DRL framework containing state, actions, and rewards is employed to formulate computing resource allocation policies and heterogeneous communication offloading policy issues in VECs.
State space s t The vehicle needs to observe network resources and computing resources to decide the offload measurement rate of heterogeneous communication, and the server allocates CPU clock cycle resources for each type of task arriving at the base station by observing the length of the queue. In the model, the size of the task is the most fundamental factor influencing the task delay and the queue length, the randomness of the task arrival influences the system stability, the task transmission delay and the cost, so that a variable capable of reflecting the data volume is taken as one of the states, namely
Figure BDA0003722642460000201
Randomness of task volume by task number lambda i To reflect it. In addition, the length of the queue affects the decision-making of the server CPU resources in order to achieve system stability. Equation (2) illustrates the queue length update method, since A i (t) is a random variable, and in the field of deep learning, it is known that environments with random rewards are more difficult to learn than environments with deterministic rewards. Use of
Figure BDA0003722642460000202
As the state of queue length at time t, A may be i (t) probability of stochastic absorption to transition s t+1 ~P(s t+1 |s t ,a t ) In this way, a deterministic reward is obtained from the environment, thus using Q t =[q 1 (t)+A 1 (t),q 2 (t)+A 2 (t),...q N (t)+A N (t)]Indicating the queue length status of all tasks. Another consideration is the time t calculated in equation (26)
Figure BDA0003722642460000203
The closed type solution is prepared from
Figure BDA0003722642460000204
And
Figure BDA0003722642460000205
impact decisions, both reflecting the resources available when task i competes with task j, equation (26) illustrates, latency
Figure BDA0003722642460000206
Is largely governed by the relatively small resource surplus, so the communication and computing resources available for task i are also brought into state, i.e.
Figure BDA0003722642460000207
The available communication resources only consider the DSRC case here, since there is no contention between tasks since CV2I is based on the bandwidth reservation model, and the richness of the millimeter wave band resources need not be considered. The resource status of all tasks can be expressed as ξ t =[ξ 1 (t),ξ 2 (t),...ξ N (t)]. Thus, state s at time t t Can be defined as:
Figure BDA0003722642460000208
due to the fact that
Figure BDA0003722642460000209
And
Figure BDA00037226424600002010
the dimensions are all N dimensions, and
Figure BDA00037226424600002011
the dimension is 4N dimensions, so the dimension of the state space is 5N dimensions.
Motion space a t : for each type of task of each type,
Figure BDA0003722642460000211
allocating CPU clock cycle resources to process task i, α (t) [ α ] 1 (t),α 2 (t),....α N (t)]How many tasks are divided is unloaded by the method of millimeter wave, DSCR and CV 2I. Thus, the action at time t is defined as:
Figure BDA0003722642460000212
wherein alpha is i (t) and
Figure BDA0003722642460000213
are required to satisfy the constraint conditions in the formula (30)
Figure BDA0003722642460000214
And
Figure BDA0003722642460000215
to more easily realize alpha at the output of the neural network i (t) satisfies
Figure BDA0003722642460000216
By adding a virtual variable a N+1 (t) outputting the N +1 dimensional action deep neural network, and then satisfying the N +1 dimensional variables using a softmax function at an output layer:
Figure BDA0003722642460000217
only the first N actions are taken. Similarly, the output action for each task i
Figure BDA0003722642460000218
And
Figure BDA0003722642460000219
using the softmax function, one can realize
Figure BDA00037226424600002110
Thus the action space is N +1+3N dimensional, the dimension of the action space increasing with the number of task types. Also, by the output action for each task
Figure BDA00037226424600002111
Figure BDA00037226424600002112
And
Figure BDA00037226424600002113
using softmax function, thereby implementing
Figure BDA00037226424600002114
The motion space is 4N +1 dimensions.
Reward function r t : in this context, the system aims to improve performance in terms of utility, stability and latency. Since in reinforcement learning the long-term reward is maximized while it is desirable to minimize the optimization objective, the reward function of the model at time t is defined as the negative of the optimization objective:
r t (a t ,s t )=-F 2 (t)(38)
r t (a t ,s t ) Description in the state s t Taking action a t Thereafter, the environment rewards feedback for the agent. By pi (a) t |s t ) Representing agent based on state s t The actions that can be taken are spatially distributed. Of the systemThe expected long-term discount return is calculated as:
Figure BDA00037226424600002115
wherein gamma is ∈ [0,1 ]]Is a discount factor that represents the agent's interest in long-term or short-term rewards, with higher values indicating that the agent is more interested in long-term rewards, and vice versa in current short-term rewards. τ ═ s 0 ,a 0 ,s 1 ,a a …) is agent-dependent action-spatial distribution π (a) t |s t ) State and behavior trace.
Step 6: SoftActor Critic reinforcement learning is used to learn the offload policies and server CPU allocation policies for each task.
All values in the state space and the action space are continuous variables, and the general reinforcement learning method can only solve discrete variables and low-dimensional variables. For control tasks of high dimension, continuous state and action space, neural networks are used to approximate the variable values in space. The Soft-Actor-criticic (SAC) algorithm is a reinforcement learning algorithm suitable for solving continuous states and action spaces, and introduces policy entropy into rewards, maximizing rewards while encouraging agents to explore more viable policies. Therefore, SAC has better robustness and stronger generalization capability.
SAC algorithm in maximizing optimization target-F 2 (t) while introducing policy entropy
Figure BDA0003722642460000221
Into the reward, the expected long-term discount reward for the model is then
Figure BDA0003722642460000222
β t Is a weight of the strategy entropy that makes a trade-off between exploring feasible strategies and the maximum optimization objective. With constant change of reward, fixed beta t Can affect the stability of the whole training, so the beta is automatically adjusted in the training process t Is very necessary. The optimization problem of reinforcement learning can be converted into:
Figure BDA0003722642460000223
by setting a lower limit
Figure BDA0003722642460000224
Is to make beta in the formula (40) t H(π t (|s t ) As large as possible). Increase beta when agent has not learned optimal action t To explore more total space; conversely, if the best strategy has been learned, then β is reduced t To reduce the training of exploration and acceleration models. Can be obtained based on the Lagrange multiplier method
Figure BDA0003722642460000225
Figure BDA0003722642460000226
Wherein
Figure BDA0003722642460000227
Is the best strategy that SAC has learned to maximize the expected long-term discount reward.
The SAC training algorithm and the testing algorithm based on heterogeneous communication technology low-delay ultra-reliable task unloading reinforcement learning are as follows:
and (3) algorithm in a training stage:
the SAC algorithm is based on an actor-Critic network framework, the actor network is used for strategy optimization, the Critic network carries out evaluation strategy, and finally the strategy pi converges to the optimal strategy pi through continuous strategy optimization and strategy evaluation * . average value mu of Gaussian distribution of operator network output strategy φ (t) and covariance ∑ φ (t), wherein φ represents a neural network parameter of an actor network; operator network samples from strategy high-dimensional Gaussian distribution and outputs high-dimensional data in current stateAction a t . criticc network output approximate Q θ (s t ,a t ) For policy evaluation, Q θ (s t ,a t ) Is shown in state s t In action a t Value function of action of θ (s t ,a t ) Indicating that action a is selected at the current time t t And then take the expectation of the future discount reward sum under the optimal action condition:
Figure BDA0003722642460000231
where θ represents a neural network parameter of the criticc network. The SAC algorithm introduces two Critic networks of the same network structure to ensure Q reduction θ (s t ,a t ) Over-estimation of, i.e. output approximation separately
Figure BDA0003722642460000232
And
Figure BDA0003722642460000233
θ 1 and theta 2 The sub-table represents the parameters of two critical networks. In addition, for faster and stable training, two target critical networks with the same structure as the critical network are introduced, and the network parameters are respectively
Figure BDA0003722642460000234
And
Figure BDA0003722642460000235
the algorithm is described in detail below.
To study the trade-off between utility and stability of the system at the weighting factor V value, the SAC model needs to be trained at different V values. First, initializing the network parameters phi, theta 1 And theta 2 To do so
Figure BDA0003722642460000236
And
Figure BDA0003722642460000237
initialized to sum θ 1 ,θ 2 The same value. A replay buffer with sufficient space size is constructed
Figure BDA0003722642460000238
The training device is used for storing data acquired in the training process. In order to reduce the influence of randomness on the stability of the SAC algorithm, the same millimeter wave channel coefficient ζ is set. Will be totally subjected to K max Training rounds, each round empties all queues first, and xi in initial state 0 Set to initial computing and communication resources, i.e.
Figure BDA0003722642460000239
Initial setting is R dsrc And is and
Figure BDA00037226424600002310
and
Figure BDA00037226424600002311
are all set to f E And the other initial states are all set to 0.
Each round includes T max Sub-time step (time step). The algorithm starts from time step T-0 to T-T max And the process is started. First, for each type of task, a vehicle number λ is randomly generated i Let the time slot be 1, i.e. t-s equals 1, according to A i (0)=λ ii (t-s)+σ i ]Calculating the task amount of each type of task reaching the base station queue so as to obtain the state
Figure BDA0003722642460000241
And state
Figure BDA0003722642460000242
This results in a state s when t is completely equal to 0 0 Will state s 0 Sending the output t to be 0 strategy average value mu of high-dimensional Gaussian distribution into operator network φ (0) Sum variance Σ φ (0) Then sampling from the distribution in the Gaussian distributionAn alpha including N +1 dimensions N+1 (0) And 3 x N dimensions
Figure BDA0003722642460000243
Motion space a 0 . For alpha N+1 (0) And performing softmax operation to satisfy the formula (37) and taking the first N values to obtain the unloading strategy of the CPU. To pair
Figure BDA0003722642460000244
Performing softmax operation to meet
Figure BDA0003722642460000245
According to the obtained action a 0 Updating all queue lengths Q (0) by formula (1); obtaining the transmission time delay of different communication technologies of each task based on a formula (26)
Figure BDA0003722642460000246
Obtaining the state xi of the next moment 1 . Feedback r is obtained from the environment when the vehicle and server take action 0 . It should be noted that the state of the next time is obtained
Figure BDA0003722642460000247
Alpha (1) generated at the time when t is 0 is required, and then the complete state at the time when t is 1 is obtained
Figure BDA0003722642460000248
Then the tuple(s) 0 ,a 0 ,r 0 ,s 1 ) Stored in a playback buffer (replay buffer). When the number of collected samples in the replay buffer is less than
Figure BDA0003722642460000249
Then the vehicle and server continue to move to the next state s 2 And sending to the operator network to start the next iteration with time step t equal to 2. When the number of samples in the playback buffer reaches
Figure BDA00037226424600002410
Time, iterationUpdating parameters phi, theta of operator network, 2 critic networks and 2 target critic networks 12 ,
Figure BDA00037226424600002411
And
Figure BDA00037226424600002412
and a relative entropy weight β to maximize the objective function J (π) of equation (40) t ). The model first randomly extracts a number I of tuples(s) from the replay buffer t ,a t ,r t ,s t+1 ) Forming small batch sample data samples, and taking s in all the small batch sample data samples t Sending the data to an actor network to obtain strategy Gaussian distribution, and sampling according to the strategy distribution to obtain action a new And policy entropy
Figure BDA00037226424600002413
Updating the policy entropy weight β according to the gradient of the policy entropy weight, i.e. along the gradient
Figure BDA00037226424600002414
Can be optimized
Figure BDA00037226424600002415
This is considered to be a gradient to the relative entropy weight factor β:
Figure BDA00037226424600002416
next, state s is given t And action a new 2 criticic network output State-cost function
Figure BDA00037226424600002417
And
Figure BDA00037226424600002418
the loss function for the operator network can be calculated as:
Figure BDA00037226424600002419
for the sake of stability of the training here,
Figure BDA00037226424600002420
to make the operator network conductive, a resampling technique for the actions at the same time, i.e. a t =f φt ;s t ),ε t Is an input noise vector sampled from some fixed distributions, which is simply sampled from the gaussian distribution, and the sampled value is multiplied by the covariance and then added with the mean to obtain the final output action a t
In order to obtain the loss function of the critic network, s in small batches of sample data needs to be sampled t And a t Respectively inputting the data into 2 critic networks to obtain a state-based data t And action a t Action-state cost function of
Figure BDA0003722642460000251
And
Figure BDA0003722642460000252
will be s in the tuple t+1 Sending the strategy into an actor network to obtain new strategy entropy pi' φ And sampling to obtain action a next 2 target critic networks will be based on s t+1 And a next Obtaining target action-state cost function
Figure BDA0003722642460000253
And
Figure BDA0003722642460000254
then the target value is:
Figure BDA0003722642460000255
the loss function of the critic network is:
Figure BDA0003722642460000256
Figure BDA0003722642460000257
next, K is performed using an Adam optimizer u And updating the parameters of the wheel. The relative entropy weight factor beta and the actor network adopt the same learning rate alpha A The learning rate of the two critic networks is alpha C . Finally, after K u After round of updating, updating two target critic network parameters:
Figure BDA0003722642460000258
Figure BDA0003722642460000259
wherein τ is a constant satisfying τ < 1.
After the training is finished, the next time step is started. When T is T max The queue is again emptied and initialized, and the next round is started. When making the maximum round K max Then, an optimal relative entropy weight factor beta is obtained * The operator network parameter, the critic network parameter and the target critic network parameter, namely obtaining an optimal strategy
Figure BDA00037226424600002510
FIG. 3 depicts the upper bound of delay with arrival rate ρ for different numbers of task types i Is increased. With the increase of the arrival rate of the tasks, the time delay performance of the unloading based on the C-V2I is better than that of the other two unloading modes, because when the tasks are transmitted through the C-V2I, no competition of communication resources exists between the tasks, and the arrival rate rho of the tasks is increased only under the condition of large flow i The impact on the delay of C-V2I is large because as the arrival rate increases, the server CPU cannot handle the tasks in the base station in time, resulting in an increase in equation (14). As can be seen from fig. 3, at low arrival rates, the delay of the millimeter waves is even lower than C-V2I because the bandwidth of the millimeter waves is very large. Another notable phenomenon in the figure is that in the case of a relatively small amount of tasks, N is 5, the upper bound of the delay of mm wave and CV2I has a stage of descending, and this result is caused because, as shown in equation (26), the communication resources available for the initially unloaded task are larger than the available CPU computing resources, and the main factors affecting the delay are
Figure BDA0003722642460000261
Then with task arrival rate ρ i Has become the main determinant of task delay
Figure BDA0003722642460000262
FIG. 4 shows the delay upper bound as a function of the burstiness σ i But is increased. In this simulation, the arrival rates ρ of all tasks i Is set to 0.5 Mbps. When burst degree sigma i When the data rate is increased to about 5Mbps, the burst degree sigma can be seen in FIG. 4 i Has a linear effect on the upper bound of the task delay. Furthermore, the upper time delay bound for DSRC is larger than other technologies, which means DSRC is not suitable for large traffic vehicular scenarios.
Fig. 5 illustrates the probability of violation comparing the upper bound of delay for different offload communication techniques at different server CPU resources. Wherein the number of categories of tasks is N-5. As is apparent from fig. 5, increasing CPU cycle resources can significantly reduce probabilistic latency. In addition, it is observed that the performance of millimeter wave communication under a low-traffic task scenario (ρ ═ 0.5Mbps) is superior to that of the other two offload communication technologies, mainly because millimeter waves have a huge bandwidth advantage in the low-traffic scenario compared with the bandwidth reserved for tasks by C-V2I.
Fig. 6 studies the impact of the number N of different category tasks on the delay performance of different heterogeneous communication technologies. At this pointIn the simulation, the arrival rates of all tasks are set to ρ 0.8Mbps, and the CPU total processing capacity f E =10 4 GHz. As can be seen from fig. 8, when N is 10, the probability delay of the millimeter wave exceeds C-V2I, which is not as good as the delay performance of the millimeter wave at a large flow rate as reflected in fig. 3 and 4 in C-V2I. On the other hand, the network performance of the DSRC shows a tendency to deteriorate rapidly as the traffic increases.
Figure 7 shows the upper random delay probability bound epsilon for various communication resources of mmwave, DSRC and C-V2I. In this simulation, the same basic parameter settings are used for all communication technologies, i.e., the burstiness σ and the arrival rate ρ are set to 0Mbps and 0.5Mbps, respectively, and the CPU cycle resources all have f E =10 4 And GHz, wherein each communication has task unloading of N-5. As can be seen from fig. 7, the delay performance of each communication technology deteriorates to various degrees as the communication resources decrease.
Fig. 8 shows the Complementary Cumulative Distribution Function (CCDF) of the delays of three off-load tasks of different V-values under a SAC-based strategy, the CCDF reflecting the probability that the task delay exceeds a certain threshold. The latency of a task is defined herein as the maximum offload latency based on three different communication technologies. The task unloading delay exceeding delay requirement T of 50ms under SAC strategy with different V values in the figure is a very tiny probability event. This reflects the validity of the SAC strategy proposed by the present invention and the ability to guarantee the offloading task latency requirements.
Fig. 9 shows the queue backlog corresponding to the unloaded 3D game under different values of the weight coefficient V, which is considered because the 3D game requires the most CPU cycle resources, and the other two tasks VR and AR generally can be processed in time because the CPU cycle resources required to be processed are relatively small. As can be seen from fig. 9, the SAC-based policy can guarantee the stability of the queue, and the larger V, the smaller the final stable length of the queue. Another phenomenon that can be derived from the graph is that both the random-based strategy and the average-based strategy have smaller queue lengths than the SAC strategy, because both strategies do not take into account system utility.
FIG. 10 shows the balancing based on the average distributionSystem utility f (t) for both offload (EAEO) strategy, random-assignment random offload (RARO) strategy, and SAC strategy of different V values. As can be seen from the figure, the system utility f (t) of the EAEO policy and the RARO policy is much greater than that of the SAC-based policy, which shows that the EAEO policy and the RARO policy in fig. 9 achieve the stability of the queue length by sacrificing the system utility (i.e., CPU resources), thereby verifying the effectiveness of the SAC-based policy. Fig. 9 also shows the difference in utility f (t) for SAC-based strategies at different values of V. As V increases, the system utility F (t) decreases instead, because as V increases, the system will allocate more processing clock rate resources f E To ensure the stability of the system. The additional allocation of CPU cycle resources allows the system to be offloaded on millimeter wave and DSRC communication techniques, resulting in an overall rate of return r t And minimum.

Claims (7)

1. A task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning is characterized by comprising the following steps:
(1) constructing a vehicle edge calculation scene, wherein the scene consists of a base station connected with a server, a plurality of road side units and a vehicle; the method comprises the steps that a vehicle heterogeneous communication network is formed by three communication technologies of millimeter waves, DSRC and CV2I, and a vehicle can unload tasks to a server through the three communication technologies for processing;
(2) constructing a bounded burst type flow model based on a random network operation theory;
(3) constructing a base station queue dynamic change model to ensure the stability of a base station queue;
(4) establishing communication transmission models of three communication technologies of millimeter waves, DSRC and CV2I based on a random network calculus theory, and establishing a calculation processing model of a CPU; carrying out minimum convolution on the communication transmission model and the calculation processing model by a series theorem to obtain a system processing model;
(5) deriving an upper bound on delay probabilities for offloading and processing based on the respective communication technologies; the delay comprises communication transmission time and server calculation processing time;
(6) establishing a vehicle edge computing system utility, the system utility consisting of a communication utility and a computing utility;
(7) establishing an optimization problem, wherein the optimization target is to minimize the system utility and ensure the task unloading delay and the stability of a base station queue;
(8) soft Actor Critic reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task.
2. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (2) is implemented as follows:
assuming that the vehicle has K types of tasks to process, at the beginning of each t-slot, A i (t) is the amount of task data accumulated to queue i over time interval [ t, t + 1); at the same time, a time interval of 0 ≦ s ≦ t is given, and a binary non-cumulant A is defined i (s,t)=A i (t-s)=A i (t)-A i (s) is the cumulative amount of tasks arriving at queue i for the ith, A i (s, t) is a bounded burst-type flow model, and satisfies a stable non-negative random process:
A i (s,t)=λ ii (t-s)+σ i ] (1)
where ρ is i For the task arrival rate, σ i For the task burst size, both are constants, λ i Satisfy Poisson distribution, λ i Representing the number of vehicles producing the ith task in the [ s, t) time interval.
3. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (3) is implemented as follows:
the queue length of the base station is expressed as:
Figure FDA0003722642450000021
wherein q is i (t) is the ith task at the beginning of time slot tQueue length of (f) E Processing the clock rate, omega, for the maximum CPU of the server i Indicating that the amount of data per bit task i processed by the server requires a CPU clock period, α i (t) represents the CPU clock cycle duty that the server allocates to the ith task, and [ x [ ]] + Max (x, 0); the stability of all queues is controlled by the following definitions:
Figure FDA0003722642450000022
the left end of equation (3) describes the long-term time-averaged backlog of the queue; equation (3) means that the strong stability of the queue corresponds to a finite average backlog with a finite average queuing delay.
4. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (4) is implemented as follows:
by beta mmw (s, t) represents the total communication transmission available for the time interval mmwave of [ s, t ], and C (q) Indicates the channel capacity, ζ, of the q time slots (q) The channel gain of the q-slot is represented,
Figure FDA0003722642450000023
representing the signal-to-noise ratio, B the bandwidth, l and δ the transmission distance and path loss exponent, respectively, the millimeter wave energy provides the total communications transmission:
Figure FDA0003722642450000024
wherein η is Blog 2 e, using
Figure FDA0003722642450000025
The task proportion of the task i for communication transmission through the millimeter waves is represented; the communication transmission capacity of millimeter wave energy for providing the ith task is as follows:
Figure FDA0003722642450000026
wherein
Figure FDA0003722642450000027
The delay-rate model in the stochastic network calculus theory is used to establish the total traffic throughput that DSRC communications can provide within a time interval s, t):
Figure FDA0003722642450000031
R dsrc in order for the DSRC communication bandwidth to be,
Figure FDA0003722642450000032
an average access delay indicating that a collision occurs when data is transmitted by the DSRC; the amount of communications traffic that the DSRC can provide the ith task is:
Figure FDA0003722642450000033
wherein
Figure FDA0003722642450000034
Figure FDA0003722642450000035
Indicating the task proportion of the communication transmission of the task i through the DSRC;
by beta i cv2i (s, t) represents the amount of communications traffic that the [ s, t) time interval DSRC can provide the ith task:
Figure FDA0003722642450000036
wherein R is cv2i For reserving communication bandwidth for the ith task, wherein
Figure FDA0003722642450000037
By using
Figure FDA0003722642450000038
Indicating that the server CPU can provide the calculation amount of processing offloaded to the server task i in the time interval [ s, t), which is equal to the amount of processing by the CPU in equation (2):
Figure FDA0003722642450000039
order set
Figure FDA00037226424500000310
Representing communication technologies that can be offloaded; by using
Figure FDA00037226424500000311
Representing time intervals s, t) via communication techniques
Figure FDA00037226424500000312
The task volume of task i offloaded to the server,
Figure FDA00037226424500000313
indicating the proportion of tasks i communicating via communication technique g, q i (s) backlog tasks not yet processed in base station queue i before s time, using
Figure FDA00037226424500000314
Indicating that the server CPU is providing at time interval s, t)
Figure FDA00037226424500000315
The amount of calculation processing of (a) is,
Figure FDA00037226424500000316
the amount of computational processing that can be obtained is calculated as:
Figure FDA00037226424500000317
wherein
Figure FDA00037226424500000318
The delay of the task is the sum of the time of communication transmission and the time of server CPU processing, the total processing amount of the system can be obtained for the task i and is the sum of the communication transmission amount and the calculation processing amount of the CPU, and the system can provide the communication transmission with the total processing amount of the task i unloaded based on the communication technology g and is the communication technology g
Figure FDA00037226424500000319
And CPU computing process
Figure FDA0003722642450000041
Minimum convolution of (d):
Figure FDA0003722642450000042
in the formula
Figure FDA0003722642450000043
The operator is the minimum convolution operator, is the most important operator in the random network operation theory, and has the following operation rules:
Figure FDA0003722642450000044
5. the task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (5) is implemented as follows:
by using
Figure FDA0003722642450000045
Indicating the delay of task i for offloading based on communication technology g, using
Figure FDA0003722642450000046
The upper bound of the probability of the task i being transmitted by the communication technology g is shown, and the task transmission and processing time W is shown i g (t) exceeds
Figure FDA0003722642450000047
Has a probability of being less than epsilon i The definition is as follows:
Figure FDA0003722642450000048
obtaining an upper bound on the delay
Figure FDA0003722642450000049
The solution of (A) is as follows:
Figure FDA00037226424500000410
in the formula
Figure FDA00037226424500000411
Comprises the following steps:
Figure FDA00037226424500000412
wherein
Figure FDA00037226424500000413
To get a taskUpper bound on probability delay of i
Figure FDA00037226424500000414
First needs to calculate
Figure FDA00037226424500000415
Figure FDA00037226424500000416
Suppose that
Figure FDA00037226424500000417
And
Figure FDA00037226424500000418
i.e. the offloaded tasks may obtain far greater communication and computational resources than the offloading rate of tasks offloaded based on communication technology g
Figure FDA00037226424500000419
To obtain
Figure FDA00037226424500000420
Closed-form solution of (c):
Figure FDA00037226424500000421
wherein the content of the first and second substances,
Figure FDA0003722642450000051
and
Figure FDA0003722642450000052
upper bound of delay
Figure FDA0003722642450000053
The excess delay is determined by a number of factorsProbability of boundary ε i The size of delay is determined, the second item is that each item of the sub-items is related to the size of the task burst, and the third item is that the communication technology g and the residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the upper limit of the delay
Figure FDA0003722642450000054
6. The task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (7) is implemented as follows:
Figure FDA0003722642450000055
wherein, T i max The maximum transmission and processing time delay requirement of the ith task is met; control variable alpha (t) ═ alpha 1 (t),α 2 (t),....α N (t)]The clock cycle resources of the CPU are allocated,
Figure FDA0003722642450000056
offloading policies for communication, wherein
Figure FDA0003722642450000057
Condition C1 is for the queue to be in a stable state; condition C2 ensures that the transmission and processing time for each type of task is within the maximum delay requirement, since the task is offloaded through three different communication techniques,
Figure FDA0003722642450000058
and
Figure FDA0003722642450000059
the maximum value of the three is used as the upper bound of the transmission delay of the ith task; constraint C3 ensures that the CPU clock cycles used to process all tasks cannot exceed those available on the serverThe CPU calculates the total amount of resources; constraint C4 ensures that each task selects mmwave, DSRC, or CV2I to perform the computational task;
the Lypunov technique was used to solve this long-term stochastic constraint C1:
defining a second order Lypunov function L (t) and a 1-slot Lypunov drift amount DeltaL t
Figure FDA0003722642450000061
Figure FDA0003722642450000062
Wherein q (t) ═ q 1 (t),q 2 (t),...q N (t)](ii) a Then, the desired system utility is added to the drift amount resulting in a drift plus penalty term, i.e.
Figure FDA0003722642450000063
Wherein V is a non-negative parameter set by the system for trading off between system utility and queue backlog; for any given control parameter V ≧ 0 with respect to the offload workload α i Next, a drift plus penalty term is derived:
Figure FDA0003722642450000064
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003722642450000065
original time-averaged long-term queue length condition C 1 Absorbing as an optimization objective in an implicit manner, the optimization objective of the problem P1 is converted into F 2 (t):
Figure FDA0003722642450000066
A DRL framework containing states, actions and rewards is employed to formulate computing resource allocation policies and heterogeneous communication offloading policy issues in the VEC:
state space s at time t t Comprises the following steps:
Figure FDA0003722642450000067
due to the fact that
Figure FDA0003722642450000068
And
Figure FDA0003722642450000069
the dimensions are all N dimensions, and
Figure FDA00037226424500000610
the dimension is 4N, and the dimension of the state space is 5N;
motion space a at time t t Comprises the following steps:
Figure FDA00037226424500000611
wherein alpha is i (t) and
Figure FDA00037226424500000612
are required to satisfy the constraint conditions in the formula (30)
Figure FDA00037226424500000613
And
Figure FDA00037226424500000614
adding a virtual variable a N+1 (t) outputting the N +1 dimensional action deep neural network, and then satisfying the N +1 dimensional variables using a softmax function at an output layer:
Figure FDA0003722642450000071
taking only the first N actions; similarly, the output action for each task i
Figure FDA0003722642450000072
And
Figure FDA0003722642450000073
using softmax function, implement
Figure FDA0003722642450000074
Thus the action space is N +1+3N dimensions, the dimension of the action space increasing with the number of task types; also, by acting on the output of each task
Figure FDA0003722642450000075
And
Figure FDA0003722642450000076
using softmax function, thereby implementing
Figure FDA0003722642450000077
So the motion space is 4N +1 dimensions;
reward function r at time t t Comprises the following steps:
r t (a t ,s t )=-F 2 (t) (38)
r t (a t ,s t ) Is illustrated in state s t Taking action a t Thereafter, environment reward feedback to agent, in pi (a) t |s t ) Representing agent based on state s t The spatial distribution of actions taken, the expected long-term discount return for the system is calculated as:
Figure FDA0003722642450000078
wherein γ ∈ [0,1 ]]The discount factor represents that the agent pays attention to the long-term or short-term rewards, the higher the value is, the more the agent pays attention to the long-term rewards, and otherwise, the current short-term rewards are paid attention to; τ ═ s 0 ,a 0 ,s 1 ,a a …) is agent-dependent on the spatial distribution of actions pi (a) t |s t ) State and behavior trace.
7. The task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (8) is implemented as follows:
SAC algorithm in maximizing optimization target-F 2 (t) while introducing policy entropy
Figure FDA0003722642450000079
Into the reward, the expected long-term discount reward for the model is then
Figure FDA00037226424500000710
β t Is a weight of policy entropy that trades off between exploring feasible policies and the maximum optimization objective; with constant change of reward, fixed beta t Can affect the stability of the whole training, so the beta is automatically adjusted in the training process t Is very necessary; the optimization problem of reinforcement learning is converted into:
Figure FDA0003722642450000081
by setting a lower limit
Figure FDA0003722642450000082
Is to make beta in the formula (40) t H(π t (·|s t ) ) as large as possible; increase beta when agent has not learned optimal action t To explore more total space; on the contrary, e.g.If the best strategy has been learned, then beta is reduced t To reduce the exploration and accelerate the training of the model; can be obtained based on the Lagrange multiplier method
Figure FDA0003722642450000083
Figure FDA0003722642450000084
CN202210756389.7A 2022-06-30 2022-06-30 Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning Pending CN115118783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210756389.7A CN115118783A (en) 2022-06-30 2022-06-30 Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210756389.7A CN115118783A (en) 2022-06-30 2022-06-30 Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning

Publications (1)

Publication Number Publication Date
CN115118783A true CN115118783A (en) 2022-09-27

Family

ID=83330488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210756389.7A Pending CN115118783A (en) 2022-06-30 2022-06-30 Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning

Country Status (1)

Country Link
CN (1) CN115118783A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130693A (en) * 2023-10-26 2023-11-28 之江实验室 Tensor unloading method, tensor unloading device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130693A (en) * 2023-10-26 2023-11-28 之江实验室 Tensor unloading method, tensor unloading device, computer equipment and storage medium
CN117130693B (en) * 2023-10-26 2024-02-13 之江实验室 Tensor unloading method, tensor unloading device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113242568B (en) Task unloading and resource allocation method in uncertain network environment
CN113950066B (en) Single server part calculation unloading method, system and equipment under mobile edge environment
CN111867139B (en) Deep neural network self-adaptive back-off strategy implementation method and system based on Q learning
CN113296845B (en) Multi-cell task unloading algorithm based on deep reinforcement learning in edge computing environment
CN112860350A (en) Task cache-based computation unloading method in edge computation
CN110717300B (en) Edge calculation task allocation method for real-time online monitoring service of power internet of things
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
CN111132074B (en) Multi-access edge computing unloading and frame time slot resource allocation method in Internet of vehicles environment
CN113543074A (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN113727306B (en) Decoupling C-V2X network slicing method based on deep reinforcement learning
CN116541106B (en) Computing task unloading method, computing device and storage medium
CN115118783A (en) Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning
CN114928611B (en) IEEE802.11p protocol-based energy-saving calculation unloading optimization method for Internet of vehicles
Shaodong et al. Multi-step reinforcement learning-based offloading for vehicle edge computing
CN113452625B (en) Deep reinforcement learning-based unloading scheduling and resource allocation method
CN115052262A (en) Potential game-based vehicle networking computing unloading and power optimization method
Omland Deep Reinforcement Learning for Computation Offloading in Mobile Edge Computing
CN117834643B (en) Deep neural network collaborative reasoning method for industrial Internet of things
Sun et al. EC-DDPG: DDPG-Based Task Offloading Framework of Internet of Vehicle for Mission Critical Applications
CN115134242B (en) Vehicle-mounted computing task unloading method based on deep reinforcement learning strategy
Ma et al. Deep Reinforcement Learning-based Edge Caching and Multi-link Cooperative Communication in Internet-of-Vehicles
CN117793805B (en) Dynamic user random access mobile edge computing resource allocation method and system
Wang et al. Joint Optimization for MEC Computation Offloading and Resource Allocation in IoV Based on Deep Reinforcement Learning
CN117793801B (en) Vehicle-mounted task unloading scheduling method and system based on hybrid reinforcement learning
Zhao et al. A Novel Multi-Criteria Contribution Evaluation Scheme for Federated Learning in Internet of Vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination