CN115118783A

CN115118783A - Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning

Info

Publication number: CN115118783A
Application number: CN202210756389.7A
Authority: CN
Inventors: 吴琼; 汪文华
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-27

Abstract

The invention discloses a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which comprises the steps of constructing a vehicle edge calculation scene and a vehicle heterogeneous communication network, wherein a vehicle can unload a task to a server for processing through three communication technologies; a base station queue dynamic change model is constructed to ensure the stability of the base station queue; calculating an upper bound of system delay for unloading based on different communication technologies by using a random network calculus theory, wherein the delay comprises communication transmission time and server processing time; establishing a vehicle edge computing system utility; establishing an optimization problem, wherein the optimization target is to minimize the system utility, and simultaneously ensure task unloading delay and the stability of a base station queue; SoftActor critical reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task. The task unloading strategy and the resource allocation scheme adopted by the invention are superior to other unloading and resource allocation schemes in reducing system utility, controlling system stability and ensuring task transmission delay.

Description

Task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning

Technical Field

The invention belongs to the technical field of vehicle networking edge computing reinforcement learning, and particularly relates to a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning.

Background

The internet of vehicles (IoV) will grow rapidly in the future age of 5G, with the demand for immersive quality of experience (QoE) and computing intensive services (such as online 3D gaming, augmented and virtual reality (AR/VR), video, or other interactive applications) increasing dramatically for Vehicle Users (VUs). Furthermore, for an autonomous vehicle, high resolution camera lidar, high speed high definition maps, and other onboard sensors will produce 1GB of data per second. It is a tremendous strain on an on-board vehicle that is not computationally intensive to perform these tasks. To solve the problem of limited on-board computing resources (e.g., CPU), Vehicle Edge Computing (VEC) is considered to be a very promising technology that can alleviate the problem of shortage of on-board computing resources. The VEC provides an open wireless network edge platform that enables vehicles to offload computationally intensive task loads to nearby roadside MEC servers with low latency. Although the bottleneck of insufficient vehicle-mounted computing resources can be alleviated to some extent by VEC technology, the emerging 5G applications and ultra-reliable low-latency (URLLC) requirements of autonomous driving on tasks still put a certain pressure on the development of the internet of vehicles. Ultra-reliable low-latency related performance requirements, including support of up to 1000 times the amount of server data, ultra-low transmission delays below 5ms, and ultra-high reliability of 99.99%, these stringent URLLC requirements pose a significant challenge to a single communication technology on the one hand, and challenge to the reliability of VEC servers on the other hand.

The emerging heterogeneous V2X communication technology brings about an increase in the communication capacity of vehicles. Currently, there are three technologies widely used in car networking, namely Dedicated Short Range Communication (DSRC) communication, cellular-based vehicle-to-all V2X (CV2X) communication, and millimeter wave (mmWave) communication. The DSRC enables short range communication for vehicles without the need to involve a roadside unit RSU, which operates primarily in the 5.9Hz band and is based on the 802.11p standard protocol. C-V2X benefits users from the existing extensive mobile communication infrastructure. In addition to operating at 5.9Hz, C-V2X may also operate on the licensed band of the cellular operator. However, research results show that both techniques do not support reliable delay guarantees at high vehicle densities. In the next generation of wireless technology, millimeter waves work in a large unused frequency spectrum (i.e., 3-300Hz), multi-gigabit transmission capability can be realized for automatic driving, and the method can also adapt to applications with high performance requirements. Heterogeneous V2X communication integrates the advantages of three communication technologies, providing wide area coverage and more efficient and reliable communication transmission for vehicles. However, due to the randomness of task generation and the time-varying channel conditions in the vehicle scene, the unloading performance of the vehicle edge computing task is greatly influenced, and a test is provided for network performance optimization. In recent years, Deep Reinforcement Learning (DRL) has been widely applied to policy decisions for task offloading in the internet of vehicles, and DRL can make optimal decisions for adjusting policies to achieve the optimal long-term goal without any prior information on the vehicle environment.

Therefore, the invention provides an ultra-reliable low-delay reinforcement learning task unloading scheme of the heterogeneous communication technology. The new scheme considers the competition factors of task unloading communication bandwidth resources and server computing resources, and obtains the delay upper bound of mmWave, DSRC and CV2I by adopting a random network algorithm (SNC) method based on a moment-mother function (MGF), so that the low delay of task unloading is ensured. The offloaded task will cause the queue length on the server side to increase, making the server unstable. Lypunov optimization has been widely used in the cohort of stable systems. The scheme uses Lypunov technology to ensure the reliability of the system. In addition, based on deep reinforcement learning Soft operator-critic, the unloading strategy of each task and the distribution strategy of the server CPU are learned under the condition of ensuring delay and reliability, so that the optimal unloading and distribution decision can be made, the consumption utility of the whole system is reduced, and the network and system performance is improved.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which is superior to other unloading and resource allocation methods in reducing system utility, controlling base station queue stability and ensuring task transmission delay requirements.

The technical scheme is as follows: the invention relates to a task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning, which comprises the following steps of:

(1) constructing a vehicle edge calculation scene, wherein the scene consists of a base station connected with a server, a plurality of road side units and a vehicle; the method comprises the steps that a vehicle heterogeneous communication network is formed by three communication technologies of millimeter waves, DSRC and CV2I, and a vehicle can unload tasks to a server through the three communication technologies for processing;

(2) constructing a bounded burst type flow model based on a random network operation theory;

(3) constructing a base station queue dynamic change model to ensure the stability of a base station queue;

(4) establishing communication transmission models of three communication technologies of millimeter waves, DSRC and CV2I based on a random network calculus theory, and establishing a calculation processing model of a CPU; carrying out minimum convolution on the communication transmission model and the calculation processing model by a series theorem to obtain a system processing model;

(5) deriving an upper bound on delay probabilities for offloading and processing based on the respective communication technologies; the delay comprises communication transmission time and server calculation processing time;

(6) establishing a vehicle edge computing system utility, the system utility consisting of a communication utility and a computing utility;

(7) establishing an optimization problem, wherein the optimization target is to minimize the system utility and ensure the task unloading delay and the stability of a base station queue;

(8) SoftActor critical reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task.

Further, the step (2) is realized as follows:

assuming that the vehicle has K types of tasks to process, at the beginning of each t-slot, A _i (t) is the amount of task data accumulated to queue i over time interval [ t, t + 1); at the same time, a time interval of 0 ≦ s ≦ t is given, and a binary non-cumulant A is defined _i (s,t)＝A _i (t-s)＝A _i (t)-A _i (s) is the cumulative amount of tasks arriving at queue i for the ith, A _i (s, t) is a bounded burst-type flow model, and satisfies a stable non-negative random process:

A _i (s,t)＝λ _i [ρ _i (t-s)+σ _i ] (1)

where ρ is _i For the task arrival rate, σ _i For the task burst size, both are constants, λ _i Satisfy Poisson distribution, λ _i Representing the number of vehicles producing the ith task in the [ s, t) time interval.

Further, the step (3) is realized as follows:

the queue length of the base station is expressed as:

wherein q is _i (t) is the queue length of the ith task at the beginning of time slot t, f _E Processing the clock rate, omega, for the maximum CPU of the server _i Indicating that the amount of data per bit task i processed by the server requires a CPU clock period, α _i (t) represents the CPU clock cycle duty that the server allocates to the ith task, and [ x [ ]] ⁺ Max (x, 0); the stability of all queues is controlled by the following definitions:

the left end of equation (3) describes the long-term time-averaged backlog of the queue; equation (3) means that the strong stability of the queue corresponds to a finite average backlog with a finite average queuing delay.

Further, the step (4) is realized as follows:

by beta ^mmw (s, t) represents the total communication transmission available at the time interval mmWave of [ s, t ], and C is used ^(q) Indicates the channel capacity, ζ, of the q time slots ^(q) The channel gain of the q-slot is represented,

representing the signal-to-noise ratio, B the bandwidth, l and δ the transmission distance and path loss exponent, respectively, the millimeter wave energy provides the total communication transmission:

wherein η is Blog ₂ e, using

The task proportion of the task i for communication transmission through the millimeter waves is represented; the communication transmission capacity of millimeter wave energy for providing the ith task is as follows:

wherein

The delay-rate model in network calculus theory is used to establish the total traffic throughput that DSRC communications can provide within a time interval [ s, t):

R ^dsrc in order for the DSRC communication bandwidth to be,

an average access delay indicating that a collision occurs when data is transmitted by the DSRC; the amount of communications traffic that the DSRC can provide the ith task is:

wherein

Indicating the task proportion of the communication transmission of the task i through the DSRC;

by using

Indicates [ s, t) the amount of communications traffic that the time interval DSRC can provide the ith task:

wherein R is ^cv2i For reserving communication bandwidth for the ith task, wherein

By using

Indicating that the server CPU can provide the calculation amount of processing offloaded to the server task i in the time interval [ s, t), which is equal to the amount of processing by the CPU in equation (2):

order set

Representing communication technologies that can be offloaded; by using

Representing time intervals s, t) via communication techniques

The amount of tasks for task i offloaded to the server,

indicating the proportion of tasks, q, of communication transmission of task i by communication technique g _i (s) backlog task(s) not yet processed in base station queue i before time s, using

Indicating that the server CPU is providing at time interval s, t)

The amount of calculation processing of (a) is,

the amount of computational processing that can be obtained is calculated as:

wherein

The delay of the task is the sum of the time of communication transmission and the time of server CPU processing, the total processing of the system can be obtained for the task i as the sum of the communication transmission amount and the calculation processing amount of the CPU, and the system can provide the overall service for the task i unloaded based on the communication technology g as the communication transmission of the communication technology g

And CPU calculation processing

Minimum convolution of (d):

in the formula

The operator is the minimum convolution operator, is the most important operator in the random network operation theory, and has the following operation rules:

further, the step (5) is realized as follows:

by W _i ^g (t) delay of task i for offloading based on communication technology g, using ω _i ^g The upper bound of the probability of the task i being transmitted by the communication technology g is shown, and the task transmission and processing time W is shown _i ^g (t) exceeds

Has a probability of being less than epsilon _i The definition is as follows:

obtaining an upper bound on the delay

The solution of (a) is:

in the formula

Comprises the following steps:

wherein

To obtain toUpper bound on probabilistic delay of transaction i

First needs to calculate

Suppose that

And

i.e. the offloaded tasks may obtain far greater communication and computational resources than the offloading rate of tasks offloaded based on communication technology g

To obtain

Closed-form solution of (c):

wherein, the first and the second end of the pipe are connected with each other,

and

upper bound of delay

The probability e of exceeding the delay bound is determined by a number of factors _i Determines the size of the delay, the second item is related to the size of the task burst, and the third item is the communication technology g andthe residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the delay upper bound

Further, the step (7) is realized as follows:

wherein, T _i ^max The maximum transmission and processing time delay requirements of the ith task are met; control variable alpha (t) ═ alpha ₁ (t),α ₂ (t),....α _N (t)]The clock cycle resources of the CPU are allocated,

offloading policies for communication, wherein

Condition C1 is for the queue to be in a stable state; condition C2 ensures that the transmission and processing time for each type of task is within the maximum delay requirement, since the task is offloaded through three different communication techniques,

and

the maximum value of the three is used as the upper bound of the transmission delay of the ith task; constraint C3 ensures that the CPU clock cycle for processing all tasks cannot exceed the total amount of CPU computing resources available on the server; constraint C4 ensures that each task selects mmWave, DSRC, or CV2I to perform the computational task;

the Lypunov technique was used to solve this long-term stochastic constraint C1:

defining a second order Lypunov function L (t) and a 1-slot Lypunov drift amount DeltaL _t ：

Wherein q (t) ═ q ₁ (t),q ₂ (t),...q _N (t)](ii) a Then, the desired system utility is added to the drift amount resulting in a drift plus penalty term, i.e.

Wherein V is a non-negative parameter set by the system for trading off between system utility and queue backlog; for any given control parameter V ≧ 0 with respect to the offload workload α _i Next, a drift plus penalty term is derived:

original time-averaged long-term queue length condition C ₁ Absorbing as an optimization objective in an implicit manner, the optimization objective of the problem P1 is converted into F ₂ (t)：

A DRL framework containing states, actions and rewards is employed to formulate computing resource allocation policies and heterogeneous communication offloading policy issues in the VEC:

state space s at time t _t Comprises the following steps:

due to the fact that

And

the dimensions are all N dimensions, and

the dimension is 4N, and the dimension of the state space is 5N;

motion space a at time t _t Comprises the following steps:

wherein alpha is _i (t) and

are required to satisfy the constraint conditions in the formula (30)

And

adding a virtual variable a _N+1 (t) outputting the N +1 dimensional action deep neural network, and then satisfying the N +1 dimensional variables using a softmax function at an output layer:

taking only the first N actions; similarly, the output action for each task i

And

using softmax function, implementing

Thus the action space is N +1+3N dimensions, the dimension of the action space increasing with the number of task types; also, by the output action for each task

And

using softmax function, thereby implementing

So the motion space is 4N +1 dimensions;

reward function r at time t _t Comprises the following steps:

r _t (a _t ,s _t )＝-F ₂ (t) (38)

r _t (a _t ,s _t ) Description in the state s _t Taking action a _t Thereafter, environment reward feedback to agent, in pi (a) _t |s _t ) Representing agent based on state s _t The spatial distribution of actions taken, the expected long-term discount return for the system is calculated as:

wherein γ ∈ [0,1 ]]The discount factor represents that the agent pays attention to the long-term or short-term rewards, the higher the value is, the more the agent pays attention to the long-term rewards, and otherwise, the current short-term rewards are paid attention to; τ ═ s ₀ ,a ₀ ,s ₁ ,a _a …) is agent-dependent on the spatial distribution of actions pi (a) _t |s _t ) State and behavior trace.

Further, the step (8) is realized as follows:

SAC algorithm in maximizing optimization target-F ₂ (t) while introducing policy entropy

Into the reward, the expected long-term discount reward for the model is then

β _t Is a weight of policy entropy that trades off between exploring feasible policies and the maximum optimization objective; with constant change of reward, fixed beta _t Can affect the stability of the whole training, so the beta is automatically adjusted in the training process _t Is very necessary; the optimization problem of reinforcement learning is converted into:

by setting a lower limit

Is to make beta in the formula (40) _t H(π _t (·|s _t ) ) as large as possible; increase beta when agent has not learned optimal action _t To explore more total space; conversely, if the best strategy has been learned, then β is reduced _t To reduce the exploration and accelerate the training of the model; can be obtained based on the Lagrange multiplier method

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the task unloading strategy and the resource allocation scheme adopted by the invention are superior to other unloading and resource allocation schemes in reducing system utility, controlling the stability of the base station queue and ensuring the task transmission delay.

Drawings

FIG. 1 is a scene diagram of task offloading of heterogeneous network vehicle edge computing;

FIG. 2 is a diagram of a server queue model framework;

FIG. 3 is a graph of the number of task types, task arrival rates, and upper delay bounds for the inventive arrangements;

FIG. 4 is a graph of the number of task types, task to burstiness, and delay upper bound for the present inventive arrangements;

FIG. 5 is a diagram illustrating the relationship between server computing resources, heterogeneous communication technologies, and delay violation probability in accordance with aspects of the present invention;

FIG. 6 is a diagram illustrating the relationship between the number of task types, heterogeneous communication technologies and delay violation probability according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating communication bandwidth resources, heterogeneous communication technologies, and delay violation probability according to an embodiment of the present invention;

FIG. 8 is a diagram of the CCDF dependency of the task overrun delay of the present inventive scheme;

FIG. 9 is a graph comparing queue backlog for an evenly distributed average offload policy, a randomly distributed random offload, and a heterogeneous communication distribution policy, in accordance with aspects of the present invention;

fig. 10 is a system utility comparison of the inventive arrangements with an evenly distributed average offload policy, randomly distributed random offload and a heterogeneous traffic distribution policy.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

In a vehicle edge computing network environment, aiming at the requirement of higher data rate, ultralow delay, high reliability and excellent user experience for vehicle networking application in the 5G era, the invention provides an ultralow-reliability and low-delay reinforcement learning task unloading method based on a heterogeneous communication technology, and the network performance of vehicle edge computing is improved based on an ultralow-reliability and low-delay task unloading scheme based on three heterogeneous communication technologies of millimeter waves, DSRC and CV 2I. Firstly, a vehicle edge computing task unloading model is provided, a vehicle user can select millimeter waves, DSRC and CV2I to distribute the task proportion unloaded by each communication technology to unload the task to an edge server so as to ensure low time delay of task unloading, and resources of a CPU are distributed at the server based on the Lypunov technology so as to ensure the reliability of a base station queue. Secondly, the upper bound of the delay for unloading and processing based on different communication technologies is calculated by using a random network calculus theory, and the delay comprises communication transmission time and server processing time. And finally, learning the unloading strategy and the server CPU allocation strategy of each task by using SoftActor criticic reinforcement learning.

The stability of the base station queue is ensured, and the task processing of a single task generated by a single vehicle is not considered. The total time is divided into T equal time slots, each time slot being spaced by Δ T, T sets of time slots

And (4) showing. The channel state is time-varying in consideration of the mobility of the vehicle. In the scheme, it is assumed that Channel State Information (CSI), distances between a vehicle and a base station and between the vehicle and an RSU do not change in one time slot, but different time slots are different. Considering that there are K types of tasks that the vehicle needs to process in the scene, a task of type i (i is 1, …, N) is represented by task i. All vehicles can directly unload tasks to a cloud server connected with a base station for processing through three communication modes of millimeter waves, DSRC and CV2I, or the tasks are unloaded to an RSU firstly and then transmitted to the cloud server connected with the base station for processing through a wired mode, and the tasks are collected and used

Representing communication technologies that can be offloaded. When the communication technology is g, the communication technology is,

when the temperature of the water is higher than the set temperature,

represents the proportion of task i that is offloaded using communication technique g, wherein

Because the time delay of wired transmission is smaller, the scheme does not consider the transmission time from the RSU to the base station. Base station toolThere are N queues, each queue is assumed to be infinitely long, queue i stores only task i, using λ _i Representing the number of vehicles producing task i within the [ s, t) time interval. The cloud server processes the tasks stored in the base station queue, and ignores the transmission time from the base station to the server because the server is close to the base station. It is an object of the present invention to minimize delays, including task communication time and processing time, while meeting power consumption and cost effectiveness and queue stability requirements. The method specifically comprises the following steps:

step 1: constructing a vehicle edge calculation scene, as shown in fig. 1, wherein the scene is composed of a base station connected with a server, a plurality of road side units and a vehicle; a vehicle heterogeneous communication network is constructed by three communication technologies of millimeter waves, DSRC and CV2I, and a vehicle can unload tasks to a server for processing through the three communication technologies.

Step 2: constructing a bounded burst type flow model based on a random network operation theory; a dynamic change model of the base station queue is constructed to ensure the stability of the base station queue, as shown in fig. 2.

At the beginning of each t-slot, A _i (t) is the amount of task data accumulated to queue i over time interval [ t, t +1 ]. At the same time, a time interval of 0 ≦ s ≦ t is given, and a binary non-cumulant A is defined _i (s,t)＝A _i (t-s)＝A _i (t)-A _i (s) is the cumulative amount of tasks arriving at queue i for the ith, assuming A _i (s, t) is a bounded burst type model, and satisfies a stable non-negative random process:

A _i (s,t)＝λ _i [ρ _i (t-s)+σ _i ] (1)

where ρ is _i For the task arrival rate, σ _i For the task burst size, both are constants, λ _i Satisfying the poisson distribution.

Consider a server having N _E Block CPU processing core, maximum CPU processing clock rate of server is f per second _E Period (frequency), and processing tasks of different application types per amount of data bit requires CPU clock periods occupying different resources of the server, i.e. processing the amount of data per task i bit requires ω _i C of unitPU clock cycle. Server assignment to ith task alpha _i (t) proportional CPU clock cycle resources, from the set α (t) [ α ] ₁ (t),α ₂ (t),...α ₃ (t)]Indicating that each element in the set needs to be in the feasible set A, i.e.

In this scheme, the Lindley recursion is used to analyze the dynamic change of the queue length, so the queue length at the base station can be expressed as:

wherein q is _i (t) is the queue length of the ith task at the beginning of time slot t, and [ x [] ⁺ Max (x, 0). The reliability of the server is achieved by guaranteeing the stability of each queue, considering the stability of all queues controlled by the following definitions:

the left end of the above equation describes the long-term time-averaged backlog of the queue, which means that a strong stability of the queue corresponds to a limited average backlog with a limited average queuing delay.

And step 3: establishing communication transmission models of three communication technologies of millimeter waves, DSRC and CV2I based on a random network calculus theory, and simultaneously establishing a calculation processing model of a CPU; carrying out minimum convolution on the communication transmission model and the calculation processing model by a series theorem to obtain a system processing model; deriving an upper bound on delay probabilities for offloading and processing based on each communication technology; the delay includes the communication transmission time and the server computing processing time.

The small-scale fading in the millimeter-wave channel is very weak due to the propagation characteristics of the millimeter-wave band, so the amplitude of the channel coefficient (millimeter-wave band) is usually modeled as a random variable satisfying the Nakagami-m distribution. For a given transmission distance l and path loss index delta, according to the Shannon formula, the capacity of a millimeter wave band channel is tried to be calculated

Wherein

Representing the signal-to-noise ratio, B representing the bandwidth, and a random variable ζ being the channel gain, being time-independent and being distributed according to a gamma distribution, i.e., ζ - Γ (M, M) ^-1 ) Where M is the Nakagami index, the probability density function (p.d.f.) of ζ is

By beta ^mmw (s, t) represents the total communication transmission available at the time interval mmWave of [ s, t ], and C is used ^(q) Indicates the channel capacity, ζ, of the q time slots ^(q) Representing the channel gain for the q slot. Then according to the literature the communication traffic of the millimeter wave is

Wherein η is Blog ₂ e, since the channel gain coefficients are independently and identically distributed

Equation (5) can be further written as

By using

Representing the task volume of a task i offloaded to the server via communication technique g for a time interval s, t), according to a formula

Due to unloading by mmWave

Need to be matched with other types of tasks

Contend for millimeter wave communication bandwidth resources, according to the remaining service theorem in random network calculus,

the mmWave communication transmission quantity obtained in the time interval [ s, t) is [ s, t) the total mmWave communication transmission quantity beta of the time interval ^mmw (s, t) subtract the task volume of all mmWave offload based tasks j ≠ i

Represents the proportion of communication transmission of the task i through the millimeter waves, so that the millimeter wave energy provides the communication transmission quantity of the task i

Since the communication transmission amount provided by the millimeter wave is generally larger than the data amount of the task transmission, [ 2 ]] ⁺ Internal is greater than 0, so equation (7) can be collated as

Order to

Equation (6) can be simplified to

DSRC communication is based on IEEE802.11p standard protocol, and in 802.11p protocol, the wireless communication basic access mode is based on distributed coordination function, namely, the retransmission of data packet after collision adopts exponential backoff algorithm, so the DSRC communication delay is mainly the access delay of backoff after collision, and the access delay is used

To indicate the average access delay of collisions occurring when transmitting data by DSRC, according to the literature

u is a constant and the number of the groups is,

is a Parato type tail index, R ^dsrc Is the DSRC communication bandwidth. By beta ^dsrc (s, t) represents the total traffic that the [ s, t) time interval DSRC can provide, and such traffic models may be established as delay-rate service models (latency-rate service) according to network theory, as follows:

in the same way and the method for establishing the millimeter wave communication transmission quantity model, the communication transmission quantity which can be provided by the DSRC for the task i can be obtained by the following principle according to the rest service theorem:

order to

The above formula can be simplified into

CV2I is a cellular network-based communication technology. CV2X communication was standardized in release 14 of 3GGP, with two scheduling-based modes: mode 3 and mode 4, both of which are communication resources pre-allocated in advance. Therefore, assume that CV2I is based on a bandwidth reservation mode, i.e., bandwidth resources of communication are pre-scheduled by the base station to be allocated to the ith task, i.e., task i, which is unloaded via CV2I, does not need to contend for communication traffic with other task j, which is also unloaded based on the communication technology. The cumulative amount of traffic that the entire CV2I can provide to the ith task at time interval s, t) is

Wherein R is ^cv2i Communication bandwidth is reserved for the tasks. To facilitate derivation of the upper bound of the computation delay, equation (13) is made into a unified form with equations (9) and (12). Thus, it can be seen that for task i, CV2I can provide communication traffic in the time interval of [ s, t) as:

it is worth noting here

This is because the CV 2I-based model is to reserve bandwidth resources for the transmitted tasks, and there is no contention for communication bandwidth resources between tasks.

And then, establishing a calculation throughput model of the CPU to obtain a system throughput model, and deducing an upper bound of delay probability for task unloading and server processing based on each communication technology by adopting random network calculation based on a moment-mother function (MGF) based on the established flow model and the system throughput model. By using

Indicating that the server CPU can provide for offloading at time interval s, t)The amount of computation processing to server task i, which is equal to the amount of tasks processed by the CPU in equation (2):

by using

Indicating that the server CPU is providing for the time interval s, t)

The amount of calculation processing of (a), it should be noted,

only the computational processing of task i is processed, so according to the remaining service theorem in the stochastic network calculus,

need to be the same as

And backlog task q not processed yet in base station queue i _i (s) competing for CPU assigned computational throughput

Therefore, it is not only easy to use

The amount of computational processing that can be obtained can be calculated as:

order to

The above equation can be simplified to:

the delay of the task is defined as the time of communication transmission and the time of server CPU processing, so that the total processing amount of the system can be obtained for the task i as the communication transmission amount and the calculation processing amount of CPU processing. According to the series theorem in the stochastic network calculus theory, the system can provide the total processing amount of the task i unloaded based on the communication technology g for the communication technology g to transmit

And CPU calculation processing

Minimum convolution of (d):

in the formula

random network calculus overcomes the problem of deterministic envelopes that only consider the worst case, allowing the envelope to be violated with a small probability of certainty to take full advantage of the statistical properties of the arriving data stream. Upper bound of probability delay

Representing task transmission and processing time W _i ^g (t) exceeds

Has a probability of less than epsilon _i Is as defined inThe following:

the inequality of the above formula is based on the Chernoff inequality P (X is more than or equal to X) and less than or equal to e ^-θX E[e ^θx ]Defining the moment mother function of x, i.e. M, simultaneously _x (θ)＝E[e ^θx ]Then equation (20) can be converted to

Wherein

With respect to a positive number parameter θ and

is used as a binary function of (1). To achieve a more compact upper bound on probability, the minimum is taken

As equivalent violation probability, i.e.

The inequality of the above equation can be converted into

Solving equation (22) can obtain the upper delay bound

The solution of (a) is:

in the formula

Can be calculated as:

wherein

To obtain the delay of task i

First needs to calculate

Since the foregoing has performed formal unification of the communication traffic model and the CPU computation throughput model of different communication technologies, that is

service ∈ { comp, comm }, so formula derivation can be conveniently performed

The final results were as follows:

suppose that

And

i.e. the communication and computing resources available for offloaded tasks are much larger than the offloading rate of tasks offloaded based on communication technology g

Combining the results of formula (25)

Can be substituted by formula (23) to obtain

Closed-form solution of (c):

wherein

And

from the above equation, the upper bound of delay can be seen

Is determined by a number of factors. The first term of the above equation indicates the probability epsilon of exceeding the delay bound _i The size of delay is determined, the second item is that each item of the sub-items is related to the size of the task burst, and the third item is that the communication technology g and the residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the upper limit of the delay

And 4, step 4: establishing a vehicle edge computing system utility, the system utility consisting of a communication utility and a computing utility, the system utility consisting of 2 parts: communication utility and computational utility.

Communication utility: it is assumed that only telecom operators charge CV2I for cellular based communications, and DSRC and millimeter waves operate in unlicensed bands, with a unit cost c for each bit of data defined for CV2I based communications ^comm The amount of work to be offloaded by CV2I is

Thus the systemCommunication utility F due to assignment to ith task _comm,i (t) is:

calculating the utility: the computational utility of the system refers to the power costs defined for the servers to handle the tasks. The unit cost of power consumed by the server per process is c ^comp . In order to establish a more real computing environment, a Dynamic Voltage Frequency Scaling (DVFS) method is adopted to simulate the CPU power consumption, and the DVFS enables the system to operate at a lower frequency and a corresponding lower Voltage under a low load or a workload highly limited by a memory, thereby saving energy consumption under the condition of hardly losing performance requirements. The servers are commonly allocated for processing the tasks in the queue

Based on the assumption of DVFS, the dynamic frequency of each CPU is

In general, the power consumption of the CPU is calculated as the third power of the frequency, and the power consumption of the CPU of each block is

Where κ represents the effective switched capacitance parameter. Using F _comp,i (t) represents the computational utility of the CPU at time t:

finally, the utility function of the system can be obtained as:

wherein the content of the first and second substances,

and

respectively, are normalized weighting coefficients to ensure that the magnitudes of the computational and communication utilities of the CPU are consistent.

And 5: and establishing an optimization problem, wherein the optimization target is to minimize the system utility and ensure the task unloading delay and the stability of the base station queue.

And constructing a heterogeneous communication unloading strategy and a CPU resource optimal allocation problem based on the unloading processing delay and the system utility function obtained in the previous step. The optimization objective is to minimize the average delay of all tasks in the model, while satisfying the power consumption and communication cost constraints and queue stability requirements, and the optimal problems are as follows:

wherein T is _i ^max The maximum transmission and processing delay requirements of the ith task. Control variable alpha (t) ═ alpha ₁ (t),α ₂ (t),....α _N (t)]The clock cycle resources of the CPU are allocated,

offloading policies for communication, wherein

Of the above constraints, condition C1 is to have the queue in a stable state, and condition C2 ensures that the transmission and processing time for each type of task is within the maximum delay requirement, since the task is offloaded by three different communication techniques,

and

the maximum value of the three is used as the upper bound of the transmission delay of the ith task. The constraint (C3) ensures that the CPU clock cycles used to process all tasks cannot exceed the total amount of CPU computing resources available on the server. Constraints (C4) ensure that each task selects mmWave, DSRC, or CV2I to perform the computational task.

It is not easy to solve the P1 for optimal transport offload policies and CPU resource allocation policies, the main reason being that the constraint (C1) is on the stability of the long-term time-averaged queue length, which has a large impact on the long-term stability of the short-term decision queue, and it is more desirable to make decisions without considering future information. Therefore, the Lypunov technique will be first adopted to solve this one long-term stochastic constraint (C1).

The Lypunov function is an efficient framework for designing online control algorithms without any a priori knowledge. Defining a second order Lypunov function L (t) and a 1-slot Lypunov drift amount DeltaL _t ：

Wherein q (t) ═ q ₁ (t),q ₂ (t),...q _N (t)]. The desired system utility is then added to the drift amount with a drift plus penalty term, i.e.

Where V is a non-negative parameter set by the system to trade off between system utility and queue backlog. For any given control parameter V ≧ 0 with respect to the offload workload α _i Next, a drift plus penalty term may be derived.

Wherein

Although the original problem is simplified by means of the Lyapunov method, it is still far from easy to directly solve the problem. On the one hand, the optimization problem P2 is not a convex problem, and on the other hand, its resolution is plagued by dimension cursing due to the complexity of the vehicle environment. Thus, a DRL framework containing state, actions, and rewards is employed to formulate computing resource allocation policies and heterogeneous communication offloading policy issues in VECs.

State space s _t The vehicle needs to observe network resources and computing resources to decide the offload measurement rate of heterogeneous communication, and the server allocates CPU clock cycle resources for each type of task arriving at the base station by observing the length of the queue. In the model, the size of the task is the most fundamental factor influencing the task delay and the queue length, the randomness of the task arrival influences the system stability, the task transmission delay and the cost, so that a variable capable of reflecting the data volume is taken as one of the states, namely

Randomness of task volume by task number lambda _i To reflect it. In addition, the length of the queue affects the decision-making of the server CPU resources in order to achieve system stability. Equation (2) illustrates the queue length update method, since A _i (t) is a random variable, and in the field of deep learning, it is known that environments with random rewards are more difficult to learn than environments with deterministic rewards. Use of

As the state of queue length at time t, A may be _i (t) probability of stochastic absorption to transition s _t+1 ～P(s _t+1 |s _t ,a _t ) In this way, a deterministic reward is obtained from the environment, thus using Q _t ＝[q ₁ (t)+A ₁ (t),q ₂ (t)+A ₂ (t),...q _N (t)+A _N (t)]Indicating the queue length status of all tasks. Another consideration is the time t calculated in equation (26)

The closed type solution is prepared from

And

impact decisions, both reflecting the resources available when task i competes with task j, equation (26) illustrates, latency

Is largely governed by the relatively small resource surplus, so the communication and computing resources available for task i are also brought into state, i.e.

The available communication resources only consider the DSRC case here, since there is no contention between tasks since CV2I is based on the bandwidth reservation model, and the richness of the millimeter wave band resources need not be considered. The resource status of all tasks can be expressed as ξ _t ＝[ξ ₁ (t),ξ ₂ (t),...ξ _N (t)]. Thus, state s at time t _t Can be defined as:

due to the fact that

And

the dimensions are all N dimensions, and

the dimension is 4N dimensions, so the dimension of the state space is 5N dimensions.

Motion space a _t : for each type of task of each type,

allocating CPU clock cycle resources to process task i, α (t) [ α ] ₁ (t),α ₂ (t),....α _N (t)]How many tasks are divided is unloaded by the method of millimeter wave, DSCR and CV 2I. Thus, the action at time t is defined as:

wherein alpha is _i (t) and

are required to satisfy the constraint conditions in the formula (30)

And

to more easily realize alpha at the output of the neural network _i (t) satisfies

By adding a virtual variable a _N+1 (t) outputting the N +1 dimensional action deep neural network, and then satisfying the N +1 dimensional variables using a softmax function at an output layer:

only the first N actions are taken. Similarly, the output action for each task i

And

using the softmax function, one can realize

Thus the action space is N +1+3N dimensional, the dimension of the action space increasing with the number of task types. Also, by the output action for each task

And

using softmax function, thereby implementing

The motion space is 4N +1 dimensions.

Reward function r _t : in this context, the system aims to improve performance in terms of utility, stability and latency. Since in reinforcement learning the long-term reward is maximized while it is desirable to minimize the optimization objective, the reward function of the model at time t is defined as the negative of the optimization objective:

r _t (a _t ,s _t )＝-F ₂ (t)(38)

r _t (a _t ,s _t ) Description in the state s _t Taking action a _t Thereafter, the environment rewards feedback for the agent. By pi (a) _t |s _t ) Representing agent based on state s _t The actions that can be taken are spatially distributed. Of the systemThe expected long-term discount return is calculated as:

wherein gamma is ∈ [0,1 ]]Is a discount factor that represents the agent's interest in long-term or short-term rewards, with higher values indicating that the agent is more interested in long-term rewards, and vice versa in current short-term rewards. τ ═ s ₀ ,a ₀ ,s ₁ ,a _a …) is agent-dependent action-spatial distribution π (a) _t |s _t ) State and behavior trace.

Step 6: SoftActor Critic reinforcement learning is used to learn the offload policies and server CPU allocation policies for each task.

All values in the state space and the action space are continuous variables, and the general reinforcement learning method can only solve discrete variables and low-dimensional variables. For control tasks of high dimension, continuous state and action space, neural networks are used to approximate the variable values in space. The Soft-Actor-criticic (SAC) algorithm is a reinforcement learning algorithm suitable for solving continuous states and action spaces, and introduces policy entropy into rewards, maximizing rewards while encouraging agents to explore more viable policies. Therefore, SAC has better robustness and stronger generalization capability.

Into the reward, the expected long-term discount reward for the model is then

β _t Is a weight of the strategy entropy that makes a trade-off between exploring feasible strategies and the maximum optimization objective. With constant change of reward, fixed beta _t Can affect the stability of the whole training, so the beta is automatically adjusted in the training process _t Is very necessary. The optimization problem of reinforcement learning can be converted into:

by setting a lower limit

Is to make beta in the formula (40) _t H(π _t (|s _t ) As large as possible). Increase beta when agent has not learned optimal action _t To explore more total space; conversely, if the best strategy has been learned, then β is reduced _t To reduce the training of exploration and acceleration models. Can be obtained based on the Lagrange multiplier method

Wherein

Is the best strategy that SAC has learned to maximize the expected long-term discount reward.

The SAC training algorithm and the testing algorithm based on heterogeneous communication technology low-delay ultra-reliable task unloading reinforcement learning are as follows:

and (3) algorithm in a training stage:

the SAC algorithm is based on an actor-Critic network framework, the actor network is used for strategy optimization, the Critic network carries out evaluation strategy, and finally the strategy pi converges to the optimal strategy pi through continuous strategy optimization and strategy evaluation ^* . average value mu of Gaussian distribution of operator network output strategy _φ (t) and covariance ∑ _φ (t), wherein φ represents a neural network parameter of an actor network; operator network samples from strategy high-dimensional Gaussian distribution and outputs high-dimensional data in current stateAction a _t . criticc network output approximate Q _θ (s _t ,a _t ) For policy evaluation, Q _θ (s _t ,a _t ) Is shown in state s _t In action a _t Value function of action of _θ (s _t ,a _t ) Indicating that action a is selected at the current time t _t And then take the expectation of the future discount reward sum under the optimal action condition:

where θ represents a neural network parameter of the criticc network. The SAC algorithm introduces two Critic networks of the same network structure to ensure Q reduction _θ (s _t ,a _t ) Over-estimation of, i.e. output approximation separately

And

θ ₁ and theta ₂ The sub-table represents the parameters of two critical networks. In addition, for faster and stable training, two target critical networks with the same structure as the critical network are introduced, and the network parameters are respectively

And

the algorithm is described in detail below.

To study the trade-off between utility and stability of the system at the weighting factor V value, the SAC model needs to be trained at different V values. First, initializing the network parameters phi, theta ₁ And theta ₂ To do so

And

initialized to sum θ ₁ ，θ ₂ The same value. A replay buffer with sufficient space size is constructed

The training device is used for storing data acquired in the training process. In order to reduce the influence of randomness on the stability of the SAC algorithm, the same millimeter wave channel coefficient ζ is set. Will be totally subjected to K _max Training rounds, each round empties all queues first, and xi in initial state ₀ Set to initial computing and communication resources, i.e.

Initial setting is R ^dsrc And is and

and

are all set to f _E And the other initial states are all set to 0.

Each round includes T _max Sub-time step (time step). The algorithm starts from time step T-0 to T-T _max And the process is started. First, for each type of task, a vehicle number λ is randomly generated _i Let the time slot be 1, i.e. t-s equals 1, according to A _i (0)＝λ _i [ρ _i (t-s)+σ _i ]Calculating the task amount of each type of task reaching the base station queue so as to obtain the state

And state

This results in a state s when t is completely equal to 0 ₀ Will state s ₀ Sending the output t to be 0 strategy average value mu of high-dimensional Gaussian distribution into operator network _φ (0) Sum variance Σ _φ (0) Then sampling from the distribution in the Gaussian distributionAn alpha including N +1 dimensions _N+1 (0) And 3 x N dimensions

Motion space a ₀ . For alpha _N+1 (0) And performing softmax operation to satisfy the formula (37) and taking the first N values to obtain the unloading strategy of the CPU. To pair

Performing softmax operation to meet

According to the obtained action a ₀ Updating all queue lengths Q (0) by formula (1); obtaining the transmission time delay of different communication technologies of each task based on a formula (26)

Obtaining the state xi of the next moment ₁ . Feedback r is obtained from the environment when the vehicle and server take action ₀ . It should be noted that the state of the next time is obtained

Alpha (1) generated at the time when t is 0 is required, and then the complete state at the time when t is 1 is obtained

Then the tuple(s) ₀ ,a ₀ ,r ₀ ,s ₁ ) Stored in a playback buffer (replay buffer). When the number of collected samples in the replay buffer is less than

Then the vehicle and server continue to move to the next state s ₂ And sending to the operator network to start the next iteration with time step t equal to 2. When the number of samples in the playback buffer reaches

Time, iterationUpdating parameters phi, theta of operator network, 2 critic networks and 2 target critic networks ₁ ,θ ₂ ,

And

and a relative entropy weight β to maximize the objective function J (π) of equation (40) _t ). The model first randomly extracts a number I of tuples(s) from the replay buffer _t ,a _t ,r _t ,s _t+1 ) Forming small batch sample data samples, and taking s in all the small batch sample data samples _t Sending the data to an actor network to obtain strategy Gaussian distribution, and sampling according to the strategy distribution to obtain action a _new And policy entropy

Updating the policy entropy weight β according to the gradient of the policy entropy weight, i.e. along the gradient

Can be optimized

This is considered to be a gradient to the relative entropy weight factor β:

next, state s is given _t And action a _new 2 criticic network output State-cost function

And

the loss function for the operator network can be calculated as:

for the sake of stability of the training here,

to make the operator network conductive, a resampling technique for the actions at the same time, i.e. a _t ＝f _φ (ε _t ；s _t )，ε _t Is an input noise vector sampled from some fixed distributions, which is simply sampled from the gaussian distribution, and the sampled value is multiplied by the covariance and then added with the mean to obtain the final output action a _t 。

In order to obtain the loss function of the critic network, s in small batches of sample data needs to be sampled _t And a _t Respectively inputting the data into 2 critic networks to obtain a state-based data _t And action a _t Action-state cost function of

And

will be s in the tuple _t+1 Sending the strategy into an actor network to obtain new strategy entropy pi' _φ And sampling to obtain action a _next 2 target critic networks will be based on s _t+1 And a _next Obtaining target action-state cost function

And

then the target value is:

the loss function of the critic network is:

next, K is performed using an Adam optimizer _u And updating the parameters of the wheel. The relative entropy weight factor beta and the actor network adopt the same learning rate alpha ^A The learning rate of the two critic networks is alpha ^C . Finally, after K _u After round of updating, updating two target critic network parameters:

wherein τ is a constant satisfying τ < 1.

After the training is finished, the next time step is started. When T is T _max The queue is again emptied and initialized, and the next round is started. When making the maximum round K _max Then, an optimal relative entropy weight factor beta is obtained ^* The operator network parameter, the critic network parameter and the target critic network parameter, namely obtaining an optimal strategy

FIG. 3 depicts the upper bound of delay with arrival rate ρ for different numbers of task types _i Is increased. With the increase of the arrival rate of the tasks, the time delay performance of the unloading based on the C-V2I is better than that of the other two unloading modes, because when the tasks are transmitted through the C-V2I, no competition of communication resources exists between the tasks, and the arrival rate rho of the tasks is increased only under the condition of large flow _i The impact on the delay of C-V2I is large because as the arrival rate increases, the server CPU cannot handle the tasks in the base station in time, resulting in an increase in equation (14). As can be seen from fig. 3, at low arrival rates, the delay of the millimeter waves is even lower than C-V2I because the bandwidth of the millimeter waves is very large. Another notable phenomenon in the figure is that in the case of a relatively small amount of tasks, N is 5, the upper bound of the delay of mm wave and CV2I has a stage of descending, and this result is caused because, as shown in equation (26), the communication resources available for the initially unloaded task are larger than the available CPU computing resources, and the main factors affecting the delay are

Then with task arrival rate ρ _i Has become the main determinant of task delay

FIG. 4 shows the delay upper bound as a function of the burstiness σ _i But is increased. In this simulation, the arrival rates ρ of all tasks _i Is set to 0.5 Mbps. When burst degree sigma _i When the data rate is increased to about 5Mbps, the burst degree sigma can be seen in FIG. 4 _i Has a linear effect on the upper bound of the task delay. Furthermore, the upper time delay bound for DSRC is larger than other technologies, which means DSRC is not suitable for large traffic vehicular scenarios.

Fig. 5 illustrates the probability of violation comparing the upper bound of delay for different offload communication techniques at different server CPU resources. Wherein the number of categories of tasks is N-5. As is apparent from fig. 5, increasing CPU cycle resources can significantly reduce probabilistic latency. In addition, it is observed that the performance of millimeter wave communication under a low-traffic task scenario (ρ ═ 0.5Mbps) is superior to that of the other two offload communication technologies, mainly because millimeter waves have a huge bandwidth advantage in the low-traffic scenario compared with the bandwidth reserved for tasks by C-V2I.

Fig. 6 studies the impact of the number N of different category tasks on the delay performance of different heterogeneous communication technologies. At this pointIn the simulation, the arrival rates of all tasks are set to ρ 0.8Mbps, and the CPU total processing capacity f _E ＝10 ⁴ GHz. As can be seen from fig. 8, when N is 10, the probability delay of the millimeter wave exceeds C-V2I, which is not as good as the delay performance of the millimeter wave at a large flow rate as reflected in fig. 3 and 4 in C-V2I. On the other hand, the network performance of the DSRC shows a tendency to deteriorate rapidly as the traffic increases.

Figure 7 shows the upper random delay probability bound epsilon for various communication resources of mmwave, DSRC and C-V2I. In this simulation, the same basic parameter settings are used for all communication technologies, i.e., the burstiness σ and the arrival rate ρ are set to 0Mbps and 0.5Mbps, respectively, and the CPU cycle resources all have f _E ＝10 ⁴ And GHz, wherein each communication has task unloading of N-5. As can be seen from fig. 7, the delay performance of each communication technology deteriorates to various degrees as the communication resources decrease.

Fig. 8 shows the Complementary Cumulative Distribution Function (CCDF) of the delays of three off-load tasks of different V-values under a SAC-based strategy, the CCDF reflecting the probability that the task delay exceeds a certain threshold. The latency of a task is defined herein as the maximum offload latency based on three different communication technologies. The task unloading delay exceeding delay requirement T of 50ms under SAC strategy with different V values in the figure is a very tiny probability event. This reflects the validity of the SAC strategy proposed by the present invention and the ability to guarantee the offloading task latency requirements.

Fig. 9 shows the queue backlog corresponding to the unloaded 3D game under different values of the weight coefficient V, which is considered because the 3D game requires the most CPU cycle resources, and the other two tasks VR and AR generally can be processed in time because the CPU cycle resources required to be processed are relatively small. As can be seen from fig. 9, the SAC-based policy can guarantee the stability of the queue, and the larger V, the smaller the final stable length of the queue. Another phenomenon that can be derived from the graph is that both the random-based strategy and the average-based strategy have smaller queue lengths than the SAC strategy, because both strategies do not take into account system utility.

FIG. 10 shows the balancing based on the average distributionSystem utility f (t) for both offload (EAEO) strategy, random-assignment random offload (RARO) strategy, and SAC strategy of different V values. As can be seen from the figure, the system utility f (t) of the EAEO policy and the RARO policy is much greater than that of the SAC-based policy, which shows that the EAEO policy and the RARO policy in fig. 9 achieve the stability of the queue length by sacrificing the system utility (i.e., CPU resources), thereby verifying the effectiveness of the SAC-based policy. Fig. 9 also shows the difference in utility f (t) for SAC-based strategies at different values of V. As V increases, the system utility F (t) decreases instead, because as V increases, the system will allocate more processing clock rate resources f _E To ensure the stability of the system. The additional allocation of CPU cycle resources allows the system to be offloaded on millimeter wave and DSRC communication techniques, resulting in an overall rate of return r _t And minimum.

Claims

1. A task unloading method based on heterogeneous communication technology ultra-reliable low-delay reinforcement learning is characterized by comprising the following steps:

(8) soft Actor Critic reinforcement learning is used to learn the offload policy and server CPU allocation policy for each task.

2. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (2) is implemented as follows:

A _i (s,t)＝λ _i [ρ _i (t-s)+σ _i ] (1)

3. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (3) is implemented as follows:

the queue length of the base station is expressed as:

wherein q is _i (t) is the ith task at the beginning of time slot tQueue length of (f) _E Processing the clock rate, omega, for the maximum CPU of the server _i Indicating that the amount of data per bit task i processed by the server requires a CPU clock period, α _i (t) represents the CPU clock cycle duty that the server allocates to the ith task, and [ x [ ]] ⁺ Max (x, 0); the stability of all queues is controlled by the following definitions:

4. The task offloading method based on the heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (4) is implemented as follows:

by beta ^mmw (s, t) represents the total communication transmission available for the time interval mmwave of [ s, t ], and C ^(q) Indicates the channel capacity, ζ, of the q time slots ^(q) The channel gain of the q-slot is represented,

representing the signal-to-noise ratio, B the bandwidth, l and δ the transmission distance and path loss exponent, respectively, the millimeter wave energy provides the total communications transmission:

wherein η is Blog ₂ e, using

wherein

The delay-rate model in the stochastic network calculus theory is used to establish the total traffic throughput that DSRC communications can provide within a time interval s, t):

R ^dsrc in order for the DSRC communication bandwidth to be,

wherein

by beta _i ^cv2i (s, t) represents the amount of communications traffic that the [ s, t) time interval DSRC can provide the ith task:

By using

order set

Representing communication technologies that can be offloaded; by using

Representing time intervals s, t) via communication techniques

The task volume of task i offloaded to the server,

indicating the proportion of tasks i communicating via communication technique g, q _i (s) backlog tasks not yet processed in base station queue i before s time, using

Indicating that the server CPU is providing at time interval s, t)

The amount of calculation processing of (a) is,

the amount of computational processing that can be obtained is calculated as:

wherein

The delay of the task is the sum of the time of communication transmission and the time of server CPU processing, the total processing amount of the system can be obtained for the task i and is the sum of the communication transmission amount and the calculation processing amount of the CPU, and the system can provide the communication transmission with the total processing amount of the task i unloaded based on the communication technology g and is the communication technology g

And CPU computing process

Minimum convolution of (d):

in the formula

5. the task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (5) is implemented as follows:

by using

Indicating the delay of task i for offloading based on communication technology g, using

The upper bound of the probability of the task i being transmitted by the communication technology g is shown, and the task transmission and processing time W is shown _i ^g (t) exceeds

Has a probability of being less than epsilon _i The definition is as follows:

obtaining an upper bound on the delay

The solution of (A) is as follows:

in the formula

Comprises the following steps:

wherein

To get a taskUpper bound on probability delay of i

First needs to calculate

Suppose that

And

To obtain

Closed-form solution of (c):

wherein the content of the first and second substances,

and

upper bound of delay

The excess delay is determined by a number of factorsProbability of boundary ε _i The size of delay is determined, the second item is that each item of the sub-items is related to the size of the task burst, and the third item is that the communication technology g and the residual resources calculated by the server and the task amount of the task i unloaded based on the communication technology g jointly determine the upper limit of the delay

6. The task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (7) is implemented as follows:

wherein, T _i ^max The maximum transmission and processing time delay requirement of the ith task is met; control variable alpha (t) ═ alpha ₁ (t),α ₂ (t),....α _N (t)]The clock cycle resources of the CPU are allocated,

offloading policies for communication, wherein

and

the maximum value of the three is used as the upper bound of the transmission delay of the ith task; constraint C3 ensures that the CPU clock cycles used to process all tasks cannot exceed those available on the serverThe CPU calculates the total amount of resources; constraint C4 ensures that each task selects mmwave, DSRC, or CV2I to perform the computational task;

state space s at time t _t Comprises the following steps:

due to the fact that

And

the dimensions are all N dimensions, and

the dimension is 4N, and the dimension of the state space is 5N;

motion space a at time t _t Comprises the following steps:

wherein alpha is _i (t) and

are required to satisfy the constraint conditions in the formula (30)

And

taking only the first N actions; similarly, the output action for each task i

And

using softmax function, implement

Thus the action space is N +1+3N dimensions, the dimension of the action space increasing with the number of task types; also, by acting on the output of each task

And

using softmax function, thereby implementing

So the motion space is 4N +1 dimensions;

reward function r at time t _t Comprises the following steps:

r _t (a _t ,s _t )＝-F ₂ (t) (38)

r _t (a _t ,s _t ) Is illustrated in state s _t Taking action a _t Thereafter, environment reward feedback to agent, in pi (a) _t |s _t ) Representing agent based on state s _t The spatial distribution of actions taken, the expected long-term discount return for the system is calculated as:

7. The task offloading method based on heterogeneous communication technology ultra-reliable low-latency reinforcement learning of claim 1, wherein the step (8) is implemented as follows:

Into the reward, the expected long-term discount reward for the model is then

by setting a lower limit

Is to make beta in the formula (40) _t H(π _t (·|s _t ) ) as large as possible; increase beta when agent has not learned optimal action _t To explore more total space; on the contrary, e.g.If the best strategy has been learned, then beta is reduced _t To reduce the exploration and accelerate the training of the model; can be obtained based on the Lagrange multiplier method