CN111866807A

CN111866807A - Software definition vehicle-mounted task fine-grained unloading method based on deep reinforcement learning

Info

Publication number: CN111866807A
Application number: CN202010571179.1A
Authority: CN
Inventors: 李致远; 彭二帅; 潘森杉; 毕俊蕾; 张威威
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-30
Anticipated expiration: 2040-06-22
Also published as: CN111866807B

Abstract

The invention discloses a software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning, which comprises the following steps of: 1. obtaining the information that the vehicle can access the RSU, the information of the vehicle task and the like; 2. dividing unloading time slots of the vehicle-mounted tasks according to the RSU information in the step 1; 3. converting a vehicle-mounted task unloading time slot decision method into a mathematical problem; 4. solving the mathematical problem in the step 3 by using a deep reinforcement learning method; 5. the algorithm is deployed to the SDN controller. The invention can fully utilize the unloading time slot to reduce the network transmission delay caused by unloading. The unloading time slot decision making method fully considers factors including the relative positions of the vehicles and the RSU, the number of the vehicles connected into the RSU, the number of vehicle-mounted tasks required to be received by the RSU and the like, and can effectively reduce the unloading delay of the vehicle tasks.

Description

Software definition vehicle-mounted task fine-grained unloading method based on deep reinforcement learning

Technical Field

The invention belongs to the field of vehicle-mounted mobile edge calculation, and relates to a vehicle-mounted task unloading time slot decision method, which is suitable for small-sized base station (small-cell base stations) environments, and is particularly suitable for small-sized base station load balancing in a local area network.

Background

With the rapid development of the internet of things technology, Mobile Edge Computing (MEC) has become an important component of the internet of things technology. The user may access the mobile edge computing through a wireless access point such as a base station, Road Side Unit (RSU), etc. The MEC may provide computing, storage, etc. resources for the user. These features find wide application in vehicle networks: vehicle Edge Computing (VEC) is a new network model that has been developed in recent years.

The application in the vehicle network can make the vehicle travel more convenient and safer. With the continuous development of vehicle applications, applications such as real-time road analysis, automatic driving, virtual reality and the like which require strong computing power and a large amount of storage space are more and more, and data contents which need to be transmitted are also more and more. The current mainstream research on task offloading of vehicles has focused on the allocation of computing resources. Most of the on-board task unloading time slot decisions are randomly selected, which cannot fully utilize the unloading time slot to reduce the network transmission delay caused by unloading. Factors influencing the task unloading time slot include the relative position of the current vehicle and the RSU, the number of vehicles accessed in the current RSU, the number of vehicle-mounted tasks required to be received by the current RSU and the like.

In view of the above, it is desirable to provide a method for deciding an unloading time slot of a vehicle-mounted task, which can cope with the unloading situation of the vehicle-mounted task and can consider various influence factors.

Disclosure of Invention

Aiming at the problems, the invention provides a Software-Defined vehicle-mounted task unloading time slot decision method based on deep learning, which mainly researches and obtains global state perception data of a Network, such as the number of vehicles accessed in an RSU (remote Defined Network, SDN) in the Network, the load state of an MEC (media independent component) server, Network delay of the RSU and the like, and constructs a self-adaptive optimization decision by combining a deep learning model on the basis to give suggestions of local unloading, global unloading and optimal vehicle-mounted task unloading time slot so as to solve the problem of overhigh delay caused by unloading of vehicle-mounted tasks, and comprises the following steps:

step 1, obtaining information: a set r of RSUs accessible to the vehicle, a network bandwidth b of the vehicle task Q, RSU requesting offloading in the RSU area;

step 2, dividing unloading time slots of the vehicle-mounted tasks according to the RSU information in the step 1;

step 3, modeling the vehicle-mounted task unloading time slot decision method;

step 4, solving the model expression in the step 3 by using a deep reinforcement learning method;

And 5, deploying the algorithm to the SDN controller.

Further, the information in step 1 includes:

let the unload task in RSU region be Q ═ Q₁,…Q_i,…,Q_nIn which Q_iA mission representing an ith vehicle;

② the size of the vehicle-mounted task is recorded as M ═ M₁,…,M_i,…M_nIn which M is_iRepresents Q_iThe size of (d);

time delay constraint of vehicle-mounted task is recorded as T ═ T₁,…,T_i,…,T_nWhere T is_iIs namely Q_iDelay constraints of (2);

r ═ R is defined as the RSU set accessible to vehicles₁,…R_i,…R_n}；R_iRepresents the ith RSU

Fifthly, the number of the vehicle-mounted tasks which are accessed by each RSU is rA ═ R { (R)₁A,…,R_iA,…,R_nA }; wherein R is_iA represents the number of the vehicle-mounted tasks accessed in the ith RSU;

the bandwidth of RSU is marked as B ═ B₁,…,B_i,…,B_nIn which B is_iRepresents R_iThe network bandwidth of (a);

further, the unloading time slot dividing method of the vehicle-mounted task in the step 2 comprises the following steps:

step 2.1, collecting link bandwidth of RSU, and recording as W; collecting the average signal power of the RSU, and recording the average signal power as P; collecting noise work of RSUThe rate is marked as N; recording link loss power of RSU and vehicle as L_p；

Step 2.2, the transmission rate v of the vehicle and the RSU can be expressed as:

wherein [ L ]_P]D is the distance of the vehicle from the RSU in km and f is the signal frequency of the RSU in MHz at 32.45+20lgd +20 lgf.

Step 2.3, the transmission delay of the onboard task with size M can be expressed as:

Step 2.4, because the network transmission delay is influenced by the relative distance between the vehicle and the RSU, dividing the coverage area of each RSU into n task unloading time slots Gap₁,…,Gap_i,…Gap_nWhere any slot is denoted by g, g ∈ [ Gap [ ]₁,…,Gap_i,…Gap_n]. For convenience of calculation and description, the transmission rates in the same region are assumed to be the same. For convenience of calculation, the RSU is used as a ground vertical point, and g is the distance between the unloading time slot and the vertical point. Then

Wherein high is the vertical height of the RSU and the ground;

further, the method for modeling the vehicle-mounted task unloading time slot decision method in the step 3 comprises the following steps:

step 3.1, define the offload slot decision as L ═ L₁,…,L_i,…,L_n}，L_iThe unloading time slot decision is obtained by representing the unloading task selection positions of the ith vehicle, wherein the combination of the unloading task selection positions of all vehicles is the unloading time slot decision;

and 3.2, determining the unloading decision of a single task. Single slot decision L for unloading certain vehicle-mounted task_iI.e. selection of an offload slot g, i.e. pair

Must have L_i∈[Gap₁,…,Gap_i,…Gap_n]

And 3.3, as can be seen from the formulas (1) and (2), the transmission delay of the vehicle-mounted task is determined by the bandwidth b of the RSU, the unloading time slot decision l and the size m of the vehicle-mounted task, and the transmission delay of the vehicle-mounted task can be rewritten as:

the link bandwidth W of the RSU is replaced by the bandwidth b of the RSU by the expression (3); representing the relative distance between the vehicle and the RSU;

And 3.4, rewriting the transmission delay of the vehicle-mounted task again by the formula (3) as follows:

wherein [ L ] is_P]＝32.45+20lgd+20lgf；

Step 3.5, converting the decision method of the unloading time slot of the vehicle-mounted task into a solving formula (5), D_i(b,l,M_i) Indicating the transmission delay of the ith onboard task.

Wherein, z represents whether to unload the task, z is 1 represents unloading the task, and z is 0 represents unloading the task; MAX_rARepresents the maximum value of rA; the value of rA is influenced by the decision of unloading time slot of the vehicle-mounted task, and rA is less than or equal to MAX_rAIndicating that rA cannot exceed the maximum number of on-board tasks to be accessed.

Further, the specific steps of solving the formula (5) by using the deep reinforcement learning method in the step 4 are as follows:

step 4.1, build Markov State space

S＝{t,rV,rD,rA}

Wherein the various parameters are specified below:

time delay constraint of vehicle-mounted task is recorded as T ═ T₁,…,T_i,…,T_nWhere T is_iIs task Q_iDelay constraints of (2);

(R) RSU set for vehicle access is defined as R ═ R₁,…R_i,…R_nAny unloaded time slot of each RSU in the r is denoted by g, and g belongs to [ Gap ]₁,…,Gap_i,…Gap_n]Where the unloading rates of the vehicle tasks in different unloading slots differ, the set of unloading rates for all unloading slots in R can be expressed as rV ═ { R ═ R₁G₁V,…,R_iG_jV,…,R_nG_nV}，R_iG_jV represents the transmission rate of the jth unloading time slot of the ith RSU;

③ the transmission delay of the vehicle-mounted task in each unloading time slot of each RSU in R is expressed as rD ═ R ₁G₁D,…,R_iG_jD,…R_nG_nD}，R_iG_jD represents the transmission delay of the vehicle-mounted task in the jth unloading time slot of the ith RSU;

fourthly, the number of the vehicle-mounted tasks which are accessed by each RSU is rA ═ R { (R)₁A,…,R_iA,…,R_nA}；

Step 4.2, build Markov motion space

A＝{(a,b)|a∈{[1,n]∩N⁺},b∈{[1,n]∩N⁺}

Wherein the various parameters are specified below:

a represents an RSU accessed by a vehicle when unloading a vehicle-mounted task is executed;

b represents the unloading time slot of the RSU accessed by the vehicle when the unloading vehicle-mounted task is executed;

③N⁺representing a positive integer.

And 4.3, establishing a Markov reward function reward:

reward＝(η)×base+(2(η)-1)×delay(rD,t)+access(rA)

wherein the various parameters are specified below:

(η) is a step function

When the eta is 1, the unloading of the vehicle-mounted task is successful, and when the eta is 0, the unloading of the vehicle-mounted task is failed.

base is a constant and represents the basic reward, then (eta) x base represents that the basic reward is obtained when the unloading of the vehicle-mounted task is successful, and the basic reward is not obtained when the unloading of the vehicle-mounted task is failed;

delay (rD, t) represents the reward or penalty gained for performing a vehicle unloading task

delay(rD)＝Rward×(rD-t)

Wherein rD represents the time taken for unloading the on-board task, and t represents the unloading time constraint of the on-board task. And when unloading is completed within the constraint time t, obtaining the reward, and otherwise, obtaining the penalty. Rward is a reward value or penalty value;

(rA) is used for judging whether the current RSU can also receive more vehicle-mounted tasks

MAX_rARepresenting the maximum number of on-board tasks that the current RSU can access. When more vehicle-mounted tasks can be accessed, namely rA is less than or equal to MAX_rAAccess (rA) has no influence on reward function reward, when rA > MAX_rAThen access (rA) will cause reward to equal 0, i.e., there will be no prize.

Step 4.4, according to the Markov model in the step 4.1-4.3, using DDPG-HER algorithm to solve the optimal unloading time slot decision, and the concrete solving steps are as follows:

step 4.4.1, establishing an Actor current network, an Actor target network, a Critic current network and a Critic target network, wherein the descriptions of the four networks are as follows:

the parameter of the current network of the Actor is theta, and theta also refers to a neural network and is responsible for updating the parameter theta of the network and generating a current action A according to the current state S. Action a acts on the current state S, which represents the set of information such as the decision that an unloading slot of a certain vehicle is being made, the location of the vehicle, which decisions have been made, etc. Generating a state S' and a reward R, the reward R being obtained by a reward function reward;

the Actor target network has parameters theta ', theta' also refers to a neural network and is responsible for selecting an action A 'from the experience playback pool and updating theta';

and thirdly, the parameter of the Critic current network is omega, which also refers to a neural network and is responsible for calculating the current Q value, and the Q value is used for measuring the quality of the selection action. Note that: the Q value here is the same as the Q previously representing the i-th vehicle task _iDifferent;

and the parameter of the Critic target network is omega ', also refers to a neural network and is responsible for calculating a target Q value, namely Q'.

And 4.4.2, training an Actor current network, an Actor target network, a Critic current network and a Critic target network. The specific steps are as follows:

step 4.4.2.1, first obtaining an initialization state S, and the Actor current network generating action A according to the state S;

step 4.4.2.2, calculate the reward R according to state S and action a, and get the next state S';

step 4.4.2.3, storing { S, A, S' } in an experience playback pool;

step 4.4.2.4, recording the current state as S';

step 4.4.2.5, calculating the current Q value and the target Q value;

step 4.4.2.6, updating the Critic current network parameter omega;

step 4.4.2.7, updating the current network parameter theta of the Actor;

in step 4.4.2.8, if the current state S' is the termination state, the iteration is completed, otherwise go to step 4.4.2.2.

And 4.4.3, obtaining the optimal unloading time slot by the trained network.

Further, the specific method for deploying the algorithm to the SDN controller in step 5 is as follows:

and after the DDPG-HER algorithm training is completed, saving the current network of the Actor and deploying the current network to the SDN controller. When unloading is required, the SDN controller determines the optimal unloading time slot for the vehicle-mounted task according to the current state information of the network and the nodes.

The invention has the beneficial effects that:

aiming at the defects of the prior art, the invention divides the coverage area of the RSU into a plurality of intervals to accurately select the unloading time slot, calculates the optimal unloading decision by reasonable analysis and modeling and simultaneously using a DDP-HER algorithm, and reduces the network delay caused by unloading of the vehicle task.

Drawings

FIG. 1 is a flowchart of an on-board task offload slot decision process

FIG. 2 DDPG-HER Algorithm flow chart

Detailed Description

The invention will be further explained with reference to the drawings.

The present invention is further described below, it should be noted that the specific implementation of the present embodiment is based on the present technology, and detailed implementation procedures and implementation steps are given, but the scope of the present invention is not limited by the present embodiment.

As shown in FIG. 1, assume that at this time vehicle i is ready to unload onboard task Q_iThen, the specific implementation flow of the present invention is as follows:

(1) an SDN controller is used. The network bandwidth b information including the set r of RSUs, the vehicle tasks Q, RSU requesting offloading in the RSU area in each local area network may be aggregated into the SDN controller when changes occur. Vehicle i preparation unloading vehicle-mounted task Q_iIts request information is sent to the SDN controller;

(2) According to the divided unloading time slots, the vehicle i prepares to unload the vehicle-mounted task Q_iAccording to the formula

Calculating unloading delay generated by unloading of the vehicle i in different time slots;

(3) SDN controller summarizes unloading task Q of vehicle i_iAnd unloading of other vehiclesTask, according to formula

Converting the task-loading unloading time slot decision method into a value for solving the above expression;

(4) the above expression is solved using the DDPG-HER algorithm. The method comprises the following specific steps:

1. the initialization state S, i.e. the state of the respective RSU, the completion of all vehicle tasks, is first obtained. And the Actor current network generates an action A according to the state S, wherein the action A is the unloading time slot selected by the task of a certain vehicle. The specific method comprises the following steps: computing a feature vector phi (S) of state S, action

Wherein pi_θA strategy (action) generated by the neural network theta is shown, the neural network theta (Actor current network) can select a time slot for unloading the vehicle task according to the information such as the current state of the RSU,

representing noise;

2. the reward R is calculated from the current state S and the action a, and a new state S' is generated. After a certain time slot for unloading the vehicle task is selected, the state of each RSU and the completion condition of all vehicle tasks are changed, and the new state is defined as S';

3. And storing { S, A, S' } into an experience replay pool, wherein the aim is to train the neural network better. Selecting an action A ' by the Actor target network theta ' according to S ' in the experience pool;

4. recording the current state as S';

5. calculating the current Q value and the target Q value

Q (S, A, omega) is the current Q value, Q '(S', A ', omega') is the target Q value, and the calculation of the current network omega is completed by inputting the state S and the action A into Critic; y is the target Q value, where Q '(S', A ', ω') is calculated in the same principle as Q (S, A, ω); γ is the learning rate.

6. Updating the criticic current network ω using the current Q value and the target Q value:

ω←ω+(y-Q(S,A,ω))

y represents a more accurate Q value, and ω + (y-Q (S, A, ω)) means that Critic' S current net ω updates itself by Q value.

The criticic current network omega helps the Actor current network theta to update:

θ←θ-TD(S,A,ω)

and TD (S, A, omega) represents that omega calculates the error of the action A selected in the state S from the optimal action, and theta-TD (S, A, omega) represents that the operator eliminates the error in the current network theta.

If the current state S' is the termination state, the iteration is finished, the Actor current network makes the decision of the optimal unloading time slot, otherwise, the step 2 is carried out.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. A software definition vehicle-mounted task fine-grained unloading method based on deep reinforcement learning is characterized by comprising the following steps:

step 3, modeling the vehicle-mounted task unloading time slot decision method;

and 4, solving the model expression in the step 3 by using a deep reinforcement learning method.

2. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning according to claim 1, wherein the information in the step 1 specifically comprises:

(ii) offload tasks in the RSU region, denoted as Q ═ Q₁,…Q_i,…,Q_nIn which Q_iA mission representing an ith vehicle;

② the size of the vehicle-mounted task, and is recorded as M ═ M₁,…,M_i,…M_nIn which M is_iRepresents Q_iThe size of (d);

③t＝{T₁,…,T_i,…,T_nwhere T is_iIs namely Q_iDelay constraints of (2);

r ═ R of RSU set accessible to vehicle₁,…R_i,…R_n}；

Fifthly, the number of the vehicle-mounted tasks which are accessed by each RSU is recorded as rA ═ R₁A,…,R_iA,…,R_nA}；

Bandwidth of RSU, noted as B ═ B₁,…,B_i,…,B_nIn which B is_iRepresents R_iThe network bandwidth of (2).

3. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning according to claim 1, wherein the unloading time slot dividing method for the vehicle-mounted task in the step 2 is as follows:

Step 2.1, collecting link bandwidth of RSU, and recording as W; collecting the average signal power of the RSU, and recording the average signal power as P; collecting noise power of RSU, and recording as N; recording link loss power of RSU and vehicle as L_p；

wherein [ L ]_P]D is the distance between the vehicle and the RSU, and f is the signal frequency of the RSU, wherein the d is 32.45+20lg d +20lg f;

step 2.4, dividing the coverage area of each RSU into n task unloading time slots Gap according to the influence of the relative distance between the vehicle and the RSU on the network delay₁,…,Gap_i,…Gap_nWhere any slot is denoted by g, g ∈ [ Gap [ ]₁,…,Gap_i,…Gap_n。

4. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning according to claim 3, wherein the method for modeling the vehicle-mounted task unloading time slot decision method in the step 3 is as follows:

step 3.1, define the offload decision as L ═ L₁,…,L_i,…,L_n}，L_iIndicating a location of the ith vehicle to select an unloading task; and g represents the distance between the unloading time slot and the ground vertical point by the SRU. Then

Wherein high is the vertical height of the RSU and the ground;

step 3.2, determining the unloading decision of a single task, and the unloading time slot decision L of the vehicle-mounted task _iI.e. selection of an offload slot g, i.e. pair

Must have L_i∈[Gap₁,…,Gap_i,…Gap_n]；

Step 3.3, the transmission delay of the vehicle-mounted task can be determined by the bandwidth b of the RSU, the unloading time slot decision l and the size m of the vehicle-mounted task, and then the transmission delay of the vehicle-mounted task can be rewritten as:

the link bandwidth W of the RSU is replaced by the bandwidth b of the RSU by the expression (3); representing that the relative distance between the vehicle and the RSU is represented by decision l;

wherein L is_p＝32.45+20lg l(km)+20lg f(MHz)；

Wherein, MAX_rARepresents the maximum value of rA; the value of rA is influenced by the decision of unloading time slot of the vehicle-mounted task, and rA is less than or equal to MAX_rAIndicating that rA cannot exceed the maximum number of on-board tasks to be accessed.

5. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning of claim 4 is characterized in that the specific steps of solving the formula (5) by using the deep reinforcement learning method in the step 4 are as follows:

step 4.1, build Markov State space

S＝{t,rV,rD,rA}

Wherein the various parameters are specified below:

time delay constraint of vehicle-mounted task is recorded as T ═ T₁,…,T_i,…,T_nWhere T is_iIs task Q _iDelay constraints of (2);

(R) RSU set for vehicle access is defined as R ═ R₁,…R_i,…R_nAny unloaded time slot of each RSU in the r is denoted by g, and g belongs to [ Gap ]₁,…,Gap_i,…Gap_n]The unloading rate of the vehicle task in different unloading time slots is different, and the unloading of all the unloading time slots in r is carried outThe carrier rate set is denoted rV ═ R₁G₁V,…,R_iG_jV,…,R_nG_nV}，R_iG_jV represents the transmission rate of the jth unloading time slot of the ith RSU;

③ the transmission delay of the vehicle-mounted task in each unloading time slot of each RSU in R is expressed as rD ═ R₁G₁D,…,R_iG_jD,…R_nG_nD}，R_iG_jD represents the transmission delay of the vehicle-mounted task in the jth unloading time slot of the ith RSU;

Step 4.2, build Markov motion space

A＝{(a,b)|a∈{[1,n]∩N⁺},b∈{[1,n]∩N⁺}

Wherein the various parameters are specified below:

③N⁺represents a positive integer;

and 4.3, establishing a Markov reward function reward:

reward＝(η)×base+(2(η)-1)×delay(rD,t)+access(rA)

wherein the various parameters are specified below:

(η) is a step function

When the (eta) is 1, the unloading success of the vehicle-mounted task is shown, when the (eta) is 0, the unloading failure of the vehicle-mounted task is shown, the base is a constant and shows the basic reward, and when the unloading success of the vehicle-mounted task is shown, the (eta) multiplied by the base shows that the basic reward is obtained, and when the unloading failure of the vehicle-mounted task is shown, the basic reward is not obtained;

delay(rD)＝Rward×(rD-t)

rD represents the time for unloading the vehicle-mounted task, t represents the unloading time constraint of the vehicle-mounted task, reward is obtained when unloading is completed within the constraint time t, otherwise punishment is obtained, and Rward is a reward value or a punishment value;

MAX_rARepresenting the maximum number of vehicle-mounted tasks which can be accessed by the current RSU, when more vehicle-mounted tasks can be accessed, namely rA is less than or equal to MAX_rAAccess (rA) has no influence on reward function reward, when rA > MAX_rAThen access (rA) will cause reward to equal 0, i.e., no reward;

and 4.4, solving the optimal unloading time slot by using a DDPG-HER algorithm according to the Markov model in the step 4.1-4.3.

6. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning according to claim 5, wherein the specific implementation of the step 4.4 comprises the following steps:

step 4.4.1, establishing an Actor current network, an Actor target network, a criticic current network and a criticic target network, wherein the descriptions of the four networks are as follows:

The Actor is characterized in that the parameter of the current network is theta, the theta also refers to a neural network and is responsible for updating the parameter theta of the network and generating a current action A according to the current state S, the action A acts on the current state S to generate a state S' and an award R, and the award R is obtained by an award function reward;

the parameter of the Critic current network is omega, which also refers to a neural network and is responsible for calculating the current Q value, and the Q value is used for measuring the quality of the selection action;

the parameter of the Critic target network is omega ', also refers to a neural network and is responsible for calculating a target Q value, namely Q';

step 4.4.2, training an Actor current network, an Actor target network, a Critic current network and a Critic target network, and specifically comprises the following steps:

step 4.4.2.3, storing { S, A, S' } in an experience playback pool;

step 4.4.2.4, recording the current state as S';

step 4.4.2.5, calculating the current Q value and the target Q value;

Step 4.4.2.6, updating the Critic current network parameter omega;

step 4.4.2.7, updating the current network parameters of the Actor;

step 4.4.2.8, if the current state S' is the termination state, the iteration is finished, otherwise go to step 4.4.2.2;

and 4.4.3, calculating the optimal unloading time slot by the trained network.

7. The deep reinforcement learning-based software defined vehicle-mounted task fine-grained unloading method according to claim 1, further comprising a step 5 of deploying an algorithm to an SDN controller.

8. The software-defined vehicle-mounted task fine-grained unloading method based on deep reinforcement learning according to claim 7, wherein the specific method in the step 5 is as follows: