CN114625504A

CN114625504A - Internet of vehicles edge computing service migration method based on deep reinforcement learning

Info

Publication number: CN114625504A
Application number: CN202210232318.7A
Authority: CN
Inventors: 肖春来; 刘迪; 赵洪祥; 张德干; 张捷; 张婷; 王法玉; 陈洪涛; 朴铭杰; 高星江; 李荭娜; 李思强
Original assignee: Tianjin University of Technology
Current assignee: Huadian Heavy Machinery Co ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-14

Abstract

A vehicle networking edge computing service migration method based on deep reinforcement learning is disclosed. Mobile Edge computing is one of the key technologies to reduce the network delay of vehicles, and due to the mobility of vehicles, the services requested by the vehicles should be migrated on different mec (mobile Edge computing) servers frequently to ensure their strict service quality requirements. However, due to uncertainty of vehicle movement, frequent migration increases cost and time delay, so it is very challenging to design a good migration method. The method minimizes the completion time of service migration under the condition of meeting the migration cost. An improved deep deterministic strategy gradient algorithm is constructed in the internet of vehicles by using deep reinforcement learning to optimize the cost and time delay of vehicle task migration. Meanwhile, a centralized training distributed execution method is used for solving the problem of high dimension during vehicle task migration in the Internet of vehicles.

Description

Internet of vehicles edge computing service migration method based on deep reinforcement learning

Technical Field

The invention belongs to the field of Internet of things, and particularly relates to a vehicle networking edge computing service migration method based on deep reinforcement learning.

Background

Edge Computing (MEC) is a promising technology to accommodate the explosive growth of delay-sensitive and computation-intensive mobile applications such as Augmented Reality (AR), real-time video processing, and car networking, but the limited resources of mobile devices have difficulty meeting the above application requirements, and have attracted a great deal of research interest in academia and industry in recent years. Unlike traditional cloud computing, MECs deploy computing and storage resources near the edge of the network of mobile users. Since the service is provided by a nearby edge server rather than a remote cloud, service response latency can be significantly reduced.

However, due to unpredictable user mobility, increasing transmission delay and maintaining good user Quality of experience (Quality of experience) is far more than deploying mobile applications and services at the edge of the network. When the user is far from the edge server where the service is deployed, a large service response delay may occur, or even a service interruption may occur. In order to ensure the continuity of the service, an effective service migration policy is needed to decide when and where to migrate the service.

There is currently little work on distributed task migration in MECs. The traditional method is to transfer tasks by predicting the location of the user, but the mobility of the vehicle in the internet of vehicles is difficult to predict. Some other methods apply Deep-reinforcement learning (DQN) to task migration, where DQN has a good effect on processing complex state space, but cannot meet the task migration requirement of multi-user edge computation, and when the number of users increases, the dimensions of system state space and behavior space increase exponentially. For a multi-user scenario of the internet of vehicles, states of all vehicles are combined into a global state, which causes instability of a multi-user environment, and influences among the vehicles are ignored. It becomes challenging to design an efficient migration policy approach to minimize migration costs and time delays in such a multi-user distributed environment. Therefore, the invention provides a depth certainty strategy gradient algorithm based on self-adaptive weight under the internet of vehicles aiming at a multi-user scene of the internet of vehicles, and simultaneously solves the task migration problem in the internet of vehicles by adopting a centralized off-line training distributed execution method on the basis of the algorithm.

Disclosure of Invention

The invention aims to solve the problem of high dimension during vehicle task migration in the Internet of vehicles. A vehicle networking edge computing service migration method based on deep reinforcement learning is provided. The method researches the service migration problem of vehicles in the Internet of vehicles in a dynamic environment, and minimizes the completion time of service migration under the condition of meeting the migration cost. An improved depth certainty strategy gradient algorithm is constructed in the internet of vehicles by using deep reinforcement learning to optimize the cost and time delay of vehicle task migration, and meanwhile, a centralized training distributed execution method is used to solve the problem of high dimension during vehicle task migration in the internet of vehicles.

Technical scheme of the invention

A car networking edge computing service migration method based on deep reinforcement learning mainly comprises the following steps:

1, establishing a system model:

1.1, establishing a return delay model;

the system comprises M {1,2, 3., M } mobile edge servers, N {1,2, 3., N } mobile vehicles, wherein the mobile vehicles change from one time slot to the next time slot according to a Markov model, the invention considers that the length of each time slot of a time slot model T {1,2, 3., T } is epsilon, the time slot model is regarded as a continuous time sampling, the time intervals between the sampling are equal, and due to the mobility of the vehicles, the service must be migrated across the edge servers in order to ensure the continuity of the service. The virtualization of the container is utilized to manage the computing tasks in the edge server, so that the flexible scheduling of the vehicle computing tasks is realized,

indicating whether vehicle m is connected to mobile edge server n at time t,

service Task representing vehicle n at time t_nWhether it is executed on top of the mobile edge server m;

due to the limited computing resources of the MEC servers, when the computing load of the local MEC server of the mobile vehicle is high, the computing tasks of the vehicle are transmitted to the MEC servers with less computing tasks nearby through the backhaul link, and the transmission delay between the MEC servers is calculated by using c_n/C_mWherein c is_nRepresents the size of n input data of the vehicle, and C_mThen the output link bandwidth of MEC edge server m is represented so the backhaul delay of the vehicle can be represented as

In the above formula, λ represents a positive coefficient, d (m)₁,m₂) Representing edge servers m₁And edge server m₂The number of hops in between.

1.2, establishing a communication delay model;

the quality of wireless communication can improve the efficiency of service migration, and the quality of wireless communication can be improved by spectrum resource management, so it is very important to allocate a proper amount of spectrum resources to each vehicle. With S_mTo represent the spectrum resources available to a mobile edge server m, all vehicles connected to m sharing the spectrum resources, the invention uses spe_n,m(t) represents the spectrum proportion allocated to the vehicle n by the MEC server m at the time t, and since the returned data is relatively small and negligible, the transmission delay of the returned result is not considered, and according to shannon's theorem, the data transmission rate between the vehicle n and the edge server m is represented as:

in the above formula, P_nRepresenting the transmission power, G, of the vehicle n_n,m(t) represents the channel gain between the vehicle n and the edge server m at time t,

represents white noise power, so the transmission delay of the input data is represented as:

1.3, establishing a calculation delay model;

sharing computing resources by all vehicles within the coverage of the MEC server, helping the vehicles to handle their offloading tasks, F_mUsed to represent the computing power of MEC server m, phi_n(t) represents the time at tInscription Task_nRequired CPU cycles, therefore, Task_nThe required time to complete on MEC server m is expressed as:

in the above formula

Indicating how many tasks are being executed on the MEC server m, it can be seen from the above equation that the execution delay of the MEC server increases in proportion to the number of executed tasks, so that the computing resources of the target MEC server also need to be considered when performing service migration of the vehicle.

1.4, establishing a migration cost model;

to satisfy vehicle service continuity, service migration between multiple MEC servers is required, while cross-server migration requires additional migration costs, assuming vehicle n will have all off-load tasks from MEC server m₁Migration to m₂，

Indicating that the vehicle n will Task at time t_nFrom m₁Migration to m₂The cost of (a) of (b),

in the above formula, χ is a positive coefficient, | o_nL represents the mirror size of the vehicle n unloading task,

1.5, description of the problem;

for a moving vehicle n, the task completion time T_nIncluding computation delay, backhaul delay, and communication delay, expressed as:

expressing the total migration cost of the vehicle n, according to the formula (5), the migration cost is 0 when the MEC server is not changed, otherwise the migration cost is α | o_nL, so get the total migration cost of vehicle n:

migration decisions are made at each time period, while a migration Cost budget Cost is made at each time period_bAnd therefore migration cost budget for the entire system

Expressed as:

on the premise of meeting the migration cost budget, the average delay of the system is minimized through learning, and the optimization formula is expressed as follows:

2, a depth certainty strategy gradient algorithm of self-adaptive weight:

2.1, improving a depth deterministic gradient algorithm;

the depth deterministic gradient algorithm adopts an empirical playback mechanism, which can satisfy the assumption of independent distribution of samples and can quickly converge, but is random when sampling samples in the return visit storage, which ignores the different importance of each sample, thereby resulting in that the sampling efficiency of the samples is not very high. A prior experience playback mechanism is proposed later, and the importance of the sample is evaluated by calculating the absolute value of the TD, but the training result is influenced if the TD error is large. The learning of the neural network is not beneficial to the samples with lower complexity, but the neural network is difficult to understand for the training samples with high complexity in the early learning stage. Therefore, each state sample in the playback storage is assigned a priority weight, their sampling probability is set according to the assigned priority weight, an adaptive weight empirical playback mechanism is proposed,

complexity of sample i CF(s)_i) Which mainly comprises the importance function RF (r) of the sample return value_i,DE_i) And a use frequency function SUF (num) on the sample_i)，

The importance of the sample return value is expressed as:

RF(r_i,DE_i)＝|DE_i|*RW(r_i)+α (10)

in the above formula, DE_i＝Q(s_i,a_i；θ^c)-(r_i+μQ'(s'_i,a'_i；θ^c') Denotes TD error, Q(s)_i,a_i；θ^c) Is the value of critical component evaluate-network, alpha represents a small positive number, which prevents the situation that cannot be sampled when the time difference is 0, RW (r)_i) Representing the weight of the corresponding reward, r being set for stability_i∈[-1,1]While RW (r)_i)>0，

In order to prevent the over-fitting phenomenon, a function related to the number of times of using the sample is added, and as the number of times of using the sample increases, the probability that the sample is selected next becomes lower, SUF (num)_i) Expressed as:

in the above formula, num_iIndicating playback of stored samples s_iP, q are constants greater than 0, so the complexity function is expressed as:

in the above formula, the first and second carbon atoms are,

representing a hyper-parameter, calculating the sample complexity of a sample defined by the invention:

in the above formula, [ phi ] E [0,1 ∈]Representing an exponential random factor, Ψ -1 representing priority sampling, Ψ -0 representing uniform sampling, the exponential random factor ensuring a balance of priority sampling and uniform sampling, thereby preventing the generation of an overfitting phenomenon, and the present invention uses an importance sampling weight w since a distribution error occurs if samples in playback storage are directly sampled_iTo correct for this deviation, the present invention uses a normalization operation to reduce the TD error,

in the above equation, D represents the playback storage capacity, and β represents the compensation coefficient.

2.2, establishing a self-adaptive weight deep learning framework;

applying the framework of centralized training and distributed execution to the proposed AWDDPG (adaptive Weight Deep Deterministic Policy gradient) algorithm, when in the offline centralized training phase, the observation state and the row of other vehicles are saved in the experience replay buffer in addition to the local observation stateTo enable an increase in the number of exercises generated per phase by combining behavior and observed states, and to increase cooperative communication between agents, the pair

And

when updating, the Actor component evaluates according to the sample acquired by the adaptive weight, and after obtaining the global information, each moving vehicle learns the state-behavior value function of the Actor, and meanwhile, after learning the behaviors of other vehicles, each moving vehicle is fixed in the off-line training stage, so that the influence of other vehicle behaviors on the environment is effectively solved, and in the decision-making stage, the Actor only needs to locally observe the state

The vehicle can select an action without knowing the information of other vehicles.

And 3, a calculation service migration method based on deep reinforcement learning:

3.1, description of migration method steps;

firstly, inputting relevant parameters of an algorithm, including batch size, playback storage size, discount factors, soft update coefficients, indexes, hyper-parameters, and the use times and complexity of samples; initializing parameters, and preheating data; and then circularly executing the following steps, firstly initializing the state, then selecting the vehicle and executing the corresponding action, then receiving the reward obtained after the action is executed and simultaneously obtaining the next state, next circularly executing the next action for each vehicle, firstly storing the sample into a playback storage and setting related parameters, then adaptively selecting the sample according to a formula, calculating a time difference error and an importance sampling weight, then updating the weight and calculating the complexity, then updating the evaluate-network parameter of the Critic through a minimized loss function, then updating the evaluate-network parameter of the Actor through a minimized strategy objective equation, finally updating the target-network parameters of the Critic and the Actor, and finishing the method after the circulation is finished.

3.2, complexity analysis;

the number of moving vehicles N and the batch size K are the main reasons for determining the complexity of the adaptive weight sampling time, the temporal complexity of the adaptive weight sampling is denoted as o (nk), since the temporal complexity of the offline training is proportional to the size of the training data and the training time, therefore, only the time complexity of execution needs to be concerned, the complexity of execution is mainly determined by the structure of the neural network, the size of the state space and the size of the action space, and under the condition of not considering the structure of the neural network, the computation complexity is Q (| A | × | S |), | A | represents the number of behavior spaces, | S | represents the number of state spaces, and after DNN is added, the setting of the environment and parameters of the system has great influence on the computation complexity, it is also difficult to estimate, so the complexity of the AWDDPG algorithm can be expressed as O (NK + | A | × | S |).

The invention has the advantages and positive effects that:

mobile edge computing is one of the key technologies to reduce the latency of vehicle networks, and due to the mobility of vehicles, the services requested by them should be migrated frequently on different servers to guarantee their strict quality of service requirements. But due to uncertainty in vehicle movement, frequent migration adds cost and time delay, so it is very challenging to design a good migration method. The invention minimizes the completion time of service migration under the condition of meeting the migration cost. An improved depth deterministic strategy gradient algorithm is constructed in the internet of vehicles by using deep reinforcement learning to optimize the cost and time delay of vehicle task migration. Meanwhile, a centralized training distributed execution method is used for solving the problem of high dimension during vehicle task migration in the Internet of vehicles.

Drawings

FIG. 1 is a prize diagram for AWDDPG and DDPG algorithms;

FIG. 2 is a graph of the loss function of the AWDDPG algorithm;

FIG. 3 is a graph of average completion times for different input data sizes;

FIG. 4 is a graph of average completion times for different numbers of vehicles;

FIG. 5 is a graph of average completion times for different numbers of mobile edge servers;

FIG. 6 is a graph of average completion times for different migration cost budgets;

FIG. 7 is a graph of average migration costs for different input data sizes;

FIG. 8 is a graph of the migration resource ratios for different numbers of vehicles;

FIG. 9 is a flowchart of a method for migrating the edge computing service of the Internet of vehicles based on deep reinforcement learning according to the present invention.

Detailed Description

Example 1:

in the experiment, Matlab 2018a is used for carrying out a large number of experiments to verify the performance of the AWDDPG distributed task migration algorithm based on the Internet of vehicles. The robustness of the algorithm under different parameters is tested through experiments. Meanwhile, the provided algorithm is compared with other algorithms, and the effectiveness of the provided algorithm is proved.

Referring to fig. 9, the method for migrating the edge computing service in the internet of vehicles based on deep reinforcement learning mainly includes the following key steps:

1, establishing a system model:

1.1, establishing a return delay model;

the system comprises M {1,2, 3., M } mobile edge servers, N {1,2, 3., N } mobile vehicles, and the mobile vehicles change from one time slot to the next time slot according to a Markov model, the invention considers that the length of each time slot of a time slot model T {1,2, 3., T } is epsilon, the time slot model can be regarded as a continuous time sample, the time intervals between the samples are equal, and the services have to be migrated across the edge servers to ensure the continuity of the services due to the mobility of the vehicles. The computing tasks in the edge server are managed by virtualization of the container, so that flexible scheduling of the vehicle computing tasks is achieved.

Indicating whether vehicle m is connected to mobile edge server n at time t.

due to the limited computing resources of the MEC server, when the computing load of the local MEC server of the mobile vehicle is high, the computing tasks of the vehicle can be transmitted to the MEC server with less computing tasks nearby through the backhaul link. And the transmission delay between MEC servers can be used as c_n/C_mWherein c is_nRepresents the size of n input data of the vehicle, and C_mIt represents the outgoing link bandwidth of the MEC edge server m. The return delay of the vehicle can be expressed as

1.2, establishing a communication delay model;

the quality of wireless communication can improve the efficiency of service migration, and the quality of wireless communication can be improved by spectrum resource management, so it is very important to allocate a proper amount of spectrum resources to each vehicle. With S_mTo represent the spectrum resources available to a mobile edge server m, all vehicles connected to m sharing the spectrum resources, the invention uses spe_n,m(t) represents the proportion of spectrum assigned by the MEC server m to the vehicle n at time t. Since the returned data is relatively small and negligible, the transmission delay of the returned result is not considered. According to shannon's theorem, the data transmission rate between vehicle n and edge server m can be expressed as:

representing white noise power. The transmission delay of the input data can be expressed as:

1.3, establishing a calculation delay model;

all vehicles within the coverage of the MEC server share computing resources, helping the vehicles handle their off-load tasks. F_mTo represent the computing power of the MEC server m. Phi is a_n(t) denotes the Task at time t_nThe required CPU cycles. Therefore, Task_nThe time required to complete at MEC server m can be expressed as:

in the above formula

Indicating how many tasks are executing on MEC server m. As can be seen from the above equation, the execution delay of the MEC server increases in proportion to the number of executing tasks, so the computing resources of the target MEC server also need to be considered when performing service migration of the vehicle.

1.4, establishing a migration cost model;

in order to satisfy continuity of vehicle service, service migration between a plurality of MEC servers is required. While the migration across servers requires additionalMigration cost, assuming vehicle n will have all the off-load tasks from MEC server m₁Migration to m₂。

Indicating that the vehicle n will Task at time t_nFrom m₁Migration to m₂The cost of (a).

In the above formula, χ is a positive coefficient, | o_nAnd | represents the mirror size of the vehicle n unloading task.

1.5, description of the problem;

for a moving vehicle n, the task completion time T_nIncluding computation, backhaul, and communication delays, can be expressed as:

representing the total migration cost of vehicle n. According to the formula (5), the migration cost is 0 when the MEC server is not changed, otherwise the migration cost is α | o_nL. The total migration cost of vehicle n can be derived:

Can be expressed as:

on the premise of meeting the migration cost budget, the average delay of the system can be minimized through learning, and the optimization formula can be expressed as follows:

2, self-adaptive weight depth certainty strategy gradient algorithm:

2.1, improving a depth deterministic gradient algorithm;

the depth deterministic gradient algorithm adopts an empirical playback mechanism, which can satisfy the assumption of independent distribution of samples and can quickly converge, but is random when sampling samples in the return visit storage, which ignores the different importance of each sample, thereby resulting in that the sampling efficiency of the samples is not very high. A prior experience playback mechanism is proposed later, and the importance of the sample is evaluated by calculating the absolute value of the TD, but the training result is influenced if the TD error is large. The learning of the neural network is not facilitated for the samples with lower complexity, but the neural network is difficult to understand for the training samples with high complexity in the early stage of learning. Therefore, each state sample in the playback storage is assigned a priority weight, the sampling probability of the state samples is set according to the assigned priority weight, and an adaptive weight empirical playback mechanism is provided.

Complexity of sample i CF(s)_i) Which mainly comprises the importance function RF (r) of the sample return value_i,DE_i) And a use frequency function SUF (num) on the sample_i)。

The importance of the sample return value is expressed as:

RF(r_i,DE_i)＝|DE_i|*RW(r_i)+α (10)

in the above formula, DE_i＝Q(s_i,a_i；θ^c)-(r_i+μQ'(s'_i,a'_i；θ^c') Denotes TD error, Q(s)_i,a_i；θ^c) Is the value of the Critic component evaluate-network. Alpha represents a small positive number, which can prevent the situation that the sampling cannot be performed when the time difference is 0. RW (r)_i) Representing the weight of the corresponding reward, r being set for stability_i∈[-1,1]While RW (r)_i)>0。

In order to prevent the over-fitting phenomenon, a function related to the number of times of using the sample is added, and as the number of times of using the sample increases, the probability that the sample is selected next becomes lower, SUF (num)_i) Can be expressed as:

in the above formula, num_iRepresenting playback of stored samples s_iP and q are constants greater than 0. The complexity function can be expressed as:

in the above formula, the first and second carbon atoms are,

representing a hyper-parameter. Calculating the sampling probability of a sample can be determined by the sample complexity defined by the present invention:

in the above formula, [ phi ] E [0,1 ∈]Representing an exponential random factor. Ψ — 1 denotes priority sampling, and Ψ — 0 denotes uniform sampling. The exponential random factor ensures a balance between priority sampling and uniform sampling, thereby preventing the over-fitting phenomenon. Since a distribution error occurs if samples in the replay memory are directly sampled, the present invention uses the importance sampling weight w_iTo correct for this deviation. The present invention uses normalization to reduce TD error.

2.2, establishing a self-adaptive weight deep learning framework;

a framework of centralized training and distributed execution is applied to the proposed awddpg (adaptive Weight Deep Deterministic Policy gradient) algorithm. When in the offline intensive training phase, the observed states and behaviors of other vehicles are saved in addition to the local observed states in the experience replay cache. The number of exercises generated per phase can be increased by combining the behavior and the observed state, while the cooperative communication between agents can also be increased. In pair

And

when updating, the Actor component evaluates against the samples collected by the adaptive weights. Having obtained the global information, each moving vehicle can learn its own state-behavior value function. Meanwhile, each moving vehicle is stationary during the off-line training phase after learning the behavior of the other vehicles. Therefore, the influence of other vehicle behaviors on the environment can be effectively solved. In the decision phase, because the Actor only needs the local observation state

3.1, description of migration method steps;

first, the relevant parameters of the algorithm are input, such as batch size, playback storage size, discount factor, soft update coefficient, exponent, hyper-parameter, number of uses of samples, and complexity. Parameters are then initialized and data is prepared for preheating. And then, circularly executing the following steps, firstly initializing the state, then selecting the action by the vehicle, executing the corresponding action, receiving the reward obtained after the action is executed, and simultaneously obtaining the next state. Next, each vehicle is cycled to perform the next actions, first storing the samples in a playback memory and setting the relevant parameters. And then, adaptively selecting samples according to a formula, and calculating time difference errors and importance sampling weights. The weights are then updated, and the complexity is calculated. Then, the evaluate-network parameter of Critic is updated by minimizing the loss function. The evaluate-network parameter of Actor is then updated by minimizing the policy objective equation. And finally, updating the parameters of the target-network of the Critic and the Actor. After the loop is over, the method ends.

3.2, complexity analysis;

the number N of moving vehicles and the batch size K are the main reasons for determining the time complexity of the adaptive weight sampling, so the time complexity of the adaptive weight sampling can be expressed as o (nk). Since the temporal complexity of offline training is proportional to the size of the training data and the training time, only the temporal complexity of the execution needs to be concerned. The complexity of execution is mainly determined by the structure of the neural network, the size of the state space and the size of the action space. Regardless of the neural network structure, the computational complexity is Q (| a | × | S |), | a | represents the number of behavior spaces, and | S | represents the number of state spaces. After DNN is added, the environment and parameter settings of the system have a great influence on the computational complexity and are difficult to estimate. Therefore, the complexity of the AWDDPG algorithm can be expressed as O (NK + | A | × | S |).

Simulation experiment:

consider a vehicle moving randomly within the coverage of multiple MEC servers, while the vehicle's trajectory belongs to a random walk model. Each vehicle will have its own compute intensive and delay sensitive task and this task will be offloaded to the MEC server for execution. The invention uses a hold-out method to separate training data and verification data, and the separation ratio is 4: 1, they are completely independent. For each vehicle, their Critic assembly was provided with 4 fully connected hidden layers, neuronal [2048,1024,512,256 ]. For the Actor component, the invention deploys 2 fully-connected hidden layers for the Actor component, the number of the neurons is [1024,512] and [512,256], and the output layer of the Actor component is activated by a tanh function. For other layers of neurons the invention uses the ReLU function for activation. Specific experimental parameter settings are shown in tables 1 and 2.

TABLE 1 Experimental parameters

TABLE 2 AWDDPG parameter settings

Fig. 1 shows the convergence of the proposed AWDDPG algorithm. Fig. 2 shows the rewards harvested by the AWDDPG and DDPG algorithms during the intensive training phase. Since the DDPG algorithm uses a standard empirical playback mechanism that is not improved, and ignores useful training samples, AWDDPG can choose better training samples in different training phases, which makes the algorithm converge faster. As can also be seen from fig. 1, after a period of training, when Critic and Actor adjust the evaluation-network and target-network parameters to approach the optimal strategy gradually, the AWDDPG algorithm can converge in a short time and reach a higher and more stable level. Fig. 2 shows the difference between the value of Critic's cost function and the actual prize value, and it can be seen that Critic's cost function gets closer to the true value as the number of iterations increases.

The performance of the proposed AWDDPG distributed task migration algorithm is verified by comparison with other algorithms. Other algorithms include DDPG, Extensive Service Migration (ESM), Always Migration (AM), counterfactual multi-agent (COMA), and Never Migration (NM).

Firstly, the average completion time of the 6 algorithms under the conditions of other fixed variables, different input data sizes, different vehicle numbers and different MEC server numbers of the 6 algorithms is compared. As shown in fig. 3, the average completion time increases as the size of the input data increases. Mainly because the calculation delay of the vehicle unloading task is added when the input task is increased. The average completion time of the AWDDPG algorithm is lowest compared to other algorithms. It can be seen that the average completion time of AM and NM algorithms is high, the AM algorithm is to perform service migration whenever the vehicle leaves the coverage of the edge server, which results in frequent service migration of the vehicle, and when the input data becomes large, the frequency of migration also increases, so the average completion time also gradually increases. The delay is lower than the DDPG algorithm because the invention improves the DDPG algorithm. When the NM algorithm initially selects an edge server, when a plurality of vehicles select the same MEC server for service and do not perform service migration, the resource utilization rate of the server is not high, which may result in an increase in the average completion time of the system. The ESM algorithm is for a single agent scenario, but its performance is not very good for a multi-user scenario, and it is evident that the average completion time increases as the input data size increases. The COMA uses the Actor-Critic algorithm and adopts a method of centrally training distributed execution, but the algorithm ignores the empirical playback mechanism.

The AWDDPG provided by the invention uses an empirical playback mechanism on the basis of COMA, so that the relevance among samples is reduced, and meanwhile, an adaptive weight sampling method is designed to increase the sampling efficiency, so that the convergence speed and stability of the algorithm are greatly increased, and the average completion time of the AWDDPG algorithm is the lowest. Fig. 4 is similar to the analysis of fig. 3. Fig. 5 shows that the average completion time of all algorithms decreases as the number of MEC servers increases, since the available resources of the vehicle also increase as the number of MEC servers increases, and it can be seen that the average completion time of the AWDDPG algorithm proposed by the present invention is the lowest. As can be seen in FIG. 6, when the present invention shifts the migration cost budgets from low to high for each phase, there are 5 algorithms that increase as the migration cost budget increases, and the average completion time of the computational task decreases. Since the NM algorithm does not perform service migration, the average completion time does not change. The above experiments show that the AWDDPG algorithm has better performance and better performance on the index of average completion time.

Next, the AWDDPG algorithm performance is verified by the average migration cost. Fig. 7 can obtain that when the size of the input data is increased, the average migration cost of 5 algorithms is increased, so that the migration cost is considered to be mainly related to the mirror size of the migration data. Because the AM algorithm performs service migration every time, its migration cost increases in proportion to the size of input data. The NM algorithm does not migrate and therefore does not generate energy consumption. Compared with ESM, COMA and DDPG algorithms, the AWDDPG algorithm provided by the invention can find a better migration strategy, so that the migration cost is the lowest.

Next, the AWDDPG algorithm performance is verified by migrating the resource occupancy ratio. As can be seen from fig. 8, when the number of vehicles is different, the occupation ratios of the migration resources of the 5 algorithms can be stabilized at a certain value. Since the AM algorithm performs service migration every time, its migration resource is the largest. The NM algorithm does not migrate, so there is no migration resource occupancy. Compared with ESM, COMA and DDPG algorithms, the AWDDPG algorithm provided by the invention can find a better migration strategy, so that the migration resource occupation ratio is lowest.

Claims

1. A method for migrating Internet of vehicles edge computing services based on deep reinforcement learning is characterized by comprising the following steps:

1, establishing a system model:

1.1, establishing a return delay model;

1.2, establishing a communication delay model;

1.3, establishing a calculation delay model;

1.4, establishing a migration cost model;

1.5, description of the problem;

2, self-adaptive weight depth certainty strategy gradient algorithm:

2.1, improving a depth deterministic gradient algorithm;

2.2, establishing a self-adaptive weight deep learning framework;

3.1, description of migration method steps;

and 3.2, complexity analysis.

2. The deep reinforcement learning-based migration method for edge computing services in the internet of vehicles according to claim 1, wherein the step 1.1 of building the backhaul delay model is that the system includes M ═ 1,2,3,. and.m } mobile edge servers, N ═ 1,2,3,. and.n } mobile vehicles, the mobile vehicles will change from one timeslot to the next according to the markov model, the length of each timeslot is considered as epsilon for one timeslot model T ═ {1,2,3,. and.t }, where the timeslot models are considered as one continuous time sample, the time intervals between the samples are equal, the virtualization of the container is used to manage the computing tasks in the edge servers, so as to implement flexible scheduling of the vehicle computing tasks,

indicates whether or not the vehicle m is connected at time tTo the mobile edge server n is connected,

when the calculation load of the local MEC server of the mobile vehicle is high, the calculation task of the vehicle is transmitted to the MEC server with less calculation tasks nearby through a backhaul link, and the transmission delay between the MEC servers is used as c_n/C_mWherein c is_nIndicating the size of n input data of the vehicle, C_mThen the output link bandwidth of MEC edge server m is represented and the backhaul delay of the vehicle is represented as

3. The method for migrating the edge computing services in the internet of vehicles based on deep reinforcement learning as claimed in claim 1, wherein the method for establishing the communication delay model in step 1.2 is to use S_mTo represent the spectrum resources available to the mobile edge server m, all vehicles connected to m share the spectrum resources, in spe_n,m(t) to represent the spectrum proportion allocated to the vehicle n by the MEC server m at time t, and since the returned data is small, regardless of the transmission delay of the returned result, according to shannon's theorem, the data transmission rate between the vehicle n and the edge server m is represented as:

representing white noise power, the transmission delay of the input data is represented as:

4. the deep reinforcement learning-based migration method for edge computing services in internet of vehicles according to claim 1, wherein the method for establishing the computation delay model in step 1.3 is that all vehicles within the coverage of the MEC server share the computing resources to assist the vehicles in handling the unloading tasks of the vehicles, F_mUsed to represent the computing power of MEC server m, phi_n(t) denotes the Task at time t_nRequired CPU cycles, therefore, Task_nThe required time to complete on MEC server m is expressed as:

in the above formula

Indicating how many tasks are being executed on the MEC server m, as seen from the above equation, the execution delay of the MEC server increases in proportion to the number of executing tasks, so that the computing resources of the target MEC server also need to be considered when performing service migration of the vehicle.

5. The IOV edge computing service migration method based on deep reinforcement learning of claim 1, wherein the method for establishing the migration cost model in step 1.4That is, to satisfy continuity of vehicle service, service migration needs to be performed between multiple MEC servers, assuming that vehicle n offloads all tasks from MEC server m₁Migration to m₂，

6. the deep reinforcement learning-based IOV edge computing service migration method according to claim 1, wherein the problem of step 1.5 is described as follows, for a moving vehicle n, the task completion time T_nIncluding computation, backhaul and communication delays, expressed as:

representing the total migration cost of the vehicle n, according to the formula (5), the migration cost is 0 when the MEC server is not changed, otherwise the migration cost is α | o_nL, so get the total migration cost of vehicle n:

Expressed as:

7. the method for migrating the edge computing services in the internet of vehicles based on deep reinforcement learning as claimed in claim 1, wherein the improved deep deterministic gradient algorithm in step 2.1 is that each state sample in the replay storage is assigned a priority weight, their sampling probability is set according to the assigned priority weight, an adaptive weight empirical replay mechanism is proposed,

The importance of the sample return value is expressed as:

RF(r_i,DE_i)＝|DE_i|*RW(r_i)+α (10)

in the above formula, DE_i＝Q(s_i,a_i；θ^c)-(r_i+μQ'(s'_i,a'_i；θ^c') Denotes TD error，Q(s_i,a_i；θ^c) Is the value of critical component evaluate-network, alpha represents a small positive number, preventing the situation of not being able to sample when the time difference is 0, RW (r)_i) Representing the weight of the corresponding reward, r being set for stability_i∈[-1,1]While RW (r)_i)>0，

in the above formula, the first and second carbon atoms are,

representing a hyper-parameter, calculating a sampling probability of a sample by a defined sample complexity:

in the above formula, [ phi ] E [0,1 ∈]Denotes an exponential random factor, Ψ ═ 1 denotes priority sampling, Ψ ═ 0 denotes uniform sampling, and the exponential random factor ensures a balance of priority sampling and uniform sampling, thereby preventing overfittingOccurrence of a phenomenon, since a distribution error occurs if samples in the playback memory are directly sampled, importance sampling weight w is used_iTo correct for this deviation, a normalization operation is used to reduce the TD error,

8. The method for migrating the edge computing services in the internet of vehicles based on the deep reinforcement learning as claimed in claim 1, wherein the step 2.2 is to establish an adaptive weight deep learning architecture, apply a centralized training and distributed execution framework to the proposed AWDDPG algorithm, when in the offline centralized training phase, save the observed states and behaviors of other vehicles in the experience playback buffer area in addition to the local observed states, and increase the training number generated in each phase by combining the behaviors and the observed states, and simultaneously increase the cooperative communication between agents, and in the opposite direction to the observed states

And

when updating, the Actor component evaluates according to the sample acquired by the adaptive weight, and after obtaining the global information, each moving vehicle learns the state-behavior value function of the Actor, and meanwhile, after learning the behaviors of other vehicles, each moving vehicle is fixed in the off-line training stage so as to effectively solve the influence of the behaviors of other vehicles on the environment

9. The method for migrating the edge computing services in the internet of vehicles based on the deep reinforcement learning as claimed in claim 1, wherein in the step 3.1, the migration method is described as follows, firstly, relevant parameters of the algorithm are input, including batch size, playback storage size, discount factor, soft update coefficient, index, hyper-parameter, number of times of use and complexity of samples; initializing parameters, and preheating data; and then circularly executing the following steps, firstly initializing the state, then selecting the action by the vehicle and executing the corresponding action, then receiving the reward obtained after the action is executed and simultaneously obtaining the next state, circularly executing the next action for each vehicle, firstly storing the sample into a playback storage and setting related parameters, then adaptively selecting the sample according to a formula, calculating time difference error and importance sampling weight, then updating the weight and calculating complexity, then updating the evaluate-network parameter of the Critic by a minimized loss function, then updating the evaluate-network parameter of the Actor by a minimized strategy objective equation, finally updating the target-network parameters of the Critic and the Actor, and finishing the method after the circulation is finished.

10. The deep reinforcement learning-based internet-of-vehicles edge computing service migration method according to claim 1, wherein the step 3.2 complexity analysis is as follows: the number of moving vehicles N and the batch size K are the main reasons for determining the complexity of the adaptive weight sampling time, the temporal complexity of the adaptive weight sampling is denoted as o (nk), since the temporal complexity of the offline training is proportional to the size of the training data and the training time, therefore, only the time complexity of execution needs to be concerned, the complexity of execution is mainly determined by the structure of the neural network, the size of the state space and the size of the action space, and under the condition of not considering the structure of the neural network, the computation complexity is Q (| A | × | S |), | A | represents the number of behavior spaces, | S | represents the number of state spaces, and after DNN is added, the setting of the environment and parameters of the system can have great influence on the computation complexity, it is also difficult to estimate, so the complexity of the AWDDPG algorithm is expressed as O (NK + | A | × | S |).