CN116501483A

CN116501483A - Vehicle edge calculation task scheduling method based on multi-agent reinforcement learning

Info

Publication number: CN116501483A
Application number: CN202211608461.8A
Authority: CN
Inventors: 陈竹; 刘奇; 刘剑群; 吴朝亮; 马颂华
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-07-28

Abstract

The invention provides a vehicle edge calculation task scheduling method based on multi-agent reinforcement learning, and relates to the technical field of edge task scheduling. By using each road side unit RSU as an agent, the road side units RSU can cooperate with other agents in a communication range, so that a multi-agent environment is established. Then, the task scheduling problem in the environment is abstracted into a calculation problem of minimizing energy consumption cost under the condition of guaranteeing each constraint condition. And deducing an optimal scheduling strategy by using a Markov game idea, and training the optimal scheduling strategy by constructing a multi-agent-based reinforcement learning algorithm-MADQN-TS algorithm model to obtain a final decision model. Therefore, under the condition of limited communication and calculation resources, resources in the network are effectively scheduled to achieve the optimal utilization rate, and the system energy consumption is promoted to be minimized, so that the problem of explosion of decision space dimensions is relieved.

Description

Vehicle edge calculation task scheduling method based on multi-agent reinforcement learning

Technical Field

The invention relates to the technical field of edge task scheduling, in particular to a vehicle edge calculation task scheduling method based on multi-agent reinforcement learning.

Background

Real-time task scheduling research under the vehicle-mounted ad hoc network can be generally divided into two research directions according to a scheduling algorithm, and a traditional scheduling algorithm and an intelligent learning algorithm. The conventional algorithm is usually a static solution of a complex optimization problem, and cannot realize an optimal decision based on a dynamic environment.

Intelligent learning algorithms can solve the dynamic solution problem, so many researchers use deep learning and reinforcement learning techniques to solve the resource scheduling problem in edge computing, including single agent learning and multi-agent learning. For example, yu et al describe the offloading decision-making problem as a multi-labeled classification problem and use deep supervised learning techniques to make task offloading decisions. Miao et al predict user computing tasks based on LSTM algorithm to optimize the edge computing offload model. Qu, ning, lu, and Shen et al, from different perspectives, adapt to dynamic environments based on DQN techniques (deep learning based Q-learning algorithms) to solve task offloading problems. The algorithm has strong parallel processing and learning capabilities, but cannot be well adapted to high-concurrency real-time tasks and multi-node clusters, and the problem of explosion of decision space dimension often exists.

In addition, in the calculation of the vehicle edge, the demand of the vehicle for data communication and calculation load is large, and the existing network resources often cannot meet the requirement of high time delay.

Disclosure of Invention

The invention aims to provide a vehicle edge calculation task scheduling method based on multi-agent reinforcement learning, which optimizes task scheduling problems in vehicle edge calculation through an improved multi-agent reinforcement learning algorithm MADQN-TS, improves network utilization rate, promotes system energy consumption to be minimized, and relieves the problem of decision space dimension explosion.

Embodiments of the present invention are implemented as follows:

the embodiment of the application provides a vehicle edge calculation task scheduling method based on multi-agent reinforcement learning, which comprises the following steps:

receiving real-time task information and resource demand information corresponding to all vehicle terminals;

classifying tasks and measuring the resource requirement of each task;

dividing the task into single atomic tasks according to the type of the task, and placing the single atomic tasks into a queue to be scheduled;

sequentially extracting tasks from a queue to be scheduled, performing task scheduling by using a preset MADQN-TS algorithm model, and distributing each task to a corresponding road side unit RSU;

the road side unit RSU executes the corresponding task, obtains and transmits the processing result back to the corresponding vehicle terminal.

In some embodiments of the present invention, the step of performing task scheduling by using a preset MADQN-TS algorithm model and allocating each task to a corresponding road side unit RSU includes:

based on the RSU cooperation idea, calculating the time delay and energy consumption of each RSU for processing the corresponding task;

constructing a joint optimization formula under multiple constraint conditions according to time delay and energy consumption, and defining a joint optimization problem;

converting the joint optimization problem into rewards about time delay and energy consumption based on a Markov game idea, and deducing to obtain a state transfer function;

solving a state transfer function by using a preset MADQN-TS algorithm model to obtain a scheduling result;

and carrying out local processing/migration to another adjacent RSU for processing according to the scheduling result.

In some embodiments of the present invention, when each RSU processes a corresponding task, a local processing delay, a local processing energy consumption, a migration processing delay, and a migration processing energy consumption need to be calculated;

local processing delayThe formula of (2) is: />Wherein c _k Representing RSU _k Computing resources required for executing a task, +.>Representing RSU _k Is provided;

local processing energy consumptionThe formula of (2) is: />Wherein κ represents RSU _k A coefficient of relation between processing power and power consumption, D _k Representing RSU _k And D _k ＝{d _k ,c _k }，d _k Representing data size, c _k Representing computing resources required to perform a task;

migration processing time delay T _k,m The formula of (2) is:wherein f _k,m Representing RSU _m Assigned to RSU _k Is x _k,m Representing RS U _m And RSU _k Communication rate between the two; and->Wherein B is _k Representing bandwidth, p _k Representing RSU _k Transmission power h of (1) _k,m Represents the channel attenuation coefficient, θ represents the path loss coefficient, l _k,m Representing RSU _m And RSU _k A distance therebetween; f is as above _k,m The formula needs to be satisfied: />Representing RSU _m The sum of all distributed computing power is smaller than RSU _m Maximum computing power F per se _m Wherein o _k,m =1, means RSU _m Migrating data to RSU _k Processing;

migration processing energy consumption E _k,m The formula of (2) is:wherein e _m Representing RSU _m The energy consumption per unit of computation capacity.

In some embodiments of the present invention, the above-mentioned joint optimization formula under the multi-constraint condition constructed according to the time delay and the energy consumption is:

s.t.C1:o _k,j ∈{0,1}

C2:

C3:

C4:0≤f _k,m ≤F _m

C5:0≤p _k ≤P

wherein,,weights representing time delays +.>Weight factor, T, representing energy consumption _k And E is _k The real time delay and the real energy consumption are represented; when o _k,0 When =1, this indicates that the task is selected to be executed locally, at this time +.>When o _k,m When=1, migration execution is indicated, at which time T _k ＝T _k,m ,E _k ＝E _k,m The method comprises the steps of carrying out a first treatment on the surface of the Constraint C1 and constraint C2 represent migration to only one RSU, constraint C3 and constraint C4 represent that the allocated computing power cannot exceed the RUS _m Maximum computational power on, constraint C5 ensures transmission power p _k The upper limit is not exceeded.

In some embodiments of the present invention, the step of deriving the state transfer function by converting the joint optimization problem into rewards for time delay and energy consumption based on the markov game concept includes:

calculating a normalized delay difference according to the local processing delay and the migration processing delay, and calculating a normalized energy consumption difference according to the local processing energy consumption and the migration processing energy consumption;

and combining the normalized delay difference and the normalized energy consumption difference to obtain a reward formula:wherein (1)>Representing RSU _k Final rewards, ->Indicating a delay reward->Representing an energy consumption reward;

according to rewardsComputing System overhead->The formula is: />Wherein ω represents a discount factor, represents the degree of influence of past rewards on the current rewards, τ represents past time slots, +.>Representing slave status +.>To state->Is a reward of (a);

according to system overheadDeriving a state transfer function->The formula is: />

In some embodiments of the present invention, the constructing of the MADQN-TS algorithm model includes:

establishing an estimation Actor network and an estimation Critic network, and inputting the current state of the RSU to the estimation Actor networkOutput action->Inputting the current state S and action A of all RSUs to the estimated Critic network, and outputting predicted valuesWherein (1)> Representing estimating Critic network parameters;

establishing a target Actor network and a target Critic network, and inputting the next state of the RSU to the target Actor networkThen output the next action +.>The next state S 'and action A' of all RSUs are input to the target Critic network, and the target value +.>Wherein (1)> Representing target Critic network parameters;

based on predicted valuesAnd target value->Estimating the Actor network parameter by using a random gradient descent method>Estimating Critic network parameters +.>Updating the loss function, and empirically storing the related state, action, loss function and network parameters to form an empirical replay mechanism;

acquiring relevant experience data from the experience replay mechanism to train and train the target Actor network parametersAnd target Critic network parameters +.>Performing cyclic updating;

the Actor network, critic network, and empirical replay mechanism described above constitute the MADQN-TS algorithm model described above.

In some embodiments of the invention, the target Actor network parameters described aboveThe updated formula of (2) is: target Actor network parameters->Target Critic network parameters->The updated formula of (2) is: />Wherein lambda is E [0,1 ]]。

Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:

the embodiment of the application provides a vehicle edge calculation task scheduling method based on multi-agent reinforcement learning. By using each road side unit RSU as an agent, the road side units RSU can cooperate with other agents in a communication range, so that a multi-agent environment is established. Then, the task scheduling problem in the environment is abstracted into a calculation problem of minimizing energy consumption cost under the condition of guaranteeing each constraint condition. And deducing an optimal scheduling strategy by using a Markov game idea, and training the optimal scheduling strategy by constructing a multi-agent-based reinforcement learning algorithm-MADQN-TS algorithm model to obtain a final decision model. Therefore, under the condition of limited communication and calculation resources, resources in the network are effectively scheduled to achieve the optimal utilization rate, and the system energy consumption is promoted to be minimized, so that the problem of explosion of decision space dimensions is relieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an interactive schematic diagram of a three-layer model of an edge system;

FIG. 2 is a flowchart of an embodiment of a vehicle edge computing task scheduling method based on multi-agent reinforcement learning provided by the present invention;

FIG. 3 is a schematic diagram of task scheduling in an embodiment of a vehicle edge computing task scheduling method based on multi-agent reinforcement learning according to the present invention;

FIG. 4 is a schematic diagram of the MADQN algorithm;

FIG. 5 is a vehicle edge computing task scheduling scenario diagram based on RSU collaboration;

FIG. 6 is a schematic diagram of an Actor network;

FIG. 7 is a schematic diagram of a Critic network;

FIG. 8 is a frame diagram of a MADQN-TS algorithm model;

FIG. 9 is a chart of a convergence alignment of the MADQN-TS algorithm and the DQN-TS algorithm;

FIG. 10 is a graph comparing energy costs of MADQN-TS algorithm and DQN-TS algorithm;

FIG. 11 is a graph of average resource utilization versus several algorithms;

FIG. 12 is a graph of average task failure rates for several algorithms.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Examples

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The various embodiments and features of the embodiments described below may be combined with one another without conflict.

In the edge computing scenario, there are many system models, and the most classical is a "cloud-edge-end" three-layer system model, which is divided into a lower layer, an upper layer and a middle layer as shown in fig. 1. The lower layer is an Internet of things layer and comprises various side intelligent devices with starvation of resources, limited energy and excessive time delay, such as vehicles, monitoring and the like; the middle layer is an edge layer, is a heart of whole edge calculation, is connected with a cloud end and an Internet of things end, and plays a role of a bridge, such as a Road Side Unit (RSU), an edge server and the like. The cloud end can be connected upwards to send a request and receive a result, the internet of things layer can be connected downwards to receive a task request of the terminal, the task is selected to be calculated locally or further uploaded to the cloud end, and a final calculation result is returned to the terminal equipment; the cloud end is used as a three-layer core brain and is provided with a storage center and a calculation center, and can calculate, process and store large-scale tasks and return results.

The resource allocation and task scheduling problems that the present invention aims to solve typically occur in the middle tier. The task request sent by the intelligent device reaches the middle layer, and the middle layer is required to judge whether the task is finished by uploading the task request to the cloud end in a cross-layer mode or finishing the task by transversely cooperating with an adjacent edge server. The scheduling process measures the conditions such as computing resources, network transmission overhead, user experience time length and the like required by the task so as to effectively schedule.

Specifically, referring to fig. 2 and 3, an embodiment of the present application provides a vehicle edge computing task scheduling method based on multi-agent reinforcement learning, which includes the following steps:

step S1: and receiving real-time task information and resource demand information corresponding to all the vehicle terminals.

In the above steps, the task release of the edge device is the first step in the task scheduling process. The initiator of the task is an internet of things device at the edge of the network, including smartphones, surveillance cameras, industrial sensors, vehicles, etc. The devices have different forms and uses, have the characteristics of isomerism and dynamicity, and also have different types of generated tasks and resource requirements of the tasks. The method and the device receive real-time task information and resource demand information corresponding to all vehicle terminals through the road side unit RSU, and then perform vehicle edge calculation task scheduling.

Step S2: tasks are classified and the resource requirements of each task are measured.

In the above steps, after the edge node receives the task, the task is classified according to different task types, where the task includes an independent task type, a periodic task type, an sporadic task type, a multi-frame task type, a real-time task type, and the like. The resource requirements of each task are then measured. Specific resource requirements that may be included include CPU resources, GPU resources, FPGA resources, etc., provide for edge node selection for subsequent task scheduling.

Step S3: according to the type of the task, the task is divided into single atomic tasks and put into a queue to be scheduled.

In the above steps, after the task is measured by the resource demand, the scheduler can more accurately know the resources required by the task and the existing resource status of the edge node. The decision process is mainly divided into two steps, wherein the first step is to select different task processing modes according to the types of the tasks, divide the tasks into single atomic tasks, and if the tasks are not separable, do not process. The second step is to put the task into the queue to be scheduled and wait for allocation.

Step S4: and sequentially extracting tasks from the queues to be scheduled, performing task scheduling by using a preset MADQN-TS algorithm model, and distributing each task to a corresponding road side unit RSU.

In the steps, the MADQN-TS algorithm is improved on the basis of the MADQN algorithm (multi-agent-based depth Q-learning algorithm). The principle of MADQN algorithm is shown in FIG. 4, assuming that there are n agents in the scene, the state s of the scene should be a combination of n agents, s { can be used ₁ ,s ₂ ,...,s _n Represented by }, i.es _j And (1.ltoreq.j.ltoreq.n) represents the confidence state of the jth agent. Agent A _j The input of (2) is s _j The output is AND A _j Q value corresponding to the action in (Q) _j ，q _j ＝[Q(s _j ),...,Q(s _j )]. The Q values of all actions are obtained by the connection of the neural network shown in the leftmost a-graph in FIG. 4, and can be expressed as +.>Upon agent decision (local process or migration process), an action is selected according to q. The right-most c-chart in fig. 4 shows the details of the computation between two hidden layers. I+1th hidden layer->The input of (a) comprises two parts, one is the last hidden layer +.>And information from other hidden units +.>Information of other hidden units->Can be expressed as: -:>then the i+1th hidden layerCan be expressed as:>where σ represents a nonlinear activation function, typically using a RELU function,is a weight matrix representing shared parameters between agents.

MADQN is generally considered to have a hidden layer d ⁱ⁺¹ ＝σ(T ⁱ h ⁱ ) Is a structured DQN (deep Q-learning algorithm), where h ⁱ Is all thatRepresentation of (i.e.)>T ⁱ Can be expressed as:

wherein,,

based on the above principle, the specific model construction and task scheduling process includes:

step S4-1: based on the RSU cooperation idea, the time delay and the energy consumption of each RSU for processing the corresponding task are calculated.

In the above steps, each road side unit RSU is first treated as an agent, so that it can cooperate with other agents in the communication range, thereby establishing a multi-agent environment.

From the perspective of the RSUs, it is assumed that k= {1,2, K } RSUs, task data offloaded by other RSUs within a certain range is acceptable to each RSU. Thus there is m= {1,2,..m } for each RSU that can migrate, where M must be less than K. Suppose RSU _k Data of (2) is D _k ＝{d _k ,c _k }, where d _k And c _k Representing the data size and the required computing resources, respectively. By p _k Representing RSU _k Transmission power, p _k Satisfy p _k ∈[0,P]Where P represents the maximum transmission power it allows. When RSU _k With RSU _m Is less than r _k Can communicate at that time. In each time slot, each RSU can only communicate with one RSU, the communication rate between the two depending on bandwidth, transmission rate and noise interference. RSU (RSU) _k With RSU _m The communication rate X between them can be expressed as:wherein B is _k Representing bandwidth, p _k Representing RSU _k Transmission power h of (1) _k,m Represents the channel attenuation coefficient, θ represents the path loss coefficient, l _k,m Representing RSU _m And RSU _k Distance between them.

During the scheduling process, the RSU can flexibly decide whether the data is to be calculated locally or migrated to other RSUs (as shown in fig. 5). Therefore, can use omicron _k,j E {0,1}, j e {0,1,2,..Calculating the decision, when the decision _k,0 When=1, data D is represented _k Processing at local, omicron _k,m When=1, data migration to RSU is indicated _m And (5) performing treatment. It should be noted that when each RSU selects only one migration object, the formula needs to be satisfied:

further, in decision making, time delay and energy consumption of the RSU need to be considered, including local processing time delay, local processing energy consumption, migration processing time delay and migration processing energy consumption. The specific calculation is as follows:

migration processing time delay T _k,m The formula of (2) is:wherein f _k,m Representing RSU _m Assigned to RSU _k Is x _k,m Representing RS U _m And RSU _k Communication rate between the two; and->Wherein B is _k Representing bandwidth, p _k Representing RSU _k Transmission power h of (1) _k,m Represents the channel attenuation coefficient, θ represents the path loss coefficient, l _k,m Representing RSU _m And RSU _k A distance therebetween; f is as above _k,m The formula needs to be satisfied: />Because of the RSU _m Data of a plurality of RSUs can be received, but the resources of each RSU are limited, so that the sum of all allocated computing power needs to be smaller than RSU _m Maximum computing power F per se _m Wherein o _k,m =1, means RSU _m Migrating data to RSU _k Processing;

Step S4-2: and constructing a joint optimization formula under multiple constraint conditions according to the time delay and the energy consumption, and defining a joint optimization problem.

In the above steps, in order to fully utilize the computing power while reducing the energy consumption, it is necessary to optimize the weighted sum of the time delay and the energy consumption, i.e., the system cost. The joint optimization formula is as follows:

s.t.C1:o _k,j ∈{0,1}

C2:

C3:

C4:0≤f _k,m ≤F _m

C5:0≤p _k ≤P

wherein,,weights representing time delays +.>Weight factor, T, representing energy consumption _k And E is _k The real time delay and the real energy consumption are represented; when o _k,0 When =1, this indicates that the task is selected to be executed locally, at this time +.>When o _k,m When=1, migration execution is indicated, at which time T _k ＝T _k,m ,E _k ＝E _k,m The method comprises the steps of carrying out a first treatment on the surface of the Constraint C1 and constraint C2 represent migration to only one RSU, constraint C3 and constraint C4 represent that the allocated computing power cannot exceed the RUS _m Maximum computational power on, constraint C5 ensures transmission power p _k The upper limit is not exceeded. Thereby determining a joint optimization problem.

Step S4-3: the joint optimization problem is converted into rewards on time delay and energy consumption based on the Markov game idea, and a state transfer function is obtained through deduction.

In the above steps, each RSU in the environment can be regarded as an Agent. In the interaction process of a plurality of agents, the RSU executes different actions to change the state, and corresponding rewards are obtained. The above-described joint optimization problem can thus be solved by maximizing the jackpot. The description of the state, action, prize, and next state in the transition process is as follows:

(1) Status of

At time slot t, RSU _k Can observe the state of oneselfThe state includes the original computational decision o _k,m Data size d _k Required computing resource c _k And distance->Thus state->Can be expressed as +.>Wherein (1)>Representing RSU _k And the distance between all interactable RSUs.

(2) Action

At time slot t, RSU _k Action of (2)Including executable computational decision->And executable transmit power +.>The action space can thus be expressed as +.>Wherein->When->When expressed in RSU _k Is executed locally, is added up>Time-indicating RSU _k Migrating data to RSU _m And executing on the computer. Similarly, let go of> Time is represented in RSU _k Is executed locally, is added up>Time indication migration to RSU with power p _m And executing on the computer.

(3) Rewards

At time slot t, RSU _k Is awarded of (a)Comprising delay rewards->And energy consumption rewards->According to the above joint optimization problem, first, a normalized delay difference is calculated from the local processing delay and the migration processing delay>And calculating normalized energy consumption difference according to the local processing energy consumption and the migration processing energy consumption>

Then, a reward formula is obtained by combining the normalized delay difference and the normalized energy consumption difference:wherein (1)>Representing RSU _k Final rewards, ->Indicating a delay reward->Indicating an energy consumption prize. The significance of the reward formula is: when RSU _k Selecting when executed locally, rewards +.>When RSU _k When selecting to execute data migration, rewarding +.>And is a non-zero value. If the migration latency is smaller than the local computation latency,/->Is a positive reward; otherwise, go (L)>Is a negative prize. Similarly, if the energy consumption of migration is less than the local computing energy consumption, < >>Is a positive reward; otherwise, go (L)>Is a negative prize.

Thereafter, according to rewardsComputing System overhead->The formula is: />Wherein ω represents a discount factor, represents the degree of influence of past rewards on the current rewards, τ represents past time slots, +.>Representing slave status +.>To stateIs a reward for (a). Thereby translating the joint optimization problem described above into rewards for latency and energy consumption.

Finally, according to the system overheadDeriving a state transfer function->The formula is: />At time slot t, RSU _k State transition function->By performing actions->Transition to the next State +.>Is expressed as the probability ofIn the interaction process of the two RSUs, the final convergence to the optimal state transfer function +.>I.e., maximizing cumulative rewards. Therefore, the optimal solution for maximizing the accumulated consideration is calculated, and the scheduling result can be obtained.

Step S4-4: and solving the state transfer function by using a preset MADQN-TS algorithm model to obtain a scheduling result.

In the above steps, the MADQN-TS algorithm is built based on the Actor-Critic framework. For each RSU, it is a separate agent, an Actor network, a Critic network, and an experience replay mechanism. Wherein, the Actor network is used for generating actions, and the Critic network is used for guiding the Actor to act better.

For example, referring to fig. 6, the actor network is composed of an input layer, three fully connected layers with ReLU activation function, and an output layer. The input to the Actor network is the statusThe output is action->Furthermore, to learn more knowledge from an unknown environmentRSUs require balanced development and exploration. Development means that the RSU will take the action with the greatest value as possible to use the learned knowledge, while exploration means that the RSU gets the unknown knowledge by random action. During the learning process, an epsilon-greed algorithm is employed to balance the actions taken.

Referring to FIG. 7, the critic network uses the Q value to evaluate the performance of the action. The Critic network comprises an input layer, three full-connection layers with ReLU activation function and an output layer with one node. The inputs to the Critic network are the states and actions of all RSUs, and the output is the Q value.

Specifically, referring to FIG. 8, the MADQN-TS algorithm model is constructed to include an Actor network, a Critic network, and an empirical replay mechanism.

Firstly, establishing an estimation Actor network and an estimation Critic network, and inputting the current state of an RSU to the estimation Actor networkOutput action->Inputting the current state S and action A of all RSUs to the estimated Critic network, and outputting predicted valuesWherein (1)> Representing estimating Critic network parameters.

Then, a target Actor network and a target Critic network are established, and the next state of the RSU is input to the target Actor networkThen output the next action +.>The next state S 'and action A' of all RSUs are input to the target Critic network, and the target value +.>Wherein (1)> Representing target Critic network parameters.

Thereafter, based on the predicted valueAnd target value->Estimating the Actor network parameter by using a random gradient descent method>Estimating Critic network parameters +.>And updating the loss function, and empirically storing the related states, actions, loss function and network parameters to form an empirical replay mechanism. Wherein, estimating the Actor network parameter +.>The updated formula of (2) is: />π _k Representing the policy. Furthermore, by minimizing the loss function of the current Q value and the next Q value +.>Can update RSU _k Is a Critic network. Its loss function->Can be expressed as: />The update formula is: />

Finally, the relevant experience data is obtained from the experience replay mechanism to train, and the target Actor network parameters are trainedAnd target Critic network parameters +.>And circularly updating to obtain the optimal MADQN-TS algorithm model. Wherein, the target Actor network parameter +.>The updated formula of (2) is: target Actor network parameters->Target Critic network parameters->The updated formula of (2) is: />Wherein lambda is E [0,1 ]]。

Specifically, the training process of the MADQN-TS algorithm model is shown in Table 1 below:

TABLE 1MADQN-TS algorithm model training process

In the training process of MADQN-TS algorithm model, each Actor network needs its own state and Q value of corresponding Critic network, and Critic network needs the state and behavior of all Actor networks. After the training process is completed, the executing process only needs the Actor networks, and each Actor network can deduce effective actions according to the state of the Actor network. In the MIn the model of the ADQN-TS algorithm, the mechanism of empirical replay is important if each sample is expressed as a formulaIts history sequence is expressed as +.>Because the training data in the samples must be independently co-distributed, the use of an empirical replay mechanism can disrupt the time dependence. During the training process, the generated states, actions, rewards, and next actions and states are stored in the cache pool. During the training process, the Actor network and the Critic network randomly select experience data to train.

Step S5: the road side unit RSU executes the corresponding task, obtains and transmits the processing result back to the corresponding vehicle terminal.

And the road side unit RSU performs local processing or migrates to another adjacent RSU for processing according to the scheduling result. And after the processing is finished, the processing result is transmitted back to the corresponding vehicle terminal, so that the vehicle edge calculation task scheduling is finished.

In order to verify the performance of the MADQN-TS algorithm, the application constructs an edge simulation system taking vehicle edge calculation as a scene on the basis of an edge calculation simulator edge cloudsim, and performs experimental analysis on the algorithm. EdgeCloudsim was developed based on Cloudsim, where experiments can be performed in an environment that considers both computing resources and network resources. The specific experimental software and hardware environments are shown in table 2 below.

Table 2 experiment software and hardware environment parameter table

Name of the name	Parameters (parameters)
		CPU	Intel(R)Xeon(R)CPU E5-2630v4@2.20GHz
GPU	NVIDIA GeForce RTX 2080Ti
		Operating system	Ubuntu18.04
Memory	128GB
		Training software	TensorFlow
Emulator	EdgeCloudsim
		Data statistics	MATLAB

Vehicle applications are generally of two types, one being vehicle-mounted safety-based applications, such as obstacle early warning; the other type is entertainment information application, mainly related applications such as voice recognition, video processing, online games and the like. In order to simulate the applications commonly found in real vehicles, the networking application program generated by using the simulator is required in the experiment, and three traffic_ MANAGEMENT, DANGER _ ASSESSMENT, INFOTAINMENT are mainly provided, and the main parameters of the three traffic_ MANAGEMENT, DANGER _ ASSESSMENT, INFOTAINMENT are shown in table 3.

Table 3 application data simulation parameter table

	TRAFFIC	DANGER	INFOTAINMENT
				Utilization (%)	30	35	35
Delay sensitivity(s)	0.5	0.8	0.25
				Poisson distribution parameters(s)	3	5	15
Average data upload (KB)	20	40	20
				Average data download (KB)	20	20	80
Task scheduling length (KB)	3000	10000	20000
				Number of cores required	1	1	1
Edge resource utilization number	6	20	40

In addition, in the experiment, the situation of two bidirectional two lanes is considered, the length of each lane is 1000 meters, the width of each lane is 4 meters, the RSU is arranged on one side of a road, and the vehicle moves back and forth on each lane. The speed data used a portion of the GAIA open dataset containing chinese west An Didi express speeds. The data set contains the GPS coordinates and real-time speed of the drip cart over a month and thousands of regions and roads. Three loads are randomly selected, and the average speed of the vehicle on the road is calculated by statistics. The experiment will randomly select the vehicle speed from among three speeds of 17.7, 35.8 and 52.6 km/h. The coverage radii of the RSU and the vehicle are set to 500 meters and 250 meters, respectively. From which the latency constraints of the task are randomly generated. Detailed parameters table 4 of vehicle and RUS used in the experiment.

TABLE 4 vehicle and RSU detailed parameters

Parameter name	Parameter symbol	Parameter value
			RSU number	K	10
Number of RSUs that each RSU can cooperate with	M	2-4
			RSU cooperative coverage radius	μ _k,m	500m
RSU computing power	F _k	100GHz/s
			Energy consumption of RSU computing power	e _k	1W/GHz
Data size	d _k	50KB-10000KB
			Required computing resources	c _k	1-9GHz
Maximum transmission power	P	300mW

Based on the above system parameters, a scene environment of the training model is constructed, and model training parameters of the DQN-TS algorithm are set as shown in Table 5 below.

TABLE 5DQN-TS algorithm model training parameters

Parameter name	Parameter value
		Actor network learning rate	10 ^-4
Critic network learning rate	10 ^-3
		Coefficient of discount	0.95
Initializing sampling probabilities	0.5
		Sampling decay rate of memory A	-0.0001
Sampling decay rate of memory B	0.0001
		Initializing epsilon	1
Attenuation coefficient of epsilon	0.995

In the verification process, the convergence of the MADQN-TS algorithm and the DQN-TS algorithm is detected, and a convergence comparison chart shown in figure 9 is obtained. Wherein, the abscissa represents the iteration times, and the ordinate represents the result of normalization processing of the reward functions of the two algorithms. In particular, from the standpoint of convergence speed, MADQN-TS converges substantially 300 times and DQN-TS converges 400 times, with MADQN-TS converging faster. MADQN-TS was 21.9% higher than DQN-TS in terms of return. It follows that the MADQN-TS algorithm achieves better convergence due to the joint optimization of the weighted sum of time and energy consumption, and the empirical replay mechanism.

FIG. 10 is a graph of energy consumption cost versus MADQN-TS algorithm and DQN-TS algorithm. It can be seen that the energy costs accumulate over time, and when the time accumulation is long enough, the rate of increase of the energy costs will tend to stabilize, because the task scheduling process in the system has reached a relatively balanced point. The system energy consumption of the MADQN-TS algorithm is reduced by 22.8% compared with DQN-TS.

In addition, the MADQN-TS algorithm is compared with the DQN-TS algorithm, the ML-based algorithm, the SMA-based algorithm and the Random algorithm, and a comparison chart of the obtained average resource utilization rate is shown in FIG. 11. It can be seen that the resource utilization of MADQN-TS is substantially consistent with DQN-TS when the amount of tasks is small, because the computing power of the RSU can handle these tasks (local processing). However, when the tasks are enough, some RSUs cannot process the tasks, and the tasks can be migrated to other RSUs to run, so that the average CPU resource utilization of the overall RSU is increased. And the more tasks, the more obvious the increase, the more suitable the MADQN-TS algorithm is for places with high vehicle density, such as city centers.

The resulting average task failure rate versus graph is shown in fig. 12. It can be seen that as tasks increase, the failure rate of the tasks increases, but by reasonable scheduling of the MADQN-TS algorithm, the failure rate is much lower than other algorithms. The task failure rate of the MADQN-TS algorithm is lower than that of the DQN-TS because the RSUs cooperate such that some tasks that would otherwise be due to insufficient RSU computing power can be migrated to be performed on adjacent relatively idle RSUs.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. The vehicle edge calculation task scheduling method based on multi-agent reinforcement learning is characterized by comprising the following steps of:

classifying tasks and measuring the resource requirement of each task;

and the road side unit RSU executes the corresponding task, obtains and transmits the processing result back to the corresponding vehicle terminal.

2. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning according to claim 1, wherein the step of performing task scheduling using a preset MADQN-TS algorithm model and assigning each task to a corresponding road side unit RSU comprises:

3. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning according to claim 2, wherein when each RSU processes a corresponding task, local processing delay, local processing energy consumption, migration processing delay and migration processing energy consumption are calculated;

migration processing time delay T _k,m The formula of (2) is:wherein f _k,m Representing RSU _m Assigned to RSU _k Is x _k,m Representing RSU _m And RSU _k Communication rate between the two; and->Wherein B is _k Representing bandwidth, p _k Representing RSU _k Transmission power h of (1) _k,m Represents the channel attenuation coefficient, θ represents the path loss coefficient, l _k,m Representing RSU _m And RSU _k A distance therebetween; said f _k,m The formula needs to be satisfied: />Representing RSU _m The sum of all distributed computing power is smaller than RSU _m Maximum computing power F per se _m Wherein o _k,m =1, means RSU _m Migrating data to RSU _k Processing;

4. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning according to claim 3, wherein the joint optimization formula under the multi-constraint condition constructed according to time delay and energy consumption is:

s.t.C1:o _k,j ∈{0,1}

C4:0≤f _k,m ≤F _m

C5:0≤p _k ≤P

5. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning of claim 3, wherein the step of deriving a state transfer function based on markov game concept to convert the joint optimization problem into rewards on time delay and energy consumption comprises:

6. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning according to claim 1, wherein the constructing of the MADQN-TS algorithm model includes:

establishing a target Actor networkThe target Critic network inputs the next state of the RSU to the target Actor networkThen output the next action +.>The next state S 'and action A' of all RSUs are input to the target Critic network, and the target value +.>Wherein (1)> Representing target Critic network parameters;

acquiring relevant experience data from the experience replay mechanism for training and aiming at target Actor network parametersAnd target Critic network parameters +.>Performing cyclic updating;

the Actor network, critic network, and empirical replay mechanism constitute the MADQN-TS algorithm model.

7. The vehicle edge computing task scheduling method based on multi-agent reinforcement learning of claim 6, wherein the target Actor network parameter θ _π ′ _k The updated formula of (2) is: target Actor network parameter θ _π ′ _k The method comprises the steps of carrying out a first treatment on the surface of the Target Critic network parameter θ' _Qk The updated formula of (2) is: θ'. _Qk ＝λθ _Qk +(1-λ)θ′ _Qk The method comprises the steps of carrying out a first treatment on the surface of the Wherein lambda is E [0,1 ]]。