Disclosure of Invention
The invention aims to overcome the defect that the prior art can not plan the continuous action of the unmanned aerial vehicle and obtain an accurate scheduling strategy, and provides a scheduling optimization method and a scheduling optimization system for the unmanned aerial vehicle auxiliary mobile edge calculation.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the invention provides a scheduling optimization method for unmanned aerial vehicle-assisted mobile edge calculation, which comprises the following steps:
s1: constructing an unloading model of the mobile edge computing system, wherein the model comprises an unmanned aerial vehicle and a plurality of user devices;
s2: obtaining the energy consumption of the calculation task according to the unloading model of the mobile edge calculation system;
s3: establishing an optimization problem combining unmanned aerial vehicle track and user equipment scheduling by taking average energy consumption minimization of the user equipment as a target;
s4: converting the optimization problem into a Markov decision process, and defining a state space, an action space and a return function of an unloading model of the mobile edge computing system;
s5: constructing a deep neural network based on a SAC algorithm, and training the deep neural network by using a state space, an action space and a return function to obtain a trained deep neural network;
s6: and carrying out scheduling optimization by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of the flight path of the unmanned aerial vehicle and the user equipment.
The SAC algorithm is an off-line random strategy algorithm based on a maximum entropy reinforcement learning framework and an Actor-Critic network, and is mainly characterized by entropy regularization, wherein entropy is a measure of strategy randomness, and the increase of entropy can bring more strategy exploration, and the expected return and the entropy value are balanced through training strategies, so that the network learning speed can be accelerated, and meanwhile, the strategy convergence to a local optimal solution is avoided; the purpose of the Actor network is to obtain the maximum return expectation and the maximum entropy, i.e. explore other strategies in the strategy space while successfully completing the task; the combination of the network update in an offline mode and the Actor-Critic network achieves good performance on a continuous control reference task, and is more stable and better in convergence.
Preferably, in step S1, the unloading model of the mobile edge computing system is specifically:
unloading module for mobile edge computing systemThe model comprises a single unmanned aerial vehicle and N user devices, wherein the unmanned aerial vehicle simultaneously serves K user devices at most, and each user device selects to calculate a calculation task from local calculation or unload the calculation task to the unmanned aerial vehicle calculation; the length and the width of the flight area of the unmanned aerial vehicle are set to be X respectivelymaxAnd YmaxThe unmanned aerial vehicle flies at a constant speed v (t) at a fixed height h, the antenna emission angle is theta, and the maximum flying speed is vmax(ii) a The flight time of the unmanned aerial vehicle is T time slots, the length of each time slot is tau, and the time for completing the calculation task at any moment cannot exceed the maximum time delay Tmax;
Let the coordinates of the unmanned aerial vehicle be [ X (t), Y (t), h]The coordinates of the user equipment are [ x ]
i(t),y
i(t),0]I ∈ {1, 2, …, N }; setting the flight distance and the horizontal direction angle of the unmanned aerial vehicle at the moment t as d (t) and theta respectively
h(t), then X (t) X (t-1) + d (t) cos (θ)
h(t)),Y(t)=Y(t-1)+d(t)sin(θ
h(t)); the maximum coverage of the unmanned aerial vehicle is R
maxH · tan (θ), and a flying speed of
Defining the calculation task at the moment t as follows:
Ii(t)={Di(t),Fi(t)}
in the formula, Di(t) represents the amount of data transferred when selecting to offload a computing task at time t of computing, Fi(t) represents the computational power required to complete the computational task at time t;
definition of alphai(t) E {0, 1} represents a selection policy of the user equipment, αiWhen (t) is 0, the calculation task local calculation at the time t is shown, alphaiWhen (t) is 1, it indicates that the calculation task at time t is unloaded.
Preferably, in step S2, obtaining the energy consumed by the computing task according to the unloading model of the mobile edge computing system includes:
user equipment selection to offload computation, i.e. alphai(t) ═ 1; this moment the distance on this user equipment and unmanned aerial vehicle's the horizontal plane does:
the user equipment is provided with a single antenna, and in order to avoid interference between the user equipment, a frequency division multiple access protocol unloading mode is adopted; because the flying height of the unmanned aerial vehicle is certain, and a free space channel model is adopted, the uplink rate during unloading calculation is as follows:
wherein B represents the average bandwidth of the communication channel, PTrRepresenting the transmission power of the user equipment data unloading, and rho represents a transmission power coefficient;
the time overhead for the user equipment to transmit the calculation task is as follows:
the time overhead of processing the calculation task by the unmanned aerial vehicle is as follows:
in the formula (f)U(t) represents the computational power of the drone;
the total time overhead for the ue to choose to offload the computation is:
the energy consumption of the user equipment for selecting the uninstalling calculation is as follows:
in the formula (I), the compound is shown in the specification,
indicating that the ith user equipment chooses to offload the calculated energy consumption.
Preferably, in step S2, the obtaining the energy consumed by the computing task according to the unloading model of the mobile edge computing system further includes:
the user equipment selecting local calculation, i.e. alphai(t)=0;
The time overhead of the user equipment for processing the computing task is as follows:
in the formula (I), the compound is shown in the specification,
representing the computing power of the user device;
setting power consumption of a user equipment to
The user equipment selects the locally calculated energy consumption as:
in the formula, kiIs a first constant, viIs a second constant.
Preferably, in step S3, with the objective of minimizing the average energy consumption of the user equipment, an optimization problem combining the trajectory of the drone and the scheduling of the user equipment is established, specifically:
defining a set of flight actions
User equipment scheduling policy set
Then optimize the problemP is represented as:
wherein E is
i(t) represents the energy consumption of the user equipment, when α
iWhen the value (t) is 1,
when alpha is
iWhen (t) is 0, the reaction is carried out,
representing a constraint on an unmanned aerial vehicle to serve at most K user equipments, alpha, simultaneously
i(t)S
i(t)≤R
maxThe user equipment representing the constraint selection offload computation is in the maximum coverage of the drone.
Preferably, in step S4, the state space and the action space of the designed unloading model of the moving edge computing system are specifically:
in an unloading model of the mobile edge computing system, an unmanned aerial vehicle and user equipment are equivalent to an intelligent agent, and in each time slot, the intelligent agent observes and obtains a current state s (t) from an environment, the current state s (t) corresponds to a current action a (t), the unmanned aerial vehicle executes the current action a (t) in an action space, interacts with the environment, and returns a current return r (t) and a new state s (t + 1);
for the state space, in each time slot, the position of the user equipment is fixed, and only the position information of the unmanned aerial vehicle needs to be considered; and when each flight cycle is finished, the unmanned aerial vehicle needs to arrive at a specific destination, and the distance between the unmanned aerial vehicle and the specific destination is set as d '(t), the current state expression in the state space is s (t) ═ x (t), y (t), h, d' (t) };
for the action space, according to the flight distance d (t) of the unmanned aerial vehicle and the horizontal direction angle thetah(t), calculating the position coordinates [ X (t +1), Y (t +1), h of the unmanned aerial vehicle at the next moment]And the selection policy of the user equipment, then in the action space, the current action expression is a (t) ═ θh(t),d(t),αi(t)}。
Preferably, in step S4, the designed reward function of the unloading model of the mobile edge computing system is specifically:
the reward function is used for evaluating the quality of the action taken by the agent in the current state, and specifically comprises the following steps:
r(t)=Rerengy+Rdes+Pout+Pspeed
wherein R (t) represents the current reward, RerengyRepresenting the return of the optimization problem, RdesIndicating that the drone flies back to a particular destinationIn return, RdesK/d' (t), k being the reward factor; poutRepresents a penalty of the unmanned aerial vehicle flying out of the flight area, PspeedAnd the penalty of flying overspeed of the unmanned aerial vehicle is represented.
Preferably, in the step S5, the constructed deep neural network includes an experience buffer, an Actor network, a first Critic network, a second Critic network, a first Critic target network and a second Critic target network;
in each time slot, the input of the Actor network is the current state s (t), and the corresponding current action a (t) is output to obtain the current scheduling strategy piφ(ii) a The input of the first Critic network and the input of the second Critic network are both the current state s (t) and the current action a (t), and Q values are respectively output; after the unmanned plane executes the current action a (t), a new state s (t +1) is generated, and the current return r (t) is obtained, and then [ s (t), a (t), r (t), s (t +1) ]]Storing in an experience buffer; the first Critic target network and the second Critic target network are respectively used as copies of the first Critic network and the second Critic network, an objective function is set, and the smaller Q value of the two Q values is selected to calculate a target value for updating network parameters of the first Critic network and the second Critic network; when the time slot is finished, updating network parameters of the Actor network and the Critic network in real time according to the current scheduling strategy, and randomly sampling from an experience buffer area to update the network parameters of the Critic target network;
the loss function for an Actor network is:
the loss function of the first Critic network and the second Critic network is:
the objective function of the first and second Critic target networks is:
where φ represents a network parameter of the Actor network, θ
iIndicates the network parameters of the ith critical network,
represents the Q value of the ith Critic network; when i is 1, theta
1Representing network parameters of the first critical network,
represents the Q value of the first Critic network; when i is 2, theta
2A network parameter representing a second critical network,
represents the Q value of the second Critic network;
represents pi according to the current scheduling policy
φCalculating the obtained new action;
representing a target value, alpha representing an entropy regularization system;
represents the Q value of the ith Critic target network, when i is 1,
representing the Q value of the first critical target network,
representing the Q value of the second critical target network.
Preferably, the constructed optimal scheduling policy expression of the deep neural network is as follows:
in the formula, pi represents an optimal scheduling strategy, alpha represents an entropy regularization coefficient, and piφRepresenting a scheduling policy, gamma representing a discount factor; h represents entropy, and the calculation method is as follows: h (Pi)φ(·|s(t)))=E[-logπφ(·|s(t))]。
The invention also provides a dispatching optimization system for unmanned aerial vehicle assisted mobile edge calculation, which comprises:
the model building module is used for building an unloading model of the mobile edge computing system, and the model comprises an unmanned aerial vehicle and a plurality of user equipment;
the energy consumption calculation module is used for obtaining the energy consumption of the calculation task according to the unloading model of the mobile edge calculation system;
the optimization problem establishing module is used for establishing an optimization problem combining the unmanned aerial vehicle track and the user equipment scheduling by taking the average energy consumption minimization of the user equipment as a target;
the optimization problem transformation module is used for transforming the optimization problem into a Markov decision process and defining a state space, an action space and a return function of an unloading model of the mobile edge computing system;
the network construction training module is used for constructing a deep neural network based on a deep reinforcement learning algorithm, and training the deep neural network by using a state space, an action space and a return function to obtain a trained deep neural network;
and the scheduling optimization module performs scheduling optimization by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of the flight path of the unmanned aerial vehicle and the user equipment.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the mobile edge computing system unloading model constructed by the invention comprises an unmanned aerial vehicle and a plurality of user equipment, and the optimization problem of combining unmanned aerial vehicle track and user equipment scheduling is established based on the energy consumption for completing the computing task and taking the average energy consumption minimization of the user equipment as a target; the optimization problem is non-convex, is difficult to solve by a traditional method, is converted into a Markov decision process, and defines a state space, an action space and a return function of an unloading model of the mobile edge computing system; the deep neural network constructed based on the SAC algorithm is trained by utilizing the state space, the action space and the return function, the trained deep neural network can be used for scheduling optimization, an optimal scheduling strategy is obtained, the continuous action of the unmanned aerial vehicle can be planned, a reasonable and accurate flight track and a selection strategy of user equipment are obtained, the complexity is low, the convergence is strong, and the average calculation energy consumption of the user equipment is reduced.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a scheduling optimization method for unmanned aerial vehicle-assisted mobile edge computing, as shown in fig. 1, including:
s1: constructing an unloading model of the mobile edge computing system, wherein the model comprises an unmanned aerial vehicle and a plurality of user devices;
s2: obtaining the energy consumption of the calculation task according to the unloading model of the mobile edge calculation system;
s3: establishing an optimization problem combining unmanned aerial vehicle track and user equipment scheduling by taking average energy consumption minimization of the user equipment as a target;
s4: converting the optimization problem into a Markov decision process, and defining a state space, an action space and a return function of an unloading model of the mobile edge computing system;
s5: constructing a deep neural network based on a SAC algorithm, and training the deep neural network by using a state space, an action space and a return function to obtain a trained deep neural network;
s6: and carrying out scheduling optimization by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of the flight path of the unmanned aerial vehicle and the user equipment.
In a specific implementation process, an unloading model of a mobile edge computing system is constructed, and a single unmanned aerial vehicle carries an MEC to fly in a specified area, so that edge computing is provided for user equipment; calculating the energy consumption for completing each calculation task according to the unloading model of the mobile edge calculation system; establishing an optimization problem combining unmanned aerial vehicle track and user equipment scheduling by taking average energy consumption minimization of the user equipment as a target; the method comprises the steps that the problem of unmanned aerial vehicle flight trajectory and user equipment selection unloading calculation or local calculation is converted into a Markov decision process, and a state space, an action space and a return function of an unloading model of a mobile edge computing system are defined; the method comprises the steps of constructing a deep neural network based on a SAC algorithm, training the deep neural network by using a state space, an action space and a return function, scheduling and optimizing by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of a flight track of the unmanned aerial vehicle and user equipment, solving a non-convex optimization problem, planning continuous actions of the unmanned aerial vehicle, obtaining a reasonable and accurate flight track and selection strategy of the user equipment, and reducing average calculation energy consumption of the user equipment.
Example 2
The embodiment provides a scheduling optimization method for unmanned aerial vehicle-assisted mobile edge computing, which comprises the following steps:
s1: constructing an unloading model of the mobile edge computing system, wherein the model comprises an unmanned aerial vehicle and a plurality of user devices;
as shown in fig. 2, the offload model of the mobile edge computing system includes a single drone and N user devices, the drone serving K user devices at most simultaneously, each user device selecting to offload computing tasks from local computing to drone computing; the length and the width of the flight area of the unmanned aerial vehicle are set to be X respectivelymaxAnd YmaxThe unmanned aerial vehicle flies at a constant speed v (t) at a fixed height h, the antenna emission angle is theta, and the maximum flying speed is vmax(ii) a The flight time of the unmanned aerial vehicle is T time slots, the length of each time slot is tau, and the time for completing the calculation task at any moment cannot exceed the maximum time delay Tmax;
Let the coordinates of the unmanned aerial vehicle be [ X (t), Y (t), h]The coordinates of the user equipment are [ x ]
i(t),y
i(t),0]I ∈ {1, 2, …, N }; setting the flight distance and the horizontal direction angle of the unmanned aerial vehicle at the moment t as d (t) and theta respectively
h(t), then X (t) X (t-1) + d (t) cos (θ)
h(t)),Y(t)=Y(t-1)+d(t)sin(θ
h(t)); the maximum coverage of the unmanned aerial vehicle is R
maxH · tan (θ), and a flying speed of
Defining the calculation task at the moment t as follows:
Ii(t)={Di(t),Fi(t)}
in the formula, Di(t) represents the amount of data transferred when selecting to offload a computing task at time t of computing, Fi(t) represents the computational power required to complete the computational task at time t;
definition of alphai(t) e {0, 1} represents a selection policy of the user equipmentA, aiWhen (t) is 0, the calculation task local calculation at the time t is shown, alphaiWhen (t) is 1, it indicates that the calculation task at time t is unloaded.
S2: obtaining the energy consumption of the calculation task according to the unloading model of the mobile edge calculation system;
user equipment selection to offload computation, i.e. alphai(t) ═ 1; this moment the distance on this user equipment and unmanned aerial vehicle's the horizontal plane does:
the user equipment is provided with a single antenna, and in order to avoid interference between the user equipment, a frequency division multiple access protocol unloading mode is adopted; because the flying height of the unmanned aerial vehicle is certain, and a free space channel model is adopted, the uplink rate during unloading calculation is as follows:
wherein B represents the average bandwidth of the communication channel, PTrRepresenting the transmission power of the user equipment data unloading, and rho represents a transmission power coefficient;
the time overhead for the user equipment to transmit the calculation task is as follows:
the time overhead of processing the calculation task by the unmanned aerial vehicle is as follows:
in the formula (f)U(t) represents the computational power of the drone;
the total time overhead for the ue to choose to offload the computation is:
the energy consumption of the user equipment for selecting the uninstalling calculation is as follows:
in the formula (I), the compound is shown in the specification,
indicating that the ith user equipment chooses to offload the calculated energy consumption.
The user equipment selecting local calculation, i.e. alphai(t)=0;
The time overhead of the user equipment for processing the computing task is as follows:
in the formula (I), the compound is shown in the specification,
representing the computing power of the user device;
setting power consumption of a user equipment to
The user equipment selects the locally calculated energy consumption as:
in the formula, kiIs a first constant, viIs a second constant. In this example, viIs 3.
S3: establishing an optimization problem combining unmanned aerial vehicle track and user equipment scheduling by taking average energy consumption minimization of the user equipment as a target;
defining a set of flight actions
User equipment scheduling policy set
The optimization problem P is then expressed as:
wherein E is
i(t) represents the energy consumption of the user equipment, when α
iWhen the value (t) is 1,
when alpha is
iWhen (t) is 0, the reaction is carried out,
representing a constraint on an unmanned aerial vehicle to serve at most K user equipments, alpha, simultaneously
i(t)S
i(t)≤R
maxThe user equipment representing the constraint selection offload computation is in the maximum coverage of the drone.
S4: converting the optimization problem into a Markov decision process, and defining a state space, an action space and a return function of an unloading model of the mobile edge computing system;
in an unloading model of the mobile edge computing system, an unmanned aerial vehicle and user equipment are equivalent to an intelligent agent, and in each time slot, the intelligent agent observes and obtains a current state s (t) from an environment, the current state s (t) corresponds to a current action a (t), the unmanned aerial vehicle executes the current action a (t) in an action space, interacts with the environment, and returns a current return r (t) and a new state s (t + 1);
for the state space, in each time slot, the position of the user equipment is fixed, and only the position information of the unmanned aerial vehicle needs to be considered; and when each flight cycle is finished, the unmanned aerial vehicle needs to arrive at a specific destination, and the distance between the unmanned aerial vehicle and the specific destination is set as d '(t), then in the state space, the current state expression is s (t) ═ x (t), y (t), h, d' (t) }, and the state space of the embodiment is 4-dimensional;
for the action space, according to the flight distance d (t) of the unmanned aerial vehicle and the horizontal direction angle thetah(t), calculating the position coordinates [ X (t +1), Y (t +1), h of the unmanned aerial vehicle at the next moment]And the selection policy of the user equipment, then in the action space, the current action expression is a (t) ═ θh(t),d(t),αi(t), the motion space of this embodiment is (N +2) -dimensional.
The reward function is used for evaluating the quality of the action taken by the agent in the current state, and specifically comprises the following steps:
r(t)=Rerengy+Rdes+Pout+Pspeed
wherein R (t) represents the current reward, RerengyRepresenting the return of the optimization problem, RdesIndicating a return of the drone to a particular destination, RdesK/d' (t), k being the reward factor; poutRepresents a penalty of the unmanned aerial vehicle flying out of the flight area, PspeedAnd the penalty of flying overspeed of the unmanned aerial vehicle is represented.
S5: constructing a deep neural network based on a SAC algorithm, and training the deep neural network by using a state space, an action space and a return function to obtain a trained deep neural network;
the SAC algorithm is an off-line random strategy algorithm based on a maximum entropy reinforcement learning framework and an Actor-Critic network, and is mainly characterized by entropy regularization, wherein entropy is a measure of strategy randomness, and the increase of entropy can bring more strategy exploration, and the expected return and the entropy value are balanced through training strategies, so that the network learning speed can be accelerated, and meanwhile, the strategy convergence to a local optimal solution is avoided; the purpose of the Actor network is to obtain the maximum return expectation and the maximum entropy, i.e. explore other strategies in the strategy space while successfully completing the task; the combination of the network update in an offline mode and the Actor-Critic network achieves good performance on a continuous control reference task, and is more stable and better in convergence.
The constructed deep neural network comprises an experience buffer area, an Actor network, a first Critic network, a second Critic network, a first Critic target network and a second Critic target network;
in each time slot, the input of the Actor network is the current state s (t), and the corresponding current action a (t) is output to obtain the current scheduling strategy piφ(ii) a The input of the first Critic network and the input of the second Critic network are both the current state s (t) and the current action a (t), and Q values are respectively output; after the unmanned plane executes the current action a (t), a new state s (t +1) is generated, and the current return r (t) is obtained, and then [ s (t), a (t), r (t), s (t +1) ]]Storing in an experience buffer; the first Critic target network and the second Critic target network are respectively used as copies of the first Critic network and the second Critic network, and target functions are setSelecting the smaller Q value of the two Q values to calculate a target value for updating the network parameters of the first Critic network and the second Critic network; when the time slot is finished, updating network parameters of the Actor network and the Critic network in real time according to the current scheduling strategy, and randomly sampling from an experience buffer area to update the network parameters of the Critic target network;
the loss function for an Actor network is:
the loss function of the first Critic network and the second Critic network is:
the objective function of the first and second Critic target networks is:
where φ represents a network parameter of the Actor network, θ
iIndicates the network parameters of the ith critical network,
represents the Q value of the ith Critic network; when i is 1, theta
1Representing network parameters of the first critical network,
represents the Q value of the first Critic network; when i is 2, theta
2A network parameter representing a second critical network,
represents the Q value of the second Critic network;
represents pi according to the current scheduling policy
φCalculating the obtained new action;
representing a target value, alpha representing an entropy regularization system;
represents the Q value of the ith Critic target network, when i is 1,
representing the Q value of the first critical target network,
represents the Q value of the second Critic target network;
s6: and carrying out scheduling optimization by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of the flight path of the unmanned aerial vehicle and the user equipment.
The optimal scheduling strategy expression is as follows:
in the formula, pi represents an optimal scheduling strategy, alpha represents an entropy regularization coefficient, and piφRepresenting a scheduling policy, gamma representing a discount factor; h represents entropy, and the calculation method is as follows: h (Pi)φ(·|s(t)))=E[-logπφ(·|s(t))]。
In a specific implementation process, each screen of the deep neural network constructed based on the SAC algorithm is that the unmanned aerial vehicle starts from a starting point and arrives at a destination or the maximum time T is over; before each screen starts, initializing a starting position and an end position of the unmanned aerial vehicle, and randomly initializing the number of user equipment, namely the value of N; in the initial stage, the scheduling strategy is far away from the optimal scheduling strategy, the entropy regularization coefficient alpha is set to be 1, so that the intelligent agent explores more actions in the initial stage to prevent from falling into the local optimal solution, the alpha is updated while the network parameters are updated, and the alpha is updated along with iterationAnd the algorithm gradually converges to the optimal solution when the times are increased. As shown in fig. 3, in each time slot, the agent outputs an action a (t), namely the flight direction and distance of the drone, according to the observed state information s (t), and the user equipment selects local calculation or offloading calculation; if the flight distance of the unmanned aerial vehicle is greater than the maximum distance d
maxLet d (t) be d
max(ii) a And if the next position of the unmanned aerial vehicle exceeds the specified area, canceling the flight action. Obtaining corresponding current return r (t) and state s (t +1) at the next moment according to the current action, and converting s (t), a (t), r (t), s (t +1)]And storing the network parameters in an experience buffer, and randomly sampling K groups of experiences from the experience buffer at the end of each time to update the network parameters. The SAC algorithm comprises a parameterized Actor network from which a strategy pi is output
φ(. s (t)), namely inputting the state information s (t) to the Actor network and outputting the corresponding action a (t) to pi
φ(. s (t)); in addition, two parameterized Critic networks, also called Q networks, input the status information s (t) of the Actor network and the corresponding obtained action a (t) are jointly input into the first Critic network and the second Critic network, and the obtained Q values are respectively output
Is selected to be smaller
The method is used for judging the performance of the Actor network and preventing overestimation. Wherein phi and theta
iRespectively representing the parameters of the Actor network and the Critic network. Similar to other DRL algorithms, the SAC algorithm also sets an experience buffer for training deep neural network parameters, and also sets a target network and soft update. The objective functions are copies of the first and second critical networks respectively,
denotes a target Q value, θ'
iRepresenting parameters of the first and second critical target networks. "Soft" update means that the parameters of the target network are updated by slowly tracking the trained network parameters, i.e., < ← τ φ + (1- τ) φ',θ
i←τθ
i+(1-τ)θ′
iwherein tau is less than or equal to 1. The difference is that the actions for updating the Actor network and Critic network are from the current policy and are not sampled from the experience buffer.
As shown in fig. 4, scheduling optimization is performed by taking a single drone service and 40 pieces of user equipment as examples, and a trajectory of a drone under different scheduling optimization methods is shown in the figure; the difference is that the trajectory 2 is the flight trajectory of the unmanned aerial vehicle which is only optimized to be scheduled by the user equipment, and the trajectory 3 is the flight trajectory of the unmanned aerial vehicle which is scheduled by the random user equipment; where the triangle represents track 1, the diamond represents track 2, the square represents track 3, and track 2 coincides with track 3; as shown in fig. 5, a comparison graph of the average energy consumption of the user equipments for three scheduling methods is shown, where a triangle represents the average energy consumption of the user equipments for jointly optimizing the trajectory of the drone and for scheduling the user equipments provided by the embodiment, a circle represents the average energy consumption of the user equipments for optimizing only scheduling the user equipments, and a square represents the average energy consumption of the user equipments for scheduling the random user equipments; based on different functions of different user equipment in reality, the calculation task size is randomly generated in the embodiment, and the maximum service number K of the unmanned aerial vehicles is 3; as can be seen from the figure, the average energy consumption of the user equipment in the scheduling optimization method for jointly optimizing the unmanned aerial vehicle trajectory and the user equipment scheduling is much smaller than that in the method for only optimizing the user equipment scheduling and the method for randomly scheduling the user equipment.
Example 3
The present embodiment provides a scheduling optimization system for unmanned aerial vehicle-assisted mobile edge computation, as shown in fig. 6, including:
the model building module is used for building an unloading model of the mobile edge computing system, and the model comprises an unmanned aerial vehicle and a plurality of user equipment;
the energy consumption calculation module is used for obtaining the energy consumption of the calculation task according to the unloading model of the mobile edge calculation system;
the optimization problem establishing module is used for establishing an optimization problem combining the unmanned aerial vehicle track and the user equipment scheduling by taking the average energy consumption minimization of the user equipment as a target;
the optimization problem transformation module is used for transforming the optimization problem into a Markov decision process and defining a state space, an action space and a return function of an unloading model of the mobile edge computing system;
the network construction training module is used for constructing a deep neural network based on a deep reinforcement learning algorithm, and training the deep neural network by using a state space, an action space and a return function to obtain a trained deep neural network;
and the scheduling optimization module performs scheduling optimization by using the trained deep neural network to obtain an optimal scheduling strategy, namely a selection strategy of the flight path of the unmanned aerial vehicle and the user equipment.
The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.