CN114698125A

CN114698125A - Method, device and system for optimizing computation offload of mobile edge computing network

Info

Publication number: CN114698125A
Application number: CN202210619336.0A
Authority: CN
Inventors: 魏楚元; 何航; 任涛
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-07-01

Abstract

The invention provides a calculation unloading optimization method, a device and a system of a mobile edge computing network, which are based on a distributed execution-centralized unloading framework of deep reinforcement learning, reduce the complexity of calculating time for solving the original target optimization problem, and avoid dimension disasters possibly faced by a traditional numerical optimization algorithm in a large-scale heterogeneous mobile edge computing network; by defining a loss function, an advantage function and a multi-agent reinforcement learning algorithm, the data sampling efficiency and the model training speed are improved, the average system cost in a network is reduced, and the service quality of calculation-intensive application is improved.

Description

Method, device and system for optimizing computation offload of mobile edge computing network

Technical Field

The invention relates to the technical field of edge computing, in particular to a computing unloading optimization method, a computing unloading optimization device and a computing unloading optimization system of a mobile edge computing network.

Background

With the explosion of computationally intensive mobile applications such as online gaming, autopilot, virtual reality, etc., it is becoming more and more imperative that mobile devices provide low-latency services for these applications. However, mobile devices typically have very limited computational resources and energy reserves, which present significant challenges to meeting the latency and computational requirements of applications. Thanks to the advent of 5G technology, mobile edge computing is considered a promising technology to address the above challenges by offloading computationally intensive-delay sensitive tasks to nearby edge nodes. However, in a conventional MEC (Mobile Edge Computing) system, an Edge server is usually deployed on a ground base station at a fixed location, which is high in deployment cost and poor in flexibility, and is not suitable for accessing scenes with dynamically changing requirements, such as event relay, traffic management, emergency rescue, and the like. Therefore, a heterogeneous mobile edge computing network facing ground vehicles and unmanned aerial vehicle assistance is receiving more and more attention from both academic and industrial circles.

Due to the high flexibility of ground vehicles and unmanned aerial vehicles and the convenience in deployment, the heterogeneous mobile edge computing network can adapt to a rapidly changing network environment and provide services for hot spots or emergency rescue activities as required. However, the requirements of high flexibility and dynamic change also bring difficult problems such as real-time decision, large-scale user association, and resource allocation under strict scheduling constraints to the heterogeneous mobile edge computing network.

In the existing research and invention, part of the method is based on the traditional numerical optimization method, for example, a convex optimization method and a heuristic search algorithm are adopted to solve the problems of task unloading and resource allocation in the multi-server mobile edge computing network, and the computing rate of the wireless power supply mobile edge computing network is maximized based on a coordinate reduction method; another part is a deep learning based approach, such as online incremental learning based on a deep neural network to solve the computational offloading and resource management problems of a dynamic heterogeneous mobile edge computing network.

Although the approximate solution can be obtained based on the traditional numerical optimization method, a large amount of iterations are usually needed to obtain a more ideal local optimal solution, the calculation complexity of problem solution is higher, and the method is not suitable for a dynamically changing environment. Most methods based on deep learning have low data sampling efficiency and the problem of low model convergence speed.

Disclosure of Invention

In order to solve the above problem, an embodiment of the present invention provides a computation offload optimization method for a mobile edge computing network, where the mobile edge computing network includes a ground vehicle and an unmanned aerial vehicle, and the method includes: constructing a system model of the moving edge computing network, and determining an optimization objective function of the model based on minimizing an average system cost; converting the optimization objective function based on the average system cost minimization into an optimization objective function based on the average reward maximization according to the state, action and reward elements of a Markov decision model; determining a distributed execution and centralized training framework of multi-agent deep reinforcement learning, and determining a loss function and an advantage function of training; training of the system model is performed according to a multi-agent reinforcement learning algorithm.

Optionally, the building a system model of the mobile edge computing network includes: establishing a network model comprising a plurality of ground vehicles, unmanned aerial vehicles and mobile equipment; establishing a communication model according to the network model, wherein the communication model comprises a mobile device-ground vehicle channel model and a mobile device-unmanned aerial vehicle channel model; and establishing a calculation model according to the communication model, wherein the calculation model comprises the calculation of local calculation cost, ground vehicle edge calculation cost and unmanned aerial vehicle edge calculation cost.

Optionally, the determining an optimization objective function of the model based on average system cost minimization comprises: determining an average system cost of all mobile devices in a plurality of time slices according to the local calculated cost, the ground vehicle edge calculated cost and the unmanned aerial vehicle edge calculated cost; and simultaneously unloading decision variables of the mobile equipment, so that the average system cost is minimum to obtain an optimized objective function.

Optionally, the transforming the optimization objective function based on the average system cost reduction into the optimization objective function based on the average reward maximization according to the state, action and reward elements of the markov decision model includes: determining the track of the mobile equipment in a plurality of time slices according to the state, action and reward elements of the Markov decision model, and calculating the probability of the track and the total reward; the state comprises task information, channel state and electric quantity information of the mobile equipment, and the action comprises unloading indication, transmission power and distributed computing capacity of the mobile equipment; calculating an average reward according to the probability of the track occurrence and the total reward, and determining an optimization objective function based on the maximization of the average reward.

Optionally, the determining a distributed execution and centralized training framework of multi-agent deep reinforcement learning, and determining a loss function and a merit function of training, comprises: constructing a distributed execution and centralized training framework of multi-agent deep reinforcement learning based on an Actor-Critic algorithm; determining a merit function using the generalized merit estimate in place of the total reward, and determining a penalty function using the offline policy in place of the online policy.

Optionally, the performing training of the system model according to a multi-agent reinforcement learning algorithm comprises: each mobile device interacts with the mobile edge computing network based on the observed local state to generate batch learning experience; training a sharing strategy based on the batch learning experience according to the generalized advantage estimation and the importance sampling; and each mobile device shares the sharing strategy to interact with the mobile edge computing network.

The embodiment of the invention provides a calculation unloading optimization device of a mobile edge calculation network, wherein the mobile edge calculation network comprises a ground vehicle and an unmanned aerial vehicle, and the device comprises: a model construction module for constructing a system model of the moving edge computing network and determining an optimization objective function of the model based on an average system cost minimization; the Markov decision conversion module is used for converting the optimization objective function based on the average system cost reduction into the optimization objective function based on the average reward maximization according to the state, the action and the reward elements of the Markov decision model; the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a distributed execution and centralized training framework of multi-agent deep reinforcement learning and determining a loss function and an advantage function of training; a training module for performing training of the system model according to a multi-agent reinforcement learning algorithm.

The embodiment of the invention provides a computing unloading optimization system of a mobile edge computing network, which is used for executing the computing unloading optimization method of the mobile edge computing network.

The embodiment of the invention is based on a distributed execution-centralized unloading framework of deep reinforcement learning, reduces the complexity of computing time for solving the original target optimization problem, and avoids dimension disasters possibly faced by the traditional numerical optimization algorithm in a large-scale heterogeneous mobile edge computing network; by defining a loss function, an advantage function and a multi-agent reinforcement learning algorithm, the data sampling efficiency and the model training speed are improved, the average system cost in a network is reduced, and the service quality of calculation-intensive application is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a computation offload optimization method for a mobile edge computing network according to an embodiment of the present invention;

FIG. 2 is a system model diagram of a heterogeneous mobile edge computing network according to an embodiment of the present invention;

FIG. 3 is a diagram of a distributed execution-centralized training framework provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computation offload optimization apparatus of a mobile edge computing network according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to solve the problems of task unloading and resource allocation of a ground vehicle and unmanned aerial vehicle assisted heterogeneous mobile edge computing network, the embodiment of the invention provides a computation unloading optimization algorithm based on deep reinforcement learning aiming at a multi-user multi-edge node scene.

Referring to fig. 1, a flow diagram of a computation offload optimization method for a mobile edge computing network is shown, where the mobile edge computing network includes a ground vehicle and an unmanned aerial vehicle, and the method includes the following steps:

s102, a system model of the mobile edge computing network is constructed, and an optimization objective function of the model based on average system cost minimization is determined.

The mobile edge computing network comprises a plurality of ground vehicles, unmanned planes and mobile equipment.

Illustratively, a system model of a mobile edge computing network may be constructed in the following manner, including: firstly, establishing a network model comprising a plurality of ground vehicles, unmanned aerial vehicles and mobile equipment; secondly, establishing a communication model according to the network model, wherein the communication model can comprise a mobile device-ground vehicle channel model and a mobile device-unmanned aerial vehicle channel model, and each channel model comprises channel gain and unloading transmission rate; then, a calculation model is established according to the communication model, wherein the calculation model can comprise the calculation of local calculation cost, ground vehicle edge calculation cost and unmanned aerial vehicle edge calculation cost, and each calculation cost comprises the delay and the energy consumption for executing tasks.

Illustratively, the optimization objective function of the above model based on the minimization of the average system cost may be determined in the following manner, including: firstly, determining the average system cost of all mobile equipment in a plurality of time slices according to the local calculation cost, the ground vehicle edge calculation cost and the unmanned aerial vehicle edge calculation cost; and secondly, unloading decision variables of the mobile equipment are connected, so that the average system cost is minimized to obtain an optimized objective function. The offload decision variables include: variables of the execution position of the decision task (local, ground vehicle and unmanned aerial vehicle), the transmitting power and the computing resources of the local, ground vehicle and unmanned aerial vehicle.

And S104, converting the optimization objective function based on the average system cost reduction into the optimization objective function based on the average reward maximization according to the state, the action and the reward elements of the Markov decision model.

Three elements, namely states, actions and rewards based on a Markov decision model are defined for the optimization problem in the steps and are converted into an optimization objective function based on the maximization of average rewards.

Due to the high coupling between offload decision variables and several limiting factors in the system, the optimization problem is an NP-hard (non-deterministic polynomial) problem, and thus the conventional numerical optimization method usually faces problems of high computational time complexity and dimensional disaster.

In order to avoid the above problems, the embodiment of the present invention defines three elements based on the markov decision model as follows: status, action, reward. Specifically, the state of each mobile device includes its task information, channel state, and power information; the actions of each mobile device include offloading indications, transmission power, and allocated computing power; the average system cost is minimized by optimizing the unloading decision and the resource allocation in the embodiment of the invention.

Based on this, the above steps may include: determining the tracks of the mobile equipment in a plurality of time slices according to the state, the action and the reward elements of the Markov decision model, and calculating the probability of track occurrence and the total reward; then, an average reward is calculated according to the probability of the track occurrence and the total reward, and an optimization objective function based on the maximization of the average reward is determined.

And S106, determining a distributed execution and centralized training framework of the multi-agent deep reinforcement learning, and determining a loss function and an advantage function of the training.

In the embodiment of the invention, a distributed execution and centralized training framework based on deep reinforcement learning is designed, and a loss function and an advantage function of training are determined. And aiming at the optimization problem based on the maximized average reward, deep reinforcement learning is adopted to train the network model. In view of the needs of large-scale user association and real-time decision making, the embodiment of the invention can adopt an Actor-Critic (Actor-evaluator) -based algorithm to build a distributed execution and centralized training framework for computation offloading and resource allocation scheduling.

The embodiment of the invention uses the deep reinforcement learning algorithm suitable for the multi-agent, and greatly reduces the computational complexity of problem solving through distributed execution and centralized training.

Optionally, a distributed execution and centralized training framework of multi-agent deep reinforcement learning can be built based on an Actor-Critic algorithm; then, a merit function is determined using the generalized merit estimate in place of the total reward, and a penalty function is determined using the offline policy in place of the online policy. The Actor may be responsible for generating actions and interacting with the environment based on an offline policy function; critic uses a device responsible for assessing the performance of Actor and directing the action of Actor at the next stage.

And S108, training the system model according to the multi-agent reinforcement learning algorithm.

Based on the above steps, the embodiment of the present invention can be implemented by using a reinforcement learning method suitable for Multi-agents, such as shared Multi-Agent proximity Policy Optimization (smapp), Multi-Agent Deep Deterministic Policy Gradient algorithm (maddppg), QMix algorithm, and the like.

Illustratively, the embodiment of the present invention employs SMAPPO based on centralized training and distributed execution framework, and the whole framework can be divided into three parts: distributed execution, data collection, and centralized training.

Based on the centralized training and distributed execution framework described above, the training process may be performed in the following manner: each mobile device interacts with a mobile edge computing network based on the observed local state to generate batch learning experience; training a sharing strategy based on batch learning experience according to generalized advantage estimation and importance sampling; and each mobile device shares the sharing strategy to interact with the mobile edge computing network. The utilization rate of experience and the convergence speed of the model can be further improved by using a near-end optimization algorithm introducing a merit function and importance sampling.

The calculation unloading optimization method of the mobile edge computing network provided by the embodiment of the invention is based on a distributed execution-centralized unloading framework of deep reinforcement learning, reduces the complexity of calculating time for solving the original target optimization problem, and avoids dimension disasters possibly faced by a traditional numerical optimization algorithm in a large-scale heterogeneous mobile edge computing network; by defining a loss function, an advantage function and a multi-agent reinforcement learning algorithm, the data sampling efficiency and the model training speed are improved, the average system cost in a network is reduced, and the service quality of calculation-intensive application is improved.

Furthermore, the method can solve the problems of computation unloading and resource allocation in the heterogeneous mobile edge computing network assisted by ground vehicles and unmanned aerial vehicles, and through the distributed execution-centralized unloading framework based on deep reinforcement learning in the step S104 and the step S106, the computation time complexity for solving the original target optimization problem is reduced, and the dimension disaster possibly faced by the traditional numerical optimization algorithm in the large-scale heterogeneous mobile edge computing network is avoided; by means of the loss function and the advantage function defined in the step S106, the importance sampling introduced in the step S108 and other methods, the data sampling efficiency and the model training speed are improved, the average system cost in the network is greatly reduced, and the service quality of the calculation-intensive application is improved.

Exemplary processes of the above steps are described in detail below.

(1) And constructing a system model of the ground vehicle and unmanned aerial vehicle assisted heterogeneous mobile edge computing network and giving an optimization objective function based on average system cost minimization.

The method comprises the following steps of constructing a system model of a ground vehicle and unmanned aerial vehicle assisted heterogeneous mobile edge computing network:

1. establishing a network model

Referring to fig. 2, a system model diagram of a heterogeneous mobile edge computing network is shown, wherein the system model diagram is included in a vehicle-oriented and unmanned aerial vehicle-oriented auxiliary mobile edge computing networkMA mobile device,VGround vehicle andUand erecting an unmanned aerial vehicle. The ground vehicle and the drone may be represented as a set, respectivelyV={1,2,…, VThe sum setU={1,2,…, U}. The mobile devices are randomly distributed over the ground and may be represented as a setM={1,2,…, M}. The overall system time is divided equally intoNA time slice, expressed as a setN={1,2,…, N}. Mobile deviceiIn time slicenWill randomly generate a task

Expressed as:

in the formula (I), the compound is shown in the specification,

represents the input data size;

indicating the number of clock cycles required to complete a 1-bit task;

indicating completion of a task

Is determined.

In this embodiment, a full offloading strategy is adopted, i.e. the generated tasks are either executed locally on the mobile device or completely offloaded to an edge node (i.e. a ground vehicle or a drone) for remote execution. Mobile deviceiIn time slicenMay be made of variables

Expressed, defined as follows:

in the formula (I), the compound is shown in the specification,

representing a local computation;

representing a ground vehicle edge calculation;

representing unmanned aerial vehicle edge calculations.

2. Establishing a communication model

1) Mobile device-ground vehicle channel model

Mobile deviceiAnd a land vehiclejIn time slicenThe channel gain of (d) is expressed as:

,

in the formula (I), the compound is shown in the specification,

representing mobile devicesiAnd a land vehiclejIn time slicenThe distance between, expressed as:

,

in the formula (I), the compound is shown in the specification,

representing mobile devicesiIn time slicenAt the position of (1) in coordinates of

;

Indicating land vehiclesjIn time slicenAt the position of (1) in coordinates of

。

According to the Shannon formula, the mobile deviceiAnd a land vehiclejIn time slicenThe offload transfer rate in between can be expressed as:

,

in the formula (I), the compound is shown in the specification,

representing mobile devicesiIn time slicenThe transmit power of (a);W _jrepresenting mobile devicesiAnd a land vehiclejThe bandwidth of the channel in between the two,σ ²representing the amount of noise power in the channel.

2) Mobile device-unmanned aerial vehicle channel model

Mobile deviceiAnd unmanned aerial vehiclekIn time slicenThe channel gain of (d) is expressed as:

in the formula (I), the compound is shown in the specification,ζ _LOSandζ _NLOSrespectively represent the excess loss of line-of-sight and non-line-of-sight links,

representing mobile devicesiAnd unmanned aerial vehiclekIn time slicenIs calculated as:

,

in the formula (I), the compound is shown in the specification,

indicating unmanned aerial vehiclekIn time slicenOf the position of (a).

According to the Shannon formula, the mobile deviceiAnd unmanned aerial vehiclekIn time slicenThe offload transfer rate in between can be expressed as:

,

in the formula (I), the compound is shown in the specification,

representing mobile devicesiAnd unmanned aerial vehiclekThe channel bandwidth in between.

3. Building a computational model

1) Local calculation: when in use

The task is executed locally. The latency of executing a task locally is expressed as:

，

in the formula (I), the compound is shown in the specification,

is a mobile deviceiIn time slicenThe local computing resources of (a). The delay should satisfy the following condition:

.

accordingly, the locally calculated energy consumption may be expressed as:

，

in the formula (I), the compound is shown in the specification,κeffective switched capacitance depending on chip architecture, ζ represents the energy consumption index, in accordance withExperience typically takes ζ = 3. In summary, the locally computed weighted cost can be expressed as:

,

in the formula (I), the compound is shown in the specification,

and

respectively representing locally calculated delay weight and energy consumption weight, and having

,

，

。

2) Ground vehicle edge calculation: when in use

Off-loading of tasks to ground vehiclesjIs executed. Offloading tasks to ground vehiclesjThe transmission delay of (d) may be expressed as:

,

accordingly, the transmission energy consumption can be expressed as:

,

vehicle for working on groundjThe calculated delay of (d) can be expressed as:

.

accordingly, the calculated energy consumption may be expressed as:

.

in the formula (I), the compound is shown in the specification,

indicating land vehiclesjThe operating power of (c). In overview, the weighted cost computed for the edge of the ground vehicle can be expressed as:

,

in the formula (I), the compound is shown in the specification,

and

respectively representing delay weight and energy consumption weight calculated at edge of ground vehicle, and having

,

，

。

3) And (3) unmanned plane edge calculation: when in use

With tasks off-loaded to the dronekIs executed. Offloading tasks to unmanned aerial vehiclekThe transmission delay of (d) can be expressed as:

,

accordingly, the transmission energy consumption can be expressed as:

,

task is at unmanned aerial vehiclekThe calculated delay of (d) can be expressed as:

.

accordingly, the calculated energy consumption may be expressed as:

.

in the formula (I), the compound is shown in the specification,

indicating unmanned aerial vehiclekThe operating power of (c). To sum up, the weighted cost computed by the drone edge may be expressed as:

,

in the formula (I), the compound is shown in the specification,

and

respectively represent the delay weight and the energy consumption weight of the edge calculation of the unmanned plane, and have

,

，

。

From the above, the mobile deviceiIn time slicenThe system cost of (a) may be expressed as:

thus, all mobile devices are inNThe average system cost in a time slice can be expressed as:

based on the established system model, unloading decision variables of all mobile devices are optimized in a combined mode

Minimizing the average system cost of the ground vehicle and drone assisted mobile edge computing network, thus said optimizing the objective functionPComprises the following steps:

.

in the formula (I), the compound is shown in the specification,C1 is an unload index constraint;C2 is the transmission power constraint;C3、C4 andC5 represent the assigned computational capability constraints of the mobile device, the ground vehicle and the drone, respectively;C6、C7、C8 means that the delay to complete a task should not be greater than its maximum tolerable delay;C9 denotes that the total energy consumption of the mobile device should be less than the maximum available energy budget of the mobile device from the start time to the current time;C10 indicates that the total energy consumption of the ground vehicle from the start time to the current time should be within the maximum available energy budget for the ground vehicle;C11 means that the total energy consumption from the start time to the current time after the drone is unloaded should not be greater than the maximum available energy budget for the drone.

And 2, defining three elements, namely states, actions and rewards based on a Markov decision model for the optimization problem in the step 1, and converting the three elements into an optimization objective function based on average reward maximization.

Due to offloading decision variables

The optimization problem is an NP-hard problem, and thus, the conventional numerical optimization method usually faces problems of high computational time complexity and dimensional disaster. In order to avoid the above problems, the present invention defines three elements based on a markov decision model as follows:

status. The status of each mobile device includes its task information, channel status and power information. Thus, the mobile deviceiIn time slicenThe state of (c) can be expressed as:

in the formula (I), the compound is shown in the specification,

representing mobile devicesiIn time slicenThe current remaining capacity of electricity.

And (6) acting. The actions of each mobile device include offloading indications, transmission power, and allocated computing power. Mobile deviceiIn time slicenThe actions of (a) may be expressed as:

and (6) awarding. The average system cost is minimized by optimizing the offloading decisions and resource allocation. Thus, the mobile deviceiIn time slicenThe reward of (a) may be expressed as:

in the formula (I), the compound is shown in the specification,

representing mobile devicesiIn time slicenWeighted cost of (2).

Based on the definition of the three elements in the Markov decision model, the mobile deviceiIn thatNThe trajectory of each time slice may be represented as:

accordingly, the probability of a track occurrence and the total reward may be expressed as:

in the formula (I), the compound is shown in the specification,θis a network parameter of the Actor,

indicating a state

The probability of occurrence.

The average reward may be expressed as:

thus, the original optimization problem can be translated into an optimization objective function based on maximizing the average reward, as follows:

and 3, designing a distributed execution and centralized training framework based on deep reinforcement learning aiming at the Markov decision problem in the step 2, and determining a loss function and an advantage function of training.

For problem P1, the present embodiment employs deep reinforcement learning to train the network model. In view of the needs of large-scale user association and real-time decision, a distributed execution and centralized training framework is built for computation unloading and resource allocation scheduling by adopting an Actor-Critic-based algorithm. For the above optimization problem, the gradient of the objective function can be expressed as:

=

=

wherein the content of the first and second substances,Bis the small batch size per sample. To add a benchmark and add a suitable confidence level, the present embodiment introduces a generalized dominance estimate instead of the total reward. The merit function is defined as follows:

wherein the content of the first and second substances,

indicating a state

In the form of a desired reward for the user,γa discount factor that represents a future reward,

mobile deviceiIn time slicen ^’The prize of (1). Thus, the gradient

It can become:

in order to improve the efficiency of data sampling, we choose to omit the replacement online policy using the offline policy, and the loss function of the Actor can be expressed as:

wherein, the first and the second end of the pipe are connected with each other,θ ^’is an Actor network parameter on each mobile device,θis the Actor network parameter to be trained,εrepresenting the clipping factor (a fraction between 0 and 1),cliprepresenting the clipping function, the function is defined as follows:

the calculation formula of (c) is as follows:

furthermore, the loss function of Critic can be expressed as:

and 4, defining a shared multi-agent near-end strategy optimization algorithm and an execution process aiming at the distributed execution and centralized training framework in the step 3.

On the basis of step 3, a shared multi-agent near-end optimization algorithm based on centralized training and distributed execution frameworks is provided.

Referring to the schematic diagram of the distributed execution-centralized training framework shown in fig. 3, the entire framework can be divided into three parts from bottom to top: distributed execution, data collection, and centralized training.

(1) Firstly, each user equipment interacts with a heterogeneous mobile edge computing network based on local states observed by the user equipment, and batch learning experience is generated.

(2) These learning experiences are then used to train a shared strategy and value function by employing generalized dominance estimates and importance sampling.

(3) Finally, each mobile device shares the trained policy continuation and context interactions.

Illustratively, the shared multi-agent near-end policy optimization is performed as follows:

1: initializing ActorπAnd CriticVUsage parameterθ ^’ ←θAndΦ ^’ ←Φ(ii) a And initializing an experience pool.

2: for the rounde=1ToERepeatedly perform

3: for time slicen=1ToNRepeatedly perform

4: for mobile devicesi=1ToMRepeatedly perform

5: interacting with the environment, storing experience tuples

To experience pool

6: end the cycle

7: end the cycle

8: for the number of updatest=1ToTRepeatedly perform

9: for the number of sampling timess=1ToSRepeatedly perform

10: random selectionBA tuple of experiences

11: computing dominant function, computing Actor loss and criticic loss

12: gradient computation with gradient descent by Adam optimizer▽θAnd▽Φ;

13: update ActorπAnd CriticVUsage parameterθ ^’ ←θAndΦ ^’ ←Φ;

14: end the cycle

15: end the cycle

16: end the cycle

The embodiment of the invention can solve the problems of computation unloading and resource allocation in a ground vehicle and unmanned aerial vehicle assisted heterogeneous mobile edge computing network, and through the distributed execution-centralized unloading framework based on deep reinforcement learning in the step 2 and the step 3, the computation time complexity for solving the original target optimization problem is reduced, and the dimension disaster possibly faced by the traditional numerical optimization algorithm in a large-scale heterogeneous mobile edge computing network is avoided. By means of the loss function and the advantage function defined in the step 3, the importance sampling method introduced in the step 4 and the like, the data sampling efficiency and the model training speed are improved, the average system cost in the network is greatly reduced, and the service quality of the calculation intensive application is improved.

Fig. 4 is a schematic structural diagram of a computation offload optimization apparatus of a mobile edge computing network in an embodiment of the present invention, where the mobile edge computing network includes a ground vehicle and an unmanned aerial vehicle, and the apparatus includes:

a model building module 401, configured to build a system model of the moving edge computing network, and determine an optimization objective function of the model based on an average system cost minimization;

a markov decision transformation module 402, configured to transform the optimization objective function based on average system cost reduction into an optimization objective function based on average reward maximization according to the state, action and reward elements of a markov decision model;

a determining module 403, configured to determine a distributed execution and centralized training framework of multi-agent deep reinforcement learning, and determine a loss function and an advantage function of training;

a training module 404 for performing training of the system model according to a multi-agent reinforcement learning algorithm.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by instructing a control device to implement the methods, and the programs may be stored in a computer-readable storage medium, and when executed, the programs may include the processes of the above method embodiments, where the storage medium may be a memory, a magnetic disk, an optical disk, and the like.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for optimizing computation offload of a mobile edge computing network, wherein the mobile edge computing network comprises a ground vehicle and an unmanned aerial vehicle, and the method comprises the following steps:

constructing a system model of the moving edge computing network, and determining an optimization objective function of the model based on minimizing an average system cost;

converting the optimization objective function based on the average system cost minimization into an optimization objective function based on the average reward maximization according to the state, action and reward elements of a Markov decision model;

determining a distributed execution and centralized training framework of multi-agent deep reinforcement learning, and determining a loss function and an advantage function of training;

performing training of the system model according to a multi-agent reinforcement learning algorithm.

2. The method of claim 1, wherein constructing the system model of the mobile edge computing network comprises:

establishing a network model comprising a plurality of ground vehicles, unmanned aerial vehicles and mobile equipment;

establishing a communication model according to the network model, wherein the communication model comprises a mobile device-ground vehicle channel model and a mobile device-unmanned aerial vehicle channel model;

and establishing a calculation model according to the communication model, wherein the calculation model comprises the calculation of local calculation cost, ground vehicle edge calculation cost and unmanned aerial vehicle edge calculation cost.

3. The method of claim 2, wherein determining an optimization objective function for the model based on an average system cost minimization comprises:

determining an average system cost of all mobile devices in a plurality of time slices according to the local calculated cost, the ground vehicle edge calculated cost and the unmanned aerial vehicle edge calculated cost;

and simultaneously unloading decision variables of the mobile equipment, so that the average system cost is minimum to obtain an optimized objective function.

4. The method of claim 1, wherein transforming the optimization objective function based on mean system cost reduction into an optimization objective function based on mean reward maximization according to state, action and reward elements of a Markov decision model comprises:

determining the track of the mobile equipment in a plurality of time slices according to the state, action and reward elements of the Markov decision model, and calculating the probability of the track and the total reward; the state comprises task information, channel state and electric quantity information of the mobile equipment, and the action comprises unloading indication, transmission power and distributed computing capacity of the mobile equipment;

calculating an average reward according to the probability of the track occurrence and the total reward, and determining an optimization objective function based on the maximization of the average reward.

5. The method of any of claims 1-4, wherein determining a distributed execution and centralized training framework for multi-agent deep reinforcement learning, and determining a loss function and a merit function for training, comprises:

constructing a distributed execution and centralized training framework of multi-agent deep reinforcement learning based on an Actor-Critic algorithm;

determining a merit function using the generalized merit estimate in place of the total reward, and determining a penalty function using the offline policy in place of the online policy.

6. The method of any one of claims 1-4, wherein said performing training of said system model according to a multi-agent reinforcement learning algorithm comprises:

each mobile device interacts with the mobile edge computing network based on the observed local state to generate batch learning experience;

training a sharing strategy based on the batch learning experience according to generalized advantage estimation and importance sampling;

and each mobile device shares the sharing strategy to interact with the mobile edge computing network.

7. The method of claim 4, wherein the mobile device is a mobile deviceiIn time slicenState of (1)Expressed as:

wherein the content of the first and second substances,

which is indicative of the size of the input data,

indicating the number of clock cycles required to complete a 1-bit task,

indicating completion of a task

Is determined by the maximum allowable delay of the delay,

representing mobile devicesiIn time slicenThe current amount of remaining power of the battery,

representing mobile devices

And a land vehicle

In time slice

The channel gain of (a) is determined,

representing mobile devices

And unmanned aerial vehiclekIn time slice

The channel gain of (a);

mobile deviceiIn time slicenIs represented as:

wherein the content of the first and second substances,

representing mobile devices

In time slice

The load-shedding decision variable of (a) is,

representing mobile devices

In time slice

The transmission power of (a) is set,

representing mobile devices

In time slice

The local computing resources of (a) are,

representing mobile devices

In time slice

The ground vehicle computing resources of (a) are,

representing mobile devices

In time slice

The unmanned aerial vehicle computing resources of (1);

mobile deviceiIn time slicenThe reward of (a) is expressed as:

wherein the content of the first and second substances,

for mobile devices

In time slice

The system cost of (a);

mobile deviceiIn thatNThe trace for each time slice is represented as:

the probability of occurrence of a track and the total reward are expressed as:

wherein, the first and the second end of the pipe are connected with each other,

indicating a state

The probability of the occurrence of the event is,

a network parameter representing Actor;

the average reward is expressed as:

wherein the content of the first and second substances,Eindicates a desire;

the optimization objective function based on maximizing the average reward is expressed as:

。

8. the method of claim 7, wherein the mobile device is a mobile deviceiIn time slicenThe state of (a) is represented as:

the merit function is expressed as:

wherein the content of the first and second substances,

indicating a state

In the form of a desired reward of (a),γa discount factor that represents a future reward,

mobile deviceiIn time slice

The reward of (1);

the loss function of Actor is expressed as:

wherein the content of the first and second substances,

indicates the Actor network parameter on each mobile device,

indicates the Actor network parameters to be trained,

the calculation formula of (c) is as follows:

the loss function for Critic is expressed as:

。

9. a computational offload optimization apparatus for a mobile edge computing network, the mobile edge computing network comprising a ground vehicle, a drone, the apparatus comprising:

a model construction module for constructing a system model of the moving edge computing network and determining an optimization objective function of the model based on an average system cost minimization;

the Markov decision conversion module is used for converting the optimization objective function based on the average system cost reduction into an optimization objective function based on the average reward maximization according to the state, the action and the reward elements of the Markov decision model;

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a distributed execution and centralized training framework of multi-agent deep reinforcement learning and determining a loss function and an advantage function of training;

a training module for performing training of the system model according to a multi-agent reinforcement learning algorithm.

10. A computing offload optimization system of a mobile edge computing network, the system being configured to perform the computing offload optimization method of the mobile edge computing network according to any of claims 1 to 8.