CN113873022A

CN113873022A - Mobile edge network intelligent resource allocation method capable of dividing tasks

Info

Publication number: CN113873022A
Application number: CN202111112170.5A
Authority: CN
Inventors: 沈斐; 唐亮; 卜智勇; 赵宇; 其他发明人请求不公开姓名
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-31

Abstract

The invention relates to a mobile edge network intelligent resource allocation method capable of dividing tasks, which comprises the following steps: dividing a serial task generated by a terminal to obtain a plurality of subtasks, and establishing an unloading task model; respectively establishing a time delay model and an energy consumption model for the subtasks according to two execution modes of local execution or unloading, and defining an unloading joint target optimization function based on the multi-user serial dependent task; under a multi-server scene, establishing a Markov game model according to the cooperative competition relationship of multiple users to wireless communication and computing resources, and optimizing the unloading combined target optimization function; in a time-varying environment, each terminal is used as an independent agent to execute a reinforcement learning algorithm to solve the Markov game model based on part of system state information, and an unloading strategy, sub-channel selection, transmitting power and resource allocation amount are determined. The invention is beneficial to reasonably distributing server resources and fully using fragmented resources, ensures the terminal user experience and improves the stability of network operation.

Description

Mobile edge network intelligent resource allocation method capable of dividing tasks

Technical Field

The invention relates to the technical field of edge calculation and artificial intelligence, in particular to a mobile edge network intelligent resource allocation method capable of dividing tasks.

Background

With the continuous development of communication technology, a great amount of emerging internet interactive applications emerge, and the requirements of the application programs on data transmission, the computing power of mobile equipment and time delay are continuously increased, so that the application programs are not suitable for being executed on intelligent equipment with poor computing power and limited battery capacity. In addition, a single cloud architecture requires long-distance data transmission, and the requirements of a terminal side on low time delay and large bandwidth in an ultra-dense wireless network under a next-generation communication framework are difficult to meet. Therefore, the moving edge calculation technology is an important solution to solve the above problems as a specific implementation mode in edge calculation. The Mobile Edge Computing (MEC) sinks part of service capability of the cloud to an Edge node near the user, and provides resource services such as Computing, caching and the like for the user. The user can unload part of the calculation intensive tasks to the server of the edge node for execution, thereby reducing the time delay generated in the data transmission process, relieving the transmission pressure of the backbone network and ensuring the effective execution of the tasks.

Because the MEC server has limited resources, competition of computing and communication resources exists among a plurality of devices for carrying out edge computing task unloading, at present, a plurality of research works aiming at the problems of task unloading and resource allocation exist, for example, a patent document with the application number of 202010171454.0 discloses a task unloading method based on a mobile edge computing scene, and an optimization objective equation with minimized system overhead is determined according to task information to be processed and system real-time parameter information; the optimization objective equation is decomposed into two sub-problems: task offloading and channel allocation sub-problems and transmission power and edge server resource allocation sub-problems; and solving the sub-problems to obtain a final task unloading scheme, so that the overall overhead of the system is minimized. However, the method is oriented to a single-server unloading scene, the problem solving dimensionality is high, the multi-terminal requirements in a dense network cannot be met, and the algorithm expansibility is poor.

The defects of the prior art are mainly reflected in four aspects, one is that the scene is too simple, most researches are oriented to a single/multi-terminal single-server scene, the problems of calculation and communication resource competition among devices are considered, and the problems of unloading server selection, load balance among servers, resource scheduling and distribution and the like are ignored; and secondly, the unloading task is not divisible, the existing research is limited to 0-1 unloading of an atomic task which is not divisible, the potential parallelism among the divisible tasks is ignored, and the fragmented resources of the server cannot be effectively utilized. Thirdly, the optimization target is too single, only two aspects of time delay and energy consumption are considered, other factors influencing the system performance are ignored, and tasks with different emergency degrees need to be processed differently; finally, the central offloading policy has poor adaptability to dynamic environments, a unified decision is made based on the collected global information, and the central control node needs to bear huge calculation and flow pressure, which is likely to become a bottleneck of the whole system.

Disclosure of Invention

The invention aims to solve the technical problem of providing an intelligent resource allocation method of a mobile edge network capable of dividing tasks, which is beneficial to reasonably allocating server resources and fully using fragmented resources, and is beneficial to improving the task unloading execution performance, ensuring the terminal user experience and improving the network operation stability.

The invention considers and solves the following technical problems:

1) the task unloading scene of the multi-terminal multi-server has the problems of calculation and communication resource competition among users, selection of unloading servers, load balance among servers, resource scheduling and distribution and the like, and has higher complexity compared with a single/multi-terminal single-server scene;

2) the serial tasks have strict constraint relation and need to be executed in sequence, and the execution sequence cannot be disturbed. Determining a proper sub-channel, transmission power and calculation resource amount for each subtask selecting the unloading strategy;

3) the design of the optimization objective function needs to meet the time delay requirements and the emergency degree of different tasks. In the environment with time-varying system state, the unloading problem of multiple terminals is solved in a distributed self-organizing manner, the instability of the multi-terminal environment is reduced, and meanwhile, the long-term reward of each terminal user is ensured.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for distributing the intelligent resources of the mobile edge network capable of dividing tasks comprises the following steps:

(1) dividing a serial task generated by a terminal to obtain a plurality of subtasks, and establishing an unloading task model;

(2) respectively establishing a time delay model and an energy consumption model for the subtasks according to two execution modes of local execution or unloading, and defining an unloading joint target optimization function based on the multi-user serial dependent task;

(3) under a multi-server scene, establishing a Markov game model according to the cooperative competition relationship of multiple users to wireless communication and computing resources, and optimizing the unloading combined target optimization function;

(4) in a time-varying environment, each terminal is used as an independent agent to execute a reinforcement learning algorithm to solve the Markov game model based on part of system state information, and an unloading strategy, sub-channel selection, transmitting power and resource allocation amount are determined.

The multiple subtasks in the step (1) have interdependencies, and data interaction exists among the multiple subtasks.

When the unloading task model is established in the step (1), each subtask is specified to be only unloaded to a certain MEC server for execution, but different subtasks in one application can be unloaded to different MEC servers; when an adjacent subtask is offloaded to the same or a different MEC server, the output data of the previous subtask is transferred to the MEC server to which the next subtask is offloaded through a wired connection.

The unloading joint objective optimization function in the step (2) is P:

wherein, T_iIndicating the time delay for completing the ith subtask, E_iRepresents the terminal energy consumption, δ, for completing the ith subtask_iIndicating the priority of the ith sub-task,χ₁,χ₂represents the weight of time delay and energy consumption, and₁,χ₂∈[0,1]，χ₁+χ ₂1, the unloading joint objective optimization function satisfies the following constraint conditions: constraint condition 1, the execution position of the application subtask is a local or edge server; constraint 2, the entry subtask and the exit subtask of the task can only be executed locally; constraint 3, the subtask can start to execute only when the execution of its predecessor subtask is finished; constraint condition 4, each subtask can only select one sub-channel frequency to transmit data to the server; constraint 5, the total amount of computing resources that can be allocated by all subtasks selected to offload to the edge server must not exceed the maximum resource ownership; constraint 6, the transmit power of the terminal device when inputting data to the edge server must not exceed its maximum transmit power.

The step (3) is specifically as follows: determining a known state space, an action space and a reward function; modeling a task unloading and resource allocation decision process of the multiple terminals into a Markov decision process, namely, in each time slot, the terminals observe the local environment state of the terminals and then independently take action according to different strategies adopted by the local environment state; according to the task execution condition, each agent can obtain the reward of environment feedback, and the agent is transferred to a new state according to the actions of all related agents; the decision process for all coupled terminals is modeled as a markov game process, i.e. at any time slot, each terminal is targeted to take the best action while maximizing the long-term prize.

The step (4) is specifically as follows: each terminal is used as an independent intelligent agent, and all changes except the terminal is used as an environment; each terminal independently operates an Actor-Critic reinforcement learning framework; all terminals are trained based on current partial environment data, and an optimal unloading and resource allocation strategy is selected through a reinforcement learning algorithm, so that a convergence state is achieved; and the terminal distributes the subtasks to the server nodes specified by the unloading strategy according to the unloading strategy and obtains the appropriate resource amount based on the resource distribution strategy.

The Actor-Critic reinforcement learning framework comprises a Critic network and an Actor network, the Critic network is trained on the basis of a Value-based function, and the input of the Critic network comprises the current state, a selected action and the state of the next step; the Critic network adopts a Temporal-Difference updating mode, namely, after a new round of training is started, parameters are updated after the round is finished; the Critic network estimates the value of each state-action and feeds back the time difference value to the Actor network; the loss function of the Critic network is defined as a square value of time difference, and the loss function guides the updating process of parameters; the training of the Actor network is based on a Policy-based function, and the action or the probability of the action is output according to the input state, wherein the Actor network adopts a Monte-Carlo updating mode, namely, updating is carried out once after each action is executed; the loss function of the Actor network is designed based on the time difference error calculated by the Critic network.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention considers the competition relationship of simultaneously submitting unloading requests by multiple terminals and the mutual influence between task unloading and server resource allocation, takes the task priority, the average completion time of application and the average energy consumption of the mobile terminal as evaluation indexes, and formulates a combined task unloading and resource allocation mechanism among a plurality of selfish and coupled users in a fringe network into a random game. Each user learns optimal offloading decisions distributively by observing their local network environment, with the goal of improving task execution performance without having to know all state information by selecting sub-channels, transmit power, and the amount of allocated computational resources. A multi-agent reinforcement learning framework is designed to solve the problem of random game. The strategy follows a first-come first-serve principle, resources of the edge server are reasonably distributed, and waiting time of tasks on the server is reduced, so that users can obtain better task unloading results, and user experience and application performance are improved.

Drawings

FIG. 1 is a diagram of a task offloading scenario for a multi-terminal multi-server oriented ultra-dense network in an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention;

FIG. 3 is a model diagram of serial task offloading in a multi-terminal multi-server scenario, according to an embodiment of the invention;

FIG. 4 is an Actor-critical architecture diagram based on a multi-agent reinforcement learning algorithm in an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a mobile edge network intelligent resource allocation method capable of dividing tasks, which comprises the following steps: dividing a serial task generated by a terminal to obtain a plurality of subtasks, and establishing an unloading task model; respectively establishing a time delay model and an energy consumption model for the subtasks according to two execution modes of local execution or unloading, and defining an unloading joint target optimization function based on the multi-user serial dependent task; under a multi-server scene, establishing a Markov game model according to the cooperative competition relationship of multiple users to wireless communication and computing resources, and optimizing the unloading combined target optimization function; in a time-varying environment, each terminal is used as an independent agent to execute a reinforcement learning algorithm to solve the Markov game model based on part of system state information, and an unloading strategy, sub-channel selection, transmitting power and resource allocation amount are determined.

On the premise of meeting the dependency relationship among the subtasks, the invention reasonably schedules the subtasks, fully utilizes the fragmented resources of the local user and the server, improves the performance of the application program and the experience of the terminal user, and solves the problem of multi-terminal task unloading when the communication and the computing resources of the edge server are limited. And formulating a task unloading mechanism of a plurality of selfish users in the network into a random game process. A multi-agent reinforcement learning framework is designed to solve the problem of random game. Each user learns the optimal decision of local calculation or edge calculation by observing the local network environment, and the aim is to determine the optimal unloading strategy and the optimal resource allocation scheme by selecting the sub-channel and the transmitting power without knowing all state information, thereby reducing the average task completion delay and the average terminal energy consumption.

This process is described in detail below in conjunction with fig. 2.

S1, establishing a serial task dividing and unloading model generated by the mobile terminal

The application program can be automatically divided into a plurality of subtasks with mutual dependency relationship, data interaction exists among the subtasks, and the serial mobile application program with dependency constraint relationship is used as a research object. The set of mobile terminals is denoted as MDs ═ {1,2, …, N }, where N denotes the number of submitted offload requests. Assuming that the application Task generated by each end user is uniformly divisible into n subtasks, the Task set is denoted as Task ═ {1,2, …, n }. Suppose subtask 0 and subtask n +1 are virtual subtasks, represent subtasks for data input and result output, and are fixed for local execution. Suppose each application is represented by a quadruplet < MD_i,d_i,c_i,δ_i>, [ i ] e.MDs, where MD_iThe representative application being generated by terminal i, d_i＝{d_i,1,d_i,2,...,d_i,nDenotes the size of each subtask input data of terminal i, c_i＝{c_i,1,c_i,2,...,c_i,nDenotes the CPU cycle, δ, required to compute the subtask_iIndicating the task priority of the terminal generating the application. The invention adopts a linear linked list L ═ { V, ED } to represent the dependency relationship between subtasks, each node j ∈ V represents a subtask of the mobile application program, and each directed edge e (j-1, j) ∈ ED represents the dependency relationship between subtask j-1 and subtask j. The jth subtask can start execution, except that enough computation, storage, and network resources are allocated, and its predecessor subtask j-1 is already executed, and the subtask offload model is shown in FIG. 3.

The time range is divided into a number of time slots, assuming that the system operates on a time slot structure. Each end user uses the respective local observation information to distributively select task execution decisions at each time slot k.

The edge server is deployed at the edge of a network close to the mobile terminal, and provides services such as calculation, network and storage for task unloading. Considering a multi-edge server scenario deployed in different locations in the ultra-dense network shown in fig. 1, a server set is denoted as S ═ {1,2]. Each server can be represented as a triple: < s, F_s,B_s>, S ∈ S, where S denotes the number of the server; f_sThe maximum computing power of the server s, representing the number of instructions executed per second; b is_sThe network bandwidth of the communication between the mobile terminal and the edge server at the current moment is represented; the uplink channel resources are divided equally into K_sAnd the unloading subtask selects the kth subchannel to upload unloading data according to the strategy. The processing power and transmission capacity of all servers are assumed to be consistent and do not change as the task volume increases.

The terminals constantly collect data and perform computationally intensive tasks, and for each application run by a terminal, the offload policy can be expressed as an n-dimensional vector, X_i＝{x_i,1,x_i,2,...,x_i,nIn which x_i,j0 denotes that the subtask j of application i is executed locally, x_i,jS ∈ S denotes that the subtask is offloaded to the edge server S for execution.

For each application program run by the terminal, the channel resource allocation strategy can be expressed as an n-dimensional vector:

wherein

Representing whether subtask j of application i transmits offload data to edge server x through kth subchannel_i,jThe above. When x is_i,jWhen 0, the task is executed locally, at which time

For each application program run by the terminal, the computing resource allocation strategy can be expressed as an n-dimensional vector:

wherein

x_i,jS stands for edge server s to assign ψ F to subtask j of application i_sIn which F_sRepresenting the maximum computing resource ownership of the edge server s. x is the number of_i,jWhen equal to 0, there is

Each subtask is specified to be only offloaded to one MEC server for execution, but different subtasks in one application can be offloaded to different MEC servers. When the adjacent subtasks are unloaded to the same or different MEC servers, the output data of the previous subtask is transmitted to the MEC server unloaded by the next subtask through wired connection, and the transmission energy consumption is 0.

S2, respectively establishing time delay and energy consumption models for the subtasks according to the local or unloading execution modes, and defining a combined objective optimization function

S21, establishing local execution time model

Local execution means that the subtask (i, j) is executed on the mobile terminal, ST_i,j,FT_i,jRespectively representing the start execution time and the end time of the subtask (i, j). Wherein, ST_i,jExpressed as:

ST_i,j＝FT_i,j-1+T_i,j-1,j,others，

wherein, T_i,j-1,jRepresents the data transmission time between subtasks (i, j-1) and (i, j):

wherein d is_i,j-1Indicating the size of the data transfer between subtasks (i, j-1) and (i, j).

Adopting orthogonal frequency division multiple access technology as an uplink access scheme, and for a server s, the working frequency band B thereof_sIs divided into K_sA plurality of equally divided frequency bands. To ensure orthogonality of uplink transmissions between user applications associated with the same server, each user is assigned to a sub-band for transmitting data to the edge server. So that server s can serve up to K simultaneously_sAnd (4) users. Each user and server has an antenna for uplink transmission. Order to

Representing the sub-bands K, K ∈ [1, K ]_s]The uplink channel gain between upper user i and server s captures the effects of path loss, shadowing, and antenna gain. p is a radical of_i,jRepresenting the wireless transmission power when the user i uploads the subtask j request to the server, wherein p is more than or equal to 0_i,j≤p_maxWhen x is_i,jWhen not equal to s, p_i,j0. Since users transmitting to the same server use different sub-bands, uplink intra-cell interference can be ignored, but these users are still affected by inter-cell interference. In this case, the signal-to-noise ratio from user i to server s on subband k is expressed as:

wherein σ²Is the background noise variance;

representing the cumulative interference within the cell for all users associated with other servers on subband k. Since each subtask j of user i transmits data only on a single subband, the rate at which the subtask j of user i transmits data to server s is:

wherein, B_i,j,sRepresenting the actual communication bandwidth after being attenuated by environmental interference and user collisions.

When a user performs his task locally, it is assumed that the user can now use all of the computing resources for sub-task execution. f. of_i ^lRepresenting the total computing power of end user i, using

Representing the time at which the subtask (i, j) executes locally, then:

wherein c is_i,jIndicating the CPU cycles required to apply the jth sub-task of i.

Thus, the sub-task (i, j) is completed locally at the user by the time

S22, establishing unloading execution time model

The unloading execution includes three phases: the time to transmit the request to the MEC server over the uplink, the time to execute the task on the MEC server, and the time to return the results of the task execution from the MEC server to the user over the downlink. Since the size of the result is usually much smaller than the request, while the downstream data rate is much higher than the upstream data rate, the delay in the transmission of the result is omitted here.

The start execution time of the subtask (i, j) on the edge server is likewise ST_i,jAnd (4) showing. Each MEC server is capable of providing computation offload services for multiple subtasks simultaneously. Computing resource provided by each MEC server to associated subtask sharing

And (6) quantizing. One feasible computing resource allocation strategy must satisfy the computing resource constraints:

by using

Represents the time at which the subtask (i, j) executes at the edge server s:

thus, the subtask (i, j) offload execution time is:

for the entire task generated by user i, its completion time can be expressed as:

T_i＝FT_i,n+1-ST_i,0，

where 0, n +1 represent the entry and exit subtasks of the task, respectively.

S23, establishing a calculation energy consumption model

The modeling of the invention only considers the energy consumption of the edge users, because the end user equipment is usually powered by a battery with limited energy and is sensitive to the energy consumption; the edge server is usually connected with an edge gateway such as a base station, and is powered by alternating current of a power grid, so that the energy consumption requirements on calculation and communication are relaxed.

During the entire serial task edge offload process, the energy consumption of the end-user device comes from two parts: the calculated energy consumption and the wireless communication energy consumption may be expressed as:

wherein E_iIs the total energy consumption of user i during the edge offload,

is the energy consumption caused by the local computation of the subtask,

representing the energy consumption caused by the user wirelessly communicating with the edge server.

Energy consumption model using calculation cycle

And (4) showing. Where τ is the energy coefficient dependent on the chip structure, setting

f represents the current CPU frequency. Thus, the computational energy consumption of application i in locally executing subtask j

The calculation is as follows:

the total computational energy consumption of the mobile user i to complete the whole task is then equal to the sum of the computational energy consumption of all locally executed sub-tasks, i.e.:

s24, establishing a transmission energy consumption model

The transmission energy consumption is mainly generated by data transmission between a mobile terminal user and an edge server, and for an application program generated by a certain user i, when two adjacent subtasks are executed on a mobile local server or the edge server, the transmission energy consumption is zero; there is a data transfer power consumption only when two adjacent subtasks are executed at different locations. The energy consumption generated when the mobile terminal user i sends the subtask j to transmit data to the edge server is represented as:

thus, the total transmission energy consumption for a mobile terminal user i to complete the entire task is expressed as:

thus, the total energy consumption of all end users in the system can be expressed as:

s25, combining the task execution time model and the energy consumption model, defining the unloading target optimization function based on the multi-terminal serial task

Latency and power consumption are two keys to task execution. If an end user chooses to offload his computational tasks, it must request spectrum and computational resources from the gateway, thereby reducing the resources that other users can allocate. And a larger transmit power means a higher transmission rate, less transmission delay, but more interference to other end users. The time model and the energy consumption model established under the serial task unloading scene are influenced by the unloading strategy and cannot reach the minimum value simultaneously through independent calculation. The present invention designs an optimal joint computation offload scheme and provides an efficient resource allocation solution between end users.

And (3) jointly executing time delay, energy consumption constraint and newly introduced task priority delta, quantifying and unifying the three dimensions into system utility evaluation unloading performance, and simultaneously serving as a reward mechanism to feed back a training neural network. According to the analysis of the above calculation model and communication model, considering the offloading policy, channel selection policy, transmission power and calculation resource allocation, an offloading optimization objective function of user i is defined as:

the system cost function comprises time delay cost and energy consumption cost for executing all tasks at a certain moment; chi shape₁，χ₂Respectively representing the weight occupied by the task completion time delay and the terminal energy consumption, and having x₁,χ₂∈[0,1]，χ₁+χ ₂1. The orientation of the sub-utilities is determined by adjusting this parameter during the training process, e.g. with more focus on execution latency in latency sensitive scenarios and with more focus on energy consumption in energy limited devices. The objective optimization function P satisfies the following constraints:

C2:x_i,0＝0,x_i,n+1＝0

wherein the constraint C1 indicates that the execution position of the application subtask can be 0 or s; c2 indicates that the entry subtask and the exit subtask of the task can only be executed locally; constraint C3 ensures that a subtask (i, j) can only begin execution until it waits for its predecessor subtask (i, j-1) to complete; constraint C4 limits each subtask to select only one subchannel frequency for transmitting data to the server; constraint C5 indicates that the total amount of computing resources that can be allocated by all the subtasks selected for offloading to the edge server s must not exceed the maximum resource ownership; the constraint C6 indicates that the transmit power at which the terminal device inputs data to the edge server must not exceed its maximum transmit power.

From the above optimization problem, in a complex scenario of multiple users and multiple servers, the optimization problem not only needs to consider the actual unloading decision of the device, but also needs to consider the resource allocation scheme of the edge server to the subtask, and the two are mutually coupled and mutually influenced, and meanwhile, due to the dependence constraint of the task itself, the unloading problem becomes very difficult.

S3, under the scene of multiple servers, establishing a Markov game model according to the cooperative competition relationship of multiple terminals to wireless communication and computing resources

According to the objective optimization function defined in step S2, the present invention aims to solve the optimal offloading policy, sub-channel selection policy, transmission power and computational resource allocation policy to minimize the system cost during task execution. Each end user can only observe local information and know channel state information through feedback of the server, so that a multi-agent Markov game model, also called a random game, is formed.

The random game theory is well suitable for being applied to a multi-terminal multi-server edge unloading scene. And under the condition that a plurality of interested selfish terminals do not share information, the unloading strategy is selected in a distributed mode. After the terminal performs the corresponding action, the reward value fed back from the system environment is obtained, and the next state is entered, wherein the next state depends on the joint action made by all the terminals. Under the time-varying environment, the above process is repeated continuously, and convergence to the Nash equilibrium state is finally expected. Under the Nash equilibrium state, higher income cannot be obtained in any terminal network through changing strategies, and the network parameters and the long-term discount rewards of the system are optimized. In a considered multi-terminal scenario, when multiple terminals autonomously select offloading behaviors according to a policy, they may compete for limited channel and server resources, striving for maximum revenue for themselves. By definition, decisions among multiple terminals in this scenario form a non-cooperative gaming process. Each terminal will take all changes except itself as part of the environment, regardless of the benefit of the other terminals. In the non-cooperative game, the unloading behaviors of all the terminals are mutually restricted and mutually influenced.

The task offload and resource allocation decision Process for each end-user is modeled as a Markov Decision Process (MDP) to accurately describe each end-user decision Process. At each time slot theta, the end-user observes its local environment state st_i(θ)∈ST_iThen taking action a independently according to different strategies adopted by the algorithm_i(θ)∈A_i. Each agent receives a reward r of environmental feedback according to the task execution situation_i(θ)＝r_i(st_i(θ),a₁(θ),...,a_N(theta)), based on the actions of all relevant agents, transition to a new state st_i(θ+1)∈ST_i. The future state in MDP depends only on the current state and is independent of the historical state. At any time slot θ, the goal of each end user is to take the best action while maximizing the long-term rewards.

The exact definition of the state space, action space and bonus functions in the Markov game is given below:

1) state space: define the state space of user i as st_iAnd (theta) including state information of the user i, other users and the MEC server, such as residual channel resources, computing resources and the like. Thus, the state space of the system is defined as:

ST(θ)＝{st₁(θ),...,st_i(θ),...,st_N(θ)}，

wherein st_i(θ)＝{st_i,1(θ),...,st_i,j(θ),...,st_i,n(θ)},i∈MDs,j∈Task_i。

2) An action space: for user i, action space a_i,j(k) Including the offload decision for subtask j, the transmit power, the uplink channel allocated by the MEC server, and the computational resources allocated by the MEC server. The motion space of the system is thus defined as:

A(θ)＝{a₁(θ),...,a_i(θ),...,a_N(θ)}，

wherein, a_i(θ)＝{a_i,1(θ),...,a_i,j(θ),...,a_i,n(θ)},i∈MDs,j∈Task_i。

In the multi-terminal IoT edge computing network under consideration, each end user i is treated as an agent whose actions taken include, at each time slot θ, an offload decision X_iSelection of sub-channel CH_iSelecting P for the transmission power level_iAnd assigned computational resource F_iI.e. a_i(θ)∈A_i＝X_i×CH_i×P_i×F_i. Therefore, the action space for calculating the offload game is:

3) reward function: the reward is feedback of the environment to the agent after the agent takes action. Reward function r_iThe design of (θ) directly guides the learning process. The invention aims to reduce the task execution cost of each user terminal to the minimum according to the resource limit of a server and a task execution delay threshold value. In particular, the system cost is considered a negative reward function in the problem, so the long-term reward must be minimized here. Rewards are set according to constraints and goals of the tasks, including priority task completion, delay constraints and energy consumption, algorithms ensure that allocated resources enable terminal applications with higher priority to execute fully earlier; the lower the mission delay and energy consumption, the higher the reward.

Next, by selecting an appropriate operation at each time slot, consideration is given to minimizing the long-term reward v_i(θ):

Wherein, λ ∈ [0,1 ]]Represents a discount factor, v_i(theta) represents the sum of the long-term discount rewards, which can be used to measure the actions taken by the end-user i, and tau is the slot index starting from the slot theta.

The optimal computation offload problem for end user i is then expressed as:

the design of the computation offload scheme for multi-terminal edge computing networks contains the above-mentioned N sub-problems, which correspond to all the sub-tasks of the N end users. Each end-user does not have the status and off-loading information of other end-users, so the present invention first models the optimization problem using non-cooperative random gaming, and then proposes a multi-agent reinforcement learning framework to solve the problem.

S4, based on the known partial system state information, each terminal user independently executes a reinforcement learning algorithm to determine a task unloading strategy and a resource allocation amount, and the problem of game is solved

The multi-agent depth deterministic policy gradient (MADDPG) is utilized to find the best policy for the MDP. The core of MADDPG is the Actor-Critic architecture, as shown in FIG. 4. The Critic part of each agent can acquire action information of all the other agents, centralized training and non-centralized execution are carried out, namely, global Critic capable of being observed is introduced during training to guide Actor training, and action is taken only by using an Actor with local observation during testing. off-line was trained for centralization and on-line was performed for decentralization.

Critic network: the Critic network is based on a Value-based function, i.e., a Q function. The inputs to the Critic network include the current state, the selected action, and the next state. Critic is a multi-layer fully-connected neural network structure. The Critic network adopts a Temporal-Difference update mode, namely, after a new round of training is started, parameters are updated after waiting for the end of the round. Critic estimates the value of each state-action and feeds back the time difference value to the Actor. Calculation taking into account the time difference: td _ error ═ r + λ × Q (st', a) -Q (st, a). The loss function of the Critic network is defined as the square value of the time difference, and the loss function guides the updating process of the parameters.

An Actor network: the training of the Actor network is based on a Policy-based function, and an action or probability of the action is output according to the input state. The Actor is also a multi-layer fully-connected neural network. The network adopts a Monte-Carlo updating mode, namely, after each action is executed, updating is carried out, and the process does not need to wait until the round is finished. The loss function of Actor is designed based on the time difference error calculated by Critic. And the Actor selects the action according to the value output by the softmax function, updates the parameters according to the Critic score and modifies the action selection probability.

The Actor selects and executes an action according to the current state. Critic scores the performance of Actor according to the current state and the environmental feedback reward value produced by the action. In the initial stage of learning, the Actor randomly selects an action, and Critic randomly scores the action. Due to environmental feedback, i.e., the existence of reward function, Critic scoring becomes more accurate and Actor performs better and better. In the parameter updating stage, the Actor updates its action strategy, namely, the Actor network parameters, according to the Critic score. Critic adjusts its own scoring strategy and network parameters according to the reward function given by the system by calculating the Q value. The Actor-Critic relates to two neural networks, the two neural networks interact with each other, loop iteration is carried out, parameters are updated in a continuous state, and network performance is improved.

In the invention, each end user runs an independent Actor-criticic algorithm to learn respective optimal strategies. In particular, the selection of the optimal action depends on the Q function, which is defined as being in the state st_iTake action a_iThe optimal expected value of time. Since the state transition probabilities are difficult to obtain in practice, the average rate of return over multiple samples may approximately represent the expected cumulative reward. This is achieved by using a monte carlo learning method and sampling the same Q function by different strategies. However, all-purposeThe monte carlo learning becomes complex by oversampling the complete interaction segment to compute the mean return. The time difference is used to recursively update the Q-value function by learning its estimated value on the basis of other estimated values, expressed as:

wherein the content of the first and second substances,

indicating the best cumulative benefit for the next time slot. α represents a learning rate, and the learning rate α is set to ensure convergence of Q learning_kThe method comprises the following steps:

wherein alpha is_ini,α_endRespectively, an initial value and a final value of a given alpha, the epsilon being the maximum number of iterations of the learning algorithm.

In order to avoid the situations of gradient disappearance and gradient explosion, which cause model degradation, the invention adopts an empirical replay strategy. And storing experience data obtained in the environment exploration process of the intelligent agent in an experience pool, and randomly sampling and updating network parameters in the subsequent deep neural network training process. The experience pool of user i may be M_i＝m^i-M+1,...,mⁱWhere M represents the size of the experience pool, the stored experience data tuple is represented as:

an epsilon-greedy method is adopted as an action selection strategy, the problem of exploration and utilization in reinforcement learning is mainly solved, an intelligent agent selects an optimal action corresponding to a maximum Q function according to a probability 1-epsilon, and the probability epsilon belongs to [0,1 ∈]A random action is selected.

As can be easily found, the invention adopts a distributed intelligent reinforcement learning algorithm to dynamically determine the unloading strategy, the sub-channel selection, the transmitting power and the multi-server computing resource allocation scheme of the multi-terminal divisible serial tasks, thereby optimizing the task execution time delay and the terminal energy consumption and improving the system efficiency.

The invention is oriented to a multi-terminal multi-server scene, fully considers the competition relationship of the multi-terminal and the coupling relationship of task unloading and resource allocation decision, solves the problem of multi-terminal task unloading when the edge server communication and the computing resources are limited, and aims to reduce the average task completion time and the average terminal energy consumption by establishing a computing model and an energy consumption model.

The invention establishes a divisible serial task model and designs an intelligent unloading strategy. On the premise of meeting the dependency relationship among the subtasks, the strategy reasonably schedules the subtasks, makes full use of fragmented resources of the local user and the server, and improves performance and user experience.

The method defines task priorities to represent different time efficiency emergency degrees of tasks, positions system cost by combining three factors of the task priorities, task execution time delay and task execution energy consumption, normalizes multi-objective optimization into single-objective optimization in a linear weighting mode, and models the single-objective optimization into a mixed integer nonlinear programming problem.

The invention designs a distributed resource allocation algorithm based on multi-agent reinforcement learning, which aims to minimize system cost and explore an optimal unloading strategy, so that a terminal user can achieve a balance state in a self-organizing manner in a time-varying environment. Each user learns and adapts to the environmental data as a separate agent, treating other users as part of the environment.

Experiments show that compared with the traditional atomic task 0-1 unloading strategy and the traditional central task unloading algorithm, under the scenes of different numbers of user tasks and different numbers of edge servers, the method can realize lower task execution cost, namely effectively reduce time delay and energy consumption. In addition, different priorities and maximum endurance times are set for the tasks, and it can be found that the tasks with higher priorities can be scheduled to be executed earlier, and under a certain time constraint, the task completion rate of the algorithm is the highest.

Claims

1. A mobile edge network intelligent resource allocation method capable of dividing tasks is characterized by comprising the following steps:

2. The method for intelligent resource allocation of a mobile edge network capable of dividing tasks according to claim 1, wherein there is an interdependence relationship between the plurality of sub-tasks in the step (1), and there is data interaction between the plurality of sub-tasks.

3. The method for intelligent resource allocation of a mobile edge network capable of dividing tasks according to claim 1, wherein when the unloading task model is established in step (1), it is specified that each subtask can only be unloaded to a certain MEC server for execution, but different subtasks in an application can be unloaded to different MEC servers; when an adjacent subtask is offloaded to the same or a different MEC server, the output data of the previous subtask is transferred to the MEC server to which the next subtask is offloaded through a wired connection.

4. The method of claim 1, wherein the offloading joint in step (2) isThe objective optimization function is P:

wherein, T_iIndicating the time delay for completing the ith subtask, E_iRepresents the terminal energy consumption, δ, for completing the ith subtask_iIndicating the priority, χ, of the ith subtask₁,χ₂Represents the weight of time delay and energy consumption, and₁,χ₂∈[0,1]，χ₁+χ₂1, the unloading joint objective optimization function satisfies the following constraint conditions: constraint condition 1, the execution position of the application subtask is a local or edge server; constraint 2, the entry subtask and the exit subtask of the task can only be executed locally; constraint 3, the subtask can start to execute only when the execution of its predecessor subtask is finished; constraint condition 4, each subtask can only select one sub-channel frequency to transmit data to the server; constraint 5, the total amount of computing resources that can be allocated by all subtasks selected to offload to the edge server must not exceed the maximum resource ownership; constraint 6, the transmit power of the terminal device when inputting data to the edge server must not exceed its maximum transmit power.

5. The method for allocating intelligent resources of a mobile edge network capable of dividing tasks according to claim 1, wherein the step (3) is specifically as follows: determining a known state space, an action space and a reward function; modeling a task unloading and resource allocation decision process of the multiple terminals into a Markov decision process, namely, in each time slot, the terminals observe the local environment state of the terminals and then independently take action according to different strategies adopted by the local environment state; according to the task execution condition, each agent can obtain the reward of environment feedback, and the agent is transferred to a new state according to the actions of all related agents; the decision process for all coupled terminals is modeled as a markov game process, i.e. at any time slot, each terminal is targeted to take the best action while maximizing the long-term prize.

6. The method for allocating intelligent resources of a mobile edge network capable of dividing tasks according to claim 1, wherein the step (4) is specifically as follows: each terminal is used as an independent intelligent agent, and all changes except the terminal is used as an environment; each terminal independently operates an Actor-Critic reinforcement learning framework; all terminals are trained based on current partial environment data, and an optimal unloading and resource allocation strategy is selected through a reinforcement learning algorithm, so that a convergence state is achieved; and the terminal distributes the subtasks to the server nodes specified by the unloading strategy according to the unloading strategy and obtains the appropriate resource amount based on the resource distribution strategy.

7. The intelligent resource allocation method for a task-partitionable mobile edge network according to claim 6, wherein the Actor-Critic reinforcement learning framework comprises a Critic network and an Actor network, the training of the Critic network is based on a Value-based function, and the input of the Critic network comprises the current state, the selected action and the next state; the Critic network adopts a Temporal-Difference updating mode, namely, after a new round of training is started, parameters are updated after the round is finished; the Critic network estimates the value of each state-action and feeds back the time difference value to the Actor network; the loss function of the Critic network is defined as a square value of time difference, and the loss function guides the updating process of parameters; the training of the Actor network is based on a Policy-based function, and the action or the probability of the action is output according to the input state, wherein the Actor network adopts a Monte-Carlo updating mode, namely, updating is carried out once after each action is executed; the loss function of the Actor network is designed based on the time difference error calculated by the Critic network.