CN117098189A

CN117098189A - Computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning

Info

Publication number: CN117098189A
Application number: CN202311101336.2A
Authority: CN
Inventors: 李云; 张剑鑫; 姚枝秀; 夏士超; 吴广富
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-21

Abstract

The invention relates to a calculation unloading and resource allocation method based on GAT mixed action multi-agent reinforcement learning, which comprises the following steps: creating a mobile edge system under a multi-base station multi-user MEC network environment; the mobile edge system adopts an orthogonal frequency division multiple access technology to provide services for a plurality of users, and a local task computing model of the users and an edge task computing model of the users are created; establishing a resource allocation model with the energy consumption minimization of the mobile edge system as a target under the maximum tolerable delay constraint of the task; defining a global state model, a local observation state model, an action model and a rewarding model of the system by using a distributed Markov decision process; in the distributed execution stage, the actor network of each agent selects actions to interact with the environment according to the local observation state, in the centralized training stage, the criticism network evaluates the actions selected by the actor network according to the shared samples obtained from the shared experience playback pool, and the evaluation value is output to guide the network to update parameters.

Description

Computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a calculation unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning.

Background

With the development of internet of things technology, such as virtual reality, automatic driving, telemedicine and the like, the popularity of intelligent devices is increasing, and a large number of computationally intensive and time delay sensitive tasks are generated. However, user devices with limited battery capacity and computing resources are extremely challenging to meet the computing and latency requirements of these tasks. The mobile edge computing (Mobile Edge Computing, MEC) remarkably reduces the processing delay and energy consumption of tasks by deploying computing resources on the network edge side close to the user, the user uninstalls the computing load to the MEC server and receives the returned result, and the limitation of task processing caused by the limitation of user equipment resources is broken.

However, in densely deployed MEC network scenarios, the complex spatial correlation and dynamics of wireless network states pose significant challenges to efficient computational offloading and resource allocation strategies. Researchers at home and abroad have conducted a great deal of intensive research on this problem, and some main achievements are:

(1) Multi-unmanned aerial vehicle assisted movement edge calculation trajectory planning algorithm based on multi-agent deep reinforcement learning (ref: wang L, wang K, pan C, et al Multi-Agent Deep Reinforcement Learning-Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing [ J ]. IEEE Transactions on Cognitive Communications and Networking,2021,7 (1): 73-84.): according to the algorithm, an unmanned aerial vehicle assisted MEC architecture is considered, and the unmanned aerial vehicle track is optimized, so that the geographic position of a user and the fairness of unmanned aerial vehicle load are improved, and the overall energy consumption of the user is reduced.

(2) Small cellular network computational offloading and interference coordination algorithm based on multi-agent deep reinforcement learning (ref: huang X Y, leng S P, maharjan S, et al Multi-agent deep reinforcement learning for computation offloading and interference coordination in small cell networks [ J ]. IEEE Transactions on Vehicular Technology,2021,70 (9): 9282-9293): the algorithm considers the unloading strategy and interval interference coordination under the multi-base station MEC network scene, and aims to furthest reduce the system energy consumption under the time delay constraint.

However, the above study ignores the potential spatial correlation in MEC networks, such as the up-link transmission rate achievable by the user is not only dependent on the radio resource allocation policy of the base station, but also on the inter-zone interference, and the closer the distance the greater the inter-zone interference of the base station. Therefore, the wireless network status information of neighboring base stations should be more focused in the calculation of offloading and resource allocation policies.

Disclosure of Invention

In order to solve the above problems, the present invention proposes a method for computing offloading and resource allocation based on GAT hybrid action multi-agent reinforcement learning, which considers the computing offloading and resource allocation problems in a multi-base station multi-user MEC network environment, comprising:

s1: creating a mobile edge system in a multi-base station multi-user MEC network environment, the system comprising: a plurality of users and N base stations configured with MEC servers; base station S _n The set of users under coverage is expressed asM _n Representing base station S _n Number of users under coverage, m _n Representing base station S _n User under coverage, ++>

S2: the mobile edge system adopts an orthogonal frequency division multiple access technology to provide services for a plurality of users, and each cell completely multiplexes spectrum resources;

s3: according to tasks generated by a user in each time slot, a local task computing model of the user and an edge task computing model of the user are created;

s4: establishing a resource allocation model with the energy consumption of the mobile edge system as a target under the maximum tolerable delay constraint of the task according to a local task calculation model and an edge task calculation model of a user;

s5: each base station in the mobile edge system is used as an agent configured with an actor-critic network, and a global state model, a local observation state model, an action model and a rewarding model of the mobile edge system are defined by using a distributed Markov decision process according to a resource allocation model;

S6: the intelligent agent adopts a framework of distributed execution and centralized training to learn, and in the stage of distributed execution, each intelligent agent extracts a local observation state from a global state model of a current mobile edge system to input an actor network, outputs a selected action, and each intelligent agent executes the selected action to interact with the environment of the mobile edge system to obtain rewards of environment feedback and the local observation state at the next moment; generating sample data and storing the sample data into an experience playback pool to be used as a shared sample of all the agents;

in the concentrated training stage, each agent samples from the experience playback pool to obtain shared sample information, and then the criticism network is utilized to evaluate the action selected by the actor network, and the evaluation value is output and the parameters of the actor network and the criticism network are updated.

Preferably, the creating the local task computing model of the user and the edge computing task model of the user includes:

definition of base station S _n Under coverage, user m _n The tasks generated at the t-th time slot are:

wherein,data quantity representing task->Representing the number of CPU cycles required per bit of task data,representing the maximum tolerable delay of the task;

creating a local task computing model: definition of the definition For user m _n Assigned to task->CPU calculation frequency of (C), user m _n Task->The time delay and energy consumption of the calculation are as follows:

wherein,representation and user m _n Chip architecture dependent effective energy coefficients; />Representing user m _n Task->Locally performing the calculated time delay; />Representing user m _n Task->The energy consumption of the calculation is locally carried out;

creating an edge computing task model: defining K mutually orthogonal wireless sub-channels for mobile edge systemThe transmission bandwidth of each wireless sub-channel is B ₀ Define channel allocation decision variable +.>Representing base station S _n Giving user m at time t _n Assigning a radio sub-channel k for transmitting data, otherwise +.>

Definition of the definitionRepresenting user m _n At the uplink transmission rate of radio subchannel k, user m _n Task->Uploading to the base station S _n The transmission delay and the energy consumption of the MEC server are as follows:

wherein,representing user m _n Task->Uploading to the base station S _n Is a transmission delay of the MEC server of (c),representing user m _n Task->Uploading to the base station S _n Transmission energy consumption of MEC server, < +.>Representing user m _n Transmission power at time t, +.>Representing base station S _n And user m _n Instantaneous channel gain, sigma, over radio subchannel k ² Represents noise power, I ^k (t) represents user m _n The inter-zone interference experienced on radio subchannel k; />Representing user m _n' Transmission power at time t; />Representing base station S _n’ User set under coverage, ++>Representing base station S _n’ Users under coverage; />Representing base station S _n Instantaneous channel gain on radio subchannel k with user mn';

definition of the definitionFor base station S _n MEC server allocation to tasks->The base station S if the CPU calculates the frequency _n Task->The calculation time delay and the energy consumption are as follows:

wherein,representing base station S _n Task->Is calculated delay of->Representing base station S _n Task->Is calculated by the energy consumption; zeta type toy _n Representation and base station S _n An MEC server chip architecture dependent effective energy coefficient;

user m _n Task is carried outUploading to the base station S _n The total time delay and energy consumption of the MEC server for edge calculation are as follows:

wherein,representing user m _n At the time of task->Uploading to the base station S _n Performing total time delay of edge calculation; />Representing user m _n At the time of task->Uploading to the base station S _n Energy consumption for edge computation.

Preferably, definition ofRepresenting user m _n Selecting a task at the t-th time slotUploading to the base station S _n Edge calculation is performed by the MEC server of +.>Representing user m _n Selecting the task +. >Performing calculation locally; the resource allocation model comprises:

wherein C1-C8 represent constraint conditionsTo move the edgeA set of offload decision variables for all users in the edge system; />Representing base station S _n A set of offload decision variables for all users under coverage; />A set of channel allocation decision variables representing all users in the t-th slot mobile edge system; />Representing the t-th time slot base station S _n A set of channel allocation decision variables for all users under the coverage area; />Representing the set of transmission powers of all users in the mobile edge system,representing a set of CPU computing resources allocated for tasks by all users in the mobile edge system; />And->CPU minimum frequency and maximum frequency local to the user, respectively, < >>And->CPU minimum frequency and maximum frequency, p, of MEC server respectively _min And p _max The minimum transmission power and the maximum transmission power of the user are respectively; constraint C3 represents base station S _n Each radio sub-channel under coverage can only be allocated to one user at the same timeConstraint C4 indicates that each user can only use one wireless subchannel for uploading tasks at the same time,/one>For the total delay of the completion of the task, when +.>When (I)>When->When (I) >

Preferably, the global state model, the local observation state model, the action model and the rewards model of the mobile edge system are defined by using a distributed Markov decision process according to a resource allocation model:

defining a global state model: the global state comprises task information and the state of a wireless sub-channel generated by all users in the mobile edge system; the global state is expressed as: s (t) = { W (t), h (t) },representing a set of task information for all users in a mobile edge system, W _n (t) represents a base station S _n A set of task information of all users under the coverage area; />A set representing instantaneous channel gains for all base stations and all users in the mobile edge system; h is a _n (t) represents a base station S _n A set of instantaneous channel gains with all users in the edge system;

defining a local observation state model: in a distributed partially observable MEC environment, the local observation state is expressed as; o (o) _n (t)＝{W _n (t),h _n (t)}，o _n (t) represents a base station S _n In the local observation state of the time slot t,representing base station S _n A collection of computing tasks for all users under coverage,representing base station S _n Set of instantaneous channel gains with all users in a mobile edge system, +.>Representing base station S _n With user m under its coverage area _n Instantaneous channel gain between->Representing base station S _n And base station->Instantaneous channel gain between users mn' under coverage;

motion model: the action space of the agent comprises: four actions of offloading decision variables, channel allocation decision variables, transmission power allocation and CPU computing resource allocation, specifically denoted as a _n (t)＝{α _n (t),β _n (t),p _n (t),f _n (t) }, wherein alpha _n (t) represents a base station S _n Unloading decision variable set of all users under coverage range, beta _n (t) represents a base station S _n Channel allocation decision variable set, p, of all users under coverage _n (t) represents a base station S _n Transmission power set for all users under coverage area, f _n (t) represents a base station S _n CPU computing resource set distributed to all users under the coverage area;

bonus function: the opposite number of energy consumption is given to the agent as a reward, and meanwhile, the agent which is decided to meet the maximum tolerable delay requirement of the task is given with additional rewards, and the specific rewards of the agent are expressed as follows:

r _n (t)＝-E _n (t)+Ω _n (t)

wherein H (·) represents a unit step function, η is a prize coefficient, r _n (t) represents agent S _n E is a reward of (2) _n (t) represents a base station S _n The total energy consumption required for the task is calculated by all users under the coverage.

Preferably, the act of each agent extracting a local observed state input actor network from a global state model of a current mobile edge system, the act of outputting the selection comprises:

For discrete actions in the action set, i.e. offloading decision and channel allocation decision, g is used separately ₁ And g ₂ Representing the continuous parameters to construct corresponding embedded tablesAnd->An embedded table representing offloading decisions; />An embedded table representing channel allocation decisions; each agent inputs the observed local observation state into the actor network to obtain the coded discrete action e _n And code continuous action z _n The method comprises the steps of carrying out a first treatment on the surface of the Coded continuous action z _n Decoding to obtain continuous action x _n The method comprises the steps of carrying out a first treatment on the surface of the According to coded discrete action e _n Look-up table->And->Obtain discrete action u _n The method comprises the steps of carrying out a first treatment on the surface of the Intelligent deviceBody S _n According to continuous action x _n And discrete action u _n Interacting with an environment of a mobile edge system, the continuous actions comprising: transmission power allocation and CPU computing resource allocation.

Preferably, the actor network comprises: a discrete action encoding network, a continuous action encoding network, and a continuous action decoding network;

the discrete motion coding network and the continuous motion coding network are respectively used for obtaining coded discrete motion and coded continuous motion according to the local observation state observed by the intelligent agent, and the discrete motion coding network, the continuous motion coding network and the continuous motion decoding network are all fully-connected neural networks comprising two hidden layers;

The continuous motion decoding network is used for encoding discrete motion e _n Local observation state o _n (t) and code continuation z _n After splicing, as input pair coding continuous action z _n Decoding to obtain continuous action x _n 。

Preferably, the evaluating the action of actor network selection by using the commentator network includes:

taking an agent as a node, regarding two base stations with a distance smaller than a set threshold D as having a connection relationship, and creating an undirected graph G (V, E, A), wherein V represents a set of vertexes in the graph, E represents a set of edges in the graph G, and A is an adjacency matrix of the graph G;

will intelligent agent S _n Channel gain as agent S with all users under coverage of all remaining agents in mobile edge system _n Node characteristics of (a); according to agent S _n Node features and undirected graph G (V, E, a) for agent node S using GAT model _n Aggregation is carried out on neighbor nodes of the network node to obtain an agent S _n The weighted node characteristics; according to agent S _n Weighted node characteristics, coded discrete actions e of all agents, continuous actions x of all agents, task pairs of all users in a mobile edge system, agents S _n And evaluating the selected action in the local observation state, and outputting an evaluation value.

Preferably, the rater network includes: a GAT model and an evaluation model;

the GAT model is used for the agent node S _n Is aggregated by neighbor nodes; the evaluation model is used for integrating agent S _n The weighted node characteristics, the coded discrete actions e of all the agents, the continuous actions x of all the agents and the tasks of all the users in the mobile edge system are spliced and then used as input to output the agents S _n The evaluation model is a fully connected neural network comprising two hidden layers.

Preferably, the updating parameters of the actor network and the critic network includes:

in the initial stage of the centralized training, each intelligent body copies an actor network pi and a critic network Q, and creates a target actor network pi 'and a target critic network Q';

actor network updates network parameters θ with the goal of maximizing action cost function _n Specifically, the method can be expressed as:

wherein B represents agent S _n The number of samples sampled from the empirical playback pool, b represents the serial number of the sample,representing a cost function, namely a criticism network; θ _n Parameters representing the actor's network; omega _n Parameters representing a commentator network- >Representing the actor network->Representing agent S in the b-th shared sample _n Is a local observation state of o ^b Representing local observed states of all agents in the b-th shared sampleAggregation e ^b And x ^b Representing a discrete action set and a continuous action set of all agents in the b-th shared sample; j (pi) _n ) Representing an action cost function;

critics network through minimizing agent S _n Updating the parameter omega by the loss function of (2) _n Agent S _n The loss function of (2) can be expressed in particular as:

wherein,representing agent S _n And γ represents the discount factor, pi' _n Representing agent S _n Is a target actor network of->Representing agent S _n Target critics network of target actor network pi' _n And target criticism network->Parameter θ' _n And omega' _n I.e. θ' _n ＝τθ _n +(1-τ)θ′ _n And omega' _n ＝τω _n +(1-τ)ω′ _n τ is the update rate, < >>Representing agent S _n Executing the action selected by the actor network to interact with the environment of the mobile edge system, so as to obtain the rewards of environment feedback; o' ^b Representing the next time obtained after all agents perform the actor network selected action to interact with the environment of the mobile edge systemA local observation state set; e' ^b And x' ^b Representing a network pair o 'of target actors' ^b A selected discrete action set and a continuous action set.

The invention has at least the following beneficial effects

Aiming at the problems of calculation unloading and resource allocation in a multi-base-station multi-user MEC network scene, the invention provides a calculation unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning, aiming at minimizing the energy consumption of a system under time delay constraint. Firstly, constructing a multi-base-station MEC network scene, and providing a joint optimization problem of calculation unloading and resource allocation. According to the invention, each base station in the multi-base-station multi-user MEC network scene is used for configuring the algorithm for training by representing one agent, the multi-base-station MEC scene is constructed as an undirected graph, potential spatial correlation among network states is mined by utilizing a GAT (graphic annotation force) -based commentator network, so that each agent selectively pays attention to information of other agents in a neighborhood, an intelligent cooperation unloading scheme is formulated, and finally, the base station acquires an optimal calculation unloading and resource allocation strategy by inputting local observation information into a trained actor network. Compared with a reference algorithm, the simulation experiment shows that the algorithm has obvious improvement in three aspects of average rewarding value, unit system energy consumption and task completion rate.

Drawings

FIG. 1 is a schematic diagram of a method architecture of the present invention;

FIG. 2 is a schematic diagram of an actor-critter network architecture in accordance with the present invention;

FIG. 3 is a schematic diagram showing the variation of average prize value, energy consumption of unit system and task completion rate with the number of agents under different algorithms;

FIG. 4 is a schematic diagram showing the variation of average prize value, energy consumption of unit system and task completion rate with bandwidth under different algorithms;

FIG. 5 is a schematic diagram showing the variation of average prize value, energy consumption of unit system and task completion rate with task data amount under different algorithms.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Referring to fig. 1, the present invention provides a computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning, which includes:

Preferably, the present embodiment models the MEC network scenario of multiple base stations and multiple users, and N base stations configured with MEC servers in the system defineFor a set of base stations (BaseStations, BSs), defineFor base station S _n A set of users under coverage. The invention disperses the time of MEC system into T time slots with equal length, uses the aggregate +.>And (3) representing. At->Time of day, user->The resulting computational task is defined asWherein (1)>Representing the amount of calculation task data, +.>Representing the number of CPU cycles required per bit of task in cycles/bit,/>Indicating the maximum tolerated latency of the task.

The invention considers that the task is not separable, and a user can select to execute the task locally for completing the calculation task or load the task to an MEC server at the base station for execution.

Defining offload decision variablesRepresenting user m _n Selecting the task +.>Uploading to the base station S _n Edge calculation is performed by the MEC server of +.>Representing user m _n Selecting the task +.>The calculation is performed locally.

When the UE chooses to offload tasks to the MEC server for processing, it is mainly divided into three phases: (1) Uploading the task data to the MEC server over a wireless channel; (2) The MEC server allocates corresponding computing resources for the uploaded tasks to process; (3) The MEC server returns the processing results to the corresponding user. Since the size of the calculation result is usually much smaller than the size of the input data, the present invention ignores the delay and power consumption generated in the third stage.

The system adopts an orthogonal frequency division multiple access (Orthogonal Frequency Division Multiple Access, OFDMA) technology to provide services for a plurality of UE, each cell completely multiplexes spectrum resources, so that interference between different UE under the same BS can be effectively inhibited, and meanwhile, the interference between the cells is also meant to exist, and the interval interference between two base stations with closer distance is stronger.

Assuming that the system has K mutually orthogonal wireless sub-channels, defineThe transmission bandwidth of each sub-channel is B ₀ The method comprises the steps of carrying out a first treatment on the surface of the Defining the channel allocation decision variable +.>Representing base station S _n Giving user m at time t _n Assigning a radio sub-channel k for transmitting data, otherwise +.>At task->Before the uploading is completed, user m _n Channel k will always be occupied. According to the definition above, user m _n Uplink transmission rate at radio subchannel k:

user m _n Task is carried outUploading to the base station S _n The transmission delay and the energy consumption of the MEC server are as follows:

wherein,representing user m _n Task->Uploading to the base station S _n Is a transmission delay of the MEC server of (c),representing user m _n Task->Uploading to the base station S _n Transmission energy consumption of MEC server, < +.>Representing user m _n Transmission power at time t, +.>Representing base station S _n And user m _n Instantaneous channel gain on radio subchannel k ，σ ² Represents noise power, I ^k (t) represents user m _n The inter-zone interference experienced on radio subchannel k; />Representing user m _n' Transmission power at time t; />Representing base station S _n’ User set under coverage, ++>Representing base station S _n’ Users under coverage; />Representing base station S _n Instantaneous channel gain on radio subchannel k with user mn';

wherein,representing user m _n At the time of task->Uploading to the base station S _n Performing total time delay of edge calculation;representing user m _n At the time of task->Uploading to the base station S _n Energy consumption for edge computation.

Creating a local task computing model: definition of the definitionFor user m _n Assigned to task->CPU calculation frequency of (C), user m _n Task->The time delay and energy consumption of the calculation are as follows:

wherein, Representation and user m _n Chip architecture dependent effective energy coefficients; />Representing user m _n Task->Locally performing the calculated time delay; />Representing user m _n Task->The energy consumption of the calculation is performed locally.

Preferably, the optimization objective of the present invention is to minimize the energy consumption of the system without exceeding the maximum tolerable delay of the task. By establishing joint optimization problems of offloading decisions, channel allocation, power allocation and computing resource allocation to achieve an optimization objective, a specific resource allocation model is as follows:

/>

wherein, C1-C8 represent constraint conditions,unloading a set of decision variables for all users in the mobile edge system; />Representing base station S _n A set of offload decision variables for all users under coverage; />A set of channel allocation decision variables representing all users in the t-th slot mobile edge system; />Representing the t-th time slot base station S _n A set of channel allocation decision variables for all users under the coverage area; />Representing the set of transmission powers of all users in the mobile edge system,representing a set of CPU computing resources allocated for tasks by all users in the mobile edge system; />And->CPU minimum frequency and maximum frequency local to the user, respectively, < > >And->CPU minimum frequency and maximum frequency, p, of MEC server respectively _min And p _max Minimum transmission power for users respectivelyRate and maximum transmission power; constraint C3 represents base station S _n Each wireless sub-channel under coverage can only be allocated to one user at the same time, constraint C4 indicates that each user can only use one wireless sub-channel to upload tasks at the same time, and constraint C4 indicates that each user can only use one wireless sub-channel to upload tasks at the same time>For the total delay of the completion of the task, when +.>When (I)>When->When (I)>

Because the resource allocation model is a mixed integer nonlinear programming problem and complete state information is difficult to obtain in a distributed deployment dynamic MEC scene, the traditional optimization algorithm is difficult to effectively solve the problem. The problem of computing unloading and resource allocation is markov, so the invention converts the resource allocation model into a partly observable markov decision process (DecentralizedPartially Observable Markov Decision Process, POMDP), takes each base station as an agent and solves by using a reinforcement learning method. Considering the influence of potential spatial correlation among MEC servers on each other performance, the invention provides a method for solving a resource allocation model based on Multi-agent reinforcement learning of GAT mixed action by combining a graph attention network (Graph attention network, GAT) and Multi-agent reinforcement learning (Multi-agent Reinforcement Learning, MARL), wherein the solving process is specifically as follows:

The invention takes each base station with MEC server as an agent, and takes the condition that the agent can not obtain the complete state information of the environment in the distributed MEC network environment into consideration, so the invention convertsFor Dec-POMDP, the process may use tuplesThe representation is performed. Wherein (1)>Is a global state space>For the observation space set of the agent, +.>And R is a reward function, wherein R is an action space set of the intelligent agent. At time t, agent n is from ambient +.>Obtain observation information->Then, the uninstall and resource allocation actions are selected according to the existing policy +.>And interact with the environment, which will be based on the joint actions of all agents +.>Transition to the next state->The smart agent will then receive the environmental return prize value r _n (t)∈R _n Observation information at time t+1->The specific definition of elements in a tuple is as follows:

defining a global state model: the global state contains task information and nothing generated by all users in the mobile edge systemThe state of the line subchannel; the global state is expressed as: s (t) = { W (t), h (t) },representing a set of task information for all users in a mobile edge system, W _n (t) represents a base station S _n A set of task information of all users under the coverage area; />A set representing instantaneous channel gains for all base stations and all users in the mobile edge system; h is a _n (t) represents a base station S _n A set of instantaneous channel gains with all users in the edge system;

defining a local observation state model: in a distributed partially observable MEC environment, the local observation state is expressed as; o (o) _n (t)＝{W _n (t),h _n (t)}，o _n (t) represents a base station S _n In the local observation state of the time slot t,representing base station S _n A collection of computing tasks for all users under coverage,representing base station S _n Set of instantaneous channel gains with all users in a mobile edge system, +.>Representing base station S _n With user m under its coverage area _n Instantaneous channel gain between->Representing base station S _n And base station->User m under coverage _n' Instantaneous channel gain between;

motion model: the action space of the agent comprises: four actions of offloading decision variables, channel allocation decision variables, transmission power allocation and CPU computing resource allocation, specifically denoted as a _n (t)＝{α _n 9t),β _n (t),p _n (t),f _n (t) }, wherein alpha _n (t) represents a base station S _n Unloading decision variable set of all users under coverage range, beta _n (t) represents a base station S _n Channel allocation decision variable set, p, of all users under coverage _n (t) represents a base station S _n Transmission power set for all users under coverage area, f _n (t) represents a base station S _n CPU computing resource set distributed to all users under the coverage area;

r _n (t)＝-E _n (t)+Ω _n (t)

In order to mine the potential spatial correlation of MEC network and obtain better calculation unloading and resource allocation strategies, the invention provides a GAT-HMARL algorithm in combination with GAT and MARL to solve the Dec-POMDP problem. The Gat-HMARL algorithm architecture is shown in FIG. 1, with each base station treated as an agent that configures the actor-critter network. Considering first that discrete actions (offloading decisions and channel assignments) and continuous actions (power assignments and computational resource assignments) are involved in this scenario, hybrid discrete continuous action coding is employed in the actor network to achieve an accurate process of action selection. Secondly, GAT is embedded in the critic network, so that the intelligent agent is guided to selectively pay attention to the state information of other intelligent agents in the neighborhood, potential spatial correlation is mined, and the accuracy of evaluating the selected actions is improved. Finally, the intelligent agent adopts a framework of distributed execution and centralized training to learn. When the distribution is executed, the intelligent agent only needs to select corresponding unloading and resource allocation actions according to the local observation information and the current strategy; during centralized training, the intelligent agents acquire sample data of all the intelligent agents from the experience playback pool in a small batch sampling mode, and network parameters are guided to be updated by using the sample data, so that a better calculation unloading and resource allocation strategy is learned.

for discrete actions in the action set, i.e. offloading decision and channel allocation decision, g is used separately ₁ And g ₂ Representing the continuous parameters to construct corresponding embedded tablesAnd->An embedded table representing offloading decisions; />An embedded table representing channel allocation decisions; each agent inputs the observed local observation state into the actor network to obtain the coded discrete action e _n And code continuous action z _n The method comprises the steps of carrying out a first treatment on the surface of the Coded continuous action z _n Decoding to obtain continuous action x _n The method comprises the steps of carrying out a first treatment on the surface of the According to coded discrete action e _n Look-up table->And->Obtain discrete action u _n The method comprises the steps of carrying out a first treatment on the surface of the Agent S _n According to continuous action x _n And discrete action u _n Interacting with an environment of a mobile edge system, the continuous actions comprising: transmission power allocation and CPU computing resource allocation. Output coded discrete action e _n By comparing with the continuous parameters respectively embedded in the table, the row coordinates of the closest set of parameters are taken as the corresponding discrete actions.

taking an agent as a node, regarding two base stations with a distance smaller than a set threshold D as having a connection relationship, and creating undirected graphs G9V, E and A, wherein V represents a set of vertexes in the graph, E represents a set of edges in the graph G, and A is an adjacency matrix of the graph G;

will intelligent agent S _n Channel gain as agent S with all users under coverage of all remaining agents in mobile edge system _n Node characteristics of (a); according to agent S _n Node features and undirected graph G (V, E, a) for agent node S using GAT model _n Aggregation is carried out on neighbor nodes of the network node to obtain an agent S _n The weighted node characteristics; according to agent S _n Weighted node characteristics, coded discrete actions e of all agents, continuity of all agentsAction x, task pair agent S for all users in a mobile edge system _n And evaluating the selected action in the local observation state, and outputting an evaluation value.

Preferably, the rater network includes: a GAT model and an evaluation model;

In a multi-base station MEC environment, there is a potential spatial correlation between MEC network states, and the closer the distance the stronger the interaction between MEC servers. In order to obtain a better calculation unloading and resource allocation strategy, the intelligent agents should selectively sense the characteristic information of other intelligent agents in the neighborhood, and mine and fully utilize potential spatial correlation. The invention constructs the environment of multiple agents into an undirected graph G (V, E, A), wherein V represents a set of endpoints, each endpoint in the graph is an agent, E represents a set of edges in the graph, A is an adjacency matrix of the graph, and for agent n, channel gains between all users under agent n' and agent n are calculated As node characteristics of n'. Through embedding GAT in the commentator network, the attention coefficient of the rest of the intelligent agents is obtained, the characteristic information is aggregated according to the coefficient, the intelligent agents are guided to purposefully sense the characteristic information of the neighbor nodes, the potential spatial correlation is mined, the selected actions are evaluated more accurately, and the network architecture is shown in figure 2.

Specifically, the attention coefficients of agent n and agent n' can be obtained by calculation of the following formula:

e _nn′ ＝att(W ₁ h _n ,W ₁ h _n′ )

wherein att (·) represents the mechanism of attention, W ₁ Is a corresponding matrix of leachable weights, attention coefficient e _nn′ The importance of the feature of node n' to node n is shown. Because of the difference of the neighboring nodes of each node and the characteristics of the neighboring nodes, the attention coefficient is asymmetric, namely e _nn′ ≠e _n′n . To facilitate comparison between different nodes, a definition is madeFor the neighbor node set of the node n, the elements in the set are determined by the adjacency matrix A, and the attention coefficient is normalized by the activation function to obtain the corresponding attention weight delta _nn′ ：

In order to make the learning process of self-attentiveness more stable, multi-headed attentiveness is applied to the learning process. The L independent attention mechanisms of the intelligent agent n are fused into potential feature vectors on the basis of the normalized attention coefficients

Where σ is a nonlinear function, ||represents a concatenation operation, |represents a sequence number of an attention mechanism, L represents the number of heads in a multi-head attention mechanism,representing agent S _n Weighted node characteristics.

In the distributed execution phase, for agent n, local observations o obtained from the environment at time t _n Is input into the actor network, and the actor network will be based on the observed value and the current continuous strategyOutput encoded discrete action e _n And continuous action x _n ，e _n After decoding, the actual discrete action u is obtained _n The agent executes the selected action to interact with the environment to obtain the rewards r of the environment feedback _n And observation information o 'at time t+1' _n All the agents simultaneously store sample data in the experience playback pool to form a shared sample, which can be expressed as { o } ₁ ,…,o _N ,e ₁ ,…,e _N ,x ₁ ,…,x _i ,r ₁ ,…,r _N ,o′ ₁ ,…,o′ _N }。

In the centralized training stage, the agent n samples from the experience playback pool to obtain shared sample information, namely the local observation information o= (o) of all agents in the environment at the moment ₁ ,o ₂ ,…,o _N ) Coded discrete action e= (e) ₁ ,e ₂ ,…,e _N ) And continuous action x= (x) ₁ ,x ₂ ,…,x _N ). In order to better sense the characteristic information of other agents in the neighborhood, the critique network uses GAT to sense the node characteristic information h _n Processing and obtaining And according to->W, e and x evaluate the selected actions of the agent under the local observation, and output the evaluation value +.>Through the processing of the information by the graph attention mechanism, the critique network can selectively sense the local observation information of other intelligent agents in the neighborhood, extract more effective information for the intelligent agent n, and increase the evaluation accuracy. W represents all users in a mobile edge systemIs a task of (a).

The intelligent agent obtains a sample set by carrying out small batch sampling from an experience playback poolFor calculating the gradient of the optimization objective. Using pi= { pi ₁ ,π ₂ ,…,π _N Sum } and->Representing an actor network and a critic network, the corresponding parameters are represented by θ= { θ ₁ ,θ ₂ ,…,θ _N Sum ω= { ω ₁ ,ω ₂ ,…,ω _N And the representation is performed. To eliminate the problem of overestimation, the corresponding target networks pi 'and Q' are created and their parameters are replicated in the initialization phase. During training, the actor network updates the network parameters θ with the goal of maximizing the action cost function _n Specifically, the method can be expressed as:

the updating parameters of the actor network and the commentator network comprises:

wherein B represents agent S _n The number of samples sampled from the empirical playback pool, b represents the serial number of the sample,representing a cost function, namely a criticism network; θn represents parameters of the actor network; omega _n Parameters representing a commentator network->Representing the actor network->Representing agent S in the b-th shared sample _n Is a local observation state of o ^b Representing a set of local observed states of all agents in the b-th shared sample, e ^b And x ^b Representing a discrete action set and a continuous action set of all agents in the b-th shared sample; j (pi) _n ) Representing an action cost function;

critics network through minimizing agent S _n Is used to update the parameter v _n Agent S _n The loss function of (2) can be expressed in particular as:

wherein,representing agent S _n And γ represents the discount factor, pi' _n Representing agent S _n Is a target actor network of->Representing agent S _n Target critics network of target actor network pi' _n And target criticism network->Parameter θ' _n And omega' _n I.e. θ' _n ＝τθ _n +(1-τ)θ′ _n And omega' _n ＝τω _n +(1-τ)ω′ _n τ is the update rate, < >>Representing agent S _n Executing the action selected by the actor network to interact with the environment of the mobile edge system, so as to obtain the rewards of environment feedback; o' ^b After all agents execute the action selected by the actor network to interact with the environment of the mobile edge system, the obtained local observation state set at the next moment is represented; e' ^b And x' ^b Representing a network pair o 'of target actors' ^b A selected discrete action set and a continuous action set.

In order to verify the effectiveness of the method provided by the embodiment, a relevant simulation experiment is carried out, the method (Gat-HMARL) provided by the invention is simulated and verified by using a Pytorch platform and an Adam optimizer, and in order to verify the performance of the proposed algorithm, the comparison algorithm comprises:

1) DDPG: each base station is respectively regarded as an agent, the agents independently learn strategies according to own local observation information and interact with the environment, and no information is exchanged among the agents.

2) Madppg. Each base station is respectively regarded as an agent, an experience playback pool is shared among the agents, information of other agents can be obtained through the experience playback pool in the training stage, and learning is conducted by utilizing the information.

In the invention, a MEC network scene of multiple base stations is considered, 6 base stations are randomly distributed in an area of 1000m multiplied by 1000m, the coverage radius of the base stations is set to be 200m, and each base station serves 5 users uniformly distributed in the coverage area of the base stations. Setting channel gain Wherein d is ^-2 Representing user m in a scene _n With base station S _n Transmission distance between them, transmission bandwidth B ₀ 20MHz; background noise power N ₀ Is-20 dBm; the bonus coefficient eta is set to 5; the transmission power of the user is between 13 and 33 dBm; the data volume of the task formed by the user is between 15 and 20Mbit, and the number of CPU cycles required for executing each bit of input task is 700cycles/bit; the maximum tolerance time delay of the task is between 15 and 20 seconds; the frequency range of the local CPU is as follows100-1000MHZ; the CPU frequency range of the MEC server is 2000-10000MHZ; local effective energy coefficient->Is 10 ^-19 W·s ² /cycle ² The method comprises the steps of carrying out a first treatment on the surface of the Effective energy coefficient ζ of MEC server _n Is 10 ^-20 W·s ² /cycle ² The method comprises the steps of carrying out a first treatment on the surface of the When the distance between two base stations is smaller than the preset distance d=400, the value of (n, n') in the adjacency matrix a is set to 1, otherwise 0.

The actor networks in the algorithm are all composed of two fully connected hidden layers containing 64 and 32 neurons, respectively, and the reviewer networks are all composed of two fully connected hidden layers containing 256 and 128 neurons, respectively, and are all activated using a ReLU. The present invention uses an attention layer with two heads in the Gat-HMARL algorithm, wherein the attention mechanism att (·) is a single layer feed-forward neural network activated by the leak ReLU, and the continuous motion decoding network of Gat-HMARL consists of two fully connected hidden layers, containing 128 and 64 neurons, respectively. During training, each round was set to 100 time steps, with the exploration rate falling from 0.1 to 0.05 in 10000 time steps.

Fig. 3 shows the trend of the average prize value, unit system energy consumption and task completion rate of the three algorithms as the number of agents increases. As can be seen from the graph, as the number of agents increases, both the average prize value of the system and the energy consumption per unit system gradually increase, while the completion rate of the task gradually decreases. This is because as the number of agents increases, the total number of users in the system will also increase, and the upper limit of prize values that can be achieved will increase, so the average prize value of the system will increase. However, the inter-zone interference increases, and cooperation between agents becomes more difficult, so that the unit system power consumption gradually increases and the completion rate of tasks gradually decreases. Fig. 3 also intuitively shows the performance advantage of the Gat-HMARL algorithm, where the Gat-HMARL algorithm has a much higher task completion rate than the other two algorithms while the energy consumption per unit system is the lowest, and thus obtains a higher average prize value. In addition, as can be found from the unit system energy consumption curve, the rising speed of the DDPG algorithm is obviously higher than that of the other two algorithms, because no information interaction exists between the intelligent agents configuring the DDPG algorithm, and as the number of the intelligent agents increases, the influence of interval interference and unstable training environment on the algorithm performance is gradually increased.

Fig. 4 shows the trend of the average prize value, unit system energy consumption and task completion rate of the three algorithms as the bandwidth increases. As can be seen from the graph, the average prize value of the system and the completion rate of the task are gradually increased with the increase of the bandwidth, and the energy consumption of the unit system is gradually reduced. This is because as bandwidth increases, the user's transmission rate will continue to rise and transmission power consumption will continue to decrease, some task selection offload execution will more easily meet latency constraints, and task processing by the MEC server will also reduce computation power consumption compared to local computation. As shown in fig. 4, the Gat-HMARL algorithm is superior to the other two algorithms in comparison of average prize value, energy consumption of unit system and task completion rate, because the mining of spatial correlation of Gat-HAMRL to wireless network state and accurate processing of actions enable the base station to fully utilize spectrum resources, and further reduce time delay and energy consumption of uplink transmission of users.

Fig. 5 shows the trend of the average prize value, unit system energy consumption and task completion rate of the three algorithms as the amount of task data increases. In this experiment, the task data amount of the UE was randomly generated within a certain range, the task data amount ranges were set to be [5, 10] Mbit, [10, 15] Mbit, [15, 20] Mbit, [20, 25] Mbit and [25, 30] Mbit in this order, and the maximum tolerable task delay was randomly set within [15, 20] s. As can be seen from fig. 5, as the task amount increases, the energy consumption of the unit system overall tends to increase, and the average prize value and the task completion rate tend to decrease. Obviously, the larger the task data volume of the user is, the higher the difficulty of meeting the time delay constraint is, and the larger the generated energy consumption is, so that the completion rate of the task and the average rewarding value are reduced. The optimal performance of the Gat-HMARL in the three comparison algorithms is followed by MADDPG and DDPG algorithms in sequence. According to the unit system energy consumption curve, the difference of the three algorithm unit system energy consumption gradually becomes larger along with the increase of the task data volume, because the local is difficult to finish the corresponding task due to the overlarge task volume, more and more users perform calculation unloading, and the interval interference is further increased along with higher transmission power for reducing the transmission delay. The Gat-HMARL algorithm realizes accurate processing of calculation unloading and resource allocation actions through mixed continuous discrete action coding, and meanwhile, the mining of the wireless network state space correlation can help the base stations to extract more valuable information, so that the base stations can be better cooperated, and the energy consumption of a system is reduced while the task completion rate is ensured.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Claims

1. The method for computing and unloading and resource allocation based on GAT hybrid action multi-agent reinforcement learning is characterized by comprising the following steps:

s1: creating a mobile edge system in a multi-base station multi-user MEC network environment, the system comprising: a plurality of users and N base stations configured with MEC servers; base station S _n The set of users under coverage is expressed asM _n Representing base station S _n Number of users under coverage, m _n Representing base station S _n The user under the coverage area is provided with a user interface,

2. The method for computing offloading and resource allocation of GAT hybrid-action multi-agent reinforcement learning-based of claim 1, wherein creating a local task computing model of a user and an edge computing task model of a user comprises:

definition base stationS _n Under coverage, user m _n The tasks generated at the t-th time slot are:

wherein,data quantity representing task->Represents the number of CPU cycles required per bit of task data, etc.>Representing the maximum tolerable delay of the task;

wherein,representation and user m _n Chip architecture dependent effective energy coefficients; />Representing user m _n Task is carried outLocally performing the calculated time delay; />Representing user m _n Task->The energy consumption of the calculation is locally carried out;

creating an edge computing task model: defining K mutually orthogonal wireless sub-channels for mobile edge system The transmission bandwidth of each wireless sub-channel is B ₀ Define channel allocation decision variable +.>Representing base station S _n Giving user m at time t _n Assigning a radio sub-channel k for transmitting data, otherwise +.>

wherein,representing user m _n Task->Uploading to the base station S _n Transmission delay of MEC server of +.>Representing user m _n Task->Uploading to the base station S _n Transmission energy consumption of MEC server, < +.>Representing user m _n Transmission power at time t, +.>Representing base station S _n And user m _n Instantaneous channel gain, sigma, over radio subchannel k ² Represents noise power, I ^k (t) represents user m _n The inter-zone interference experienced on radio subchannel k; />Representing user m _n' Transmission power at time t; m is M _n' Representing base station S _n’ User set under coverage, ++>Representing base station S _n’ Users under coverage;representing base station S _n Instantaneous channel gain on radio subchannel k with user mn';

definition of the definitionFor base station S _n MEC server allocation to tasks->The base station S if the CPU calculates the frequency _n For tasksThe calculation time delay and the energy consumption are as follows:

Wherein,representing base station S _n Task->Is calculated delay of->Representing base station S _n For tasksIs calculated by the energy consumption; zeta type toy _n Representation and base station S _n An MEC server chip architecture dependent effective energy coefficient;

wherein,representing user m _n At the time of task->Uploading to the base station S _n Edge calculationTotal time delay;representing user m _n At the time of task->Uploading to the base station S _n Energy consumption for edge computation.

3. The method for computing offloading and resource allocation based on GAT hybrid-action multi-agent reinforcement learning of claim 2, wherein the method is defined as followsRepresenting user m _n Selecting a task at the t-th time slotUploading to the base station S _n Edge calculation is performed by the MEC server of +.>Representing user m _n Selecting the task +.>Performing calculation locally; then

The resource allocation model includes:

wherein, C1-C8 represent constraint conditions,unloading a set of decision variables for all users in the mobile edge system; />Representing base station S _n A set of offload decision variables for all users under coverage; />Representing all users in a t-th slot mobile edge system A set of channel allocation decision variables; />Representing the t-th time slot base station S _n A set of channel allocation decision variables for all users under the coverage area; />Representing the set of transmission powers of all users in a mobile edge system,/->Representing a set of CPU computing resources allocated for tasks by all users in the mobile edge system; />And->CPU minimum frequency and maximum frequency local to the user, respectively, < >>And->CPU minimum frequency and maximum frequency, p, of MEC server respectively _min And p _max The minimum transmission power and the maximum transmission power of the user are respectively; constraint C3 represents base station S _n Each wireless sub-channel under coverage can only be allocated to one user at the same time, constraint C4 indicates that each user can only use one wireless sub-channel to upload tasks at the same time, and constraint C4 indicates that each user can only use one wireless sub-channel to upload tasks at the same time>For the total delay of the completion of the task, when +.>When (I)>When->In the time-course of which the first and second contact surfaces,

4. a method of computing offloading and resource allocation for GAT hybrid-action multi-agent reinforcement learning as claimed in claim 3, wherein the global state model, local observation state model, action model and rewards model of the mobile edge system are defined by a distributed markov decision process according to the resource allocation model:

r _n (t)＝-E _n (t)+Ω _n (t)

5. The method for computing offloading and resource allocation of GAT hybrid-action multi-agent reinforcement learning of claim 4, wherein each agent extracts a local observed state input actor network from a global state model of a current mobile edge system, the act of outputting the selection comprising:

for discrete actions in the action set, i.e. offloading decision and channel allocation decision, g is used separately ₁ And g ₂ Representing the continuous parameters to construct corresponding embedded tablesAnd-> An embedded table representing offloading decisions; />An embedded table representing channel allocation decisions; each agent inputs the observed local observation state into the actor network to obtain the coded discrete action e _n And code continuous action z _n The method comprises the steps of carrying out a first treatment on the surface of the Coded continuous action z _n Decoding to obtain continuous action x _n The method comprises the steps of carrying out a first treatment on the surface of the According to coded discrete action e _n Look-up table->And->Obtain discrete action u _n The method comprises the steps of carrying out a first treatment on the surface of the Agent S _n According to continuous action x _n And discrete action u _n Interacting with an environment of a mobile edge system, the continuous actions comprising: transmission power allocation and CPU computing resource allocation.

6. The method for computing offloading and resource allocation of GAT hybrid-action multi-agent reinforcement learning of claim 4, wherein the actor network comprises: a discrete action encoding network, a continuous action encoding network, and a continuous action decoding network;

7. The method for computing offloading and resource allocation of multi-agent reinforcement learning based on GAT hybrid actions of claim 5, wherein said evaluating actions selected by the actor network with the commentator network comprises:

8. The method for computing offloading and resource allocation of GAT hybrid-action multi-agent reinforcement learning of claim 7, wherein the evaluator network comprises: a GAT model and an evaluation model;

9. The method for computing offloading and resource allocation of GAT-hybrid-action multi-agent reinforcement learning of claim 8, wherein updating parameters of the actor network and the commentator network comprises:

Actor network updates network parameters θ with the goal of maximizing action cost function _n In particular, it can representThe method comprises the following steps:

wherein B represents agent S _n The number of samples sampled from the empirical playback pool, b represents the serial number of the sample,representing a cost function, namely a criticism network; θ _n Parameters representing the actor's network; omega _n Parameters representing a commentator network->Representing the actor network->Representing agent S in the b-th shared sample _n Is a local observation state of o ^b Representing a set of local observed states of all agents in the b-th shared sample, e ^b And x ^b Representing a discrete action set and a continuous action set of all agents in the b-th shared sample; j (pi) _n ) Representing an action cost function;