CN113676954A

CN113676954A - Large-scale user task unloading method and device, computer equipment and storage medium

Info

Publication number: CN113676954A
Application number: CN202110783668.8A
Authority: CN
Inventors: 张旭; 古博; 林梓淇; 丁北辰; 姜善成; 韩瑜
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-11-19
Anticipated expiration: 2041-07-12
Also published as: CN113676954B

Abstract

The application relates to a method and a device for unloading tasks of a large-scale user, computer equipment and a storage medium, which are suitable for the technical field of computers. The method comprises the following steps: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset depth reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset depth reinforcement learning model comprises a graph convolution neural network; and unloading the target task to the target base station. By adopting the method, the situation that a plurality of terminal devices occupy the computing resources by crowding can be effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided.

Description

Large-scale user task unloading method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of resource allocation technologies in the field of communications, and in particular, to a method and an apparatus for offloading a large-scale user task, a computer device, and a storage medium.

Background

With the continuous development of communication technology, a large number of emerging mobile applications such as cloud games, Virtual Reality (VR), Augmented Reality (AR), and the like are promoted. Such applications work properly for fulfillment. The task unloading technology is developed by the way that a communication technology is utilized to unload the computation-intensive tasks in the terminal equipment to a server side with sufficient computing resources for processing, and then the server side transmits the computing results back to the terminal equipment, so that the dual optimization of computing capacity and time delay is realized. However, the distance between the unloading end server and the terminal device at the terminal device end in the cloud computing is far away, so that the transmission delay of the unloading end server is far higher than the tolerable delay requirement of the computing task, and the terminal device experience is poor. However, in recent years, offloading the computation-intensive tasks in the terminal devices to the edge base station side with sufficient computation resources for processing becomes a hot issue of research.

In the conventional method, a conventional algorithm represented by convex optimization, game theory, or the like does not perform communication between a plurality of terminal devices when the plurality of terminal devices simultaneously offload tasks.

Therefore, in the above conventional method, when there are multiple terminal devices offloading tasks simultaneously, there may be a situation where multiple terminal devices offload tasks to the same base station simultaneously, so that the base station resources are insufficient and the tasks are difficult to complete.

Disclosure of Invention

Therefore, it is necessary to provide a method, an apparatus, a computer device and a storage medium for offloading a large-scale user task, which can solve the problem of how to collaborate multiple terminal devices to offload a task.

In a first aspect, a large-scale user task offloading method is provided, and the method includes: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset deep reinforcement learning model; and unloading the target task to the target base station.

In one embodiment, the method for performing deep reinforcement learning includes inputting task attribute information, probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, and includes: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.

In one embodiment, the preset deep reinforcement learning model includes a reporting function, and the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of a plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station are input into the preset deep reinforcement learning model to determine the target base station corresponding to the target task, and the method further includes: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.

In one embodiment, the preset depth-enhanced learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, the task attribute information, the probability distribution selected by the adjacent terminal device of the corresponding device of each candidate base station, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station are input into the preset depth-enhanced learning model, the target base station corresponding to the target task is determined, and a target evaluation value corresponding to the identification information of the target base station is output, including: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.

In one embodiment, obtaining attribute information of a plurality of candidate base stations associated with a terminal device includes: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.

In one embodiment, the preset deep reinforcement learning model training process includes: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.

In one embodiment, the method for training the deep reinforcement learning model includes training the deep reinforcement learning network to obtain the preset deep reinforcement learning model, where the preset deep reinforcement learning model includes a target actor network, a target comment family network, and a return function, and the input is attribute information of a training task, attribute information of multiple candidate base stations corresponding to the training task, identification information of the training base stations corresponding to a channel estimation information training task from a terminal device corresponding to the training task to each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device, and includes: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.

In a second aspect, a large-scale user task offloading device is provided, the device comprising:

the first acquisition module is used for acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

a second obtaining module, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station;

the determining module is used for inputting the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;

and the unloading module is used for unloading the target task to the target base station.

In a third aspect, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method for large-scale user task offloading as described in any of the first aspects above when executing the computer program.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of large scale user task offloading as in any of the first aspects above.

According to the large-scale user task unloading method, the large-scale user task unloading device, the computer equipment and the storage medium, task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment are obtained; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset depth reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, the preset depth reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine which base station the neighbor terminal equipment unloads the unloading task to, and finally the unloading is ensured to be mutually cooperated among the neighbor base stations. The terminal equipment inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determines a target base station corresponding to a target task. The terminal equipment determines a target base station corresponding to the target task based on a preset deep reinforcement learning model by combining task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, and the problem of inconsistent action space caused by different connectable base stations of different terminal equipment is solved because the preset deep reinforcement learning model comprises a graph convolution neural network. In addition, in the method, through mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources in a crowded manner is effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading a target task to a target base station can be evaluated.

Drawings

FIG. 1 is a diagram of an application environment for a large-scale user task offloading method in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a large-scale user task offloading method, according to one embodiment;

FIG. 3 is a schematic structural diagram of a deep reinforcement learning model in the large-scale user task offloading method in one embodiment;

FIG. 4 is a schematic diagram illustrating a structure of a graph convolution neural network in a large-scale user task offloading method according to another embodiment;

FIG. 5 is a flow diagram illustrating a large-scale user task offloading method, according to one embodiment;

FIG. 6 is a diagram illustrating a deep reinforcement learning model in a large-scale user task offloading method in an embodiment;

FIG. 7 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;

FIG. 8 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;

FIG. 9 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;

FIG. 10 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;

FIG. 11 is a block diagram of a large-scale user task offloading device in one embodiment;

FIG. 12 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;

FIG. 13 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;

FIG. 14 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;

FIG. 15 is a block diagram of a large-scale user task offloading device in one embodiment;

FIG. 16 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The large-scale user task unloading method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal device 102 communicates with the base station 104 over a network. And the terminal equipment acquires the attribute information of a plurality of candidate base stations corresponding to the terminal equipment through communication with the base station according to the position information of the terminal equipment. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the base station 104 may be implemented by a server cluster formed by a plurality of base stations.

In an embodiment, as shown in fig. 2, a large-scale user task offloading method is provided, which is described by taking the method as an example of being applied to the terminal device in fig. 1, and includes the following steps:

step 201, the terminal device obtains task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device.

Specifically, the terminal device may obtain attribute information of a target task to be offloaded, where the task attribute information of the target task may include a data size of the target task, identification information of the target task, and the like. In addition, the terminal device may further obtain, through communication connection with the neighboring terminal device, probability distribution that each candidate base station is selected by the neighboring terminal device of the corresponding device.

In step 202, the terminal device obtains attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.

Specifically, the terminal device may send signals to surrounding base stations in a broadcast manner, and receive attribute information returned by each base station. The attribute information returned by each base station may include location information of each base station. And the terminal equipment determines a pair of base stations corresponding to the terminal equipment according to the position information of the terminal equipment and the position information of each base station, and determines attribute information corresponding to the base stations. And the terminal equipment determines channel estimation information between the terminal equipment and each candidate base station according to the attribute information of the terminal equipment and the attribute information of a plurality of candidate base stations associated with the terminal equipment.

Step 203, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determines a target base station corresponding to the target task, and outputs a target evaluation value corresponding to the identification information of the target base station.

The target evaluation value is used for representing the matching degree of unloading the target task to the target base station. The preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset deep reinforcement learning model.

Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, the terminal device performs at least twice feature extraction on input data by using a graph convolution neural network in the preset deep reinforcement learning model, and determines a target base station corresponding to a target task based on the extracted features.

Among them, the deep reinforcement learning model has been widely used in various research fields as a hotspot of current research. As shown in FIG. 3, the deep reinforcement learning model is used to learn a countermeasure under a specific application scenario, which is usually based on observable State information (states s) in the environment_t) For inputting, the terminal device makes corresponding Action (Action a) after evaluating_t) And acts on the environment to obtain feedback (Reward r)_t) To improve the strategy. The above steps are repeated until the terminal equipment can freely cope with the dynamic change of the environment. In general, reinforcement learning can be divided into two categories: one is a value-based approach (such as the DQN algorithm) that aims to maximize the return on each action taken. Therefore, the higher the reward, the more easily the corresponding action is selected; in another strategy-based approach, it is aimed at directly learning a parameterized strategy pi_θ. Meanwhile, the parameter θ based on the strategy method can be updated by inverse gradient transfer using the following formula:

wherein,p^πis the state distribution probability. And the gradient can be calculated according to the following formula:

wherein, pi_θ(a_t|s_t) Representing information at a given state s_tTime selection action a_tThe probability of (c).

Then, the model parameters are updated by inverse gradient conduction:

where α is the step size setting in the learning process.

In the embodiment of the present application, a Multi-terminal device distributed Reinforcement Learning algorithm (MAGCAC) based on Graph Learning in a deep Reinforcement Learning model is mainly improved to obtain a preset deep Reinforcement Learning model. The preset deep reinforcement learning model is used for determining a base station which has the shortest time delay and meets the preset constraint condition in the target task unloading process from a plurality of base stations.

In addition, Graph Convolution Networks (GCNs) have been a focus of research since their birth in 2017, and have achieved unusual effects in many fields. Generally, the structure of the graph is quite irregular and has no translational invariance, and therefore, it is impossible to extract features using a Convolutional Neural Network (CNN), a cyclic neural network (RNN), or the like. Thus, much work on the theory of image learning has emerged as a spring shoot after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structure features as input, and outputs corresponding features after graph convolution, and the computation layer by layer is as follows:

wherein,

representing a contiguous matrix of the graph structure, I_NThen is the identity matrix;

w is a learnable weight parameter matrix. σ (-) is an activation function, e.g., ReLU (-) etc.; h^(l)∈R^N×DIs the first^thFeatures extracted by the layer diagram convolution neural network, when^thWhen 0, then H⁽⁰⁾X is an input graph structure feature.

And step 204, the terminal equipment unloads the target task to the target base station.

Specifically, after the target base station corresponding to the target task is determined, the terminal device may offload the target task to the target base station, and after the target base station calculates the target task, the calculation result is sent to the terminal device.

In the task unloading method, the terminal equipment acquires the task attribute information of a target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing feature extraction at least twice on input data of the preset deep reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine which base station the neighbor terminal equipment unloads the unloading task to, and finally the unloading is ensured to be mutually cooperated among the neighbor base stations. The terminal equipment inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determines a target base station corresponding to a target task. The terminal equipment determines a target base station corresponding to the target task based on a preset deep reinforcement learning model by combining task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, and the problem of inconsistent action space caused by different connectable base stations of different terminal equipment is solved because the preset deep reinforcement learning model comprises a graph convolution neural network. In addition, in the method, through mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources in a crowded manner is effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading a target task to a target base station can be evaluated.

In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, as shown in fig. 5, the step 203 of inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determining the target base station corresponding to the target task, and outputting the target evaluation value corresponding to the identification information of the target base station may include the following steps:

step 501, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into the target actor network, and outputs identification information of the target base station.

Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a target actor network, and the terminal device may perform feature extraction on input data by using at least two feature extraction layers included in the target actor network, calculate extracted features by using a full connection layer in the target actor network, and finally output identification information of the target base station.

Specifically, in the embodiment of the present application, the preset deep Reinforcement Learning model is mainly improved based on a Multi-terminal device distributed Reinforcement Learning algorithm (magmac) of Graph Learning in the deep Reinforcement Learning model. The algorithm takes the terminal equipment as the terminal equipment, takes the whole edge computing system as the environment, and is divided into an actor network and a critic network.

In the embodiment of the application, the observation state refers to the observation of the model on the environment, and whether the selection of the features in the observation state reasonably and directly influences whether the terminal equipment can learn an effective coping strategy or not. The algorithm regards both the terminal equipment and the base station in the system as nodes, so that a corresponding graph structure G is drawn according to the connectivity between the terminal equipment and the base station. For the convenience of implementation, the terminal device is regarded as a special base station in the embodiments of the present application, that is, since the terminal device in the system does not support the calculation task to be completely calculated locally, the characteristic information of the terminal device as the base station is set to 0. It should be noted that, in the embodiments of the present application, only the connectivity between the terminal device and the base station is considered, and the connectivity between the terminal devices is not considered. Therefore, the embodiment of the application divides the node characteristics of the terminal equipment and the base stationEremory of other notes

And

the diagram structure corresponding to the terminal device i is as follows:

in the embodiment of the present application, the graph structure at the time t

As state observation information of terminal device i

Namely, it is

In the process of task unloading, the time delay and the energy consumption are mainly influenced by the following factors, respectively: base station computing power f_jAchievable transmission rate r_i,j(t) and base station computing resource being crowded. The connectable base station computing power and achievable transmission rate are thus taken as the main observed state information, and, for the terminal device i,

for the situation that the computing resources of the base station are crowded, the cooperation situation between the neighbor devices needs to be determined.

At time t, the terminal device evaluates the current state information to derive a corresponding action:

wherein the motion

The base stations selected for one-hot coding, i.e., for offloading, are denoted as 1, and the others are denoted as 0. However, since the motion is required as continuous in the DDPG algorithm, the embodiment of the present application re-expresses the DDPG algorithm output and discretizes into the above-described one-hot encoded form.

In addition, as shown in fig. 6, in the embodiment of the present application, the actor network structure in MAGCAC algorithm takes the graph structure G as input, and uses two layers of GCNs to extract features, and finally takes a Multilayer Perceptron (MLP) as output. Since each agent action space is different, the output result of the multilayer perceptron is multiplied by the mask of the corresponding agent to obtain the final action.

Thus, agent i is determining a policy

The following gradient can be calculated:

similarly, the comment family network structure in the magmac algorithm also takes the graph structure G as input, uses two layers of GCNs to extract features, and finally takes a Multilayer Perceptron (MLP) as output. Thus, the loss function for the critic's network can be calculated as:

wherein,

then the target action value is calculated as follows:

while

Representing the probability distribution, G, of each base station being selected by a terminal device adjacent to terminal device i_iSet of neighbouring terminal devices representing terminal device i:

step 502, the terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of the target base station, and probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station.

Specifically, the terminal device inputs the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, performs feature extraction on input data by using at least two special extraction layers in the critic network, and outputs a target evaluation value corresponding to the identification information of the target base station.

Step 503, the terminal device calculates a target report value by using the report function.

And the target return value is used for representing time delay data and energy consumption data corresponding to unloading of the target task to the target base station.

Specifically, the return value is used to represent a task delay condition and an energy consumption condition corresponding to the unloading of the target task to the target base station. The higher the return value is, the shorter the time delay of the task corresponding to the target task being unloaded to the target base station is, and the smaller the energy consumption is.

Illustratively, in the embodiment of the present application, a reward function is used to expect that the task delay is minimized under the constraint condition of meeting the energy consumption budget. In a given action

Then, the corresponding report function is calculated according to the following formula:

wherein,

is an non-positive number and represents the upper limit of the energy consumption penalty. The return function can always aim at minimizing task delay under the condition of considering battery energy consumption safety. When energy consumption is epsilon_i(t) is less than

In the time, the reward of the energy consumption part in the return function is 0, namely under the condition of ensuring the safety of energy consumption, the embodiment of the application has no specific limitation on the energy consumption of task transmission; when energy consumption is epsilon_i(t) is higher than

When the number of the part is less than the threshold, the part is a negative number, namely a punishment, and the punishment of the part is provided with a lower limit

Therefore, under the guidance of the return function, on the basis of considering both task delay and transmission energy consumption, the terminal device can learn an excellent task unloading strategy and unload a given task to a proper base station.

In the embodiment of the application, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into the network of the target actor, and outputs the identification information of the target base station. And then, the terminal equipment inputs the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station. In addition, the terminal device calculates a target reward value using a reward function. Therefore, the task time delay of unloading the target task to the target base station can be guaranteed to be shortest, and the energy consumption constraint condition is met.

In an optional embodiment of the present application, the preset depth-enhanced learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, a task time delay of a target task offloaded to a target base station is shortest, and an energy consumption constraint condition is satisfied, in step 203, "inputting task attribute information, probability distribution selected by adjacent terminal devices of corresponding devices of each candidate base station, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the preset depth-enhanced learning model, determining the target base station corresponding to the target task, and outputting a target evaluation value corresponding to identification information of the target base station", may include the following contents:

the method comprises the steps that the terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, at least two layers of graph convolution neural networks in the target actor network are used for carrying out feature extraction on input data at least twice, and identification information of the target base station is output based on extracted features.

The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two times of feature extraction is carried out on input data through at least two layers of graph convolutional neural networks in the target critic network, and a target evaluation value corresponding to the identification information of the target base station is output based on extracted features.

Among them, Graph Convolution Networks (GCNs) have been a research focus since their birth in 2017, and have achieved unusual effects in many fields. Generally, the structure of the graph is quite irregular and has no translational invariance, and therefore, it is impossible to extract features using a Convolutional Neural Network (CNN), a cyclic neural network (RNN), or the like. Thus, much work on the theory of image learning has emerged as a spring shoot after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structure features as input, and outputs corresponding features after graph convolution, and the computation layer by layer is as follows:

wherein,

Specifically, the actor network structure takes task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station as input, and uses two layers of GCNs to extract features from the input information, and finally calculates the extracted features by using a multi-layer Perceptron (MLP) and outputs identification information of a target base station.

The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output.

The target evaluation value is used for representing the matching degree of unloading the target task to the target base station.

Specifically, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device is input into the target critic network. The terminal device performs at least two times of feature extraction on input data by using at least two layers of graph convolutional neural networks in the critic network, calculates the extracted features by using a Multilayer Perceptron (MLP), and outputs a target evaluation value corresponding to the identification information of the target base station.

Wherein, the loss function of the target critic network can be calculated as:

wherein,

then the target action value is calculated as follows:

while

Probability distribution, G, representing the selection of candidate base stations by terminal devices adjacent to terminal device i_iSet of neighbouring terminal devices representing terminal device i:

in the embodiment of the application, the terminal device inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into a target actor network, performs at least twice feature extraction on input data by using at least two layers of graph convolution neural networks in the target actor network, and outputs identification information of the target base station based on the extracted features. The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two times of feature extraction is carried out on input data through at least two layers of graph convolutional neural networks in the target critic network, a target evaluation value corresponding to the identification information of the target base station is output based on extracted features, and the target evaluation value is used for representing the matching degree of unloading a target task to the target base station. In the method, at least two layers of graph convolution neural networks in the target actor network are used for extracting the characteristics of the input data at least twice, so that the accuracy of the characteristics extracted by the target actor network is ensured, and the accuracy of the identification of the target base station output by the target actor network is ensured to be higher. In addition, at least two layers of graph convolutional neural networks in the target critic network are used for carrying out feature extraction on input data at least twice, and the accuracy of a target evaluation value output by the target critic network is guaranteed.

In an alternative embodiment of the present application, as shown in fig. 7, the "acquiring attribute information of multiple candidate base stations associated with the terminal device" in step 202 includes:

in step 701, the terminal device sends broadcast information to the base station.

The broadcast information is used for instructing each base station to send attribute information of the base station to the terminal equipment.

Specifically, the terminal device may transmit the broadcast information to base stations around each terminal device before offloading the target task.

After receiving the broadcast information sent by the terminal device, each base station may send attribute information of the base station to the terminal device, and establish a connection with the terminal device.

Step 702, the terminal device receives the attribute information sent by each base station, and determines the attribute information of a plurality of candidate base stations associated with the terminal device according to the position information of the terminal device and the position information of the base station included in each attribute information.

Specifically, the attribute information sent by each base station may include location information of each base station, and after receiving the attribute information sent by each base station, the terminal device may determine the location of each base station according to the location information of each base station included in each attribute information. The terminal device may select, from among the base stations that have received the attribute information, a base station that is relatively close to the terminal device as a plurality of base stations corresponding to the terminal device, according to the position information of the terminal device and the position information of each base station, and determine the attribute information of a plurality of candidate base stations corresponding to the terminal device.

In the embodiment of the application, the terminal device sends broadcast information to the base stations, receives attribute information sent by each base station, and determines attribute information of a plurality of candidate base stations corresponding to the terminal device according to the position information of the terminal device and the position information of the base stations included in each attribute information. In the method, the terminal equipment determines the base station which can establish connection with the terminal equipment by sending broadcast information to the base station and receiving the attribute information sent by each base station. And then determining attribute information of a plurality of candidate base stations corresponding to the terminal equipment from the base stations establishing connection according to the position information of the terminal equipment and the position information of the base stations included in each attribute information, thereby ensuring that the corresponding base stations corresponding to the terminal equipment can establish stable connection with the terminal equipment and are close to the terminal equipment, ensuring that the task time delay required for unloading the target task to the target base station is shortest and meeting the energy consumption constraint condition of the base stations.

In an alternative embodiment of the present application, as shown in fig. 8, the training process of the preset deep reinforcement learning model may include the following steps:

step 801, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model.

The training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment.

Specifically, before training a preset deep reinforcement learning model, the terminal device needs to obtain a training set corresponding to the preset deep reinforcement learning model. The terminal device may obtain attribute information of a plurality of training tasks, where the attribute information of the plurality of tasks may include data size information of each training task and identification information of each training task. The terminal equipment can also acquire the attribute information of the candidate base stations corresponding to the training task through the communication connection with the base stations. The terminal device may calculate the time delay data and the energy consumption data for offloading each training task to each base station according to a preset algorithm, and thereby determine the target base station corresponding to each training task and the identification information of the target base station from the plurality of candidate base stations according to the calculated time delay data and energy consumption data.

Illustratively, in the embodiment of the present application, an edge computing system is defined, which is deployed with N micro Base Stations (BSs) and can provide computing services for large-scale Mobile internet of things devices (MDs) in the system. For convenience of description, the base station is not represented as Ν ═ 1, 2., N }, the mobile internet-of-things device is represented as Μ ═ 1, 2., M }, and the time is discretized into τ different time intervals (time slots), denoted as T ═ 1, 2., τ }. Meanwhile, the base stations are different in distribution position and different in signal coverage capability, so that each base station can serve different terminal devices; in addition, the base stations to which the terminal device can connect are different due to the different positions of the terminal device. Then, at time t, the set of connectable base stations of terminal device i is remembered as Ν_i(t), the set of serviceable terminal devices of base station j is denoted m_j(t) of (d). At this time, for any base station j, if the terminal device in the signal coverage area offloads the task to the base station, it is marked as 1, otherwise, it is marked as 0, which may be specifically expressed as:

exemplarily, taking a community scene with an edge computing system deployed as an example, a plurality of mobile internet-of-things devices including smart watches, smart glasses, smart phones and the like are randomly distributed in any position in the community, a computing task k with a specific size is generated at the beginning of each time gap τ, the computing task k is unloaded to a selected edge base station after local preprocessing for further computing and analysis, and finally, the base station returns the processed result to the terminal device. In the process, two points need to be noted, namely, the data to be unloaded after the terminal equipment is preprocessed is inseparable, namely, the data is directly submitted to a selected base station for calculation and analysis; secondly, since the analysis result after the calculation by the base station is much smaller than the data to be unloaded, the transmission delay of the delayed downlink can be ignored during the calculation task.

In the preprocessing step, the terminal device generally needs to encrypt and pack the generated task data, and then unload the task data to the base station for processing. For convenience of description, the data size that the terminal device i needs to process locally is not set to

The data size to be offloaded to the base station side process is

Correspondingly, the number of CPU cycles required for the task generated at time t, the local calculation of the unit data amount and the base station calculation is respectively

And

the time delay it consumes in the local preprocessing is then:

wherein f is_iThe CPU frequency of the terminal device i is represented; the energy consumption spent in local processing is as follows:

wherein, κ_iThe coefficient is typically dependent on different chip architectures for the power consumption coefficient of the corresponding device.

In this scenario, since the task to be unloaded is not separable, the unloading delay usually includes two parts, respectively: transmission delay and computation delay. First, the transmission delay refers to the time taken for the terminal device i to transmit the preprocessed task to the selected base station j. Therefore, for the terminal device i, the transmission delay at the time t is specifically:

wherein,

is the size of the content to be transmitted, r_i,j(t) is the uplink rate that can be achieved between the terminal device i and the base station j, and is specifically calculated as follows:

wherein B represents the bandwidth available for data transfer between the terminal device and the connectable base station;

representing the channel gain between terminal device i and selected base station j. In addition, the terminal equipment is uniformly powered by power p_txTransmission tasks during which the noise power is expressed as sigma²The interference power of the base station can be represented as I_i,j. Wherein, the channel gain calculation formula is as follows:

wherein X represents an adjustment factor for path loss; beta is a_i,jAnd

respectively representing fast fading gain coefficients and slow fading gain coefficients; d_i,jRepresents the distance between terminal device i and base station j; ζ is the path loss coefficient.

Secondly, the computation delay of the task generated by the terminal device i at the time t on the edge server can be expressed as follows:

wherein,

representing the number of CPU cycles required by the calculation of the subunit task at the base station end. f. of_i,j(t)＝f_j/∑(I_j(t)) represents

The CPU frequency of the terminal device i on the base station j at the moment t, namely when a plurality of tasks are unloaded to the same base station, the base station distributes the self computing power to each task evenly.

Thus, the total time delay required for the task on terminal device i from preprocessing to completion of the computation is:

in addition, for the terminal device, the energy consumption spent by the task in the offloading process generally includes two parts, namely the energy consumption required for transmitting the task to the base station and the energy consumption required for receiving the task when the base station transmits the calculation result back to the terminal device. The data volume of the calculation result is very small compared with the data volume to be transmitted, so the receiving energy consumption can be ignored. Therefore, when the terminal device i unloads the task, the transmission energy consumption is as follows:

the total energy consumption is:

in addition, when the terminal device in the mobile edge system unloads the task, power consumption is inevitably generated. However, it is detrimental if the battery instantaneous discharge power is large, and for this reason a battery safety factor is introduced here

Namely, when the terminal device unloads the task, the energy consumption should meet the following conditions:

therefore, when the terminal device unloads the task, the minimum total time delay should be achieved under the condition of meeting the energy consumption constraint condition. This optimization problem is defined as follows:

based on the above, the terminal device may calculate the time delay data and the energy consumption data corresponding to offloading each training task to each base station, and determine the target base station corresponding to each training task and the identification information of the target base station from the plurality of base stations according to the calculated time delay data and energy consumption data. The task unloading time delay corresponding to the unloading of each training task to the target base station is shortest, and the preset energy consumption constraint condition is met.

Step 802, the terminal device trains the deep reinforcement learning network to obtain a preset deep reinforcement learning model by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device as input.

Specifically, the terminal device may input the attribute information of each training task, the attribute information of a plurality of candidate base stations corresponding to each training task, and the channel estimation information between the terminal device corresponding to the training task and each candidate base station into the untrained deep reinforcement learning network, and train the deep reinforcement learning network with the deep reinforcement learning model as the gold standard, thereby obtaining the preset deep reinforcement learning model.

Furthermore, when the preset depth reinforcement learning model is trained, an Adam optimizer can be selected to optimize the preset depth reinforcement learning model, so that the preset depth reinforcement learning model can be rapidly converged.

When the Adam optimizer is used for optimizing the preset deep reinforcement learning model, a learning rate can be set for the optimizer, and the optimal learning rate can be selected by adopting a learning rate range test technology. The learning rate selection process of the test technology comprises the following steps: firstly, setting the learning rate to be a small value, then simply iterating a preset deep reinforcement learning model and training sample data for several times, increasing the learning rate after each iteration is completed, recording the training loss (loss) each time, and then drawing a learning rate range test chart, wherein the general ideal learning rate range test chart comprises three areas: if the first region learning rate is too small, the loss is basically unchanged, the second region loss reduction is fast in convergence, and the last region learning rate is too large, so that the loss begins to diverge, then the learning rate corresponding to the lowest point in the learning rate range test t-chart can be used as the optimal learning rate.

In the embodiment of the application, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model, and trains a deep reinforcement learning network to obtain the preset deep reinforcement learning model by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station as input. In the embodiment of the application, the preset deep reinforcement learning model is obtained based on training of the training set, and the preset deep reinforcement learning model can be ensured to be more accurate, so that the target task unloaded to the target base station, which is obtained based on the preset deep reinforcement learning model, is ensured to be more accurate.

In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, as shown in fig. 9, in step 802, "training a deep reinforcement learning network to obtain the preset deep reinforcement learning model by using, as input," attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to a channel estimation information training task from a terminal device corresponding to the training task to each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device "may include the following steps:

step 901, the terminal device inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information between the terminal device corresponding to the training task and each candidate base station to the initial actor network, and outputs the identifier of the training base station corresponding to the training task.

Wherein the initial actor network may include a first actor network and a second actor network

Step 902, the terminal device inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, and the identification of the training base station corresponding to the training task into the initial critic network, performs feature extraction on the input data by using the initial critic network, and outputs a training evaluation value for unloading the training task to the training base station.

The training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task.

Step 903, the terminal device uses the reward function to calculate a training reward value corresponding to the training base station to unload the training task.

The training return value is used for representing time delay data and energy consumption data corresponding to unloading of the training task to the training base station.

And 904, training the initial critic network by the terminal equipment according to the training return value to obtain a target critic network.

Step 905, the terminal device trains the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.

The specific training and execution process may include the steps of:

1. the model comprises a plurality of terminal devices (intelligent agents), and each terminal device comprises an actor network part and a critic network part. Wherein the actor/critic network comprises a first actor/critic network and a second actor/critic network. And the second actor/critic network is completely replicated from the first actor/critic network prior to training; during training, the second actor/critic network is updated according to a certain rule, for example, a represents a parameter of the first actor/critic network, B represents a parameter in the second actor/critic network, and B ═ α B + (1- α) a.

2. For convenience of representation, the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station are called as state information; the "identification of the training base station corresponding to the training task" is referred to as an action, and the "probability distribution of the candidate base stations being selected" is referred to as a joint action.

3. The execution flow comprises the following steps: firstly, each terminal device acquires attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and inputs the channel estimation information into a first actor network to obtain an identifier of the training base station corresponding to the training task and a corresponding return value. Meanwhile, each terminal device obtains the identifier of the corresponding base station selected by the adjacent terminal device through the communication module, and calculates the probability distribution of each candidate base station. At this time, the attribute information of the training task in the environment, the attribute information of the plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station are updated to the next time, and can be acquired by the terminal device. Finally, the terminal device combines the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task, the selected probability distribution of each base station, the corresponding return value, the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station into a complete experience, and stores the complete experience in respective independent experience pools for subsequent training.

4. Training process: typically, a complete training process involves multiple cycles from training a critic's network to training an actor's network, and both are dependent on each other.

Training a critic network: firstly, inputting attribute information of a training task obtained by random sampling from the experience pool, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, identification of the training base station corresponding to the training task and selected probability distribution information of each base station into a first critic network in a corresponding model by each terminal equipment to obtain a critic value; then inputting the attribute information of the training task at the next moment in the experience, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into a second actor network in a corresponding model to obtain the identification of the training base station corresponding to the training task at the next moment; then each terminal device obtains the identification of the training base station corresponding to the training task of the adjacent terminal device through an obtaining module, and calculates the probability distribution of each candidate base station to be selected; and finally, inputting the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task and the selected probability distribution information of each candidate base station into a second critic network in the submodel, and calculating to obtain a comment value of the next moment. At this time, the loss is calculated by using the comment value, the return value obtained by sampling and the comment value at the next moment, and the gradient is further calculated to update the first critic network in the terminal device.

Training an actor network: firstly, each terminal device inputs the attribute information of the training task obtained by sampling, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station into a first actor network in a corresponding model and obtains the identification of the training base station corresponding to the training task, and inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task and the selected probability distribution information of each candidate base station into a first critic network in the corresponding terminal device to obtain the corresponding comment value. Then, loss is calculated according to the comment values, and gradient is further calculated to update the first actor network in the corresponding terminal equipment.

And finally, updating the second actor/critic network in the terminal equipment according to the second actor/critic network updating mode in the step 1.

In order to better explain the large-scale user task offloading method provided by the present application, the present application provides an illustrative embodiment of the overall flow aspect of the large-scale user task offloading method, as shown in fig. 10, the method includes:

step 1001, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model.

In step 1002, the terminal device trains a deep reinforcement learning network to obtain a preset deep reinforcement learning model by using attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device as input.

Step 1003, the terminal device obtains task attribute information of the target task to be unloaded and probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device.

In step 1004, the terminal device sends broadcast information to the base station.

Step 1005, the terminal device receives the attribute information sent by each base station, and determines the attribute information of a plurality of candidate base stations associated with the terminal device and the channel estimation information between the terminal device and each candidate base station according to the position information of the terminal device and the position information of the base station included in each attribute information.

Step 1006, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of the plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the target actor network, performs at least two times of feature extraction on the input data by using at least two layers of graph convolution neural networks in the target actor network, and outputs the identification information of the target base station based on the extracted features.

Step 1007, the terminal device inputs the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal device and each candidate base station, the identification information of the target base station, and the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, performs at least twice feature extraction on the input data by using at least two-layer graph convolutional neural network in the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In step 1008, the terminal device calculates a target report value using the report function.

It should be understood that although the various steps in the flowcharts of fig. 2, 5, and 7-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to be performed in a strict order unless explicitly stated in the embodiments of the present application, and may be performed in other orders. Moreover, at least some of the steps in fig. 2, 5, and 7-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or at least partially with other steps or with at least some of the other steps.

In one embodiment of the present application, as shown in fig. 11, there is provided a large-scale user task offloading device 1100, including: a first obtaining module 1110, a second obtaining module 1120, a determining module 1130, and an uninstalling module 1140, wherein:

the first obtaining module 1110 obtains task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device.

A second obtaining module 1120, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.

A determining module 1130, configured to input the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of multiple candidate base stations, and the channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determine a target base station corresponding to a target task, and output a target evaluation value corresponding to the identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset deep reinforcement learning model;

an offloading module 1140 for offloading the target task to the target base station.

In an embodiment of the present application, the preset deep reinforcement learning model includes a target actor network and a target critic network, as shown in fig. 12, the determining module 1130 includes: a first output unit 1131, and a second output unit 1132, wherein:

a first output unit 1131, configured to input the task attribute information, the attribute information of the multiple candidate base stations, and the channel estimation information between the terminal device and each candidate base station to the target actor network, and output the identification information of the target base station.

A second output unit 1132, configured to input, to the target critic network, the task attribute information, the attribute information of the plurality of candidate base stations, the channel estimation information between the terminal device and each candidate base station, the identification information of the target base station, and the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, and output a target evaluation value corresponding to the identification information of the target base station.

In an embodiment of the present application, the predetermined deep reinforcement learning model includes a reward function, as shown in fig. 13, the determining module 1130 further includes: a calculation unit 1133, wherein:

a calculating unit 1133, configured to calculate a target return value by using a return function, where the target return value is used to represent time delay data and energy consumption data corresponding to offloading of a target task to a target base station.

In an embodiment of the application, the preset deep reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, and the determining module 1130 is specifically configured to input task attribute information, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, perform feature extraction on input data at least twice by using at least two layers of graph convolutional neural networks in the target actor network, and output identification information of the target base station based on extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.

In an embodiment of the present application, as shown in fig. 14, the second obtaining module 1120 includes a sending unit 1121 and a receiving unit 1122, where:

a sending unit 1121, configured to send broadcast information to the base stations by the terminal device, where the broadcast information is used to instruct each base station to send attribute information of the base station to the terminal device;

the receiving unit 1122 is configured to receive the attribute information transmitted by each base station, and determine the attribute information of the plurality of candidate base stations corresponding to the terminal device according to the position information of the terminal device and the position information of the base station included in each attribute information.

In an embodiment of the present application, as shown in fig. 15, the large-scale user task offloading device 1100 further includes: a third acquisition module 1150 and a training module 1160, wherein

The third obtaining module 1150 obtains a training set corresponding to the preset deep reinforcement learning model, where the training set includes attribute information of multiple training tasks, attribute information of multiple candidate base stations corresponding to the training tasks, channel estimation information from a terminal device corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device.

The training module 1160 is configured to train the deep reinforcement learning network by using, as inputs, attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between a terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device, to obtain a preset deep reinforcement learning model.

In an embodiment of the application, the preset deep reinforcement learning model includes a target actor network, a target comment family network, and a return function, and the training module 1160 is specifically configured to input attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information from a terminal device corresponding to the training task to each candidate base station to an initial actor network, and output an identifier of a training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.

For specific limitations of the large-scale user task offloading device, reference may be made to the above limitations of the task offloading method, which are not described herein again. The modules in the task uninstalling device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 16. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a task offloading method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 16 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station.

In one embodiment, the pre-set deep reinforcement learning model comprises a network of target actors and a network of target critics, and the processor when executing the computer program further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.

In one embodiment, the predetermined deep reinforcement learning model includes a reward function, and the processor executes the computer program to further perform the following steps: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.

In one embodiment, the preset deep reinforcement learning model comprises a target actor network and a target commentator network, the target actor network comprises at least two layers of graph convolutional neural networks, the target commentator network comprises at least two layers of graph convolutional neural networks, and the processor executes the computer program and further realizes the following steps: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.

In one embodiment, the processor, when executing the computer program, further performs the steps of: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.

In one embodiment, the preset deep reinforcement learning model comprises a target actor network, a target commentator network and a reward function, and the processor, when executing the computer program, further implements the following steps: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station.

In one embodiment, the pre-defined deep reinforcement learning model includes a network of target actors and a network of target critics, the computer program when executed by the processor further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.

In one embodiment, the predetermined deep reinforcement learning model includes a reward function, and the computer program when executed by the processor further implements the steps of: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.

In one embodiment, the preset deep reinforcement learning model comprises a target actor network and a target commentator network, the target actor network comprises at least two layers of graph convolutional neural networks, the target commentator network comprises at least two layers of graph convolutional neural networks, and when being executed by the processor, the computer program further realizes the following steps: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.

In one embodiment, the computer program when executed by the processor further performs the steps of: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.

In one embodiment, the predetermined deep reinforcement learning model includes a network of target actors, a network of target critics, and a reward function, and the computer program when executed by the processor further performs the steps of: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A large-scale user task offloading method, comprising:

acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station;

inputting the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of the candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;

and unloading the target task to the target base station.

2. The method of claim 1, wherein the preset deep reinforcement learning model comprises a target actor network and a target critic network, and the inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determining the target base station corresponding to the target task, and outputting the target evaluation value corresponding to the identification information of the target base station comprises:

inputting the task attribute information, the attribute information of the candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into the target actor network, and outputting the identification information of the target base station;

and inputting the task attribute information, the attribute information of the candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.

3. The method of claim 2, wherein the pre-designed deep reinforcement learning model includes a reward function, and the step of inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the pre-designed deep reinforcement learning model to determine the target base station corresponding to the target task further includes:

and calculating a target return value by using the return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to unloading of the target task to the target base station.

4. The method of claim 1, wherein the preset deep reinforcement learning model comprises a target actor network and a target critic network, the target actor network comprises at least two layers of the graph convolutional neural network, the target critic network comprises at least two layers of the graph convolutional neural network, the task attribute information, the probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, the attribute information of the candidate base stations and the channel estimation information between the terminal device and each candidate base station are input into the preset deep reinforcement learning model, the target base station corresponding to the target task is determined, and a target evaluation value corresponding to the identification information of the target base station is output, and the method comprises the following steps:

inputting the task attribute information, the attribute information of the candidate base stations and channel estimation information between the terminal equipment and each candidate base station into the target actor network, performing feature extraction on input data at least twice by using at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features;

inputting the task attribute information, the attribute information of the candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, performing at least twice feature extraction on input data by using at least two layers of graph convolutional neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

5. The method of claim 1, wherein the obtaining attribute information of a plurality of candidate base stations associated with the terminal device comprises:

the terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment;

and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information.

6. The method according to claim 1, wherein the training process of the pre-set deep reinforcement learning model is as follows:

acquiring a training set corresponding to the preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, the identification information of the training base station corresponding to the training task and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain the preset deep reinforcement learning model.

7. The method of claim 6, wherein the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, and the training of the deep reinforcement learning network with the input of attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from a terminal device corresponding to the training task to each candidate base station, identification information of the training base station corresponding to the training task, and probability distribution of each candidate base station selected by a neighboring terminal device of the corresponding device, to obtain the preset deep reinforcement learning model comprises:

inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station to an initial actor network, and outputting the identification of the training base station corresponding to the training task;

inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, and outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task;

calculating a training return value corresponding to the training task unloaded to the training base station by using the return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station;

training the initial critic network according to the training return value to obtain the target critic network;

and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

8. A large-scale user task offloading apparatus, the apparatus comprising:

a second obtaining module, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each of the candidate base stations;

a determining module, configured to input the task attribute information, the probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, the attribute information of the multiple candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determine a target base station corresponding to the target task, and output a target evaluation value corresponding to identification information of the target base station, where the target evaluation value is used to characterize a matching degree for offloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.