CN113676954B

CN113676954B - Large-scale user task unloading method, device, computer equipment and storage medium

Info

Publication number: CN113676954B
Application number: CN202110783668.8A
Authority: CN
Inventors: 张旭; 古博; 林梓淇; 丁北辰; 姜善成; 韩瑜
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2023-07-18
Anticipated expiration: 2041-07-12
Also published as: CN113676954A

Abstract

The application relates to a large-scale user task unloading method, a large-scale user task unloading device, computer equipment and a storage medium, and is suitable for the technical field of computers. The method comprises the following steps: acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset deep reinforcement learning model comprises a graph convolutional neural network; and unloading the target task to the target base station. The method can effectively prevent a plurality of terminal devices from occupying computing resources, and avoid the phenomenon that the base station resources are insufficient and tasks are difficult to complete.

Description

Large-scale user task unloading method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of communications and resource allocation technologies, and in particular, to a method and apparatus for offloading large-scale user tasks, a computer device, and a storage medium.

Background

With the continuous development of communication technology, a great deal of emerging mobile applications, such as cloud gaming, virtual Reality (VR), augmented Reality (Augmented Reality, AR), and the like, are being promoted. Such applications work well for satisfaction. The task offloading technology has been developed, and the task offloading technology uses a communication technology to offload a computationally intensive task in a terminal device to a server end with sufficient computational resources for processing, and then the server end returns a computation result to the terminal device, thereby realizing dual optimization of computation capability and time delay. However, because the unloading end server and the terminal equipment of the terminal equipment end are far away from each other in the cloud computing, the transmission delay is far higher than the tolerable delay requirement of the computing task, so that the terminal equipment experience is poor. However, in recent years, offloading the computationally intensive tasks in the terminal equipment to the edge base station side with sufficient computing resources for processing has become a hot spot problem for research.

In the conventional method, conventional algorithms represented by convex optimization, game theory, and the like do not perform communication between a plurality of terminal devices when the plurality of terminal devices simultaneously offload tasks.

Therefore, in the above conventional method, when there are a plurality of terminal devices for simultaneously offloading tasks, there may occur a case where the plurality of terminal devices offload tasks to the same base station at the same time, thereby causing a phenomenon that the base station resources are insufficient and it is difficult to complete the tasks.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for offloading tasks for a large-scale user, which can solve the problem of how to offload tasks cooperatively by a plurality of terminal devices.

In a first aspect, a method for offloading large-scale user tasks is provided, the method comprising: acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading a target task to a target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for carrying out feature extraction on input data of the preset deep reinforcement learning model at least twice; and unloading the target task to the target base station.

In one embodiment, the preset deep reinforcement learning model includes a target actor network and a target criticism network, inputs task attribute information, probability distribution of each candidate base station being selected by an adjacent terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determines a target base station corresponding to a target task, and outputs a target evaluation value corresponding to identification information of the target base station, including: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station to a target actor network, and outputting identification information of the target base station; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into the target criticism network, and the target evaluation value corresponding to the identification information of the target base station is output.

In one embodiment, the preset deep reinforcement learning model includes a return function, inputs task attribute information, probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determines a target base station corresponding to the target task, and further includes: and calculating a target return value by using the return function, wherein the target return value is used for representing delay data and energy consumption data corresponding to unloading the target task to the target base station.

In one embodiment, the preset depth reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolution neural networks, the target critic network includes at least two layers of graph convolution neural networks, task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal equipment and each candidate base station are input into the preset depth reinforcement learning model, a target base station corresponding to a target task is determined, and a target evaluation value corresponding to identification information of the target base station is output, including: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station into a target actor network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; inputting task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In one embodiment, acquiring attribute information of a plurality of candidate base stations associated with a terminal device includes: the terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base stations to the terminal equipment; and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base stations included in each attribute information.

In one embodiment, the training process of the preset deep reinforcement learning model is as follows: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between terminal equipment corresponding to the training tasks and each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to the training task from terminal equipment corresponding to the training task to channel estimation information among the candidate base stations and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment as inputs to obtain a preset deep reinforcement learning model.

In one embodiment, the preset deep reinforcement learning model includes a target actor network, a target critic network, and a return function, and the preset deep reinforcement learning model is obtained by training the deep reinforcement learning network with attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to a channel estimation information training task between a terminal device corresponding to the training task and each candidate base station, and probability distribution selected by an adjacent terminal device of the corresponding device as inputs, including: the method comprises the steps of inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between terminal equipment corresponding to the training task and each candidate base station to an initial actor network, and outputting an identifier of the training base station corresponding to the training task; inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial commentator network, extracting characteristics of input data by utilizing the initial commentator network, outputting training evaluation values for unloading the training task to the training base stations, wherein the training evaluation values are used for representing matching degree of the training base stations corresponding to the task; the training report value corresponding to the training task is unloaded to the training base station by using the report function calculation, and the training report value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training the initial criticism network according to the training return value to obtain a target criticism network; and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

In a second aspect, there is provided a large-scale user task offloading apparatus, comprising:

the first acquisition module is used for acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

a second acquisition module, configured to acquire attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station;

the determining module is used for inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, wherein the target evaluation value is used for representing matching degree of unloading the target task to the target base station; the preset deep reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for extracting features of input data of the preset deep reinforcement learning model at least twice;

and the unloading module is used for unloading the target task to the target base station.

In a third aspect, there is provided a computer device comprising a memory storing a computer program and a processor implementing a method of large-scale user task offloading as any one of the first aspects above when the computer program is executed by the processor.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of large scale user task offloading as in any of the first aspects above.

The large-scale user task unloading method, the large-scale user task unloading device, the computer equipment and the storage medium acquire task attribute information of a target task to be unloaded and probability distribution of selection of each candidate base station by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, wherein the target evaluation value is used for representing matching degree of unloading the target task to the target base station, the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for extracting at least two times of input data of the preset deep reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be offloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine to which base station the adjacent terminal equipment offloads the offloading task, and finally, the mutual cooperation offloading among the adjacent base stations is ensured. The terminal equipment inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of the corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and a target base station corresponding to the target task is determined. The terminal equipment combines task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, determines a target base station corresponding to a target task based on a preset depth reinforcement learning model, and solves the problem of inconsistent action space caused by different connectable base stations of different terminal equipment due to the fact that a graph convolutional neural network is included in the preset depth reinforcement learning model. In addition, in the method, through the mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources is effectively prevented, and the phenomenon that the task is difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading the target task to the target base station can be evaluated.

Drawings

FIG. 1 is an application environment diagram of a large-scale user task offloading method in one embodiment;

FIG. 2 is a flow diagram of a large-scale user task offloading method, in one embodiment;

FIG. 3 is a schematic diagram of a deep reinforcement learning model in a large-scale user task offloading method according to an embodiment;

FIG. 4 is a schematic diagram of a graph convolutional neural network in a large-scale user task offloading method according to another embodiment;

FIG. 5 is a flow diagram of a large-scale user task offloading method, in one embodiment;

FIG. 6 is a schematic diagram of a deep reinforcement learning model in a large-scale user task offloading method according to one embodiment;

FIG. 7 is a flow diagram of a method of large-scale user task offloading in one embodiment;

FIG. 8 is a flow diagram of a method of large-scale user task offloading in one embodiment;

FIG. 9 is a flow diagram of a method of large-scale user task offloading in one embodiment;

FIG. 10 is a flow diagram of a method of large-scale user task offloading in one embodiment;

FIG. 11 is a block diagram of a large-scale user task offloading device, in one embodiment;

FIG. 12 is a large scale of one embodiment a block diagram of a user task offloading apparatus;

FIG. 13 is a block diagram of a large-scale user task offloading device, in one embodiment;

FIG. 14 is a block diagram of a large-scale user task offloading device, in one embodiment;

FIG. 15 is a block diagram of a large-scale user task offloading device, in one embodiment;

fig. 16 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The large-scale user task unloading method provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal device 102 communicates with the base station 104 via a network. The terminal equipment acquires attribute information of a plurality of candidate base stations corresponding to the terminal equipment through communication with the base stations according to the position information of the terminal equipment. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the base station 104 may be a server cluster formed by a plurality of base stations.

In one embodiment, as shown in fig. 2, a method for offloading large-scale user tasks is provided, and the method is applied to the terminal device in fig. 1 for illustration, and includes the following steps:

step 201, the terminal device obtains task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding device.

Specifically, the terminal device may obtain attribute information of a target task to be offloaded, where the task attribute information of the target task may include a data size of the target task, identification information of the target task, and the like. In addition, the terminal device can also obtain probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device through communication connection with the adjacent terminal device.

In step 202, the terminal device obtains attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.

Specifically, the terminal device may transmit signals to surrounding base stations in the form of broadcasting, and receive attribute information returned from each base station. The attribute information returned by each base station may include location information of each base station. The terminal equipment determines a pair of base stations corresponding to the terminal equipment according to the position information of the terminal equipment and the position information of each base station, and determines attribute information corresponding to the plurality of base stations. The terminal device determines channel estimation information between the terminal device and each candidate base station according to the attribute information of the terminal device and the attribute information of a plurality of candidate base stations associated with the terminal device.

In step 203, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determines a target base station corresponding to the target task, and outputs a target evaluation value corresponding to the identification information of the target base station.

The target evaluation value is used for representing the matching degree of unloading the target task to the target base station. The preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for carrying out feature extraction at least twice on input data of the preset deep reinforcement learning model.

Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, the terminal device performs feature extraction on input data at least twice by using a graph convolution neural network in the preset deep reinforcement learning model, and determines a target base station corresponding to a target task based on the extracted features.

Among them, the deep reinforcement learning model has been widely used in various research fields as a hotspot of current research. As shown in fig. 3, the deep reinforcement learning model is used in a specific application scenario to learn a certain coping strategy, which is usually implemented by using observable State information (states s in the environment _t ) For input, the terminal device evaluates and then makes a corresponding Action (Action a _t ) And acts on the environment to obtain feedback (forward r) _t ) To improve the strategy. The method is repeated in a circulating way until the terminal equipment can freely cope with the dynamic change of the environment. In general, reinforcement learning can be divided into two categories: one is a value-based approach (such as the DQN algorithm) aimed at maximizing the return on each action taken. Thus, the higher the return, the easier its corresponding action is selected; another is a policy-based approach aimed at directly learning a parameterized policy pi _θ . Meanwhile, the parameter θ in the policy-based method can be updated by inverse gradient transfer using the following formula:

wherein p is ^π Is a state distribution probability. And the gradient can be calculated according to the following formula:

wherein pi _θ (a _t |s _t ) Representing information s in a given state _t Time selection action a _t Is a probability of (2).

The model parameters are then updated by inverse gradient conduction:

Where α is the step size setting in the learning process.

In the embodiment of the application, a Multi-terminal equipment distributed reinforcement learning algorithm (Multi-Agent Graph Learning based Actor Critic Reinforcement Learning, MAGCAC) based on graph learning in a deep reinforcement learning model is mainly improved to obtain a preset deep reinforcement learning model. The preset deep reinforcement learning model is used for determining the base station which has the shortest time delay required in the unloading process of the target task and has the energy consumption meeting the preset constraint condition from a plurality of base stations.

In addition, graph roll neural networks (Graph Convolution Networks, GCN) have been a hotspot of research since 2017's birth and have achieved a profound effect in a variety of fields. Generally, the structure of the graph is quite irregular, has no translational invariance, and thus cannot be extracted using Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), and the like. Thus, many works on graph learning theory emerge like bamboo shoots after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structural features as input, outputs corresponding features after graph convolution, and is calculated layer by layer as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device, Representing the adjacent matrix of the graph structure, I _N Then it is the identity matrix; />W is a matrix of weight parameters that can be learned. Sigma (·) is an activation function, e.g., reLU (·) etc.; h ^(l) ∈R ^N×D Is the first ^th The characteristics of the layer graph rolled neural network after extraction are as follows ^th When=0, then H ⁽⁰⁾ =x, X is the input diagram structural feature.

In step 204, the terminal device offloads the target task to the target base station.

Specifically, after determining the target base station corresponding to the target task, the terminal device may offload the target task to the target base station, and after the target base station calculates the target task, send the calculation result to the terminal device.

In the task offloading method, a terminal device acquires task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset depth reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset depth reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for extracting at least two features of input data of the preset depth reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be offloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine to which base station the adjacent terminal equipment offloads the offloading task, and finally, the mutual cooperation offloading among the adjacent base stations is ensured. The terminal equipment inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of the corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and a target base station corresponding to the target task is determined. The terminal equipment combines task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, determines a target base station corresponding to a target task based on a preset depth reinforcement learning model, and solves the problem of inconsistent action space caused by different connectable base stations of different terminal equipment due to the fact that a graph convolutional neural network is included in the preset depth reinforcement learning model. In addition, in the method, through the mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources is effectively prevented, and the phenomenon that the task is difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading the target task to the target base station can be evaluated.

In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critics network, and a return function, as shown in fig. 5, in the step 203, the task attribute information, probability distribution of each candidate base station being selected by an adjacent terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station are input into the preset deep reinforcement learning model, a target base station corresponding to a target task is determined, and a target evaluation value corresponding to identification information of the target base station is output, and may include the following steps:

in step 501, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, and outputs identification information of the target base station.

Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, the terminal device may perform feature extraction on the input data by using at least two feature extraction layers included in the target actor network, calculate the extracted features by using a full connection layer in the target actor network, and finally output identification information of the target base station.

Specifically, in the embodiment of the application, the preset deep reinforcement learning model is mainly improved based on a Multi-terminal device distributed reinforcement learning algorithm (Multi-Agent Graph Learning based Actor Critic Reinforcement Learning, magac) for graph learning in the deep reinforcement learning model. The algorithm takes terminal equipment as terminal equipment, takes the whole edge computing system as an environment, and is divided into an actor network and a criticism network.

In the embodiment of the application, the observation state refers to the observation of the environment by the model, and whether the feature selection in the observation state reasonably and directly affects whether the terminal equipment can learn an effective coping strategy. The algorithm regards both the terminal equipment and the base station in the system as nodes, so that a corresponding graph structure G is drawn according to the connectivity between the terminal equipment and the base station. For the convenience of implementation, the terminal device is regarded as a special base station in the embodiment of the application, that is, the terminal device in the system does not support the calculation task to be calculated completely locally, so that the characteristic information of the terminal device serving as the base station is set to 0. It should be noted that, in the embodiment of the present application, only the connectivity between the terminal device and the base station is considered, and the connectivity between the terminal devices is not considered. Therefore, the node characteristics of the terminal device and the base station are respectively recorded as And->The diagram structure corresponding to the terminal device i is as follows:

in the embodiment of the application, the graph structure of the time t will beStatus observation information +.>I.e.In the task unloading process, the time delay and the energy consumption are mainly influenced by the following factors: base station computing capability f _j The transmission rate r can be reached _i,j (t) the base station computing resources are crowded. Thus, the connectable base station computing power and the achievable transmission rate are taken as the main observation state information, and then, for the terminal device i,/>And for the situation that the computing resources of the base station are occupied, the situation depends on the cooperation situation between the neighbor devices.

At time t, the terminal device derives a corresponding action by evaluating the current state information:

wherein the motion isThe base station that selected the offload is designated 1 for one-hot coding and the others are designated 0. However, since the actions are required to be continuous in the DDPG algorithm, the embodiments of the present application re-represent the DDPG algorithm output and discretize into the one-hot encoded form described above.

Furthermore, as shown in fig. 6, in the embodiment of the present application, the actor network structure in the magac algorithm takes the graph structure G as input, and uses two layers of GCN to extract features, and finally takes a multi-layer perceptron (Multilayer Perceptron, MLP) as output. Since the action space of each agent is different, the output result of the multi-layer perceptron is multiplied by the mask of the corresponding agent to obtain the final action.

Thus, agent i is determining a policyThe following gradient can be calculated:

similarly, the commentator network structure in the magac algorithm takes the graph structure G as input, uses two layers of GCN to extract features, and finally takes a multi-layer perceptron (Multilayer Perceptron, MLP) as output. Thus, the loss function of the critic network can be calculated as:

wherein, the liquid crystal display device comprises a liquid crystal display device,then the target action value is calculated as follows:

whileRepresenting the probability distribution of each base station being selected by the neighboring terminal device of terminal device i, G _i A set of neighboring terminal devices representing terminal device i:

step 502, the terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station, and probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding devices to the target criticism network, and outputs a target evaluation value corresponding to the identification information of the target base station.

Specifically, the terminal device inputs probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device to the target criticism network, performs feature extraction on input data by using at least two special extraction layers in the criticism network, and outputs a target evaluation value corresponding to the identification information of the target base station.

In step 503, the terminal device calculates a target return value by using the return function.

The target return value is used for representing delay data and energy consumption data corresponding to unloading the target task to the target base station.

Specifically, the report value is used for representing the task time delay condition and the energy consumption condition corresponding to the task unloading of the target task to the target base station. The higher the return value is, the shorter the task time delay corresponding to the task unloading to the target base station is, and the lower the energy consumption is.

Illustratively, in the embodiments of the present application, a reward function is desirable to minimize task latency under constraints that satisfy an energy consumption budget. In a given actionWhen the corresponding return function is calculated according to the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,the upper energy consumption penalty limit is represented as a non-positive number. The return function can always aim at minimizing task time delay under the condition of considering battery energy consumption safety. When the energy consumption epsilon _i (t) is lower than->When the energy consumption part rewards in the rewarding function are 0, namely, the embodiment of the application has no specific limit on the task transmission energy consumption under the condition of ensuring the energy consumption safety; when the energy consumption epsilon _i (t) is higher than->When the part is negative, the penalty is set with the lower limit +. >Therefore, under the guidance of the return function, on the basis of considering the task time delay and the transmission energy consumption, the terminal equipment can learn an excellent task unloading strategy and unload the given task to the proper base station.

In the embodiment of the application, the terminal device inputs task attribute information, probability distribution that each candidate base station is selected by the adjacent terminal device of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, and outputs identification information of the target base station. And then, the terminal equipment inputs probability distribution selected by the adjacent terminal equipment of the corresponding equipment of each candidate base station to a target criticism network, and outputs a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station. In addition, the terminal device calculates a target return value by using the return function. Therefore, the task time delay for unloading the target task to the target base station is ensured to be shortest, and the energy consumption constraint condition is met.

In an optional embodiment of the present application, the preset depth reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolution neural networks, the target critic network includes at least two layers of graph convolution neural networks, a task time delay of unloading a target task to a target base station is shortest, and an energy consumption constraint condition is satisfied, and the "inputting task attribute information, probability distribution of each candidate base station being selected by a neighboring terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station in the step 203 into the preset depth reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to identification information of the target base station" may include:

The terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, at least two layers of graph convolution neural networks in the target actor network are utilized to extract characteristics of input data at least twice, and identification information of the target base station is output based on the extracted characteristics.

The terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, at least two-layer graph convolution neural networks in the target critic network are utilized for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.

Among them, the graph roll neural network (Graph Convolution Networks, GCN) has been a hot spot of research since 2017, and has achieved a popular effect in various fields. Generally, the structure of the graph is quite irregular, has no translational invariance, and thus cannot be extracted using Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), and the like. Thus, many works on graph learning theory emerge like bamboo shoots after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structural features as input, outputs corresponding features after graph convolution, and is calculated layer by layer as follows:

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the adjacent matrix of the graph structure, I _N Then it is the identity matrix; />W is a matrix of weight parameters that can be learned. Sigma (·) is an activation function, e.g., reLU (·) etc.; h ^(l) ∈R ^N×D Is the first ^th The characteristics of the layer graph rolled neural network after extraction are as follows ^th When=0, then H ⁽⁰⁾ =x, X is the input diagram structural feature.

Specifically, the actor network structure takes task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station as input, extracts characteristics of the input information by using two layers of GCNs, calculates the extracted characteristics by using a multi-layer perceptron (Multilayer Perceptron, MLP), and outputs identification information of a target base station.

The terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target criticism network, and at least two layers of graph convolution neural networks in the target criticism network are utilized to extract at least two times of characteristics of input data, so that a target evaluation value corresponding to the identification information of the target base station is output.

The target evaluation value is used for representing the matching degree of unloading the target task to the target base station.

Specifically, probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment is input into the target criticism network. The terminal equipment utilizes at least two layers of graph convolution neural networks in the criticizing home network to extract the characteristics of the input data at least twice, utilizes a multi-layer perceptron (Multilayer Perceptron, MLP) to calculate the extracted characteristics, and outputs a target evaluation value corresponding to the identification information of the target base station.

The loss function of the target critics network can be calculated by the following steps:

whileRepresenting probability distribution of each candidate base station selected by adjacent terminal equipment of terminal equipment i, G _i A set of neighboring terminal devices representing terminal device i:

in the embodiment of the application, the terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, at least two layers of graph convolution neural networks in the target actor network are utilized to extract characteristics of input data at least twice, and identification information of the target base station is output based on the extracted characteristics. The terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, at least two layers of graph convolution neural networks in the target critic network are utilized for carrying out feature extraction on input data at least twice, a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features, and the target evaluation value is used for representing matching degree of unloading a target task to the target base station. In the method, the input data is subjected to at least two times of feature extraction by utilizing at least two layers of graph convolution neural networks in the target actor network, so that the accuracy of the features extracted by the target actor network is ensured, and the accuracy of the identification of the target base station output by the target actor network is ensured to be higher. In addition, at least two layers of graph convolution neural networks in the target criticism network are utilized to conduct feature extraction on input data at least twice, and accuracy of target evaluation values output by the target criticism network is guaranteed.

In an optional embodiment of the present application, as shown in fig. 7, the "obtaining attribute information of a plurality of candidate base stations associated with a terminal device" in step 202 includes:

in step 701, the terminal device transmits broadcast information to the base station.

The broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment.

Specifically, the terminal device may transmit broadcast information to base stations around each terminal device before offloading the target task.

After receiving the broadcast information sent by the terminal equipment, each base station may send attribute information of the base station to the terminal equipment, and establish a connection with the terminal equipment.

In step 702, the terminal device receives the attribute information sent by each base station, and determines attribute information of a plurality of candidate base stations associated with the terminal device according to the location information of the terminal device and the location information of the base stations included in each attribute information.

Specifically, the attribute information sent by each base station may include position information of each base station, and the terminal device may determine the position of each base station according to the position information of each base station included in each attribute information after receiving the attribute information sent by each base station. The terminal device may select, from among the base stations that have received the attribute information, a base station that is relatively close to the terminal device as a plurality of base stations corresponding to the terminal device, based on the location information of the terminal device itself and the location information of the base stations, and determine attribute information of a plurality of candidate base stations corresponding to the terminal device.

In the embodiment of the application, the terminal equipment sends broadcast information to the base stations, receives attribute information sent by each base station, and determines attribute information of a plurality of candidate base stations corresponding to the terminal equipment according to the position information of the terminal equipment and the position information of the base stations included in each attribute information. In the method, the terminal equipment determines the base station which can establish connection with the terminal equipment by sending broadcast information to the base stations and receiving attribute information sent by each base station. And then determining attribute information of a plurality of candidate base stations corresponding to the terminal equipment from the base stations for establishing connection according to the position information of the terminal equipment and the position information of the base stations included in each attribute information, thereby ensuring that the opposite base stations corresponding to the terminal equipment can be stably connected with the terminal equipment and have a relatively short distance from the terminal equipment, ensuring that the task time delay required for unloading the target task to the target base station is the shortest, and meeting the energy consumption constraint condition of the base stations.

In an alternative embodiment of the present application, as shown in fig. 8, the training process of the preset deep reinforcement learning model may include the following:

step 801, a terminal device acquires a training set corresponding to a preset deep reinforcement learning model.

The training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between terminal equipment corresponding to the training tasks and each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment.

Specifically, before training a preset deep reinforcement learning model, the terminal device needs to acquire a training set corresponding to the preset deep reinforcement learning model. The terminal device may obtain attribute information of a plurality of training tasks, where the attribute information of the plurality of tasks may include data size information of each training task and identification information of each training task. The terminal equipment can also obtain attribute information of each candidate base station corresponding to the training task through communication connection with the base station. The terminal device can calculate time delay data and energy consumption data for unloading each training task to each base station according to a preset algorithm, so that the target base station corresponding to each training task and the identification information of the target base station are determined from a plurality of candidate base stations according to the calculated time delay data and energy consumption data.

Illustratively, in the embodiment of the present application, an edge computing system is defined, where N micro Base Stations (BS) are deployed, and may provide computing services for large-scale Mobile internet of things devices (MD) within the system. For convenience of description, the base station may be denoted as n= {1,2,..and N }, the mobile internet of things device as m= {1,2,..and M }, and the time is discretized into τ different time intervals (time slots), denoted as t= {1,2,..and τ }. Meanwhile, as the distribution positions of the base stations are different and the signal coverage capability is different, each base station can serve different terminal equipment; in addition, the base stations to which the terminal device can be connected are different due to the different locations of the terminal device. Thus, at time t, the set of connectable base stations of terminal device i is denoted as N _i (t) the set of serviceable terminal devices of base station j is denoted as m _j (t). In this case, for any base station j, if the terminal device in the signal coverage area of the base station j is off-loaded with the task, the terminal device is marked as 1, otherwise, the terminal device is marked as 0, which can be specifically expressed as:

taking a community scene with an edge computing system as an example, various mobile internet of things devices including a smart watch, smart glasses, a smart phone and the like are randomly distributed at any position in the community, a computing task k with a specific size is generated at the beginning of each time interval tau, is unloaded to a selected edge base station for further computing analysis after being locally preprocessed, and finally the processed result is returned to the terminal device by the base station. In the process, the following two points need to be noted, namely, the data to be unloaded after the terminal equipment pretreatment is inseparable, namely, the data to be unloaded is directly delivered to a selected base station for calculation and analysis; secondly, the analysis result after the calculation of the base station is much smaller than the data to be unloaded, so that the transmission delay of the downlink can be ignored when the task delay is calculated.

Wherein in the preprocessing step, the terminal equipment usually needs to generateIs encrypted and packed, and is then offloaded to the base station for processing. For convenience of description, the data size of the terminal device i to be processed locally may be set toThe data size to be offloaded to the base station is +.>Correspondingly, the CPU cycle number required by the local calculation and the base station calculation of the unit data volume of the task generated at the time t is +.>And->Thus, the delay it consumes in local preprocessing is:

wherein f _i The CPU frequency of the terminal equipment i is represented; the energy consumption spent in the local treatment is as follows:

wherein, kappa _i For the energy consumption coefficient of the corresponding device, the coefficient is usually dependent on different chip architectures.

In this scenario, since the task to be offloaded is not separable, its offload latency usually includes two parts, namely: transmission delay and computation delay. First, the transmission delay refers to the time it takes for the terminal device i to transmit the preprocessed task to the selected base station j. Therefore, for the terminal device i, the transmission delay at the time t is specifically:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the size of the content to be transmitted, r _i,j And (t) is the up-link rate that can be achieved between the terminal equipment i and the base station j, and is specifically calculated as follows:

Wherein B represents the bandwidth available when data is transferred between the terminal device and the connectable base station;representing the channel gain between the terminal device i and the selected base station j. In addition, the terminal equipment unifies the power p _tx Task transmission in which the noise power is expressed as sigma ² The interference power at the base station end can be expressed as I _i,j . The channel gain calculation formula is as follows:

wherein X represents an adjustment factor of path loss; beta _i,j Andrepresenting a fast fading gain coefficient and a slow fading gain coefficient respectively; d, d _i,j Representing the distance between terminal device i and base station j; ζ is the path loss coefficient.

Secondly, the computation delay of the task generated by the terminal device i at the time t on the edge server can be expressed as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,the number of CPU cycles required for the sub-unit task calculation at the base station side is represented. f (f) _i,j (t)＝f _j /∑(I _j (t)) represents

The CPU frequency of the terminal i at the moment t is divided at the base station j, i.e. when a plurality of tasks are offloaded to the same base station, the base station distributes the own calculation power to each task on average.

The total delay required from preprocessing to computation completion for the task on terminal device i is therefore:

furthermore, the energy consumed by a task in the offloading process for a terminal device generally includes two parts, namely, the energy required to transmit the task to a base station and the energy required for the base station to receive the calculation result when transmitting the calculation result back to the terminal device. The data amount of the calculation result is very small compared with the data amount to be transmitted, so that the receiving energy consumption is negligible. Then, when the terminal device i unloads the task, its transmission energy consumption is:

Its total energy consumption is:

furthermore, when the terminal device in the mobile edge system is off-loaded, energy consumption is inevitably generated. However, if the instantaneous discharge power of the battery is large, it is detrimental, for which reason a battery safety factor is introduced hereI.e. the terminal device is off-loaded,the energy consumption should satisfy the following conditions:

therefore, when the terminal equipment is used for unloading tasks, the minimum and optimal total time delay is realized under the condition that the energy consumption constraint condition is met. This optimization problem is defined as follows:

based on the above, the terminal device may calculate the delay data and the energy consumption data corresponding to the offloading of each training task to each base station, and determine, from the plurality of base stations, the target base station and the identification information of the target base station corresponding to each training task according to the calculated delay data and the energy consumption data. The task unloading time delay corresponding to unloading each training task to the target base station is shortest, and the preset energy consumption constraint condition is met.

Step 802, the terminal device trains the deep reinforcement learning network by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to channel estimation information training task between the terminal device corresponding to the training task and each candidate base station as input, and probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding device, so as to obtain a preset deep reinforcement learning model.

Specifically, the terminal device may input attribute information of each training task, attribute information of a plurality of candidate base stations corresponding to each training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station into an untrained deep reinforcement learning network, and train the deep reinforcement learning network with the deep reinforcement learning model as a gold standard, thereby obtaining a preset deep reinforcement learning model.

Further, when the preset deep reinforcement learning model is trained, the Adam optimizer can be selected to optimize the preset deep reinforcement learning model, so that the preset deep reinforcement learning model can be quickly converged.

When the Adam optimizer is used for optimizing the preset deep reinforcement learning model, a learning rate can be set for the optimizer, and the optimal learning rate can be selected by adopting a learning rate range test technology. The learning rate selection process of the test technology comprises the following steps: firstly, setting a learning rate to be a very small value, then, simply iterating a preset deep reinforcement learning model and training sample data for several times, increasing the learning rate after each iteration is completed, recording each training loss (loss), and then, drawing a learning rate range test chart, wherein the ideal learning rate range test chart generally comprises three areas: the learning rate of the first region is basically unchanged when the loss is too small, the second region is quickly reduced and converged, and the last region is too large so that the loss starts to diverge, so that the learning rate corresponding to the lowest point in the learning rate range test t diagram can be used as the optimal learning rate.

In the embodiment of the application, the terminal equipment acquires a training set corresponding to a preset deep reinforcement learning model, takes attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between the terminal equipment corresponding to the training task and each candidate base station as inputs, trains the deep reinforcement learning network, and obtains the preset deep reinforcement learning model. In an embodiment of the application, the preset deep reinforcement learning model is obtained based on training set training, so that the preset deep reinforcement learning model can be ensured to be more accurate, and the target task can be ensured to be more accurately unloaded to the target base station based on the preset deep reinforcement learning model.

In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critics network, and a return function, as shown in fig. 9, in the step 802, "training the deep reinforcement learning network with attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to a channel estimation information training task between a terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device as input, to obtain the preset deep reinforcement learning model", may include the following steps:

Step 901, the terminal device inputs attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station to an initial actor network, and outputs an identifier of the training base station corresponding to the training task.

Wherein the initial actor networks may include a first actor network and a second actor network

In step 902, the terminal device inputs attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between the terminal device corresponding to the training task and each candidate base station, probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device, and identification of the training base station corresponding to the training task into an initial criticizing network, performs feature extraction on the input data by using the initial criticizing network, and outputs a training evaluation value of unloading the training task to the training base station.

The training evaluation value is used for representing the matching degree of the training base station corresponding to the task, and the training task is unloaded.

In step 903, the terminal device uses the report function to calculate a training report value corresponding to the training task to be unloaded to the training base station.

The training return value is used for representing time delay data and energy consumption data corresponding to the task to be unloaded to the training base station.

And step 904, the terminal equipment trains the initial criticism network according to the training return value to obtain the target criticism network.

In step 905, the terminal device trains the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

The specific training and execution process may include the following steps:

1. the model comprises a plurality of terminal devices (agents), and each terminal device comprises an actor network and a criticism network. Wherein the actor/commentator network comprises a first actor/commentator network and a second actor/commentator network. And the second actor/critique network is completely duplicated by the first actor/critique network prior to training; during training, the second actor/critter network is updated according to a certain rule, for example, a represents a parameter of the first actor/critter network, B represents a parameter in the second actor/critter network, and b=αb+ (1- α) a.

2. For convenience of representation, the attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal equipment corresponding to the training task and each candidate base station are referred to as status information; the "identity of the training base station corresponding to the training task" is referred to as an action, and the "probability distribution that each candidate base station is selected" is referred to as a joint action.

3. The execution flow is as follows: firstly, each terminal device acquires attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and inputs the information into a first actor network to acquire an identification of the training base station corresponding to the training task and a corresponding return value. Meanwhile, each terminal device obtains the identification of the corresponding base station selected by the adjacent terminal device through the communication module, and the probability distribution of the selected candidate base stations is calculated according to the identification. At this time, attribute information of the training task in the environment, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station are updated to the next time, and can be acquired by the terminal device. And finally, the terminal equipment combines the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station, the identification of the training base station corresponding to the training task, the selected probability distribution of each base station, the corresponding return value and the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into a complete experience, and stores the complete experience in independent experience pools for subsequent training.

4. Training process: the complete training process typically involves multiple loops from training a critic network to training an actor network, and is interdependent.

Training a critic network: firstly, each terminal device inputs attribute information of a training task obtained by random sampling in the experience pool, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between the terminal device corresponding to the training task and each candidate base station, an identification of the training base station corresponding to the training task and selected probability distribution information of each base station into a first criticism network in a corresponding model to obtain a comment value; then, attribute information of a training task at the next moment in the experience, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between terminal equipment corresponding to the training task and each candidate base station are input into a second actor network in a corresponding model, and identification of the training base station corresponding to the training task at the next moment is obtained; then each terminal device acquires the identification of the training base station corresponding to the training task of the adjacent terminal device through the acquisition module, and calculates the probability distribution of each candidate base station being selected; and finally, inputting the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station, the identification of the training base station corresponding to the training task and the selected probability distribution information of each candidate base station into a second criticism network in the submodel, and calculating to obtain a comment value of the next moment. At this time, the loss is calculated by using the comment value, the return value obtained by sampling, and the comment value at the next time, and the gradient is further calculated to update the first criticizing network in the terminal device.

Training the actor network: firstly, each terminal device inputs the attribute information of the training task obtained by sampling, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information between the terminal device corresponding to the training task and each candidate base station into a first actor network in a corresponding model, obtains the identification of the training base station corresponding to the training task, and inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information between the terminal device corresponding to the training task and each candidate base station, the identification of the training base station corresponding to the training task and the probability distribution information of each candidate base station selected into a first evaluator network in the corresponding terminal device, so as to obtain a corresponding comment value. And then calculating loss according to the comment value, and further calculating gradient to update the first actor network in the corresponding terminal equipment.

And finally, updating the second actor/critic network in the terminal equipment according to the second actor/critic network updating mode in the step 1.

In order to better illustrate the method for offloading large-scale user tasks provided in the present application, the present application provides an embodiment for illustrating an overall flow aspect of the method for offloading large-scale user tasks, as shown in fig. 10, the method includes:

In step 1001, the terminal device obtains a training set corresponding to the preset deep reinforcement learning model.

In step 1002, the terminal device trains the deep reinforcement learning network with the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base stations corresponding to the channel estimation information training task between the terminal device corresponding to the training task and each candidate base station selected by the adjacent terminal device of the corresponding device as inputs, and obtains a preset deep reinforcement learning model.

In step 1003, the terminal device acquires task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by neighboring terminal devices of the corresponding device.

In step 1004, the terminal device transmits broadcast information to the base station.

In step 1005, the terminal device receives the attribute information sent by each base station, and determines attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station according to the location information of the terminal device and the location information of the base station included in each attribute information.

In step 1006, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the target actor network, performs at least two feature extraction on the input data by using at least two layers of graph convolution neural networks in the target actor network, and outputs the identification information of the target base station based on the extracted features.

In step 1007, the terminal device inputs the task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of the target base station, and probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device into the target critic network, performs feature extraction on the input data at least twice by using at least two layers of graph convolution neural networks in the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In step 1008, the terminal device calculates a target return value using the return function.

It should be understood that, although the steps in the flowcharts of fig. 2, 5, and 7-10 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps may be performed in other orders, unless explicitly stated in the embodiments of the present application, and are not limited to the exact order. Moreover, at least some of the steps of fig. 2, 5, and 7-10 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

In one embodiment of the present application, as shown in fig. 11, there is provided a large-scale user task offloading apparatus 1100, comprising: a first acquisition module 1110, a second acquisition module 1120, a determination module 1130, and an offloading module 1140, wherein:

the first obtaining module 1110 obtains task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by a neighboring terminal device of the corresponding device.

A second obtaining module 1120, configured to obtain attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.

A determining module 1130, configured to input task attribute information, probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determine a target base station corresponding to a target task, and output a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading a target task to a target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for carrying out feature extraction on input data of the preset deep reinforcement learning model at least twice;

An offloading module 1140 is used to offload the target task to the target base station.

In one embodiment of the present application, the preset deep reinforcement learning model includes a target actor network and a target criticism network, as shown in fig. 12, the determining module 1130 includes: a first output unit 1131, and a second output unit 1132, wherein:

a first output unit 1131, configured to input task attribute information, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, and output identification information of the target base station.

A second output unit 1132, configured to input task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of the target base station, and probability distribution that each candidate base station is selected by a neighboring terminal device of the corresponding device into the target criticizing home network, and output a target evaluation value corresponding to the identification information of the target base station.

In one embodiment of the present application, the preset deep reinforcement learning model includes a return function, as shown in fig. 13, the determining module 1130 further includes: a computing unit 1133, wherein:

The calculating unit 1133 is configured to calculate a target report value by using the report function, where the target report value is used to characterize the delay data and the energy consumption data corresponding to the unloading of the target task to the target base station.

In one embodiment of the present application, the preset deep reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolution neural networks, the target critic network includes at least two layers of graph convolution neural networks, the determining module 1130 is specifically configured to input task attribute information, attribute information of a plurality of candidate base stations, and channel estimation information between a terminal device and each candidate base station into the target actor network, perform feature extraction on input data at least twice by using the at least two layers of graph convolution neural networks in the target actor network, and output identification information of a target base station based on the extracted features; inputting task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In one embodiment of the present application, as shown in fig. 14, the second obtaining module 1120 includes a sending unit 1121 and a receiving unit 1122, where:

a transmitting unit 1121 configured to transmit broadcast information to the base stations, the broadcast information being used to instruct each base station to transmit attribute information of the base station to the terminal device;

and a receiving unit 1122 configured to receive the attribute information transmitted from each base station, and determine attribute information of a plurality of candidate base stations corresponding to the terminal device based on the location information of the terminal device and the location information of the base station included in each attribute information.

In one embodiment of the present application, as shown in fig. 15, the foregoing large-scale user task offloading apparatus 1100 further includes: a third acquisition module 1150 and a training module 1160, wherein

The third obtaining module 1150 obtains a training set corresponding to the preset deep reinforcement learning model, where the training set includes attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between a terminal device corresponding to the training tasks and each candidate base station, identification information of the training base station corresponding to the training tasks, and probability distribution that each candidate base station is selected by an adjacent terminal device of the corresponding device.

The training module 1160 is configured to train the deep reinforcement learning network to obtain a preset deep reinforcement learning model by using attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to channel estimation information training task between terminal equipment corresponding to the training task and adjacent terminal equipment of each candidate base station as inputs.

In one embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critic network, and a return function, and the training module 1160 is specifically configured to input attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between terminal devices corresponding to the training task and each candidate base station to an initial actor network, and output an identifier of a training base station corresponding to the training task; inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial commentator network, extracting characteristics of input data by utilizing the initial commentator network, outputting training evaluation values for unloading the training task to the training base stations, wherein the training evaluation values are used for representing matching degree of the training base stations corresponding to the task; the training report value corresponding to the training task is unloaded to the training base station by using the report function calculation, and the training report value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training the initial criticism network according to the training return value to obtain a target criticism network; and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

Specific limitations regarding the large-scale user task offloading apparatus may be found in the above limitations regarding the task offloading method, and will not be described in detail herein. The various modules in the task offloading apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 16. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a task offloading method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 16 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset deep reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for extracting features of input data of the preset deep reinforcement learning model at least twice; and unloading the target task to the target base station.

In one embodiment, the predetermined depth reinforcement learning model includes a target actor network and a target critic network, and the processor when executing the computer program further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station to a target actor network, and outputting identification information of the target base station; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into the target criticism network, and the target evaluation value corresponding to the identification information of the target base station is output.

In one embodiment, the predetermined deep reinforcement learning model includes a return function, and the processor when executing the computer program further performs the steps of: and calculating a target return value by using the return function, wherein the target return value is used for representing delay data and energy consumption data corresponding to unloading the target task to the target base station.

In one embodiment, the preset depth reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph roll-up neural networks, the target critic network includes at least two layers of graph roll-up neural networks, and the processor when executing the computer program further implements the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station into a target actor network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; inputting task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In one embodiment, the processor when executing the computer program further performs the steps of: the terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base stations to the terminal equipment; and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base stations included in each attribute information.

In one embodiment, the processor when executing the computer program further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between terminal equipment corresponding to the training tasks and each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to the training task from terminal equipment corresponding to the training task to channel estimation information among the candidate base stations and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment as inputs to obtain a preset deep reinforcement learning model.

In one embodiment, the preset deep reinforcement learning model includes a target actor network, a target critic network, and a return function, and the processor when executing the computer program further implements the steps of: the method comprises the steps of inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between terminal equipment corresponding to the training task and each candidate base station to an initial actor network, and outputting an identifier of the training base station corresponding to the training task; inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial commentator network, extracting characteristics of input data by utilizing the initial commentator network, outputting training evaluation values for unloading the training task to the training base stations, wherein the training evaluation values are used for representing matching degree of the training base stations corresponding to the task; the training report value corresponding to the training task is unloaded to the training base station by using the report function calculation, and the training report value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training the initial criticism network according to the training return value to obtain a target criticism network; and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment, and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset deep reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for extracting features of input data of the preset deep reinforcement learning model at least twice; and unloading the target task to the target base station.

In one embodiment, the pre-set depth reinforcement learning model includes a target actor network and a target critic network, the computer program when executed by the processor further implementing the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station to a target actor network, and outputting identification information of the target base station; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into the target criticism network, and the target evaluation value corresponding to the identification information of the target base station is output.

In one embodiment, the predetermined deep reinforcement learning model includes a reward function that when executed by the processor further performs the steps of: and calculating a target return value by using the return function, wherein the target return value is used for representing delay data and energy consumption data corresponding to unloading the target task to the target base station.

In one embodiment, the preset depth reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph roll-up neural networks, the target critic network includes at least two layers of graph roll-up neural networks, and the computer program when executed by the processor further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between terminal equipment and each candidate base station into a target actor network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; inputting task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between terminal equipment and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, carrying out feature extraction on input data at least twice by utilizing at least two layers of graph convolution neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

In one embodiment, the computer program when executed by the processor further performs the steps of: the terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base stations to the terminal equipment; and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base stations included in each attribute information.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between terminal equipment corresponding to the training tasks and each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to the training task from terminal equipment corresponding to the training task to channel estimation information among the candidate base stations and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment as inputs to obtain a preset deep reinforcement learning model.

In one embodiment, the predetermined deep reinforcement learning model includes a target actor network, a target critic network, and a return function, and the computer program when executed by the processor further performs the steps of: the method comprises the steps of inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between terminal equipment corresponding to the training task and each candidate base station to an initial actor network, and outputting an identifier of the training base station corresponding to the training task; inputting attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial commentator network, extracting characteristics of input data by utilizing the initial commentator network, outputting training evaluation values for unloading the training task to the training base stations, wherein the training evaluation values are used for representing matching degree of the training base stations corresponding to the task; the training report value corresponding to the training task is unloaded to the training base station by using the report function calculation, and the training report value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training the initial criticism network according to the training return value to obtain a target criticism network; and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of large-scale user task offloading, the method comprising:

acquiring task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station;

inputting the task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of the plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to identification information of the target base station, wherein the target evaluation value is used for representing matching degree of unloading the target task to the target base station; the preset deep reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for extracting features of input data of the preset deep reinforcement learning model at least twice;

Unloading the target task to the target base station;

the preset depth reinforcement learning model includes a target actor network and a target criticism network, the task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of a corresponding device, attribute information of the plurality of candidate base stations, and channel estimation information between the terminal equipment and each candidate base station are input into the preset depth reinforcement learning model, a target base station corresponding to the target task is determined, and a target evaluation value corresponding to identification information of the target base station is output, including:

inputting the task attribute information, the attribute information of the plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station to the target actor network, and outputting the identification information of the target base station;

and inputting the task attribute information, the attribute information of the plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target criticism network, and outputting a target evaluation value corresponding to the identification information of the target base station.

2. The method according to claim 1, wherein the predetermined deep reinforcement learning model includes a return function, and the inputting the task attribute information, the probability distribution of each candidate base station being selected by a neighboring terminal device of a corresponding device, the attribute information of the plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the predetermined deep reinforcement learning model, and determining a target base station corresponding to the target task, further includes:

and calculating a target return value by using the return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to unloading the target task to the target base station.

3. The method according to claim 1, wherein the preset depth reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of the graph roll-up neural network, the target critic network includes at least two layers of the graph roll-up neural network, the inputting the task attribute information, probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, attribute information of the plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the preset depth reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to identification information of the target base station, including:

Inputting the task attribute information, the attribute information of the plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into the target actor network, extracting features of input data at least twice by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting the identification information of the target base station based on the extracted features;

inputting the task attribute information, the attribute information of the plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target criticizing home network, carrying out at least two-time feature extraction on input data by utilizing at least two layers of the graph convolution neural network in the target criticizing home network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.

4. The method according to claim 1, wherein the acquiring attribute information of a plurality of candidate base stations associated with the terminal device comprises:

The terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base stations to the terminal equipment;

and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base stations included in the attribute information.

5. The method of claim 1, wherein the training process of the preset deep reinforcement learning model is:

acquiring a training set corresponding to the preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information between terminal equipment corresponding to the training tasks and each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;

and training the deep reinforcement learning network by taking attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, identification information of the training base station corresponding to the training task and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment as inputs to obtain the preset deep reinforcement learning model.

6. The method according to claim 5, wherein the preset deep reinforcement learning model includes a target actor network, a target critic network, and a return function, the training the deep reinforcement learning network with attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between a terminal device corresponding to the training task and each candidate base station, identification information of a training base station corresponding to the training task, and probability distribution of each candidate base station selected by a neighboring terminal device of the corresponding device as inputs, and the training the deep reinforcement learning network to obtain the preset deep reinforcement learning model includes:

inputting attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between terminal equipment corresponding to the training task and each candidate base station to an initial actor network, and outputting an identifier of the training base station corresponding to the training task;

inputting attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information between terminal equipment corresponding to the training task and each candidate base station, probability distribution of selection of adjacent terminal equipment of each candidate base station by corresponding equipment and identification of the training base station corresponding to the training task into an initial commentator network, extracting characteristics of input data by utilizing the initial commentator network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing matching degree of unloading the training task to the training base station corresponding to the task;

Calculating a training return value corresponding to the training task unloaded to the training base station by using the return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station;

training the initial criticism network according to the training return value to obtain the target criticism network;

and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.

7. A large-scale user task offloading apparatus, the apparatus comprising:

the determining module is used for inputting the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of the plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset deep reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for extracting features of input data of the preset deep reinforcement learning model at least twice;

The unloading module is used for unloading the target task to the target base station;

the preset depth reinforcement learning model comprises a target actor network and a target criticism network, and the determining module comprises: a first output unit and a second output unit;

a first output unit, configured to input the task attribute information, attribute information of the plurality of candidate base stations, and channel estimation information between the terminal device and each of the candidate base stations to the target actor network, and output identification information of the target base station;

and the second output unit is used for inputting the task attribute information, the attribute information of the plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target criticism network and outputting a target evaluation value corresponding to the identification information of the target base station.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.