CN113676954A - Large-scale user task unloading method and device, computer equipment and storage medium - Google Patents

Large-scale user task unloading method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113676954A
CN113676954A CN202110783668.8A CN202110783668A CN113676954A CN 113676954 A CN113676954 A CN 113676954A CN 202110783668 A CN202110783668 A CN 202110783668A CN 113676954 A CN113676954 A CN 113676954A
Authority
CN
China
Prior art keywords
base station
target
task
training
attribute information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110783668.8A
Other languages
Chinese (zh)
Other versions
CN113676954B (en
Inventor
张旭
古博
林梓淇
丁北辰
姜善成
韩瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110783668.8A priority Critical patent/CN113676954B/en
Publication of CN113676954A publication Critical patent/CN113676954A/en
Application granted granted Critical
Publication of CN113676954B publication Critical patent/CN113676954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/09Management thereof
    • H04W28/0925Management thereof using policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/52Allocation or scheduling criteria for wireless resources based on load
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The application relates to a method and a device for unloading tasks of a large-scale user, computer equipment and a storage medium, which are suitable for the technical field of computers. The method comprises the following steps: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset depth reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset depth reinforcement learning model comprises a graph convolution neural network; and unloading the target task to the target base station. By adopting the method, the situation that a plurality of terminal devices occupy the computing resources by crowding can be effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided.

Description

Large-scale user task unloading method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of resource allocation technologies in the field of communications, and in particular, to a method and an apparatus for offloading a large-scale user task, a computer device, and a storage medium.
Background
With the continuous development of communication technology, a large number of emerging mobile applications such as cloud games, Virtual Reality (VR), Augmented Reality (AR), and the like are promoted. Such applications work properly for fulfillment. The task unloading technology is developed by the way that a communication technology is utilized to unload the computation-intensive tasks in the terminal equipment to a server side with sufficient computing resources for processing, and then the server side transmits the computing results back to the terminal equipment, so that the dual optimization of computing capacity and time delay is realized. However, the distance between the unloading end server and the terminal device at the terminal device end in the cloud computing is far away, so that the transmission delay of the unloading end server is far higher than the tolerable delay requirement of the computing task, and the terminal device experience is poor. However, in recent years, offloading the computation-intensive tasks in the terminal devices to the edge base station side with sufficient computation resources for processing becomes a hot issue of research.
In the conventional method, a conventional algorithm represented by convex optimization, game theory, or the like does not perform communication between a plurality of terminal devices when the plurality of terminal devices simultaneously offload tasks.
Therefore, in the above conventional method, when there are multiple terminal devices offloading tasks simultaneously, there may be a situation where multiple terminal devices offload tasks to the same base station simultaneously, so that the base station resources are insufficient and the tasks are difficult to complete.
Disclosure of Invention
Therefore, it is necessary to provide a method, an apparatus, a computer device and a storage medium for offloading a large-scale user task, which can solve the problem of how to collaborate multiple terminal devices to offload a task.
In a first aspect, a large-scale user task offloading method is provided, and the method includes: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset deep reinforcement learning model; and unloading the target task to the target base station.
In one embodiment, the method for performing deep reinforcement learning includes inputting task attribute information, probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, and includes: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.
In one embodiment, the preset deep reinforcement learning model includes a reporting function, and the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of a plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station are input into the preset deep reinforcement learning model to determine the target base station corresponding to the target task, and the method further includes: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.
In one embodiment, the preset depth-enhanced learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, the task attribute information, the probability distribution selected by the adjacent terminal device of the corresponding device of each candidate base station, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station are input into the preset depth-enhanced learning model, the target base station corresponding to the target task is determined, and a target evaluation value corresponding to the identification information of the target base station is output, including: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.
In one embodiment, obtaining attribute information of a plurality of candidate base stations associated with a terminal device includes: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.
In one embodiment, the preset deep reinforcement learning model training process includes: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.
In one embodiment, the method for training the deep reinforcement learning model includes training the deep reinforcement learning network to obtain the preset deep reinforcement learning model, where the preset deep reinforcement learning model includes a target actor network, a target comment family network, and a return function, and the input is attribute information of a training task, attribute information of multiple candidate base stations corresponding to the training task, identification information of the training base stations corresponding to a channel estimation information training task from a terminal device corresponding to the training task to each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device, and includes: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.
In a second aspect, a large-scale user task offloading device is provided, the device comprising:
the first acquisition module is used for acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;
a second obtaining module, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station;
the determining module is used for inputting the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;
and the unloading module is used for unloading the target task to the target base station.
In a third aspect, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method for large-scale user task offloading as described in any of the first aspects above when executing the computer program.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of large scale user task offloading as in any of the first aspects above.
According to the large-scale user task unloading method, the large-scale user task unloading device, the computer equipment and the storage medium, task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment are obtained; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset depth reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, the preset depth reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine which base station the neighbor terminal equipment unloads the unloading task to, and finally the unloading is ensured to be mutually cooperated among the neighbor base stations. The terminal equipment inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determines a target base station corresponding to a target task. The terminal equipment determines a target base station corresponding to the target task based on a preset deep reinforcement learning model by combining task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, and the problem of inconsistent action space caused by different connectable base stations of different terminal equipment is solved because the preset deep reinforcement learning model comprises a graph convolution neural network. In addition, in the method, through mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources in a crowded manner is effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading a target task to a target base station can be evaluated.
Drawings
FIG. 1 is a diagram of an application environment for a large-scale user task offloading method in one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a large-scale user task offloading method, according to one embodiment;
FIG. 3 is a schematic structural diagram of a deep reinforcement learning model in the large-scale user task offloading method in one embodiment;
FIG. 4 is a schematic diagram illustrating a structure of a graph convolution neural network in a large-scale user task offloading method according to another embodiment;
FIG. 5 is a flow diagram illustrating a large-scale user task offloading method, according to one embodiment;
FIG. 6 is a diagram illustrating a deep reinforcement learning model in a large-scale user task offloading method in an embodiment;
FIG. 7 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;
FIG. 8 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;
FIG. 9 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;
FIG. 10 is a flow diagram that illustrates a method for offloading large-scale user tasks, according to one embodiment;
FIG. 11 is a block diagram of a large-scale user task offloading device in one embodiment;
FIG. 12 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;
FIG. 13 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;
FIG. 14 is a block diagram of the architecture of a large-scale user task offloading device in one embodiment;
FIG. 15 is a block diagram of a large-scale user task offloading device in one embodiment;
FIG. 16 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The large-scale user task unloading method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal device 102 communicates with the base station 104 over a network. And the terminal equipment acquires the attribute information of a plurality of candidate base stations corresponding to the terminal equipment through communication with the base station according to the position information of the terminal equipment. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the base station 104 may be implemented by a server cluster formed by a plurality of base stations.
In an embodiment, as shown in fig. 2, a large-scale user task offloading method is provided, which is described by taking the method as an example of being applied to the terminal device in fig. 1, and includes the following steps:
step 201, the terminal device obtains task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device.
Specifically, the terminal device may obtain attribute information of a target task to be offloaded, where the task attribute information of the target task may include a data size of the target task, identification information of the target task, and the like. In addition, the terminal device may further obtain, through communication connection with the neighboring terminal device, probability distribution that each candidate base station is selected by the neighboring terminal device of the corresponding device.
In step 202, the terminal device obtains attribute information of a plurality of candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.
Specifically, the terminal device may send signals to surrounding base stations in a broadcast manner, and receive attribute information returned by each base station. The attribute information returned by each base station may include location information of each base station. And the terminal equipment determines a pair of base stations corresponding to the terminal equipment according to the position information of the terminal equipment and the position information of each base station, and determines attribute information corresponding to the base stations. And the terminal equipment determines channel estimation information between the terminal equipment and each candidate base station according to the attribute information of the terminal equipment and the attribute information of a plurality of candidate base stations associated with the terminal equipment.
Step 203, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determines a target base station corresponding to the target task, and outputs a target evaluation value corresponding to the identification information of the target base station.
The target evaluation value is used for representing the matching degree of unloading the target task to the target base station. The preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset deep reinforcement learning model.
Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, the terminal device performs at least twice feature extraction on input data by using a graph convolution neural network in the preset deep reinforcement learning model, and determines a target base station corresponding to a target task based on the extracted features.
Among them, the deep reinforcement learning model has been widely used in various research fields as a hotspot of current research. As shown in FIG. 3, the deep reinforcement learning model is used to learn a countermeasure under a specific application scenario, which is usually based on observable State information (states s) in the environmentt) For inputting, the terminal device makes corresponding Action (Action a) after evaluatingt) And acts on the environment to obtain feedback (Reward r)t) To improve the strategy. The above steps are repeated until the terminal equipment can freely cope with the dynamic change of the environment. In general, reinforcement learning can be divided into two categories: one is a value-based approach (such as the DQN algorithm) that aims to maximize the return on each action taken. Therefore, the higher the reward, the more easily the corresponding action is selected; in another strategy-based approach, it is aimed at directly learning a parameterized strategy piθ. Meanwhile, the parameter θ based on the strategy method can be updated by inverse gradient transfer using the following formula:
Figure BDA0003157949810000061
wherein,pπis the state distribution probability. And the gradient can be calculated according to the following formula:
Figure BDA0003157949810000062
wherein, piθ(at|st) Representing information at a given state stTime selection action atThe probability of (c).
Then, the model parameters are updated by inverse gradient conduction:
Figure BDA0003157949810000063
where α is the step size setting in the learning process.
In the embodiment of the present application, a Multi-terminal device distributed Reinforcement Learning algorithm (MAGCAC) based on Graph Learning in a deep Reinforcement Learning model is mainly improved to obtain a preset deep Reinforcement Learning model. The preset deep reinforcement learning model is used for determining a base station which has the shortest time delay and meets the preset constraint condition in the target task unloading process from a plurality of base stations.
In addition, Graph Convolution Networks (GCNs) have been a focus of research since their birth in 2017, and have achieved unusual effects in many fields. Generally, the structure of the graph is quite irregular and has no translational invariance, and therefore, it is impossible to extract features using a Convolutional Neural Network (CNN), a cyclic neural network (RNN), or the like. Thus, much work on the theory of image learning has emerged as a spring shoot after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structure features as input, and outputs corresponding features after graph convolution, and the computation layer by layer is as follows:
Figure BDA0003157949810000064
wherein,
Figure BDA0003157949810000065
representing a contiguous matrix of the graph structure, INThen is the identity matrix;
Figure BDA0003157949810000066
w is a learnable weight parameter matrix. σ (-) is an activation function, e.g., ReLU (-) etc.; h(l)∈RN×DIs the firstthFeatures extracted by the layer diagram convolution neural network, whenthWhen 0, then H(0)X is an input graph structure feature.
And step 204, the terminal equipment unloads the target task to the target base station.
Specifically, after the target base station corresponding to the target task is determined, the terminal device may offload the target task to the target base station, and after the target base station calculates the target task, the calculation result is sent to the terminal device.
In the task unloading method, the terminal equipment acquires the task attribute information of a target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determining a target base station corresponding to a target task, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing feature extraction at least twice on input data of the preset deep reinforcement learning model; and unloading the target task to the target base station. In the method, the terminal equipment not only acquires the task attribute information of the target task to be unloaded and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, but also acquires the attribute information of a plurality of candidate base stations associated with the terminal equipment and the channel estimation information between the terminal equipment and each candidate base station, so that the terminal equipment can be ensured to clearly determine which base station the neighbor terminal equipment unloads the unloading task to, and finally the unloading is ensured to be mutually cooperated among the neighbor base stations. The terminal equipment inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, and determines a target base station corresponding to a target task. The terminal equipment determines a target base station corresponding to the target task based on a preset deep reinforcement learning model by combining task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station, and the problem of inconsistent action space caused by different connectable base stations of different terminal equipment is solved because the preset deep reinforcement learning model comprises a graph convolution neural network. In addition, in the method, through mutual communication between the terminal equipment and the neighbor terminal equipment, the cooperative decision among the terminal equipment is realized, the optimal overall performance of the system is further realized, the situation that a plurality of terminal equipment occupy computing resources in a crowded manner is effectively prevented, and the phenomenon that the tasks are difficult to complete due to insufficient base station resources is avoided. In addition, the preset deep reinforcement learning model can also output a target evaluation value, so that the matching of unloading a target task to a target base station can be evaluated.
In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, as shown in fig. 5, the step 203 of inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determining the target base station corresponding to the target task, and outputting the target evaluation value corresponding to the identification information of the target base station may include the following steps:
step 501, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into the target actor network, and outputs identification information of the target base station.
Specifically, the terminal device inputs task attribute information, probability distribution of each candidate base station selected by adjacent terminal devices of the corresponding device, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into a target actor network, and the terminal device may perform feature extraction on input data by using at least two feature extraction layers included in the target actor network, calculate extracted features by using a full connection layer in the target actor network, and finally output identification information of the target base station.
Specifically, in the embodiment of the present application, the preset deep Reinforcement Learning model is mainly improved based on a Multi-terminal device distributed Reinforcement Learning algorithm (magmac) of Graph Learning in the deep Reinforcement Learning model. The algorithm takes the terminal equipment as the terminal equipment, takes the whole edge computing system as the environment, and is divided into an actor network and a critic network.
In the embodiment of the application, the observation state refers to the observation of the model on the environment, and whether the selection of the features in the observation state reasonably and directly influences whether the terminal equipment can learn an effective coping strategy or not. The algorithm regards both the terminal equipment and the base station in the system as nodes, so that a corresponding graph structure G is drawn according to the connectivity between the terminal equipment and the base station. For the convenience of implementation, the terminal device is regarded as a special base station in the embodiments of the present application, that is, since the terminal device in the system does not support the calculation task to be completely calculated locally, the characteristic information of the terminal device as the base station is set to 0. It should be noted that, in the embodiments of the present application, only the connectivity between the terminal device and the base station is considered, and the connectivity between the terminal devices is not considered. Therefore, the embodiment of the application divides the node characteristics of the terminal equipment and the base stationEremory of other notes
Figure BDA0003157949810000081
And
Figure BDA0003157949810000082
the diagram structure corresponding to the terminal device i is as follows:
Figure BDA0003157949810000083
in the embodiment of the present application, the graph structure at the time t
Figure BDA0003157949810000084
As state observation information of terminal device i
Figure BDA0003157949810000085
Namely, it is
Figure BDA0003157949810000086
In the process of task unloading, the time delay and the energy consumption are mainly influenced by the following factors, respectively: base station computing power fjAchievable transmission rate ri,j(t) and base station computing resource being crowded. The connectable base station computing power and achievable transmission rate are thus taken as the main observed state information, and, for the terminal device i,
Figure BDA0003157949810000087
for the situation that the computing resources of the base station are crowded, the cooperation situation between the neighbor devices needs to be determined.
At time t, the terminal device evaluates the current state information to derive a corresponding action:
Figure BDA0003157949810000088
wherein the motion
Figure BDA0003157949810000089
The base stations selected for one-hot coding, i.e., for offloading, are denoted as 1, and the others are denoted as 0. However, since the motion is required as continuous in the DDPG algorithm, the embodiment of the present application re-expresses the DDPG algorithm output and discretizes into the above-described one-hot encoded form.
In addition, as shown in fig. 6, in the embodiment of the present application, the actor network structure in MAGCAC algorithm takes the graph structure G as input, and uses two layers of GCNs to extract features, and finally takes a Multilayer Perceptron (MLP) as output. Since each agent action space is different, the output result of the multilayer perceptron is multiplied by the mask of the corresponding agent to obtain the final action.
Thus, agent i is determining a policy
Figure BDA00031579498100000810
The following gradient can be calculated:
Figure BDA00031579498100000811
similarly, the comment family network structure in the magmac algorithm also takes the graph structure G as input, uses two layers of GCNs to extract features, and finally takes a Multilayer Perceptron (MLP) as output. Thus, the loss function for the critic's network can be calculated as:
Figure BDA00031579498100000812
wherein,
Figure BDA0003157949810000091
then the target action value is calculated as follows:
Figure BDA0003157949810000092
while
Figure BDA0003157949810000093
Representing the probability distribution, G, of each base station being selected by a terminal device adjacent to terminal device iiSet of neighbouring terminal devices representing terminal device i:
Figure BDA0003157949810000094
step 502, the terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of the target base station, and probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station.
Specifically, the terminal device inputs the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, performs feature extraction on input data by using at least two special extraction layers in the critic network, and outputs a target evaluation value corresponding to the identification information of the target base station.
Step 503, the terminal device calculates a target report value by using the report function.
And the target return value is used for representing time delay data and energy consumption data corresponding to unloading of the target task to the target base station.
Specifically, the return value is used to represent a task delay condition and an energy consumption condition corresponding to the unloading of the target task to the target base station. The higher the return value is, the shorter the time delay of the task corresponding to the target task being unloaded to the target base station is, and the smaller the energy consumption is.
Illustratively, in the embodiment of the present application, a reward function is used to expect that the task delay is minimized under the constraint condition of meeting the energy consumption budget. In a given action
Figure BDA0003157949810000095
Then, the corresponding report function is calculated according to the following formula:
Figure BDA0003157949810000096
wherein,
Figure BDA0003157949810000097
is an non-positive number and represents the upper limit of the energy consumption penalty. The return function can always aim at minimizing task delay under the condition of considering battery energy consumption safety. When energy consumption is epsiloni(t) is less than
Figure BDA0003157949810000098
In the time, the reward of the energy consumption part in the return function is 0, namely under the condition of ensuring the safety of energy consumption, the embodiment of the application has no specific limitation on the energy consumption of task transmission; when energy consumption is epsiloni(t) is higher than
Figure BDA0003157949810000099
When the number of the part is less than the threshold, the part is a negative number, namely a punishment, and the punishment of the part is provided with a lower limit
Figure BDA00031579498100000910
Therefore, under the guidance of the return function, on the basis of considering both task delay and transmission energy consumption, the terminal device can learn an excellent task unloading strategy and unload a given task to a proper base station.
In the embodiment of the application, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of a plurality of candidate base stations and the channel estimation information between the terminal device and each candidate base station into the network of the target actor, and outputs the identification information of the target base station. And then, the terminal equipment inputs the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station. In addition, the terminal device calculates a target reward value using a reward function. Therefore, the task time delay of unloading the target task to the target base station can be guaranteed to be shortest, and the energy consumption constraint condition is met.
In an optional embodiment of the present application, the preset depth-enhanced learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, a task time delay of a target task offloaded to a target base station is shortest, and an energy consumption constraint condition is satisfied, in step 203, "inputting task attribute information, probability distribution selected by adjacent terminal devices of corresponding devices of each candidate base station, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the preset depth-enhanced learning model, determining the target base station corresponding to the target task, and outputting a target evaluation value corresponding to identification information of the target base station", may include the following contents:
the method comprises the steps that the terminal equipment inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, at least two layers of graph convolution neural networks in the target actor network are used for carrying out feature extraction on input data at least twice, and identification information of the target base station is output based on extracted features.
The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two times of feature extraction is carried out on input data through at least two layers of graph convolutional neural networks in the target critic network, and a target evaluation value corresponding to the identification information of the target base station is output based on extracted features.
Among them, Graph Convolution Networks (GCNs) have been a research focus since their birth in 2017, and have achieved unusual effects in many fields. Generally, the structure of the graph is quite irregular and has no translational invariance, and therefore, it is impossible to extract features using a Convolutional Neural Network (CNN), a cyclic neural network (RNN), or the like. Thus, much work on the theory of image learning has emerged as a spring shoot after rain. Fig. 4 shows a multi-layer graph convolution network, which takes graph structure features as input, and outputs corresponding features after graph convolution, and the computation layer by layer is as follows:
Figure BDA0003157949810000101
wherein,
Figure BDA0003157949810000102
representing a contiguous matrix of the graph structure, INThen is the identity matrix;
Figure BDA0003157949810000103
w is a learnable weight parameter matrix. σ (-) is an activation function, e.g., ReLU (-) etc.; h(l)∈RN×DIs the firstthFeatures extracted by the layer diagram convolution neural network, whenthWhen 0, then H(0)X is an input graph structure feature.
Specifically, the actor network structure takes task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station as input, and uses two layers of GCNs to extract features from the input information, and finally calculates the extracted features by using a multi-layer Perceptron (MLP) and outputs identification information of a target base station.
The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output.
The target evaluation value is used for representing the matching degree of unloading the target task to the target base station.
Specifically, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device is input into the target critic network. The terminal device performs at least two times of feature extraction on input data by using at least two layers of graph convolutional neural networks in the critic network, calculates the extracted features by using a Multilayer Perceptron (MLP), and outputs a target evaluation value corresponding to the identification information of the target base station.
Wherein, the loss function of the target critic network can be calculated as:
Figure BDA0003157949810000111
wherein,
Figure BDA0003157949810000112
then the target action value is calculated as follows:
Figure BDA0003157949810000113
while
Figure BDA0003157949810000114
Probability distribution, G, representing the selection of candidate base stations by terminal devices adjacent to terminal device iiSet of neighbouring terminal devices representing terminal device i:
Figure BDA0003157949810000115
in the embodiment of the application, the terminal device inputs task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal device and each candidate base station into a target actor network, performs at least twice feature extraction on input data by using at least two layers of graph convolution neural networks in the target actor network, and outputs identification information of the target base station based on the extracted features. The method comprises the steps that a terminal device inputs task attribute information, attribute information of a plurality of candidate base stations, channel estimation information between the terminal device and each candidate base station, identification information of a target base station and probability distribution of each candidate base station selected by adjacent terminal devices of corresponding devices into a target critic network, at least two times of feature extraction is carried out on input data through at least two layers of graph convolutional neural networks in the target critic network, a target evaluation value corresponding to the identification information of the target base station is output based on extracted features, and the target evaluation value is used for representing the matching degree of unloading a target task to the target base station. In the method, at least two layers of graph convolution neural networks in the target actor network are used for extracting the characteristics of the input data at least twice, so that the accuracy of the characteristics extracted by the target actor network is ensured, and the accuracy of the identification of the target base station output by the target actor network is ensured to be higher. In addition, at least two layers of graph convolutional neural networks in the target critic network are used for carrying out feature extraction on input data at least twice, and the accuracy of a target evaluation value output by the target critic network is guaranteed.
In an alternative embodiment of the present application, as shown in fig. 7, the "acquiring attribute information of multiple candidate base stations associated with the terminal device" in step 202 includes:
in step 701, the terminal device sends broadcast information to the base station.
The broadcast information is used for instructing each base station to send attribute information of the base station to the terminal equipment.
Specifically, the terminal device may transmit the broadcast information to base stations around each terminal device before offloading the target task.
After receiving the broadcast information sent by the terminal device, each base station may send attribute information of the base station to the terminal device, and establish a connection with the terminal device.
Step 702, the terminal device receives the attribute information sent by each base station, and determines the attribute information of a plurality of candidate base stations associated with the terminal device according to the position information of the terminal device and the position information of the base station included in each attribute information.
Specifically, the attribute information sent by each base station may include location information of each base station, and after receiving the attribute information sent by each base station, the terminal device may determine the location of each base station according to the location information of each base station included in each attribute information. The terminal device may select, from among the base stations that have received the attribute information, a base station that is relatively close to the terminal device as a plurality of base stations corresponding to the terminal device, according to the position information of the terminal device and the position information of each base station, and determine the attribute information of a plurality of candidate base stations corresponding to the terminal device.
In the embodiment of the application, the terminal device sends broadcast information to the base stations, receives attribute information sent by each base station, and determines attribute information of a plurality of candidate base stations corresponding to the terminal device according to the position information of the terminal device and the position information of the base stations included in each attribute information. In the method, the terminal equipment determines the base station which can establish connection with the terminal equipment by sending broadcast information to the base station and receiving the attribute information sent by each base station. And then determining attribute information of a plurality of candidate base stations corresponding to the terminal equipment from the base stations establishing connection according to the position information of the terminal equipment and the position information of the base stations included in each attribute information, thereby ensuring that the corresponding base stations corresponding to the terminal equipment can establish stable connection with the terminal equipment and are close to the terminal equipment, ensuring that the task time delay required for unloading the target task to the target base station is shortest and meeting the energy consumption constraint condition of the base stations.
In an alternative embodiment of the present application, as shown in fig. 8, the training process of the preset deep reinforcement learning model may include the following steps:
step 801, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model.
The training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment.
Specifically, before training a preset deep reinforcement learning model, the terminal device needs to obtain a training set corresponding to the preset deep reinforcement learning model. The terminal device may obtain attribute information of a plurality of training tasks, where the attribute information of the plurality of tasks may include data size information of each training task and identification information of each training task. The terminal equipment can also acquire the attribute information of the candidate base stations corresponding to the training task through the communication connection with the base stations. The terminal device may calculate the time delay data and the energy consumption data for offloading each training task to each base station according to a preset algorithm, and thereby determine the target base station corresponding to each training task and the identification information of the target base station from the plurality of candidate base stations according to the calculated time delay data and energy consumption data.
Illustratively, in the embodiment of the present application, an edge computing system is defined, which is deployed with N micro Base Stations (BSs) and can provide computing services for large-scale Mobile internet of things devices (MDs) in the system. For convenience of description, the base station is not represented as Ν ═ 1, 2., N }, the mobile internet-of-things device is represented as Μ ═ 1, 2., M }, and the time is discretized into τ different time intervals (time slots), denoted as T ═ 1, 2., τ }. Meanwhile, the base stations are different in distribution position and different in signal coverage capability, so that each base station can serve different terminal devices; in addition, the base stations to which the terminal device can connect are different due to the different positions of the terminal device. Then, at time t, the set of connectable base stations of terminal device i is remembered as Νi(t), the set of serviceable terminal devices of base station j is denoted mj(t) of (d). At this time, for any base station j, if the terminal device in the signal coverage area offloads the task to the base station, it is marked as 1, otherwise, it is marked as 0, which may be specifically expressed as:
Figure BDA0003157949810000121
exemplarily, taking a community scene with an edge computing system deployed as an example, a plurality of mobile internet-of-things devices including smart watches, smart glasses, smart phones and the like are randomly distributed in any position in the community, a computing task k with a specific size is generated at the beginning of each time gap τ, the computing task k is unloaded to a selected edge base station after local preprocessing for further computing and analysis, and finally, the base station returns the processed result to the terminal device. In the process, two points need to be noted, namely, the data to be unloaded after the terminal equipment is preprocessed is inseparable, namely, the data is directly submitted to a selected base station for calculation and analysis; secondly, since the analysis result after the calculation by the base station is much smaller than the data to be unloaded, the transmission delay of the delayed downlink can be ignored during the calculation task.
In the preprocessing step, the terminal device generally needs to encrypt and pack the generated task data, and then unload the task data to the base station for processing. For convenience of description, the data size that the terminal device i needs to process locally is not set to
Figure BDA0003157949810000131
The data size to be offloaded to the base station side process is
Figure BDA0003157949810000132
Correspondingly, the number of CPU cycles required for the task generated at time t, the local calculation of the unit data amount and the base station calculation is respectively
Figure BDA0003157949810000133
And
Figure BDA0003157949810000134
the time delay it consumes in the local preprocessing is then:
Figure BDA0003157949810000135
wherein f isiThe CPU frequency of the terminal device i is represented; the energy consumption spent in local processing is as follows:
Figure BDA0003157949810000136
wherein, κiThe coefficient is typically dependent on different chip architectures for the power consumption coefficient of the corresponding device.
In this scenario, since the task to be unloaded is not separable, the unloading delay usually includes two parts, respectively: transmission delay and computation delay. First, the transmission delay refers to the time taken for the terminal device i to transmit the preprocessed task to the selected base station j. Therefore, for the terminal device i, the transmission delay at the time t is specifically:
Figure BDA0003157949810000137
wherein,
Figure BDA0003157949810000138
is the size of the content to be transmitted, ri,j(t) is the uplink rate that can be achieved between the terminal device i and the base station j, and is specifically calculated as follows:
Figure BDA0003157949810000139
wherein B represents the bandwidth available for data transfer between the terminal device and the connectable base station;
Figure BDA00031579498100001310
representing the channel gain between terminal device i and selected base station j. In addition, the terminal equipment is uniformly powered by power ptxTransmission tasks during which the noise power is expressed as sigma2The interference power of the base station can be represented as Ii,j. Wherein, the channel gain calculation formula is as follows:
Figure BDA00031579498100001311
wherein X represents an adjustment factor for path loss; beta is ai,jAnd
Figure BDA00031579498100001312
respectively representing fast fading gain coefficients and slow fading gain coefficients; di,jRepresents the distance between terminal device i and base station j; ζ is the path loss coefficient.
Secondly, the computation delay of the task generated by the terminal device i at the time t on the edge server can be expressed as follows:
Figure BDA00031579498100001313
wherein,
Figure BDA00031579498100001314
representing the number of CPU cycles required by the calculation of the subunit task at the base station end. f. ofi,j(t)=fj/∑(Ij(t)) represents
The CPU frequency of the terminal device i on the base station j at the moment t, namely when a plurality of tasks are unloaded to the same base station, the base station distributes the self computing power to each task evenly.
Thus, the total time delay required for the task on terminal device i from preprocessing to completion of the computation is:
Figure BDA0003157949810000141
in addition, for the terminal device, the energy consumption spent by the task in the offloading process generally includes two parts, namely the energy consumption required for transmitting the task to the base station and the energy consumption required for receiving the task when the base station transmits the calculation result back to the terminal device. The data volume of the calculation result is very small compared with the data volume to be transmitted, so the receiving energy consumption can be ignored. Therefore, when the terminal device i unloads the task, the transmission energy consumption is as follows:
Figure BDA0003157949810000142
the total energy consumption is:
Figure BDA0003157949810000143
in addition, when the terminal device in the mobile edge system unloads the task, power consumption is inevitably generated. However, it is detrimental if the battery instantaneous discharge power is large, and for this reason a battery safety factor is introduced here
Figure BDA0003157949810000144
Namely, when the terminal device unloads the task, the energy consumption should meet the following conditions:
Figure BDA0003157949810000145
therefore, when the terminal device unloads the task, the minimum total time delay should be achieved under the condition of meeting the energy consumption constraint condition. This optimization problem is defined as follows:
Figure BDA0003157949810000146
Figure BDA0003157949810000147
Figure BDA0003157949810000148
based on the above, the terminal device may calculate the time delay data and the energy consumption data corresponding to offloading each training task to each base station, and determine the target base station corresponding to each training task and the identification information of the target base station from the plurality of base stations according to the calculated time delay data and energy consumption data. The task unloading time delay corresponding to the unloading of each training task to the target base station is shortest, and the preset energy consumption constraint condition is met.
Step 802, the terminal device trains the deep reinforcement learning network to obtain a preset deep reinforcement learning model by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device as input.
Specifically, the terminal device may input the attribute information of each training task, the attribute information of a plurality of candidate base stations corresponding to each training task, and the channel estimation information between the terminal device corresponding to the training task and each candidate base station into the untrained deep reinforcement learning network, and train the deep reinforcement learning network with the deep reinforcement learning model as the gold standard, thereby obtaining the preset deep reinforcement learning model.
Furthermore, when the preset depth reinforcement learning model is trained, an Adam optimizer can be selected to optimize the preset depth reinforcement learning model, so that the preset depth reinforcement learning model can be rapidly converged.
When the Adam optimizer is used for optimizing the preset deep reinforcement learning model, a learning rate can be set for the optimizer, and the optimal learning rate can be selected by adopting a learning rate range test technology. The learning rate selection process of the test technology comprises the following steps: firstly, setting the learning rate to be a small value, then simply iterating a preset deep reinforcement learning model and training sample data for several times, increasing the learning rate after each iteration is completed, recording the training loss (loss) each time, and then drawing a learning rate range test chart, wherein the general ideal learning rate range test chart comprises three areas: if the first region learning rate is too small, the loss is basically unchanged, the second region loss reduction is fast in convergence, and the last region learning rate is too large, so that the loss begins to diverge, then the learning rate corresponding to the lowest point in the learning rate range test t-chart can be used as the optimal learning rate.
In the embodiment of the application, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model, and trains a deep reinforcement learning network to obtain the preset deep reinforcement learning model by taking attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station as input. In the embodiment of the application, the preset deep reinforcement learning model is obtained based on training of the training set, and the preset deep reinforcement learning model can be ensured to be more accurate, so that the target task unloaded to the target base station, which is obtained based on the preset deep reinforcement learning model, is ensured to be more accurate.
In an optional embodiment of the present application, the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, as shown in fig. 9, in step 802, "training a deep reinforcement learning network to obtain the preset deep reinforcement learning model by using, as input," attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of a training base station corresponding to a channel estimation information training task from a terminal device corresponding to the training task to each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device "may include the following steps:
step 901, the terminal device inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information between the terminal device corresponding to the training task and each candidate base station to the initial actor network, and outputs the identifier of the training base station corresponding to the training task.
Wherein the initial actor network may include a first actor network and a second actor network
Step 902, the terminal device inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, and the identification of the training base station corresponding to the training task into the initial critic network, performs feature extraction on the input data by using the initial critic network, and outputs a training evaluation value for unloading the training task to the training base station.
The training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task.
Step 903, the terminal device uses the reward function to calculate a training reward value corresponding to the training base station to unload the training task.
The training return value is used for representing time delay data and energy consumption data corresponding to unloading of the training task to the training base station.
And 904, training the initial critic network by the terminal equipment according to the training return value to obtain a target critic network.
Step 905, the terminal device trains the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.
The specific training and execution process may include the steps of:
1. the model comprises a plurality of terminal devices (intelligent agents), and each terminal device comprises an actor network part and a critic network part. Wherein the actor/critic network comprises a first actor/critic network and a second actor/critic network. And the second actor/critic network is completely replicated from the first actor/critic network prior to training; during training, the second actor/critic network is updated according to a certain rule, for example, a represents a parameter of the first actor/critic network, B represents a parameter in the second actor/critic network, and B ═ α B + (1- α) a.
2. For convenience of representation, the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station are called as state information; the "identification of the training base station corresponding to the training task" is referred to as an action, and the "probability distribution of the candidate base stations being selected" is referred to as a joint action.
3. The execution flow comprises the following steps: firstly, each terminal device acquires attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and inputs the channel estimation information into a first actor network to obtain an identifier of the training base station corresponding to the training task and a corresponding return value. Meanwhile, each terminal device obtains the identifier of the corresponding base station selected by the adjacent terminal device through the communication module, and calculates the probability distribution of each candidate base station. At this time, the attribute information of the training task in the environment, the attribute information of the plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station are updated to the next time, and can be acquired by the terminal device. Finally, the terminal device combines the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task, the selected probability distribution of each base station, the corresponding return value, the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station into a complete experience, and stores the complete experience in respective independent experience pools for subsequent training.
4. Training process: typically, a complete training process involves multiple cycles from training a critic's network to training an actor's network, and both are dependent on each other.
Training a critic network: firstly, inputting attribute information of a training task obtained by random sampling from the experience pool, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, identification of the training base station corresponding to the training task and selected probability distribution information of each base station into a first critic network in a corresponding model by each terminal equipment to obtain a critic value; then inputting the attribute information of the training task at the next moment in the experience, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into a second actor network in a corresponding model to obtain the identification of the training base station corresponding to the training task at the next moment; then each terminal device obtains the identification of the training base station corresponding to the training task of the adjacent terminal device through an obtaining module, and calculates the probability distribution of each candidate base station to be selected; and finally, inputting the attribute information of the training task corresponding to the next moment, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task and the selected probability distribution information of each candidate base station into a second critic network in the submodel, and calculating to obtain a comment value of the next moment. At this time, the loss is calculated by using the comment value, the return value obtained by sampling and the comment value at the next moment, and the gradient is further calculated to update the first critic network in the terminal device.
Training an actor network: firstly, each terminal device inputs the attribute information of the training task obtained by sampling, the attribute information of a plurality of candidate base stations corresponding to the training task, and the channel estimation information from the terminal device corresponding to the training task to each candidate base station into a first actor network in a corresponding model and obtains the identification of the training base station corresponding to the training task, and inputs the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal device corresponding to the training task to each candidate base station, the identification of the training base station corresponding to the training task and the selected probability distribution information of each candidate base station into a first critic network in the corresponding terminal device to obtain the corresponding comment value. Then, loss is calculated according to the comment values, and gradient is further calculated to update the first actor network in the corresponding terminal equipment.
And finally, updating the second actor/critic network in the terminal equipment according to the second actor/critic network updating mode in the step 1.
In order to better explain the large-scale user task offloading method provided by the present application, the present application provides an illustrative embodiment of the overall flow aspect of the large-scale user task offloading method, as shown in fig. 10, the method includes:
step 1001, a terminal device obtains a training set corresponding to a preset deep reinforcement learning model.
In step 1002, the terminal device trains a deep reinforcement learning network to obtain a preset deep reinforcement learning model by using attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between the terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device as input.
Step 1003, the terminal device obtains task attribute information of the target task to be unloaded and probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device.
In step 1004, the terminal device sends broadcast information to the base station.
Step 1005, the terminal device receives the attribute information sent by each base station, and determines the attribute information of a plurality of candidate base stations associated with the terminal device and the channel estimation information between the terminal device and each candidate base station according to the position information of the terminal device and the position information of the base station included in each attribute information.
Step 1006, the terminal device inputs the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device, the attribute information of the plurality of candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the target actor network, performs at least two times of feature extraction on the input data by using at least two layers of graph convolution neural networks in the target actor network, and outputs the identification information of the target base station based on the extracted features.
Step 1007, the terminal device inputs the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal device and each candidate base station, the identification information of the target base station, and the probability distribution of each candidate base station selected by the adjacent terminal device of the corresponding device into the target critic network, performs at least twice feature extraction on the input data by using at least two-layer graph convolutional neural network in the target critic network, and outputs a target evaluation value corresponding to the identification information of the target base station based on the extracted features.
In step 1008, the terminal device calculates a target report value using the report function.
It should be understood that although the various steps in the flowcharts of fig. 2, 5, and 7-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to be performed in a strict order unless explicitly stated in the embodiments of the present application, and may be performed in other orders. Moreover, at least some of the steps in fig. 2, 5, and 7-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or at least partially with other steps or with at least some of the other steps.
In one embodiment of the present application, as shown in fig. 11, there is provided a large-scale user task offloading device 1100, including: a first obtaining module 1110, a second obtaining module 1120, a determining module 1130, and an uninstalling module 1140, wherein:
the first obtaining module 1110 obtains task attribute information of a target task to be offloaded and probability distribution of each candidate base station selected by an adjacent terminal device of a corresponding device.
A second obtaining module 1120, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each candidate base station.
A determining module 1130, configured to input the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of multiple candidate base stations, and the channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determine a target base station corresponding to a target task, and output a target evaluation value corresponding to the identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station, wherein the preset deep reinforcement learning model comprises a graph convolution neural network, and the graph convolution neural network is used for performing at least twice feature extraction on input data of the preset deep reinforcement learning model;
an offloading module 1140 for offloading the target task to the target base station.
In an embodiment of the present application, the preset deep reinforcement learning model includes a target actor network and a target critic network, as shown in fig. 12, the determining module 1130 includes: a first output unit 1131, and a second output unit 1132, wherein:
a first output unit 1131, configured to input the task attribute information, the attribute information of the multiple candidate base stations, and the channel estimation information between the terminal device and each candidate base station to the target actor network, and output the identification information of the target base station.
A second output unit 1132, configured to input, to the target critic network, the task attribute information, the attribute information of the plurality of candidate base stations, the channel estimation information between the terminal device and each candidate base station, the identification information of the target base station, and the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, and output a target evaluation value corresponding to the identification information of the target base station.
In an embodiment of the present application, the predetermined deep reinforcement learning model includes a reward function, as shown in fig. 13, the determining module 1130 further includes: a calculation unit 1133, wherein:
a calculating unit 1133, configured to calculate a target return value by using a return function, where the target return value is used to represent time delay data and energy consumption data corresponding to offloading of a target task to a target base station.
In an embodiment of the application, the preset deep reinforcement learning model includes a target actor network and a target critic network, the target actor network includes at least two layers of graph convolutional neural networks, the target critic network includes at least two layers of graph convolutional neural networks, and the determining module 1130 is specifically configured to input task attribute information, attribute information of a plurality of candidate base stations, and channel estimation information between the terminal device and each candidate base station into the target actor network, perform feature extraction on input data at least twice by using at least two layers of graph convolutional neural networks in the target actor network, and output identification information of the target base station based on extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.
In an embodiment of the present application, as shown in fig. 14, the second obtaining module 1120 includes a sending unit 1121 and a receiving unit 1122, where:
a sending unit 1121, configured to send broadcast information to the base stations by the terminal device, where the broadcast information is used to instruct each base station to send attribute information of the base station to the terminal device;
the receiving unit 1122 is configured to receive the attribute information transmitted by each base station, and determine the attribute information of the plurality of candidate base stations corresponding to the terminal device according to the position information of the terminal device and the position information of the base station included in each attribute information.
In an embodiment of the present application, as shown in fig. 15, the large-scale user task offloading device 1100 further includes: a third acquisition module 1150 and a training module 1160, wherein
The third obtaining module 1150 obtains a training set corresponding to the preset deep reinforcement learning model, where the training set includes attribute information of multiple training tasks, attribute information of multiple candidate base stations corresponding to the training tasks, channel estimation information from a terminal device corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device.
The training module 1160 is configured to train the deep reinforcement learning network by using, as inputs, attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, identification information of the training base station corresponding to the training task and channel estimation information between a terminal device corresponding to the training task and each candidate base station, and probability distribution of each candidate base station selected by an adjacent terminal device of the corresponding device, to obtain a preset deep reinforcement learning model.
In an embodiment of the application, the preset deep reinforcement learning model includes a target actor network, a target comment family network, and a return function, and the training module 1160 is specifically configured to input attribute information of a training task, attribute information of a plurality of candidate base stations corresponding to the training task, and channel estimation information from a terminal device corresponding to the training task to each candidate base station to an initial actor network, and output an identifier of a training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.
For specific limitations of the large-scale user task offloading device, reference may be made to the above limitations of the task offloading method, which are not described herein again. The modules in the task uninstalling device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 16. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a task offloading method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 16 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station.
In one embodiment, the pre-set deep reinforcement learning model comprises a network of target actors and a network of target critics, and the processor when executing the computer program further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.
In one embodiment, the predetermined deep reinforcement learning model includes a reward function, and the processor executes the computer program to further perform the following steps: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.
In one embodiment, the preset deep reinforcement learning model comprises a target actor network and a target commentator network, the target actor network comprises at least two layers of graph convolutional neural networks, the target commentator network comprises at least two layers of graph convolutional neural networks, and the processor executes the computer program and further realizes the following steps: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.
In one embodiment, the processor, when executing the computer program, further performs the steps of: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.
In one embodiment, the preset deep reinforcement learning model comprises a target actor network, a target commentator network and a reward function, and the processor, when executing the computer program, further implements the following steps: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station; inputting task attribute information, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to a target task, and outputting a target evaluation value corresponding to identification information of the target base station; the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model; and unloading the target task to the target base station.
In one embodiment, the pre-defined deep reinforcement learning model includes a network of target actors and a network of target critics, the computer program when executed by the processor further performs the steps of: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, and outputting identification information of the target base station; and inputting the task attribute information, the attribute information of a plurality of candidate base stations, channel estimation information between the terminal equipment and each candidate base station, identification information of the target base station and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment into a target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.
In one embodiment, the predetermined deep reinforcement learning model includes a reward function, and the computer program when executed by the processor further implements the steps of: and calculating a target return value by using a return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to the unloading of the target task to the target base station.
In one embodiment, the preset deep reinforcement learning model comprises a target actor network and a target commentator network, the target actor network comprises at least two layers of graph convolutional neural networks, the target commentator network comprises at least two layers of graph convolutional neural networks, and when being executed by the processor, the computer program further realizes the following steps: inputting task attribute information, attribute information of a plurality of candidate base stations and channel estimation information between the terminal equipment and each candidate base station into a target actor network, performing at least twice feature extraction on input data by utilizing at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features; the task attribute information, the attribute information of a plurality of candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment are input into a target critic network, at least two-layer graph convolutional neural network in the target critic network is used for carrying out feature extraction on input data at least twice, and a target evaluation value corresponding to the identification information of the target base station is output based on the extracted features.
In one embodiment, the computer program when executed by the processor further performs the steps of: the terminal equipment sends broadcast information to the base stations, and the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment; and determining the attribute information of a plurality of candidate base stations related to the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information after receiving the attribute information sent by each base station.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a training set corresponding to a preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment; and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the identification information of the training base station corresponding to the channel estimation information training task from the terminal equipment corresponding to the training task to each candidate base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain a preset deep reinforcement learning model.
In one embodiment, the predetermined deep reinforcement learning model includes a network of target actors, a network of target critics, and a reward function, and the computer program when executed by the processor further performs the steps of: inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information between the terminal equipment corresponding to the training task and each candidate base station into an initial actor network, and outputting the identification of the training base station corresponding to the training task; inputting the attribute information of a training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution selected by adjacent terminal equipment of corresponding equipment of each candidate base station and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task; calculating a training return value corresponding to the training task unloaded to the training base station by using a return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station; training an initial critic network according to the training return value to obtain a target critic network; and training the initial actor network according to the training evaluation value and the training return value to obtain a target actor network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A large-scale user task offloading method, comprising:
acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;
acquiring attribute information of a plurality of candidate base stations associated with the terminal equipment and channel estimation information between the terminal equipment and each candidate base station;
inputting the task attribute information, the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment, the attribute information of the candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into a preset deep reinforcement learning model, determining a target base station corresponding to the target task, and outputting a target evaluation value corresponding to the identification information of the target base station, wherein the target evaluation value is used for representing the matching degree of unloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;
and unloading the target task to the target base station.
2. The method of claim 1, wherein the preset deep reinforcement learning model comprises a target actor network and a target critic network, and the inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the preset deep reinforcement learning model, determining the target base station corresponding to the target task, and outputting the target evaluation value corresponding to the identification information of the target base station comprises:
inputting the task attribute information, the attribute information of the candidate base stations and the channel estimation information between the terminal equipment and each candidate base station into the target actor network, and outputting the identification information of the target base station;
and inputting the task attribute information, the attribute information of the candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station.
3. The method of claim 2, wherein the pre-designed deep reinforcement learning model includes a reward function, and the step of inputting the task attribute information, the probability distribution of each candidate base station selected by the neighboring terminal device of the corresponding device, the attribute information of the candidate base stations, and the channel estimation information between the terminal device and each candidate base station into the pre-designed deep reinforcement learning model to determine the target base station corresponding to the target task further includes:
and calculating a target return value by using the return function, wherein the target return value is used for representing time delay data and energy consumption data corresponding to unloading of the target task to the target base station.
4. The method of claim 1, wherein the preset deep reinforcement learning model comprises a target actor network and a target critic network, the target actor network comprises at least two layers of the graph convolutional neural network, the target critic network comprises at least two layers of the graph convolutional neural network, the task attribute information, the probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, the attribute information of the candidate base stations and the channel estimation information between the terminal device and each candidate base station are input into the preset deep reinforcement learning model, the target base station corresponding to the target task is determined, and a target evaluation value corresponding to the identification information of the target base station is output, and the method comprises the following steps:
inputting the task attribute information, the attribute information of the candidate base stations and channel estimation information between the terminal equipment and each candidate base station into the target actor network, performing feature extraction on input data at least twice by using at least two layers of graph convolution neural networks in the target actor network, and outputting identification information of the target base station based on the extracted features;
inputting the task attribute information, the attribute information of the candidate base stations, the channel estimation information between the terminal equipment and each candidate base station, the identification information of the target base station and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment into the target critic network, performing at least twice feature extraction on input data by using at least two layers of graph convolutional neural networks in the target critic network, and outputting a target evaluation value corresponding to the identification information of the target base station based on the extracted features.
5. The method of claim 1, wherein the obtaining attribute information of a plurality of candidate base stations associated with the terminal device comprises:
the terminal equipment sends broadcast information to the base stations, wherein the broadcast information is used for indicating each base station to send attribute information of the base station to the terminal equipment;
and receiving attribute information sent by each base station, and determining attribute information of a plurality of candidate base stations associated with the terminal equipment according to the position information of the terminal equipment and the position information of the base station included in each attribute information.
6. The method according to claim 1, wherein the training process of the pre-set deep reinforcement learning model is as follows:
acquiring a training set corresponding to the preset deep reinforcement learning model, wherein the training set comprises attribute information of a plurality of training tasks, attribute information of a plurality of candidate base stations corresponding to the training tasks, channel estimation information from terminal equipment corresponding to the training tasks to each candidate base station, identification information of the training base stations corresponding to the training tasks and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;
and training the deep reinforcement learning network by taking the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station, the identification information of the training base station corresponding to the training task and the probability distribution of each candidate base station selected by the adjacent terminal equipment of the corresponding equipment as input, so as to obtain the preset deep reinforcement learning model.
7. The method of claim 6, wherein the preset deep reinforcement learning model includes a target actor network, a target critic network and a reward function, and the training of the deep reinforcement learning network with the input of attribute information of the training task, attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from a terminal device corresponding to the training task to each candidate base station, identification information of the training base station corresponding to the training task, and probability distribution of each candidate base station selected by a neighboring terminal device of the corresponding device, to obtain the preset deep reinforcement learning model comprises:
inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task and the channel estimation information from the terminal equipment corresponding to the training task to each candidate base station to an initial actor network, and outputting the identification of the training base station corresponding to the training task;
inputting the attribute information of the training task, the attribute information of a plurality of candidate base stations corresponding to the training task, channel estimation information from terminal equipment corresponding to the training task to each candidate base station, probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment and identification of the training base station corresponding to the training task into an initial critic network, performing feature extraction on input data by using the initial critic network, and outputting a training evaluation value for unloading the training task to the training base station, wherein the training evaluation value is used for representing the matching degree of unloading the training task to the training base station corresponding to the task;
calculating a training return value corresponding to the training task unloaded to the training base station by using the return function, wherein the training return value is used for representing time delay data and energy consumption data corresponding to the training task unloaded to the training base station;
training the initial critic network according to the training return value to obtain the target critic network;
and training the initial actor network according to the training evaluation value and the training return value to obtain the target actor network.
8. A large-scale user task offloading apparatus, the apparatus comprising:
the first acquisition module is used for acquiring task attribute information of a target task to be unloaded and probability distribution of each candidate base station selected by adjacent terminal equipment of corresponding equipment;
a second obtaining module, configured to obtain attribute information of multiple candidate base stations associated with the terminal device and channel estimation information between the terminal device and each of the candidate base stations;
a determining module, configured to input the task attribute information, the probability distribution of each candidate base station selected by a neighboring terminal device of a corresponding device, the attribute information of the multiple candidate base stations, and channel estimation information between the terminal device and each candidate base station into a preset deep reinforcement learning model, determine a target base station corresponding to the target task, and output a target evaluation value corresponding to identification information of the target base station, where the target evaluation value is used to characterize a matching degree for offloading the target task to the target base station; the preset depth reinforcement learning model comprises a graph convolution neural network, wherein the graph convolution neural network is used for carrying out at least twice feature extraction on input data of the preset depth reinforcement learning model;
and the unloading module is used for unloading the target task to the target base station.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110783668.8A 2021-07-12 2021-07-12 Large-scale user task unloading method, device, computer equipment and storage medium Active CN113676954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110783668.8A CN113676954B (en) 2021-07-12 2021-07-12 Large-scale user task unloading method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110783668.8A CN113676954B (en) 2021-07-12 2021-07-12 Large-scale user task unloading method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113676954A true CN113676954A (en) 2021-11-19
CN113676954B CN113676954B (en) 2023-07-18

Family

ID=78538882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110783668.8A Active CN113676954B (en) 2021-07-12 2021-07-12 Large-scale user task unloading method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113676954B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024168531A1 (en) * 2023-02-14 2024-08-22 华为技术有限公司 Communication method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135094A (en) * 2019-05-22 2019-08-16 长沙理工大学 A kind of virtual plant Optimization Scheduling based on shrink space harmony algorithm
CN110347500A (en) * 2019-06-18 2019-10-18 东南大学 For the task discharging method towards deep learning application in edge calculations environment
CN112202928A (en) * 2020-11-16 2021-01-08 绍兴文理学院 Credible unloading cooperative node selection system and method for sensing edge cloud block chain network
CN112367353A (en) * 2020-10-08 2021-02-12 大连理工大学 Mobile edge computing unloading method based on multi-agent reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135094A (en) * 2019-05-22 2019-08-16 长沙理工大学 A kind of virtual plant Optimization Scheduling based on shrink space harmony algorithm
CN110347500A (en) * 2019-06-18 2019-10-18 东南大学 For the task discharging method towards deep learning application in edge calculations environment
CN112367353A (en) * 2020-10-08 2021-02-12 大连理工大学 Mobile edge computing unloading method based on multi-agent reinforcement learning
CN112202928A (en) * 2020-11-16 2021-01-08 绍兴文理学院 Credible unloading cooperative node selection system and method for sensing edge cloud block chain network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024168531A1 (en) * 2023-02-14 2024-08-22 华为技术有限公司 Communication method and apparatus

Also Published As

Publication number Publication date
CN113676954B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN109151864B (en) Migration decision and resource optimal allocation method for mobile edge computing ultra-dense network
CN112422644B (en) Method and system for unloading computing tasks, electronic device and storage medium
Li et al. Energy-aware task offloading with deadline constraint in mobile edge computing
CN109951873B (en) Task unloading mechanism under asymmetric and uncertain information in fog computing of Internet of things
CN113645637B (en) Method and device for unloading tasks of ultra-dense network, computer equipment and storage medium
CN110740473B (en) Management method for mobile edge calculation and edge server
CN107708152B (en) Task unloading method of heterogeneous cellular network
CN113572804B (en) Task unloading system, method and device based on edge collaboration
CN112988285B (en) Task unloading method and device, electronic equipment and storage medium
CN114585006B (en) Edge computing task unloading and resource allocation method based on deep learning
CN111813539A (en) Edge computing resource allocation method based on priority and cooperation
WO2024174426A1 (en) Task offloading and resource allocation method based on mobile edge computing
Huda et al. Deep reinforcement learning-based computation offloading in uav swarm-enabled edge computing for surveillance applications
KR20210147240A (en) Energy Optimization Scheme of Mobile Devices for Mobile Augmented Reality Applications in Mobile Edge Computing
Lakew et al. Adaptive partial offloading and resource harmonization in wireless edge computing-assisted IoE networks
Dai et al. Deep reinforcement learning for edge computing and resource allocation in 5G beyond
CN116455768A (en) Cloud edge end collaborative CNN reasoning method and system for global time delay optimization
CN114698125A (en) Method, device and system for optimizing computation offload of mobile edge computing network
CN113676954A (en) Large-scale user task unloading method and device, computer equipment and storage medium
Li et al. Computation offloading strategy for IoT using improved particle swarm algorithm in edge computing
CN114995990A (en) Method and device for unloading computing tasks, electronic equipment and computer storage medium
CN114025359A (en) Resource allocation and computation unloading method, system, device and medium based on deep reinforcement learning
Hou et al. Cache control of edge computing system for tradeoff between delays and cache storage costs
Xie et al. Backscatter-aided hybrid data offloading for mobile edge computing via deep reinforcement learning
CN116419325A (en) Task unloading, resource allocation and track planning method and system for collaborative calculation of multiple unmanned aerial vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant