CN113407345A

CN113407345A - Target-driven calculation unloading method based on deep reinforcement learning

Info

Publication number: CN113407345A
Application number: CN202110712564.8A
Authority: CN
Inventors: 韦云凯; 王炜中; 冷甦鹏; 杨鲲; 刘强; 沈军
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-17
Anticipated expiration: 2041-06-25
Also published as: CN113407345B

Abstract

The invention discloses a target-driven calculation unloading method based on deep reinforcement learning, which is applied to the field of wireless communication such as 5G/6G, Internet of things and the like and aims at solving the problem of low calculation unloading operation efficiency caused by the fact that task types are not distinguished in the prior art; the task information enhancement module of the MoE hybrid expert system is adopted to remarkably improve the characteristic expression capability of the task information, and the task information enhancement module can remarkably improve the influence ratio of the delay sensitive characteristics of the task in the calculation unloading decision, so that the discrimination of different types of calculation tasks is improved; the reward mechanism of the deep reinforcement learning can be customized according to a specific wireless network scene and can be adjusted in a self-adaptive mode according to network characteristics.

Description

Target-driven calculation unloading method based on deep reinforcement learning

Technical Field

The invention belongs to the field of wireless communication of 5G/6G, the Internet of things and the like, and particularly relates to a calculation unloading technology based on target driving.

Background

The development of wireless communication technologies and applications such as 5G/6G, Internet of things and the like has promoted two trends: (1) the intelligent characteristics of the network are continuously enhanced, and a large amount of intelligent calculation, such as intelligent image recognition, intelligent data analysis and the like, needs to be carried out on the network equipment; (2) at the same time, network devices and scales are growing rapidly, and the proportion and number of lightweight devices are increasing. These two trends together lead to a direct consequence: the large amount of intelligent computing requirements presents a significant challenge to these resource-constrained, lightweight devices.

To solve this problem, computing offload techniques have emerged. In the computation unloading, the computation tasks of the lightweight device are transferred to the appropriate computation resource surplus nodes, so that the computation unloading of the lightweight device to the resource surplus nodes is realized. In the process, the lightweight device is called a task node, and the resource surplus node for completing the task is called a computing node.

Based on the sending target of the calculation result, the calculation unloading can be divided into two modes of calculation unloading driven by the source node and calculation unloading driven by the target. In the source-driven computation unloading, the computation result is finally returned to the task node, and the task mainly considers the distribution of the unloading proportion between the local unloading and the unloading at the computation node and the selection of the unloading node. In the target-driven computation offloading, the computation result needs to be transmitted to the remote target node, so the target-driven computation offloading not only needs to consider the computation amount distribution ratio problem, but also needs to select a more appropriate computation offloading path according to the target node. Currently, research in the computing offload in the industry and academia can be basically classified as a source-driven computing offload mode. In fact, target-driven computing offloading is also widely existed in various wireless networks such as 5G/6G, internet of things and the like, but currently, related research is very deficient.

Furthermore, target-driven computational offloading may also present different requirements facing different application scenarios. In a specific wireless communication application scene, different types of calculation tasks often exist, and the diversification of the calculation tasks often means the differentiation of time delay sensitivity, for example, some tasks belong to an emergency task type, and the time delay sensitivity degree is higher; some tasks belong to periodic tasks or common task types, so that high requirements on time delay are not required, and the problem of energy limitation of wireless communication network nodes also generally exists. Therefore, under the condition that the result of the computation offload is not required to be transmitted back to the source node but is transmitted to other target nodes, how to reasonably allocate resources and pertinently make a computation offload strategy for each type of computation task has important significance for efficient operation of network computation offload processes such as 5G/6G and the Internet of things.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target-driven computation offloading method based on deep reinforcement learning, which combines a hybrid expert system based on MoE and a deep reinforcement learning framework, reasonably distributes computation resources and plans an end-to-end offloading path, and maintains load balance and prolongs network survivability while meeting the time delay requirements of various types of tasks.

The technical scheme adopted by the invention is as follows: a target-driven computation unloading method based on deep reinforcement learning models a wireless communication scene into a network comprising a source node, a target node, a computation node and a common node, wherein the source node is a computation task issuing node, the target node is a computation task result destination node, the computation node is a computation server node, and the common node is a node for providing relay service;

modeling a computation task unloading process from a source node to a destination node into a Markov decision process, and calculating a selection and computation unloading strategy of a next hop by a current node through a neural network obtained by deep reinforcement learning from the source node until a computation unloading task is completed; the input of the deep reinforcement learning network is a Markov state space, the Markov state space is recorded as an observation state, and the output of the Markov state space is an optimal calculation unloading strategy under the corresponding observation state.

The optimal calculation strategy is specifically the proportion of calculation tasks needing to be unloaded by the current node and the corresponding next hop node, and if the current node is a common node, the unloading proportion is 0.

The reward of the markov decision process is a function of the overall time delay and the variance of the energy with respect to the task.

The observed states include: the task type characteristics are non-numerical characteristics representing task priority or time delay sensitivity, and the common characteristics are other characteristics except the task type characteristics.

The method further comprises the following steps of processing the observation state of the input deep reinforcement learning network by adopting a task information enhancement module, specifically: the task information enhancement module is based on a MoE hybrid expert system, and the MoE hybrid expert system comprises: a sub-network and a gate control network; the sub-network comprises a plurality of expert networks, each expert network corresponds to a calculation unloading strategy of the current task type, and the input of the expert networks is a common input characteristic; the input of the gate control network is a task type characteristic, and the output is a weight corresponding to the output of the expert network; and the output of each expert network is respectively subjected to weighted summation with the corresponding weight to obtain the output of the MoE hybrid expert system.

Also included is stitching the common input features behind the output of the MoE hybrid expert system.

And the task type characteristics are represented by adopting One-Hot coding.

The deep reinforcement learning network calculates the unloading ratio A of the self continuous action_propDiscretizing into 11 actions from 0.0 to 1.0, and combining the node size N to generate an 11 multiplied by N two-dimensional discrete action space; the best action screened from the two-bit discrete action space is the best next hop and the calculation unloading strategy.

The system also comprises a central server, wherein the central server trains a deep learning neural network suitable for all nodes after integrating global data according to the data < S, A, R, S' > collected by each node; then transmitting the network parameters to each node;

wherein S represents a state space, a represents an action space, R represents a reward, and S' represents a next state space in the markov transition process.

The training server is used for locally simulating and recording a target-driven calculation unloading process based on the collected state space of the current node, learning an optimal target calculation unloading strategy in an off-line mode, and broadcasting parameters of the neural network after the neural network deeply and intensively learned by the current node is updated to other nodes.

The invention has the beneficial effects that: the invention provides a target-driven computation unloading mechanism based on deep reinforcement learning for solving the target-driven computation unloading requirements in wireless communication networks such as 5G/6G, Internet of things and the like, so that computation resources in the wireless communication networks can be reasonably allocated, computation unloading decisions can be pertinently provided for different delay sensitive types, and the network life cycle can be prolonged under the condition of guaranteeing task delay in the scene with limited resources; the method has the following advantages:

1. the invention provides a task information enhancement module of a MoE mixed expert system, which remarkably improves the characteristic expression capability of task information, and a large number of experiments show that compared with a neural network without a MoE module, the task information enhancement module can remarkably improve the influence ratio of task delay sensitive characteristics in the calculation unloading decision, thereby increasing the discrimination of different types of calculation tasks;

2. the reward mechanism of the deep reinforcement learning can be customized according to a specific wireless network scene, and can also be adjusted in a self-adaptive manner according to network characteristics, and for a scene with limited energy, the requirements of uniform energy distribution and task time delay calculation are balanced; for the scene with sufficient energy, computing resources are reasonably planned to ensure the time delay of the computing task with high priority;

3. the distributed computation unloading mechanism not only ensures the timeliness of the computation unloading strategy, but also reduces the task burden of the computation unloading decision.

Drawings

FIG. 1 is a schematic of the computational offload of the present invention;

FIG. 2 is a flow chart of a single offload decision of the present invention;

FIG. 3 is a schematic diagram of a neural network structure according to the present invention.

Detailed Description

The present invention will now be further described in order to make the objects, technical solutions and advantages of the present invention more apparent. The DRL-DDCO adopts deep reinforcement learning, integrates task time delay and network survivability, and learns the mapping relation between the network information environment and the decision feedback network benefit, thereby realizing an objective-driven computation unloading mechanism aiming at individuation differentiation under different task types and different network environments.

The overall technical scheme of the invention comprises a deep reinforcement learning framework facing a target-driven computation unloading strategy and a task information enhancement module based on a hybrid expert system (MoE).

In the target-oriented computing offloading deep reinforcement learning framework, a real wireless communication scene is modeled into a network constructed by four types of nodes, namely a computing task issuing node (source node), a computing task result destination node (target node), a computing server node (computing node) and a common node capable of providing relay service.

In a target-driven computation offload mode, the transmission and offload of computation tasks are performed in a coordinated manner, namely 'forwarding while offloading'; and the forwarding node and the unloading strategy are determined hop by hop on the basis of deep reinforcement learning on the node of the path where the forwarding node and the unloading strategy pass through. That is, the invention designs the target-driven computation offload mode, and implements the modes of "edge forwarding, edge offload, and edge decision".

Specifically, in the target-driven computation unloading Process, the mechanism of the invention firstly models the unloading Process into a Markov Decision Process (MDP), and on the basis, starting from a source node, the selection of the next hop and the computation unloading proportion are obtained by the current node through the deep reinforcement learning of the neural network until the computation unloading task is completed. The input of the deep reinforcement learning network is Markov state space, which is referred to as observation state for short, and the output is the optimal next hop and unloading ratio under the corresponding observation state.

In one decision process round, as shown in fig. 1, all state transitions, computational offload strategies, and corresponding received rewards are stored to train a neural network of fitting action potential values and state mappings in deep reinforcement learning. The converged neural network has the capacity of memory and generalization. In the decision making process, the neural network can predict the subsequent potential state transition process so as to search and obtain the optimal unloading strategy according to the current state. In this way, as shown in fig. 2, the DRL-DDCO model can gradually calculate the optimal computation offload policy during computation offload, and modify the offload policy according to the network environment and task information.

In the task information enhancement module based on the MoE hybrid Expert system, the expression of delay sensitive characteristics in different types of calculation unloading tasks is mainly enhanced, the MoE hybrid Expert system can reflect the mapping relation of the difference between different types of tasks and corresponding decisions thereof through the combination of a plurality of Expert sub-networks, so that more expressive task information characteristics are output, and the decision system can learn a uniform action strategy from the different characteristic information and reward data fed back by the decisions, so that an intelligent unloading decision system for all task types is formed.

The network structure of the present invention is shown in fig. 3, and the main contents include:

1. deep reinforcement learning framework for target-oriented computing unloading strategy

In a wireless communication network, the problem of computation and offloading decision under a target-driven mode often faces background traffic interference, and a distributed network is difficult to make a centralized decision. Aiming at the problems, the invention introduces a Deep reinforcement Learning algorithm (DDQN) to lead the intelligent agent to adaptively learn the relation between the calculation unloading decision and the target income, thereby reasonably making the calculation unloading strategy.

(1) Reinforced learning module

The integral model of the DRL-DDCO basically accords with a reinforcement learning mode, and before explaining the reinforcement learning mode, firstly, a calculation unloading decision process driven by a target is needed to be explained, so that the condition that the Markov property is met is easily proved, and the invention does not prove the problem.

a) Markov Decision Process (MDP, Markov Decision Process)

For reinforcement learning, the target-driven computational offload scenario first needs to be modeled as a Markov decision process, which includes determining a state space (S), an action space (A), a transition probability (P), and a corresponding reward (R), i.e., a classical quadruple < S, A, P, R >. Wherein the transition probability P defaults to 1 in the way-finding problem, because this is a reliably determined network, and the transmission failure or error is not within the scope of the present invention. The other main components are as follows:

S＝(I_nearby,T,Topo)

A＝(A_node,A_prop)

R＝f(D,△Var)

wherein D represents the time delay required by the completion of the task unloading, and Δ Var represents the variance change of the residual total energy of the adjacent nodes around the current node. The state space is mainly composed of three part features: (1) i is_nearbyThe network state which represents local collection comprises the number of the surrounding nodes, the respective computing resources and the related information of the energy reserve; (2) t represents task information received by the computing node, including data volume, computing amount and other task characteristics of the computing task; (3) topo represents information of network topology, including Dijkstra distance from each node to a task target point, which can be obtained through a communication algorithm commonly used in networks. The above three features constitute a complete state space and uniquely identify the optimal computational offload strategy. Therefore, we can also prove that the target-driven computational offload process satisfies the Markov property, i.e., the whole process is modeledEffectiveness of the MDP.

There are two sub-actions in the action space: (1) selecting the next hop node A_nodeThe next short-term destination selected by the calculation task may be the destination to which the calculation unloading work is to be performed, or may be only used as a relay node to forward the calculation task; (2) selecting the unloading ratio A at the current node_propSince it is a ratio, A_prop∈[0,1]The unload ratio is equal to 0 if it is just a relay action. Secondly, this unloading ratio is based on the initial demand calculation, and such a setting helps the agent to distinguish between each action.

b) Reward setting

The reward formula is a function of the overall time delay and the variance of the energy with respect to the mission.

For task T_j(j is a natural number), its time delay D_jThe total time from the completion of task issuing and calculation to the transmission of the result to the target node is represented, wherein the total time comprises not only the time delay caused by calculation unloading, but also the data transmission time delay and the signal propagation time delay; the energy variance changes, the energy variance after unloading is subtracted from the energy variance before unloading, if the energy variance is reduced, the regional energy distribution is more uniform, then the delta Var is less than 0, and the intelligent agent receives a part of forward feedback, and vice versa. The overall reward is set as follows:

R(D_j,△var)＝-α*D_j*s_j-β*△var

wherein s is_jRepresents a computing task T_jThe higher the delay sensitivity degree, the more urgent the task is, the specific value of the delay sensitivity degree needs to be set according to the task urgency degree before training by combining with the actual application scene. E.g. setting the task urgency level to 7 levels, s, from 0 to 6_jThe value of (1) is a certain value between 0 and 6 corresponding to the task urgency degree. And alpha and beta represent the reward factor for delay and variance, respectively. Usually, α + β is set to 1, and the values of α and β may be set in combination with empirical values, and the parameter values are adjusted during training according to the training effect.

Assume task T_jFrom the unloading source nodeThe whole process from point to final destination node needs to be transmitted through k hops

Are sequentially marked as h₁,h₂,...h_κAnd computing task T_jAt the h th_k(1 ≦ k) the time delay introduced by the hop is recorded as D_j(h_k). Then task T_jIntegral time delay D of_jThe formula is as follows:

wherein,

representing a natural number, D_j(h_k) Means for calculating h-th on unloading path_kAll the time delays generated on the step nodes include four parts: transmission delay, propagation delay, offload delay, and latency. Wherein the transmission time delay

The formula is as follows:

wherein L is_j(h_k) Representing a computing task T_jAt h_kAmount of data to be transmitted, N (T), for time hopping_j,h_k) Representing a computing task T_jCorresponds to the h th_kNode of hop, and denominator

Representative is the transmission rate at the node.

Before introducing the amount of calculation task data, a definition of profitability is given below,

profitability refers to the amount of data in the calculation task result

For its task initial data volume

The representative meaning of the ratio is that the compression ratio of the task data volume is calculated. Based on this, the following definitions are given in h_kCalculating task data volume formula of jump:

R_j(h_k) The calculation amount of the calculation task which is not completed after the calculation unloading action,

is the initial computational load of the task. When R is_j(h_k) When the task calculation amount is already completed, 0, then the h-th time_kThe amount of data that needs to be transmitted

Where lambda represents the remaining ratio of the amount of data after the task is unloaded,

i.e. the remaining data volume after the task is unloaded.

For calculating offload latency

In other words, the formula is as follows:

wherein,

the finger is at h_kThe computation load unloaded on the step node is one of the strategies that we need to decide.

Representing a computing task T_jAt the h th_kThe calculated rate of the hop node.

Propagation delay

The formula is as follows:

wherein, W (h)_k,h_k+1) Represents h_kHop node and h_k+1The distance between hop nodes, and v represents the electron wave propagation velocity, which is generally assumed in the present invention to be 2/3 times the speed of light.

Last wait delay

Often, the method is applied to a multitask computing unloading scene, and a good computing unloading distribution algorithm needs to coordinate computing tasks according to the busy condition of nodes so as to reduce the consumption

The invention adopts special skills to record the waiting time delay

The invention records the time TP of the special action.

TP_j(h_k) Representative task T_jArrives at the h_kPoint in time of the hop node, and

representing the point in time at which the offloading of the computation load of the task on the node is completed. When tau (N (T)_j,h_k) H) is less than or equal to 0_kThe hop node receives the task T_jWhen the computation of the last task is unloaded

Computing task T_jAfter completion, update time point:

TP_j(h_k+1)＝TP_j(h_k)+D_j(h_k)

the energy consumption variance formula is as follows:

△Var(h_k)＝Var_after-Var_before

wherein Λ denotes the Nth (T)_j,h_k) Set of adjacent nodes, H_lThen refers to the remaining energy on node l, and

and

respectively refer to the average value of residual energy of all nodes in the node set Lambda before and after the unloading action, H (A)_node) Representing the remaining energy in the selected node. And a specific calculation task T_jThe total energy consumption formula is as follows:

wherein,

is represented at the h_kThe calculation on the hop node unloads the energy consumption, and for the convenience of formula reading, the following is used

Denotes the h th_kThe specific formula of the calculation rate on the hop node is as follows:

wherein,

represents the firsth_kThe calculation rate of the hop node in the CPU is

In the case of (2), the unit is converted to the rate of energy consumption, and the energy consumed for unloading is calculated as the rate multiplied by the calculated amount of unloading C_j(h_k). And the coefficient v inside is normally set to 10^-11. And the transmission energy consumption is set as follows:

at node N (T)_j,h_k) Transmission energy source of upper consumption

Equal to the rate of energy consumption per unit time

Transfer delay on passenger

Wherein is just

In terms of the calculation formula (2), wherein N₀Representing the complex white gaussian noise variance, h represents the channel gain, and W represents the channel bandwidth.

(2) DNN module

The neural network in deep reinforcement learning mainly plays a role in associating a large state space with actions and predicting the value of each action in the state. The neural network is utilized to fully excavate deep connection between state characteristics and action characteristics in a target-driven computation unloading scene, so that not only is the function of network memory exerted, but also the generalization capability of the neural network can process emerging computation tasks in the network, and the problem of multi-task reuse of the traditional algorithm model is well solved. The neural network structure in the decision framework is shown in the upper half of fig. 3:

a) input processing

The neural network input is a feature of the state space in the MDP, but some preprocessing needs to be performed on the data before the input, because the direction of convergence of the training process is affected by the difference of the dimension and distribution among the data. Similarly, in the execution stage, the data needs to be equally processed before being put into the network to obtain the result. The features of a part of the score classes such as available computing resources, task data amount and computing amount are normalized to a value between 0 and 1 through the maximum and minimum values, and the normalization formula is as follows:

for features where the magnitude of some values is not intuitively significant, such as surplus energy, the values must generally be combined with other data around to make a visual sense. The invention therefore discretizes or binarizes similar features, such as marking nodes above the average energy level as 1 and nodes below the average level as 0.

b) Output processing

For the output of the neural network, the output content is that in the corresponding state, the Q value of each action is a two-dimensional space due to the two-dimensional action space. Aiming at a complicated target-driven calculation unloading scene, a better action selection probably cannot perfectly follow normal distribution, so unreasonable actions can be easily obtained in the sampling process, and the method adopts a method based on a Double Deep Q Network (DDQN) to calculate the unloading proportion A of the continuous actions per se_propDiscretization into 11 actions from 0.0 to 1.0, combined with node size N, yields an 11 × N two-dimensional discrete action space. The advantage of the discretization of the motion space is that the anti-interference capability of the model can be increased, and in addition, the model is activeAction screening provides convenience.

For the training process of the decision model, the invention provides two modes: on-line training and off-line training. The online training has the advantages that the calculation unloading model can be adjusted in real time according to the scene characteristics, but the online training occupies larger transmission resources; off-line training does not require the transmission of large amounts of training data, but the response to the scene is also slow.

The online Training method adopts a CTDE (Centralized Training Distributed Execution) Training module, and specifically comprises the following steps:

in a complex computation offloading scenario, on one hand, a Distributed Training (DT) creates an additional computation burden for each computation server, and because of differences between computation resources and received tasks, model convergence progress between the computation servers is not uniform, and network parameters on nodes with faster convergence speed oscillate due to data provided by other lagging nodes. In other words, the problem that convergence progress is difficult to unify in the distributed training mode can cause the overall network convergence situation to be blocked. Aiming at the situation, the invention provides a working mode of Centralized Training Distributed Execution (CTDE), namely, in the working process, each computing node collects the collected < S, A, R, S' > data set to a certain central server, and then the central server integrates global data and trains a neural network suitable for all the computing nodes. Where S' represents the next state in the markov transition. And finally, the network parameters are transmitted to each computing node by the central training server.

Of course, the CTDE also has its own short board, i.e. it needs to propagate a large amount of data including transfer records on each computing node and iterative network parameters. In the scene of industrial internet of things with a part of links tense, the problem of extra transmission link burden can be caused. In response to this problem, the present invention also proposes an alternative to this, namely off-line training. The training process of the neural network mainly occurs on an external training server, the server locally simulates and records the target driving calculation unloading process based on the collected server information and network link information, learns the optimal target calculation unloading strategy in an off-line mode based on the collected records, and broadcasts the parameters of the neural network to each real node after convergence is completed.

The server information collected here is specifically: the service node provides information about service capabilities, typically including the computing power, transmission capabilities, available energy, topological relationships, etc. of the node.

(4) Action space search optimization module

An Action space Search Optimization (ASO) module is inspired by ant colony algorithm prohibited options, and refers to that before an intelligent agent searches a redundant Action space, some intuitively invalid Action options are screened out through a certain rule, and then corresponding actions are selected from the screened Action set. The invalid action screening rule is as follows:

screening out the related actions of non-adjacent nodes, wherein the non-adjacent nodes originally in the topological relation and the nodes failing due to the exhaustion of energy sources are included. These actions of passing the computational task to the node are clearly unreasonable, and therefore such actions are filtered out;

and secondly, screening out the node-related actions recorded on the calculation unloading path, namely the passed node-related actions. This is because even if there is a possibility of unloading on the node, the corresponding amount of computation should be done when the node is first passed, so the same node should not be passed repeatedly twice, which also prevents a jump back and forth situation;

and screening out the related tasks exceeding the residual calculation amount of the calculation task. The executed unloading action of the calculated amount is matched with the actual calculated amount, so that the unloading action is more reasonable to be standardized according to the residual calculated amount;

based on the three criteria, when the action is selected and the next state optimal action is selected during training updating, some unproductive actions can be removed through manual screening, so that invalid records in the memory library can be reduced, and the quality of the memory library can be improved. The data in the memory bank is equivalent to a data set, and the quality of the data set directly determines the convergence degree of the final neural network. Experimental results also show that the neural network is difficult to converge in a larger motion space if no limitations are imposed on the selected motion.

2. Task information enhancement module based on MoE mixed expert system

The hybrid expert system (MoE) is a neural network and also belongs to a model of hybrid networks. The method is suitable for solving the problem of different data mapping relations in the data set. The neural network corresponding to the MoE module is shown in the lower half part of FIG. 3, the MoE model mainly comprises two parts, one part is a small sub-network (Experts) and is recorded as an expert network, and the expert network can perform specialized training (specialize) through part of data sets, so that the mapping relation in the data can be accurately described; the other part is a gating network (Manager), which can be composed of a DNN neural network, and finally a distribution probability is output through a SoftMax layer.

The input of the gating network is a task type characteristic, wherein the task type characteristic specifically refers to a non-numerical characteristic representing task priority or time delay sensitivity in a task and is usually represented by One-Hot coding; the input of the sub-network is a common input characteristic, namely other characteristics except task type characteristics of the task are removed again by the task, and the common input characteristic is usually some numerical characteristics; the common input features and the task type features together form input features of the neural network, namely state features of the transitions.

Different from a general neural network, the invention trains a plurality of models according to data separation, each model is called an expert, the gating network is used for selecting which expert is used, and the actual output of the MoE is the weight combination of the output of each expert network and the gating network.

Each expert model can fit different functions (various linear or non-linear functions), so that the MoE model well solves the problem of different mapping relations caused by different responsible data sources. In the problem of the invention, the MoE can well solve the problem of different mapping relations between the action values and the input states in the decision-making processes of different types of tasks.

Those skilled in the art will appreciate that the multiple expert networks of sub-networks in the MoE model are employed in the present invention to fit the decision relationships of a single type of task.

And finally, because the condition of weakening the expression of the common type features exists in the process of enhancing the task type information, the common input features are spliced behind the output result of the MoE in the process of carrying out unloading decision, so that the condition of information loss is prevented. The result of this concatenation is taken as input to the dual depth Q network.

For the MoE hybrid system module, the loss function is as follows:

wherein p is_iRespectively corresponding to the proportion occupied by the ith expert output under the control of a gate, and outputting a result through a SoftMax layer, wherein the formula is as follows:

then the answer is the updated gradient of this model, why the system allows each expert to learn different parameters. For the gradient of expert:

from its gradient formula, it can be seen that part of the data is for p_iThe updated gradient will also be larger for more expert than for larger, a process called "specialization" (specularize); for gate, the proportion p corresponding to each expert needs to be regulated and controlled_iThe gradient formula is as follows:

the gradient formulation shows that for some data, if the penalty for a certain expert output is higher than the average penalty, then the corresponding p_iReduce, i.e. indicate that the expert network is not well predictable for this part of the data; on the contrary, the expert is good at predicting the mapping relation of the part of data, and then the corresponding output ratio p is improved_i. From the perspective of a formula, the MoE system realizes specialization of each expert and different expert combinations for different mapping relationships according to the differentiation of the gradient update direction.

In a computing offload scenario, different types of computing tasks are also inconsistent with latency requirements. Part of tasks do not require time delay, but still need to unload the calculated amount due to self resource limitation; and the time delay requirement of some emergency tasks is very high, and the task time delay needs to be kept lower all the time, which is similar to the alarm task of forest fire. Although the same network can be distinguished by adding One-Hot characteristics of task level, the actual effect is little. Because most of the network still shares a set of parameters for different task state inputs, a small number of feature inputs have little effect on the network results. Therefore, the invention provides that the MoE system is adopted, and different expert network outputs can be combined for different similar tasks. Because the task characteristics concerned by each network are inconsistent with the network characteristics, for example, the demand of an emergency computing task for computing resources is higher, and the distribution of energy consumption is concerned more by a computing task with a loose time delay requirement, so that the MoE-based task information enhancement module enables the decision model to make an individualized computing unloading strategy for different types of tasks.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A target-driven computation unloading method based on deep reinforcement learning is characterized in that a wireless communication scene is modeled into a network comprising a source node, a target node, a computation node and a common node, wherein the source node is a computation task issuing node, the target node is a computation task result destination node, the computation node is a computation server node, and the common node is a node for providing relay service;

modeling a computation task unloading process from a source node to a destination node into a Markov decision process, and calculating a selection and computation unloading strategy of a next hop by a current node through a neural network obtained by deep reinforcement learning from the source node until a computation unloading task is completed; the input of the deep reinforcement learning network is a Markov state space, the Markov state space is recorded as an observation state, and the optimal calculation unloading strategy under the corresponding observation state is output.

2. The method according to claim 1, wherein the calculation strategy specifically includes a ratio of the current node to be unloaded with respect to the calculation task and a next-hop node corresponding thereto, and if the current node is a normal node, the unloading ratio is 0.

3. The method of claim 2, wherein the reward of the Markov decision process is a function of the overall time delay and the variance of energy of the task.

4. The method according to claim 3, wherein the observation state comprises: the task type characteristics are non-numerical characteristics representing task priority or time delay sensitivity, and the common characteristics are other characteristics except the task type characteristics.

5. The target-driven computation offloading method based on deep reinforcement learning of claim 4, further comprising processing the observation state input to the deep reinforcement learning network by using a task information enhancement module, specifically: the task information enhancement module is based on a MoE hybrid expert system, and the MoE hybrid expert system comprises: a sub-network and a gate control network; the sub-network comprises a plurality of expert networks, each expert network corresponds to a calculation unloading strategy of the current task type, and the input of the expert networks is a common input characteristic; the input of the gate control network is a task type characteristic, and the output is a weight corresponding to the output of the expert network; and the output of each expert network is respectively subjected to weighted summation with the corresponding weight to obtain the output of the MoE hybrid expert system.

6. The deep reinforcement learning-based target-driven computation offload method according to claim 5, further comprising stitching common input features behind the output of the MoE hybrid expert system.

7. The method for unloading target-driven computation based on deep reinforcement learning of claim 6, wherein the task type features are represented by One-Hot coding.

8. The method according to claim 7, wherein the deep reinforcement learning network calculates the unloading ratio A from continuous actions of the deep reinforcement learning network_propDiscretizing into 11 actions from 0.0 to 1.0, and combining the node size N to generate an 11 multiplied by N two-dimensional discrete action space; the best action screened from the two-bit discrete action space is the best next hop and the calculation unloading strategy.

9. The target-driven computation offload method based on deep reinforcement learning of claim 8, further comprising a central server, wherein the central server trains a deep learning neural network applicable to all the computation nodes according to the collected < S, a, R, S' > data and after integrating the global data; then transmitting the network parameters to each computing node;

10. The method according to claim 9, further comprising a training server for locally simulating and recording a target-driven computation offload process based on the collected state space of the current node, learning an optimal target-driven computation offload strategy offline, and broadcasting parameters of the current node to other nodes after updating the deep-reinforcement-learned neural network.