CN113407345B

CN113407345B - Target driving calculation unloading method based on deep reinforcement learning

Info

Publication number: CN113407345B
Application number: CN202110712564.8A
Authority: CN
Inventors: 韦云凯; 王炜中; 冷甦鹏; 杨鲲; 刘强; 沈军
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2023-12-15
Anticipated expiration: 2041-06-25
Also published as: CN113407345A

Abstract

The invention discloses a target driving calculation unloading method based on deep reinforcement learning, which is applied to the wireless communication fields of 5G/6G, the Internet of things and the like and aims at the problem of low calculation unloading operation efficiency caused by the fact that task types are not distinguished in the prior art; the task information enhancement module of the MoE mixed expert system is adopted to remarkably improve the characteristic expression capability of task information, and can remarkably improve the influence duty ratio of the task time delay sensitive characteristics in calculation unloading decision, so that the distinction degree of different types of calculation tasks is increased; the deep reinforcement learning reward mechanism can be customized according to a specific wireless network scene, and can also be adaptively adjusted according to network characteristics.

Description

Target driving calculation unloading method based on deep reinforcement learning

Technical Field

The invention belongs to the field of wireless communication such as 5G/6G, internet of things and the like, and particularly relates to a calculation unloading technology based on target driving.

Background

The development of wireless communication technologies and applications such as 5G/6G, internet of things and the like has induced two trends: (1) The intelligent characteristics of the network are continuously enhanced, and the network equipment needs to perform a large amount of intelligent computation, such as intelligent image recognition, intelligent data analysis and the like; (2) At the same time, the network devices and the scale are rapidly increasing, and the proportion and the number of the lightweight devices are continuously increasing. Together, these two trends lead to a direct consequence: the large intelligent computing requirements present significant challenges to these resource-constrained lightweight devices.

To address this problem, computing offloading techniques have evolved. In computing offloading, the computing tasks of the lightweight devices are transferred to the appropriate computing resource-rich nodes, thereby enabling computing offloading of the lightweight devices to the resource-rich nodes. In this process, the lightweight device is referred to as a task node, and the resource-rich node that completes the task is referred to as a compute node.

Based on the sending target of the calculation result, the calculation unloading can be divided into two modes of calculation unloading driven by the source node and calculation unloading driven by the target. In source driven computing offloading, the computation results are ultimately returned to the task node, such tasks primarily taking into account the allocation of offload proportions and the selection of offload nodes, both locally offloaded and offloaded at the compute node. In the target driving calculation unloading, the calculation result needs to be transmitted to a remote target node, so that the target driving calculation unloading not only needs to consider the calculation amount distribution proportion problem, but also needs to select a more suitable calculation unloading path according to the target node. Currently, research in computing offloading in industry and academia can be categorized essentially as a source-driven computing offloading model. In practice, the target drive computing offload is widely applied to various wireless networks such as 5G/6G and internet of things, but the related research is still very deficient.

Moreover, the computational offloading of the target drive faces different application scenarios and may also exhibit different requirements. In a specific wireless communication application scene, different types of computing tasks often exist, and the diversification of the computing tasks often means differentiation of delay sensitivity, for example, some computing tasks belong to emergency task types, and the delay sensitivity is high; some wireless communication network nodes belong to periodic tasks or common task types, have no high requirement on time delay, and generally have the problem of energy limitation. Therefore, under the condition that the calculation unloading result is not required to be transmitted back to the source node but is required to be transmitted to other target nodes, how to reasonably allocate resources, and a calculation unloading strategy is formulated for each type of calculation task in a targeted manner, so that the method has important significance for efficient operation of the network calculation unloading process such as 5G/6G, the Internet of things and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target driving calculation unloading method based on deep reinforcement learning, which is combined with a hybrid expert system based on MoE and a deep reinforcement learning framework to reasonably allocate calculation resources and plan an end-to-end unloading path, so that the load balance is maintained and the network survivability is prolonged while the time delay requirements of various tasks are met.

The invention adopts the technical scheme that: a target driving calculation unloading method based on deep reinforcement learning models a wireless communication scene as a network comprising a source node, a target node, a calculation node and a common node, wherein the source node is a calculation task release node, the target node is a calculation task result destination node, the calculation node is a calculation server node, and the common node is a node for providing relay service;

modeling a calculation task unloading process from a source node to a destination node into a Markov decision process, and starting from the source node, calculating a next hop selection and calculation unloading strategy by the current node through a neural network subjected to deep reinforcement learning until the calculation unloading task is completed; the input of the deep reinforcement learning network is a Markov state space, the input is recorded as an observation state, and the output is the optimal lower calculation unloading strategy under the corresponding observation state.

The optimal calculation strategy is specifically that the ratio of the current node to the calculation task to be unloaded and the next-hop node corresponding to the current node are the same, and if the current node is a common node, the unloading ratio is 0.

The rewards of the markov decision process are functions of overall time delay and energy variance for the task.

The observed state includes: the task type features are non-numerical features representing task priority or delay sensitivity, and the common features are other features after the task type features are removed.

The method further comprises the step of adopting a task information enhancement module to process the observed state of the input deep reinforcement learning network, and specifically: the task information enhancement module is based on a MoE hybrid expert system, and the MoE hybrid expert system comprises: a sub-network and a gating network; the sub-network comprises a plurality of expert networks, each expert network corresponds to one calculation unloading strategy of the current task type, and the input of the expert network is a common input characteristic; the input of the gate control network is task type characteristics, and the output is weight output by a corresponding expert network; the output of each expert network is respectively weighted and summed with the corresponding weight to be used as the output of the MoE hybrid expert system.

Also included is stitching the common input features behind the output of the MoE hybrid expert system.

The task type features are represented by One-Hot encoding.

The deep reinforcement learning network calculates the unloading proportion A by self continuous actions _prop Discrete into 11 actions from 0.0 to 1.0, and combining the node scale N to generate a two-dimensional discrete action space of 11 multiplied by N; the best action obtained by screening from the two-bit discrete action space is the best next hop and the calculation unloading strategy.

The system also comprises a central server, wherein the central server integrates global data according to the < S, A, R and S' > data collected by each node and trains a deep learning neural network applicable to all the nodes; then, the network parameters are transmitted to each node;

where S represents the state space, A represents the action space, R represents the reward, and S' represents the next state space in the Markov transition process.

The training server is used for simulating and recording a target driving calculation unloading process in a local mode based on the collected state space of the current node, learning an optimal target calculation unloading strategy in an off-line mode, and broadcasting parameters of the neural network after the deep reinforcement learning of the current node is updated to other nodes.

The invention has the beneficial effects that: the invention provides a target drive calculation unloading mechanism based on deep reinforcement learning for solving the target drive calculation unloading requirements in wireless communication networks such as 5G/6G, internet of things and the like, so that the calculation resources in the wireless communication network can be reasonably allocated, calculation unloading decisions can be provided for different time delay sensitive types in a targeted manner, and the network life cycle can be prolonged under the condition that the task time delay can be guaranteed in a scene with limited resources; the method has the following advantages:

1. according to the invention, the task information enhancement module of the MoE hybrid expert system remarkably improves the characteristic expression capability of task information, and a large number of experiments show that compared with a neural network without the MoE module, the task information enhancement module can remarkably improve the influence duty ratio of the task time delay sensitive characteristic in calculation unloading decision, so that the distinction degree of different types of calculation tasks is increased;

2. the reward mechanism of the deep reinforcement learning can be customized according to a specific wireless network scene, and can also be adjusted in a self-adaptive manner according to network characteristics, so that the requirements of uniform energy distribution and calculation task time delay are balanced for the scene with limited energy; for the scene of sufficient energy, computing resources are reasonably planned to ensure the time delay of the computing task with high priority;

3. the distributed computation offload mechanism not only ensures the timeliness of computation offload policies, but also reduces the task burden of computation offload decisions.

Drawings

FIG. 1 is a schematic diagram of a computational offload of the present invention;

FIG. 2 is a single offload decision flow chart of the present invention;

fig. 3 is a schematic diagram of a neural network according to the present invention.

Detailed Description

The present invention will now be further described in order to make the objects, technical solutions and advantages of the present invention more apparent. The DRL-DDCO adopts deep reinforcement learning, synthesizes task time delay and network survivability, learns the mapping relation between the network information environment and the network benefit of decision feedback, and realizes a personalized and differential target drive calculation unloading mechanism aiming at different task types and different network environments.

The whole technical scheme of the invention consists of a deep reinforcement learning framework facing to the target driving calculation unloading strategy and a task information enhancement module based on a hybrid expert system (Mixture of Experts, moE).

In the target-oriented driving computing unloading deep reinforcement learning framework, a real wireless communication scene is modeled into a network constructed by four types of nodes, namely a computing task release node (source node), a computing task result destination node (target node), a computing server node (computing node) and a common node capable of providing relay service.

In a target-driven computing unloading mode, the transmission and unloading of computing tasks are performed cooperatively, namely 'forwarding and unloading at the same time'; and, its forwarding node and offloading policy are determined hop-by-hop based on deep reinforcement learning on the path nodes it traverses. That is, the "target-driven computing offload mode" designed by the present invention implements a "forward-while-offload, decision-while-make" mode.

Specifically, in the calculation unloading process of the target drive, the mechanism firstly models the unloading process into a Markov decision process (Markov Decision Process, MDP), and on the basis, the selection of the next hop and the calculation unloading proportion are calculated by the current node through the neural network which is deeply reinforcement-learned until the calculation unloading task is completed from the source node. The input of the deep reinforcement learning network is a Markov state space, which is simply called an observation state, and the output is the optimal next hop and unloading proportion under the corresponding observation state.

In one decision process round, as shown in fig. 1, all state transitions, calculation offloading policies and corresponding received rewards are stored for training a neural network of fitting action potential value and state mapping relationships in deep reinforcement learning. The converged neural network has the ability to memorize and generalize. In the decision process, the neural network can predict the subsequent potential state transition process to search for the optimal offloading policy based on the current state. In this way, as shown in fig. 2, the DRL-DDCO model can gradually calculate the optimal calculation offloading policy in the calculation offloading process, and correct the offloading policy according to the network environment and task information.

In a task information enhancement module based on the MoE hybrid Expert system, the expression of time delay sensitive characteristics in different types of calculation unloading tasks is mainly enhanced, the MoE hybrid Expert system can reflect the different mapping relation between different types of tasks and corresponding decisions thereof through the combination of a plurality of Expert sub-networks, so that task information characteristics with higher expressive force are output, and a decision system can learn a unified action strategy from the characteristic information of the difference and decision feedback reward data, so that an intelligent unloading decision system for all task types is formed.

The network structure of the present invention is shown in fig. 3, and the main contents include:

1. deep reinforcement learning framework oriented to target drive computing unloading strategy

In a wireless communication network, the problem of calculating and unloading decision in a target driving mode often faces the problems of background traffic interference, difficulty in centering decision of a distributed network and the like. Aiming at the problems, the invention introduces a Deep Learning algorithm (DDQN) to enable an intelligent agent to adaptively learn the relation between a calculation unloading decision and a target income, thereby reasonably making a calculation unloading strategy.

(1) Reinforced learning module

Before explaining the reinforcement learning mode, the overall model of the DRL-DDCO basically accords with the reinforcement learning mode, and firstly, a calculation unloading decision process driven by a target needs to be explained, so that the Markov property is relatively easy to prove to be satisfied, and the invention does not prove about the problem.

a) Markov decision process (MDP, markov Decision Process)

For reinforcement learning, the computational offload scenario of target drive needs to be modeled first as a markov decision process, which includes determining a state space (S), an action space (a), a transition probability (P) and a corresponding reward (R), i.e. classical quaternion < S, a, P, R >. The transition probability P is default to 1 in the case of a seek class problem, because this is a reliably defined network, and transmission failures or errors are not within the scope of the present invention. The other main components are as follows:

S＝(I _nearby ,T,Topo)

A＝(A _node ,A _prop )

R＝f(D,△Var)

where D represents the time delay required to compute the task offload completion and Δvar represents the variance change of the total energy remaining in the neighboring nodes around the current node. The state space is mainly composed of three partial features: (1) I _nearby Representing locally collected network states including numbers of surrounding nodes, respective computing resources and information about energy reserves; (2) T represents task information received by a computing node, and the task information comprises data quantity, calculated quantity and other task characteristics of a computing task; (3) Topo represents information of network topology, including Dijkstra distance between each node and target point, which can be obtained through communication algorithm commonly used by network. The above three features constitute a complete state space and uniquely identify the best computational offload strategy. We can therefore also demonstrate that the goal driven computational offload process satisfies the markov property, i.e., the overall process is modeled as the effectiveness of MDP.

The action space consists of two sub-actions: (1) Selecting a next hop node A _node This is the meterThe next short-term destination selected by the computing task is not only needed to go to the short-term destination to carry out the computing unloading work, but also can be used as a relay node to forward the computing task; (2) Selecting offload proportion A at the current node _prop Because of the proportion, A _prop ∈[0,1]The offload ratio is equal to 0 if only relay is active. Secondly, the unloading ratio is based on the calculation amount of the initial requirement, and the setting helps the agent distinguish between each action.

b) Reward setting

The reward formula is a function of overall time delay of the task as a function of energy variance.

For task T _j (j is a natural number) its delay D _j The total time required from the task release and calculation completion to the transmission of the result to the target node is represented, wherein the total time not only comprises the time delay caused by calculation and unloading, but also comprises the data transmission time delay and the signal propagation time delay; the energy variance is changed, the energy variance after unloading is subtracted from the energy variance before unloading, if the energy variance is reduced and the area energy distribution is more uniform, delta Var < 0, the agent receives a part of forward feedback, and vice versa. The overall prize setting is as follows:

R(D _j ,△var)＝-α*D _j *s _j -β*△var

wherein s is _j Representing a computing task T _j The higher the time delay sensitivity degree of the system is, the more urgent the task is represented, the specific value of the system needs to be combined with the actual application scene, and the system is set according to the emergency degree of the task before training. If the task emergency degree is set to 0-6 and 7 grades are added, s _j The value of (2) is a certain value between 0 and 6 corresponding to the task emergency degree. And alpha and beta represent the rewarding coefficients of time delay and variance change, respectively. Typically, the values of α+β=1 are set, and α and β can be set in combination with empirical values, and the parameter values are adjusted in accordance with the training effect in the training.

Assume task T _j The entire process from the offload source node to the final destination node requires a kappa-hop transmissionSequentially marked as h ₁ ,h ₂ ,...h _κ And calculate task T _j In the h _k The time delay introduced by the jump (1.ltoreq.k.ltoreq.κ) is denoted as D _j (h _k ). Task T _j Is the overall delay D of (2) _j The formula is as follows:

wherein,represent natural number, D _j (h _k ) Refer to calculating the h on the offload path _k All delays generated at the step node include four parts: transmission delay, propagation delay, unloading delay, and latency. Wherein transmission delay->The formula is as follows:

wherein L is _j (h _k ) Representing a computing task T _j At h _k The amount of data to be transmitted during time hopping, N (T _j ,h _k ) Representing a computing task T _j Corresponds to the h _k Node of the hop, denominatorRepresentative is the transmission rate at the node.

Before introducing the amount of calculation task data, a definition of the yield is given below,

yield refers to the data volume of the result of a calculation taskInitial data volume for its task->Meaning, namely, the task data volume compression ratio after the calculation is completed. Based on this, the following is defined in the h _k Calculation task data amount formula of jump:

R _j (h _k ) Refers to the amount of computation that has not been completed by the computing task after the computation offload action,the initial calculation amount for the task. When R is _j (h _k ) When=0, i.e. the task calculation has been completed, then at this time at h _k The amount of data that needs to be transferredWherein lambda represents the remaining ratio of the amount of data after the task is offloaded,/>The residual data volume after the task is unloaded.

For calculating unloading delayThe formula is as follows:

wherein,refer to the h _k The amount of computation offloaded on the step node is one of the policies we need to make. />Representing a computing task T _j In the h _k The calculation rate of the hop node.

Propagation delayThe formula is as follows:

wherein W (h _k ,h _k+1 ) Represents h _k Jumping node and h _k+1 The distance between the nodes is skipped and v represents the electron wave propagation velocity, which is typically assumed to be 2/3 times the speed of light in the present invention.

Finally waiting for time delayOften in a multitasking computing offload scenario, a good computing offload allocation algorithm needs to coordinate computing tasks based on node busy conditions to reduce +.>The present invention adopts special skills to record waiting delay +.>The invention records the time TP of the special action.

TP _j (h _k ) Representative task T _j Arrive at the h _k Time point of node is jumped, andrepresenting the point in time at which the offloading of the amount of computation of the task at that node is complete. When τ (N (T) _j ,h _k ) H) is less than or equal to 0 _k The jump node receives the task T _j At the time of completion of the calculation offloading of the previous task, at this time +.>Calculation task T _j After completion, the point in time is updated:

TP _j (h _k+1 )＝TP _j (h _k )+D _j (h _k )

the energy consumption variance change formula is as follows:

△Var(h _k )＝Var _after -Var _before

wherein Λ means the nth (T _j ,h _k ) Set of adjacent nodes, H _l Then refer to node lResidual energy ofAnd->Mean value of residual energy of all nodes in node set lambda before and after unloading action, H (A) _node ) Representing the remaining energy in the selected node. While specifying the calculation task T _j The total energy consumption formula is as follows:

wherein,represents at the h _k The calculation at the jump node unloads the energy consumption, for the convenience of reading the formula, then using +.>Refer to the h _k The calculation rate on the jump node is specifically expressed as follows:

wherein,represents the h _k The calculation rate of the jump node in the CPU is +.>In the case of (a), the rate of energy consumption per unit of energy conversion, and the energy consumed for calculating the unloading is equal to the rate multiplied by the calculated amount C of unloading _j (h _k ). Whereas the coefficient v inside is usually set to 10 ^-11 . The transmission energy consumption is set as follows:

at node N (T _j ,h _k ) Up-consumed transmission energy sourceEqual to the rate of energy consumption per unit time +.>Multiplying transmission delay->Wherein is->In the calculation formula of (2), wherein N ₀ Representing complex gaussian white noise variance, while h represents channel gain and W represents channel bandwidth.

(2) DNN module

The neural network in deep reinforcement learning mainly plays a role in associating a large state space with actions and predicting the value of each action in the state. The deep connection between the state characteristics and the action characteristics in the target driving calculation unloading scene is fully excavated by utilizing the neural network, the function of network memory is exerted, and the generalization capability of the neural network can also process the emerging calculation tasks in the network, so that the problem of multiplexing of the traditional algorithm model is well solved. The neural network structure in the decision framework is shown in the upper half of fig. 3:

a) Input processing

Neural network input is a feature of the state space in MDP, but some preprocessing of the data is required before input, because the difference in dimension and distribution between the data affects the direction in which the training process converges. Similarly, in the execution stage, the data needs to be equally processed and then put into the network to obtain a result. Wherein for a part of the characteristics of the numerical class, such as available computing resources, the task data volume and the computing volume are normalized to a numerical value between 0 and 1 through the maximum and minimum values, and the normalization formula is as follows:

for features where the partial value size is intuitively not significant, such as surplus energy, these values are generally intuitively significant in combination with other surrounding data. The invention therefore discretizes or binarizes similar features, such as marking nodes above the average energy level as 1 and nodes below the average level as 0.

b) Output processing

For the output of the neural network, the output content is that under the corresponding state, the Q value of each action is also a two-dimensional space due to the two-dimensional action space of the neural network. For complex target-driven calculation unloading scenes, the optimal action selection probably cannot perfectly follow normal distribution, so that unreasonable actions are extremely easy to obtain in the sampling process, and the method based on the dual-depth Q network (Double Deep Q Network, DDQN) is adopted to calculate the unloading proportion A of the continuous actions _prop Discrete into 11 actions from 0.0 to 1.0, and in combination with the node size N, creates an 11 x N two-dimensional discrete action space. The motion space discretization has the advantages of not only increasing the anti-interference capability of the model, but also providing convenience for active motion screening.

For the training process of the decision model, the invention gives two modes: on-line training and off-line training. The online training has the advantages that the calculation unloading model can be adjusted and calculated according to scene characteristics in real time, but the model occupies larger transmission resources; offline training does not require the transmission of large amounts of training data, but is somewhat slower to react to the scene.

On-line training the invention adopts a CTDE (Centralized Training Distributed Execution ) training module, and specifically:

in a complex computing offload scenario, the distributed training method (Distributed Training, DT) generates additional computing burden for each computing server, and because of the difference between computing resources and received tasks, the model convergence progress between computing servers is not uniform, and network parameters on nodes with faster convergence speed oscillate due to data provided by other lagging nodes. In other words, the problem that the convergence progress is difficult to unify in the distributed training mode may cause the overall network convergence situation to be blocked. Aiming at the situation, the invention provides a working mode of adopting Centralized Training Distributed Execution (CTDE), namely, during the working process, collected < S, A, R, S' > data sets are summarized to a certain central server by each computing node, and then the central server integrates global data and trains a neural network applicable to all computing nodes. Where S' represents the next state in the markov transition process. And finally, the network parameters are transmitted to each computing node by the central training server.

Of course, CTDE also has its own short board, i.e. network parameters that need to propagate large amounts of data including transfer records and iterations on each compute node. In the industrial internet of things scenario with partial link shortage, the problem of additional transmission link burden may be caused. The present invention also proposes an alternative to this problem, namely offline training. The training process of the neural network mainly occurs on an externally hung training server, the server simulates and records a target driving calculation unloading process locally based on the collected server information and network link information, and learns an optimal target calculation unloading strategy offline based on the collected records, and parameters of the neural network are broadcasted to each real node after convergence of the neural network is completed.

The server information collected here is specifically: the service node provides information about the service capabilities, typically including computing power, transmission power, available energy, topology, etc. of the node.

(4) Motion space search optimization module

The action space search optimization (Action Search Optimization, ASO) module is inspired by the ant colony algorithm forbidden options, namely, before the intelligent agent searches redundant action spaces, a certain rule is used for screening out some intuitively invalid action options, and then a corresponding action is selected from the screened action set. The invalid action screening rules are as follows:

(1) the relevant actions of non-adjacent nodes are filtered, wherein the relevant actions comprise the nodes which are not adjacent in the topological relation originally and some nodes which fail due to energy consumption. These actions of passing computational tasks to the node are obviously unreasonable, so such actions are filtered out;

(2) and screening out the node related actions recorded on the calculation unloading path, namely the node related actions which have passed through. This is because even with the possibility of offloading at the node, the corresponding calculation should be done the first time the node is traversed, so the same node should not be traversed twice repeatedly, which prevents a jump back and forth situation;

(3) and screening out related tasks exceeding the residual calculated amount of the calculation tasks. The executed calculated quantity unloading action is matched with the actual calculated quantity, so that the standard unloading action according to the residual calculated quantity is reasonable;

based on the three-point criteria, when selecting actions and selecting the best action of the next state during training and updating, some actions without value can be removed through manual screening, so that invalid records in a memory bank can be reduced, and the quality of the memory bank is improved. The data of the memory library is equivalent to the data set, and the quality of the data set directly determines the convergence degree of the final neural network. Experimental results also show that if the selected actions are not limited, the neural network is difficult to converge in a larger action space.

2. Task information enhancement module based on MoE hybrid expert system

The hybrid expert system (MoE) is a neural network, and also belongs to a model of a hybrid network. The method is suitable for solving the problem of different data mapping relations in the data set. The neural network corresponding to the MoE module is shown in the lower half of FIG. 3, the MoE model mainly comprises two parts, one part is a smaller sub-network (experiments), and is recorded as an expert network, and the expert network can perform specialized training (specialized) through part of a data set, so that the mapping relation in the data can be accurately described; the other part is a gate network (Manager), which may be a DNN neural network, and finally outputs a distribution probability through the SoftMax layer.

The input of the gating network is task type characteristics, the task type characteristics specifically refer to non-numerical characteristics representing task priority or time delay sensitivity in the task, and the non-numerical characteristics are usually represented by One-Hot codes; the input of the sub-network is a common input characteristic, namely, the common input characteristic is that the task is to remove other characteristics except the task type characteristic, and the common input characteristic is usually a numerical characteristic; the common input features and task type features together form the input features of the neural network, namely the state features of the transition.

Each expert network in the sub-network outputs a mapping result with the same dimension, each gating network outputs a corresponding weight of the mapping result, and the method is different from a common neural network in that a plurality of models are trained according to data separation, each model is called an expert, the gating network is used for selecting which expert to use, and the actual output of the MoE is the weight combination of the output of each expert network and the gating network.

Each expert model can fit different functions (various linear or nonlinear functions), so that the MoE model well solves the problem of different mapping relations caused by different responsible data sources. In the invention, moE can well solve the problem of different mapping relations between action values and input states in different types of task decision processes.

Those skilled in the art will appreciate that the multiple expert networks of the sub-networks in the MoE model are employed in the present invention to fit the decision relationships of a single type of task.

Finally, as the condition of weakening the expression of the common type features exists in the process of enhancing the task type information, the common input features are spliced behind the output result of the MoE in the process of unloading decision so as to prevent the condition of information loss. The result of this splice is used as an input to the dual depth Q network.

For the MoE hybrid system module, its loss function is as follows:

wherein p is _i The output results of the SoftMax layer are output according to the proportion of the ith expert under the gate control, and the formulas are as follows:

why the system allows each expert to learn different parameters, the answer being the updated gradient of this model. For the gradient of expert:

as can be seen from its gradient formula, the partial data is for p _i The gradient of updates will also be larger for larger ratios of expertise, a process called "specialization"; for gate, the ratio p of each expert needs to be controlled _i The gradient formula is as follows:

the gradient formulation shows that for some data, if the loss of some expert output is higher than the average loss, then the corresponding p _i The reduction, i.e. the expert network is said to be not well predicted for this part of the data; on the contrary, the mapping relation of the part of data, which is predicted by the experient, is described, and the corresponding output duty ratio p is improved _i . From the view of the formula, the MoE system realizes the specificity of each experiment and different experiment combinations aiming at different mapping relations according to the differentiation of the gradient updating direction.

In a computing offload scenario, different types of computing tasks are also inconsistent in latency requirements. Part of the tasks do not require time delay, but the calculation amount still needs to be unloaded due to the self resource limitation; and part of emergency task delay requirements are very high, and the lower task delay is required to be maintained all the time, which is similar to the alarm task of forest fire. The same network can be distinguished by adding a task-level One-Hot feature, but the actual effect is very little. Because the vast majority of the network still shares a set of parameters for different task state inputs, a small number of feature inputs do not greatly affect the network results. Therefore, the invention proposes to use the MoE system, and can combine different expert network outputs for different similar tasks. Because the task features focused by each network are inconsistent with the network features, for example, the requirements of the emergency computing task on computing resources are larger, and the computing task with loose time delay requirements focuses on energy consumption distribution more, therefore, the task information enhancement module based on MoE can enable the decision model to formulate personalized computing unloading strategies for different types of tasks.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. The target driving calculation unloading method based on the deep reinforcement learning is characterized by modeling a wireless communication scene as a network comprising a source node, a target node, a calculation node and a common node, wherein the source node is a calculation task release node, the target node is a calculation task result destination node, the calculation node is a calculation server node, and the common node is a node for providing relay service;

modeling a calculation task unloading process from a source node to a destination node into a Markov decision process, and starting from the source node, calculating a next hop selection and calculation unloading strategy by the current node through a neural network subjected to deep reinforcement learning until the calculation unloading task is completed; the input of the deep reinforcement learning network is a Markov state space, the input is marked as an observation state, and the output is the optimal calculation unloading strategy under the corresponding observation state;

the observed state includes: task type features, which are non-numerical features representing task priority or delay sensitivity, and common input features, which are other features after the task type features are removed;

2. The method for unloading target drive calculation based on deep reinforcement learning according to claim 1, wherein the calculation policy is specifically that the ratio of the current node to unload the calculation task and the corresponding next-hop node is 0 if the current node is a common node.

3. A method of targeted driven computational offloading based on deep reinforcement learning as claimed in claim 2, wherein the rewards of the markov decision process are a function of the overall delay of the task as a function of the variance of the energy variance.

4. A method of target driven computational offloading based on deep reinforcement learning as claimed in claim 3, further comprising stitching common input features behind the output of the MoE hybrid expert system.

5. The method for target-driven computational offload based on deep reinforcement learning of claim 4, wherein the task-type features are represented using One-Hot coding.

6. The method for target-driven computational offload based on deep reinforcement learning of claim 5, wherein the deep reinforcement learning network computes offload ratio a from a continuous motion of the network _prop Discrete into 11 actions from 0.0 to 1.0, and combining the node scale N to generate a two-dimensional discrete action space of 11 multiplied by N; the best action screened from the two-dimensional discrete action space is the best next hop and calculates the unloading strategy.

7. The method for target-driven computing offloading based on deep reinforcement learning of claim 6, further comprising a central server, wherein the central server integrates global data according to the collected < S, a, R, S' > data of each computing node to train a deep learning neural network applicable to all computing nodes; then, the network parameters are transmitted to each computing node;

8. The method for target-driven computational offload based on deep reinforcement learning of claim 7, further comprising a training server for locally simulating and recording a target-driven computational offload process based on the collected state space of the current node, offline learning an optimal target computational offload strategy, and broadcasting parameters of the current node to other nodes after updating the deep reinforcement-learned neural network of the current node.