CN117852621A - Module combined model-free computing and unloading method and device in multi-environment MEC - Google Patents
Module combined model-free computing and unloading method and device in multi-environment MEC Download PDFInfo
- Publication number
- CN117852621A CN117852621A CN202410017052.3A CN202410017052A CN117852621A CN 117852621 A CN117852621 A CN 117852621A CN 202410017052 A CN202410017052 A CN 202410017052A CN 117852621 A CN117852621 A CN 117852621A
- Authority
- CN
- China
- Prior art keywords
- environment
- model
- strategy
- description information
- trained
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 128
- 230000008569 process Effects 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims description 29
- 238000005070 sampling Methods 0.000 claims description 20
- 238000013210 evaluation model Methods 0.000 claims description 19
- 230000004927 fusion Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 18
- 238000012216 screening Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 238000010801 machine learning Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 43
- 230000000875 corresponding effect Effects 0.000 description 28
- 230000002787 reinforcement Effects 0.000 description 26
- 230000009471 action Effects 0.000 description 23
- 238000004364 calculation method Methods 0.000 description 23
- 238000004891 communication Methods 0.000 description 13
- 239000003795 chemical substances by application Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 12
- 238000009826 distribution Methods 0.000 description 12
- 230000007774 longterm Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000005265 energy consumption Methods 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 239000000306 component Substances 0.000 description 5
- 238000004590 computer program Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 230000007704 transition Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005562 fading Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- SLXKOJJOQWFEFD-UHFFFAOYSA-N 6-aminohexanoic acid Chemical compound NCCCCCC(O)=O SLXKOJJOQWFEFD-UHFFFAOYSA-N 0.000 description 1
- 102000006822 Agouti Signaling Protein Human genes 0.000 description 1
- 108010072151 Agouti Signaling Protein Proteins 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 201000003723 learning disability Diseases 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a model-free computing unloading method and device for module combination in a multi-environment MEC, and relates to the field of machine learning. In the method, a policy device acquires environment description information and current state description information of a target edge computing environment; and calling a pre-trained strategy model to process the environment description information and the state description information to obtain a task unloading strategy considering both the target edge computing environment and the state description information. Therefore, compared with the conventional strategy model, the task unloading strategy is generated only according to the current state description information, and the current environment description information is combined, so that the generated task unloading strategy takes the target edge computing environment and the states in the environment into consideration.
Description
Technical Field
The application relates to the field of machine learning, in particular to a model-free computing and unloading method and device for module combination in a multi-environment MEC.
Background
The proliferation of User Equipment (UE) has led to a wide spread of mobile applications, such as some computationally intensive and latency-focused applications: mobile payment, online gaming, smart medicine, augmented reality, and the like. User devices are mostly equipped with limited computational resources and energy budgets, which results in a large gap between the capabilities of the user device and the application requirements. With the help of fast-evolving high-speed wireless communication technologies today, mobile edge computing (Mobile Edge Computing, MEC) is an effective way to alleviate this problem by offloading the user equipment's tasks to a Base Station (BS) where powerful edge computing resources are deployed nearby for computing.
One of the key issues of MEC is task offloading, making dynamic decisions (e.g., offloading tasks, transmit power) based on time-varying MEC states (e.g., task requirements, energy budget, radio conditions), improving the computational efficiency of mobile applications. In order to make the best decision for computational offloading, conventional approaches are mainly based on mathematical programming development, which depends to a large extent on the reliability of the MEC system model. Heuristic search-based methods (e.g., genetic algorithms and particle swarm optimization) are methods that enable near optimal computational offloading in the case of unpredictable system models when they are not available or reliable. However, with the rapid development of wireless network speeds, these approaches have become increasingly difficult to find efficient offload solutions in a short time. Thus, when MEC states and offloading decisions have high dimensionality, it is becoming mainstream to develop model-free computational offloading methods using reinforcement learning or more commonly deep reinforcement learning.
In the practical process, the efficient calculation unloading method based on the deep reinforcement learning under various scenes developed at present is found, such as multi-intelligent user equipment MEC, multi-base station MEC, vehicle auxiliary MEC, satellite integrated MEC and the like; most of these methods only consider compute offload in a single MEC environment with constant bandwidth, edge Server (ES) capacity, task type, etc. However, these methods cannot accommodate various highly diverse MEC environments in real-world scenarios.
Disclosure of Invention
In order to overcome at least one of the defects in the prior art, the application provides a model-free computing unloading method and device for a module combination in a multi-environment MEC, which specifically comprise the following steps:
in a first aspect, the present application provides a method for model-free computing offload of module combinations in a multi-environmental MEC, the method comprising:
acquiring environment description information and current state description information of a target edge computing environment;
and calling a pre-trained strategy model to process the environment description information and the state description information to obtain a task unloading strategy considering both the target edge computing environment and the state description information.
With reference to the optional implementation manner of the first aspect, the policy model includes a plurality of policy layers connected in series, each policy layer includes a plurality of sub-models that are independent of each other, and the invoking the pre-trained policy model processes the environment description information and the state description information to obtain a task offloading policy that considers the target edge computing environment and the state description information, including:
inputting state embedded features of the state description information into the plurality of policy layers;
for any adjacent layer in a plurality of strategy layers, screening the characteristics output by each sub-model in the previous strategy layer according to the environment embedded characteristics of the environment description information and the state embedded characteristics, and determining the input characteristics of each sub-model in the next strategy layer;
And determining a task unloading strategy considering both the target edge computing environment and the state description information according to the output results of the strategy layers.
With reference to the optional implementation manner of the first aspect, the filtering the output feature of each sub-model in the previous policy layer according to the environment embedded feature of the environment description information and the state embedded feature, to determine the input feature of each sub-model in the next policy layer includes:
generating a weight vector of each sub-model in the next strategy layer according to the environment embedded feature of the environment description information and the state embedded feature;
and weighting the output characteristics of each sub-model in the previous strategy layer according to the weight vector of each sub-model in the next strategy layer to obtain the input characteristics of each sub-model in the next strategy layer.
With reference to the optional implementation manner of the first aspect, the policy model further includes a plurality of weight layers connected in series, where the plurality of weight layers are in one-to-one correspondence with portions of the plurality of policy layers; the generating a weight vector of each sub-model in the next policy layer according to the environment embedded feature of the environment description information and the state embedded feature comprises the following steps:
Acquiring fusion characteristics between the state embedded characteristics and environment embedded characteristics;
inputting the fusion features into the plurality of weight layers;
and for any adjacent layer in the weight layers, multiplying the weight vector output by the previous weight layer with the fusion characteristic, and inputting the multiplied weight vector into the next weight layer to obtain the weight vector of the strategy layer corresponding to the next weight layer.
With reference to the optional implementation manner of the first aspect, the acquiring a fusion feature between the state embedded feature and the environment embedded feature includes:
multiplying the state embedded feature by the environment embedded feature element by element to obtain the fusion feature.
With reference to the optional implementation manner of the first aspect, the policy model further includes a first encoder and a second encoder, and the method further includes:
processing the state description information through the first encoder to obtain a state embedding feature of the state description information;
and processing the environment description information through the second encoder to obtain the environment embedded feature of the environment description information.
With reference to the optional implementation manner of the first aspect, the method further includes a training method of the policy model, where the training method includes:
Acquiring a plurality of strategy models to be trained and an evaluation model to be trained of each strategy model to be trained, wherein the strategy models to be trained respectively correspond to different edge computing environments;
for each strategy model to be trained, interacting with a corresponding edge computing environment through the strategy model to be trained to obtain task unloading experience aiming at the current state of the edge computing environment, and caching the task unloading experience into an experience pool;
sampling experience from the experience pool after the task unloading experience collected by the experience pool meets a preset condition, and updating the plurality of strategy models to be trained and the corresponding evaluation models to be trained according to the sampling experience;
and if the plurality of strategy models to be trained and the corresponding evaluation models to be trained do not reach the convergence condition, returning to the strategy models to be trained, and interacting with the corresponding edge computing environment through the strategy models to be trained to obtain task unloading experience aiming at the current state of the edge computing environment until the convergence condition is met, and taking the strategy models to be trained after the iteration as the pre-trained strategy models.
With reference to the optional implementation manner of the first aspect, updating the plurality of policy models to be trained and the corresponding evaluation models to be trained according to the sampling experience includes:
and respectively updating a plurality of strategy layers and a plurality of weight layers in each strategy model to be trained alternately according to the sampling experience, and updating the evaluation model to be trained corresponding to each strategy model to be trained.
With reference to the optional implementation manner of the first aspect, the task offloading experience includes environment description information of an edge computing environment corresponding to the policy model to be trained.
In a second aspect, the present application further provides a model-free computing unloading device for modular blocks in a multi-environmental MEC, the method comprising:
the information acquisition module is used for acquiring environment description information and current state description information of the target edge computing environment;
and the strategy generation module is used for calling a pre-trained strategy model to process the environment description information and the state description information so as to obtain a task unloading strategy considering the target edge computing environment and the state description information.
With reference to the optional implementation manner of the second aspect, the policy model includes a plurality of policy layers connected in series, each policy layer includes a plurality of sub-models independent from each other, and the policy generation module is further specifically configured to:
Inputting state embedded features of the state description information into the plurality of policy layers;
for any adjacent layer in a plurality of strategy layers, screening the characteristics output by each sub-model in the previous strategy layer according to the environment embedded characteristics of the environment description information and the state embedded characteristics, and determining the input characteristics of each sub-model in the next strategy layer;
and determining a task unloading strategy considering both the target edge computing environment and the state description information according to the output results of the strategy layers.
With reference to the optional implementation manner of the second aspect, the policy generation module is further specifically configured to:
generating a weight vector of each sub-model in the next strategy layer according to the environment embedded feature of the environment description information and the state embedded feature;
and weighting the output characteristics of each sub-model in the previous strategy layer according to the weight vector of each sub-model in the next strategy layer to obtain the input characteristics of each sub-model in the next strategy layer.
With reference to the optional implementation manner of the second aspect, the policy model further includes a plurality of weight layers connected in series, where the plurality of weight layers are in one-to-one correspondence with portions of the plurality of policy layers; the policy generation module is further specifically configured to:
Acquiring fusion characteristics between the state embedded characteristics and environment embedded characteristics;
inputting the fusion features into the plurality of weight layers;
and for any adjacent layer in the weight layers, multiplying the weight vector output by the previous weight layer with the fusion characteristic, and inputting the multiplied weight vector into the next weight layer to obtain the weight vector of the strategy layer corresponding to the next weight layer.
With reference to the optional implementation manner of the second aspect, the policy generation module is further specifically configured to:
multiplying the state embedded feature by the environment embedded feature element by element to obtain the fusion feature.
With reference to the optional implementation manner of the second aspect, the policy model further includes a first encoder and a second encoder, and the policy generation module is further configured to:
processing the state description information through the first encoder to obtain a state embedding feature of the state description information;
and processing the environment description information through the second encoder to obtain the environment embedded feature of the environment description information.
With reference to the optional implementation manner of the second aspect, the method further includes a training method of the policy model, and the apparatus further includes:
An experience collection block is used for obtaining a plurality of strategy models to be trained and evaluation models to be trained of each strategy model to be trained, wherein the strategy models to be trained correspond to different edge computing environments respectively;
the experience collection block is further used for interacting with a corresponding edge computing environment through each strategy model to be trained to obtain task unloading experience aiming at the current state of the edge computing environment, and caching the task unloading experience into an experience pool;
the model updating module is used for sampling experience from the experience pool after the task unloading experience collected by the experience pool meets a preset condition, and updating the plurality of strategy models to be trained and the corresponding evaluation models to be trained respectively according to the sampling experience;
and the model further module is further configured to return to, if the multiple strategy models to be trained and the respective corresponding evaluation models to be trained do not reach a convergence condition, to interact with the corresponding edge computing environment through the strategy models to be trained for each strategy model to be trained, to obtain a task unloading experience for the current state of the edge computing environment, and to use the strategy model to be trained after the iteration as the pre-trained strategy model after the convergence condition is satisfied.
With reference to the optional implementation manner of the second aspect, the model updating module is further specifically configured to:
and respectively updating a plurality of strategy layers and a plurality of weight layers in each strategy model to be trained alternately according to the sampling experience, and updating the evaluation model to be trained corresponding to each strategy model to be trained.
With reference to the optional implementation manner of the second aspect, the task offloading experience includes environment description information of the edge computing environment corresponding to the policy model to be trained.
Compared with the prior art, the application has the following beneficial effects:
the embodiment provides a model-free computing unloading method and device for module combination in a multi-environment MEC. In the method, a policy device acquires environment description information and current state description information of a target edge computing environment; and calling a pre-trained strategy model to process the environment description information and the state description information to obtain a task unloading strategy considering both the target edge computing environment and the state description information. Therefore, compared with the conventional strategy model, the task unloading strategy is generated only according to the current state description information, and the current environment description information is combined, so that the generated task unloading strategy takes the target edge computing environment and the states in the environment into consideration.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of a scenario provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a model structure according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of training principles provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Icon: 101-base station; 102-an edge server; 103-user equipment; 201-an information acquisition module; 202-a policy generation module; 301-memory; 302-a processor; 303-a communication unit; 304-a system bus.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present application, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Based on the above statement, as introduced in the background art, the efficient deep reinforcement learning-based computation offloading method under various scenes developed at present, such as multiple intelligent user equipments MEC, multiple base stations MEC, vehicle auxiliary MEC, satellite integrated MEC and the like; most of these methods only consider compute offload in a single MEC environment with constant bandwidth, edge Server (ES) capacity, task type, etc. However, these methods cannot accommodate various highly diverse MEC environments in real-world scenarios. The problems are specifically expressed as follows:
(1) Experience exploration and learning is inefficient because much experience may be required to learn deep reinforcement learning based offloading strategies even in a single MEC environment.
(2) Training of empirical learning disturbances in deep reinforcement learning based offloading strategies, where gradients of the exploration experience from different MEC environments may contradict each other, results in a significant degradation of performance of computational offloading.
Based on the findings of the above technical problems, the inventors have made creative efforts to propose the following technical solutions to solve or improve the above problems. It should be noted that the above prior art solutions have drawbacks, which are obtained by the inventor after practice and careful study, and therefore the discovery process of the above problems and the solutions presented in the following embodiments of the present application for the above problems should be all contributions of the inventor to the present application during the inventive process, and should not be construed as technical matters known to those skilled in the art.
In view of the reinforcement learning in the machine learning field, the following explanation will be given of the possibly related specialized concepts in order to make the scheme described below easier to understand.
Reinforcement learning, a machine learning method, aims to let an intelligent system interact with the environment to learn how to make optimal decisions. In reinforcement learning, the intelligent system is called an "agent" and takes a series of actions to maximize the future jackpot by observing the status of the environment and the reward signal. It will be appreciated that the core idea of reinforcement learning is that agents learn by trial and error processes. I.e. the agent is initially not aware of the environment, and by interacting with the environment and observing the results, it is gradually learned which actions can get higher rewards. The agent uses a function called "policy" to decide which action should be taken in a given state. The policies may be deterministic or probabilistic. In summary, the core components of reinforcement learning include:
environment (Environment): the external environment with which the agent interacts may be a real physical environment or a virtual simulated environment.
State (State): the current observations of the environment are used to characterize the environment.
Action (Action): the actions taken by the agent in a given state.
Rewards (Reward): the agent is used to evaluate the quality of the action based on signals obtained from the environment from its actions.
Policy (Policy): the manner in which the agent selects the action based on the current state may be a deterministic map or a probabilistic distribution.
Value Function (Value Function): a function that measures the long-term value of an agent over a certain state or state-action pair. It is used to evaluate the policy for quality.
Model (Model): an internal model of the environment for predicting state transitions and reward signals of the environment.
The goal of reinforcement learning is to optimize the strategy or value function to maximize the cumulative rewards that an agent receives during interaction with the environment. The development history of reinforcement learning is subjected to a value function method, a strategy gradient method, an Actor-Critic architecture and deep reinforcement learning, and the methods are exemplarily described below:
(1) The value function method and the early reinforcement learning method are the most dominant methods. The most classical algorithm is Q-learning, which learns the optimal strategy by iteratively updating the value function of state-action pairs. The value function method focuses on evaluating the value of an action and improving the decision by selecting the action with the highest value.
(2) Policy gradient methods, based on the value function method, those skilled in the art begin to focus on how to directly optimize policies. Unlike the value function method, the policy gradient method directly learns the parameterized representation of the policy, rather than indirectly inferring the policy by learning the value function. The core idea of the strategy gradient approach is to update the parameters of the strategy by gradient descent so that the strategy can produce high-return behavior in the environment. Specifically, the strategic gradient method is trained by:
defining a strategy: first, a parameterized policy function, such as a neural network, is selected. The policy function receives as input the environmental state and outputs a probability distribution selected over each possible action.
Collecting a sample: the current strategy is used to interact with the environment to generate a series of trajectories of states, actions, and rewards. The monte carlo method is typically used for sampling, i.e. the sample is collected by interacting with the environment a number of times.
Calculating the gradient: for each sample trace, its corresponding gradient is calculated. The computation of the gradient uses importance sampling techniques to adjust the gradient according to the selected actions in the trajectory and the probability of actions in the strategy.
Updating a strategy: the gradients of all sample trajectories are averaged and the parameters of the strategy are updated using a gradient descent method. The goal is to maximize the expected value of the prize.
Iterative training: steps 2 to 4 are repeated, and the strategy is gradually improved by interaction with the environment and parameter updating until a predetermined stopping condition is reached (e.g. convergence or maximum number of iterations is reached).
(3) Based on the value function method and the strategy gradient method, the reinforcement learning method of the Actor-Critic architecture is provided by a person skilled in the art, and the method combines the advantages of the value function method and the strategy gradient method. The Actor-Critic architecture divides the agent into two parts: one is the Actor (Actor) responsible for generating the action and the other is the Critic (Critic) responsible for evaluating the value of the action. The actor generates an action according to the strategy, and the critics provide a feedback signal through an estimated value function. In the Actor-Critic architecture, actors use gradient ascent to update policy parameters to maximize long-term jackpot. Critics use the value function to evaluate the value of the action and provide gradient signals to the actor through updates to the value function. This architecture allows agents to evaluate both the value function and the optimization strategy during the learning process, thereby making decision improvement more efficient.
(4) Deep reinforcement learning (Deep Reinforcement Learning ), with the advent of deep learning methods, reinforcement learning also began to combine with deep neural networks, forming deep reinforcement learning. Deep reinforcement learning uses a deep neural network to approximate functions or strategy functions so that high-dimensional, complex state and motion spaces can be handled. Typical algorithms include Deep Q Network (DQN), depth Deterministic Policy Gradient (DDPG), and the like.
Based on the description of reinforcement learning in the foregoing embodiments, this embodiment mainly relates to improvement of the existing deep reinforcement learning method, and before the improvement of this embodiment is specifically described, the edge computing environment to which this embodiment relates is further described below. As shown in FIG. 1, the edge computing environment includes a base station 101, the base station 101 and an edge server 102 for computing a frequency f ES Highly associated, also associated with U user devices 103 (denoted asAt a calculated frequency f UE Highly correlated. The user equipment 103 may be, but is not limited to, a smart phone, a noteThis computer, intelligent wrist-watch etc.. The user device 103 may offload computing tasks to the edge server 102 for computing or may perform computing locally. In this regard, a task offload policy that is executed locally or offload to the edge server 102 may be formulated by the policy device for each user device 103 based on preset constraints.
Assuming that the communication bandwidth of the system is BMHz, the system time is divided into N equal-length time periods, and the duration of each time period is s. Each user equipment 103 is denoted uGenerating a computational task at the beginning of the nth time period, expressed as:
in the method, in the process of the invention,representing the computing task->C represents the data size of the completion of the calculation task +.>The required CPU cycles, τ represents the computational task +.>Is a maximum tolerable delay of (a).
In addition, the present embodiment also provides for computing tasksIs inseparable and provides a binary variable +.>To represent computing task->Is performed locally at the user equipment u>Or offloaded to an edge server for edge computation (when). And the available energy budget of the user equipment u in the nth time period is marked +.>Namely e UE Indicating that local calculation is performed or calculation tasks are +.>Energy consumption in wireless transmission. User equipment u and edge server both deploy task queues to prevent task loss, +.>And->The queue sizes at the beginning of the nth time period are shown, respectively.
(1) Based on the above edge computing environment, the present embodiment provides the following communication model of the edge computing environment:
the block fading model used in this embodiment represents the channel gain from the ue u to the base station in the nth time period, and the calculation formula is as follows:
Where |·| represents modulo operation on the formula in the symbol,representing a small-scale fading of the object,the +.A first order Gaussian Markov process can be modeled according to the jak edge server model +.>Indicating large scale fading including path loss and log normal shadows. According to the LTE standard->The modeling of (2) is:
in the method, in the process of the invention,representing the distance of the user equipment u from the base station for the nth time period, z representing compliance +.>Is a log-normal random variable of (c).
Further, in this embodiment, the signal-to-interference-plus-noise ratio of the split link from the ue u to the base station in the nth time period is expressed asThe calculation formula is as follows:
in the method, in the process of the invention,indicating that the radio transmit power of the user equipment u in the nth time period should not exceed the maximum transmit power P UE ,Representing additive gaussian white noise.
At the nthThe time period, the transmission rate from the user equipment u to the base station is expressed asThe calculation formula is as follows:
(2) Based on the above communication model, the present embodiment further provides an edge computing model of the edge computing environment:
assume the binary variables defined aboveThen calculate task->Will be offloaded from the user device n to the edge server. The task delay duration offloaded to the edge server is denoted +.>The task delay time includes a transmission time Queuing time of edge server->Execution duration +.>The respective calculation formulas are as follows:
in the method, in the process of the invention,indicating position of meterCalculation task->The calculation formula of the task size which is waiting for the edge server to execute before is as follows:
wherein the queue size from the nth time period is includedComputing task->Previously offloaded task data size. For the above expression, if->Hold true->The value of (2) is 1, otherwise +.>The value of (2) is 0.
Based on the transmission time length, queuing time length and execution time length unloaded to the edge server, the task delay of the edge calculationThe calculation formula of (2) is as follows:
correspondingly, the energy consumption of the user equipment uThe method comprises the following steps:
(3) Based on the communication model and the edge computing model, the implementation also provides the following local computing model of the edge computing environment:
assume thatThen calculate task->Will run locally at the user terminal u. Similarly, the task delay of local computation +.>Including queuing time->And execution duration +.>The respective calculation formulas are as follows:
and combining the local queuing time length and the execution time length, wherein the expression of the task delay calculation formula of the local calculation is as follows:
correspondingly, the energy consumption of the user equipment u The method comprises the following steps:
where ζ depends on the energy efficiency coefficient of the chip architecture of the user equipment u.
In the nth time period, the calculation task is calculated by integrating the calculation modelTask delay->Energy consumption with user equipment u +.>The calculation formula of (2) is as follows:
in summary, the variables involved in this embodiment can be divided into time-invariant external variables and time-variant internal variables. Wherein the time-invariant external variable comprises environment description information v env :
υ env ={P task ,B,f UE ,f ES }
Wherein P is task A probability distribution representing the size of the task input data,the present embodiment uses +.>Representing, distinguishing between different edge computing environments with subscripts,
aggregation of time-varying internal variablesMake a representation of->ComprisesEach element therein is time dependent and transmitted over a time period as follows:
(4) Based on the communication computing model, the edge computing model and the local computing model, the following task offloading task for the edge computing environment in this embodiment is turned into the following problems:
without loss of generality, the goal of computing the offloading problem is defined as minimizing the average task delay t over multiple time periods n And energy consumption e of UE n Is added to the weighted cost of (a). It should be noted that, since the present embodiment relates to different edge task computing scenarios, t is omitted here for convenience of description n And e n Is above (1) mark v env That is, for each edge computing environment, the average task delay and energy consumption can be expressed as:
the task offloading policy of the user equipment u in the nth time period is expressed asTaking the task offloading policy as the optimization variable in this embodiment, then for each edge computing environment, v env The optimization problem of (2) can be expressed as:
P1:
C1:
C2:
C3:
C4:
thus, for multiple edge computing environmentsThe optimization problem is translated into:
P2:
computing environments for multiple edgesIs also constrained by the constraints C1-C4 of the above constraints.
It should be noted that, the current task offloading policyStrictly depending on the previous task offloading policy +.>The internal variable (n'<n). Thus, the above-described problem P1 can be converted into an MDP (markov decision process) problem. For this, the present embodiment defines a five-tuple<S,A,Pr,R,γ>For representing MDP. S in the five-tuple represents a state space, each time period comprising one state S in the state space n Epsilon S, defined as:
a in the five-tuple represents an action space, each time period including an action a n E A, defined as:
pr in the five-tuple represents a state transition probability that is not a priori, the so-called state transition probability being used to determine the execution a n When from s n To s n+1 Transition probability Pr(s) n+1 |s n ,a n )。
R in the five-tuple represents a timely bonus function for following s n In a n For optimization objectives and constraints, an instant prize R(s) is generated in each time period n ,a n )。
And gamma in the five-tuple represents a discount factor, and the value range is gamma epsilon (0, 1) and is used for determining the influence on future rewards.
It should be appreciated that under the MDP problem, the P1 problem for the single environment described above may be translated into the following P3 problem:
in the P3 problem, the goal is to obtain a policy pi such that v in a single edge computing environment env Maximizing the expected long-term return of (a). Similarly, for the above-described multiple edge computing environmentsThe P2 problem in (a) can be converted into the following P4 problem:
in the P4 problem, the goal is to obtain a policy pi, such that in multiple edge computing environmentsMaximizing the expected long-term return in comparison to a single environment.
Based on the above problem to be optimized, in order to adapt to different edge computing environments, the embodiment is also based on an Actor-Critic framework of the DRL, and proposes a policy model to be trained, which includes a multi-level and multi-sub model and can be combined at will, as an Actor in the Actor-Critic framework. And constructing a training frame comprising a plurality of strategy models to be trained and corresponding evaluation models to be trained respectively, and performing reinforcement learning training. In the training process, the plurality of strategy models to be trained interact with different edge computing environments during training, so that task unloading experience in the different edge computing environments is learned, and the trained strategy models can adapt to various edge computing environments.
In this embodiment, a result trained by a policy model to be trained is referred to as a policy model, and based on the policy model obtained by training, this embodiment provides a model-combined model-free calculation unloading method for a multi-environment MEC. In the method, a policy device acquires environment description information and current state description information of a target edge computing environment; and calling a pre-trained strategy model to process the environment description information and the state description information, so as to obtain a task unloading strategy considering the target edge computing environment and the state description information. Therefore, compared with the conventional strategy model, the task unloading strategy is generated only according to the current state description information, and the current environment description information is combined, so that the generated task unloading strategy takes the target edge computing environment and the states in the environment into consideration.
The policy device implementing the module combination model-free computing offloading method in the multi-environment MEC may be, but is not limited to, a mobile terminal, a tablet computer, a laptop computer, a desktop computer, a server, and the like. In some embodiments, the server may be a single server or a group of servers. The server farm may be centralized or distributed (e.g., the servers may be distributed systems). In some embodiments, the server may be local or remote to the user terminal. In some embodiments, the server may be implemented on a cloud platform; by way of example only, the Cloud platform may include a private Cloud, public Cloud, hybrid Cloud, community Cloud (Community Cloud), distributed Cloud, cross-Cloud (Inter-Cloud), multi-Cloud (Multi-Cloud), or the like, or any combination thereof. In some embodiments, the server may be implemented on an electronic device having one or more components.
In order to make the solution provided by this embodiment clearer, the following details of the steps of the method are described with reference to fig. 2. It should be understood that the operations of the flow diagrams may be performed out of order and that steps that have no logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art. As shown in fig. 2, the method includes:
SA101, acquiring environment description information and current state description information of the target edge computing environment.
The environment description information comprises probability distribution of the input data size of tasks corresponding to a plurality of user equipment in a target edge computing environment, communication bandwidth, computing frequency of an edge server and computing frequency of the user equipment. The state description information includes the calculation task, bandwidth, calculation frequency of the user equipment, calculation frequency of the edge server, the current remaining energy, the queue length in the user equipment and the queue length in the edge server, which are set by each user.
SA102, invoking a pre-trained strategy model to process the environment description information and the state description information, and obtaining a task unloading strategy considering the target edge computing environment and the state description information.
The policy model includes a plurality of policy layers in series, each policy layer including a plurality of sub-models independent of each other. It can be understood that, in this embodiment, a plurality of sub-models in a plurality of policy layers are arranged and combined in a dynamic management manner, so as to obtain sub-policy models that can adapt to different edge computing environments. The manner in which the multiple sub-models in the multiple policy layers are combined is described below. In an alternative embodiment, step SA102 may include:
SA102-1 inputs state embedded features of state description information into a plurality of policy layers.
In an alternative embodiment, the policy model includes a first encoder and a second encoder, and the policy device processes the state description information through the first encoder to obtain a state embedded feature of the state description information; and processing the environment description information through a second encoder to obtain the environment embedded feature of the environment description information.
The policy model is referred to as MC-Actor in this embodiment, and its structure is shown in FIG. 3, which shows the state description information as s n The first Encoder is denoted as Encoder1, s n Input toAfter the treatment of the Encoder1, s is obtained n State embedded feature e of (2) s . Representing the environment description information as upsilon env The second Encoder is denoted by Encoder2, v env After being input into the Encoder2 for processing, the environment embedded feature e is obtained v 。
Based on the description of the embedded status feature and the embedded environment feature, step SA102 further includes:
SA102-2, for any adjacent layer in a plurality of strategy layers, filters the characteristics output by each sub-model in the previous strategy layer according to the environment embedded characteristics and the state embedded characteristics of the environment description information, and determines the input characteristics of each sub-model in the next strategy layer.
It should be understood that, in view of the fact that the model architecture of the policy model provided in this embodiment is already fixed, that is, the data transfer relationship between the two sub-models is already fixed at the beginning of the model design for any adjacent policy layer. In this case, in order to be able to adjust the connection relationship between the sub-models, the present embodiment adjusts the data input into each sub-model so as to achieve the purpose of indirectly adjusting the combination manner between the sub-models. Thus, an alternative embodiment of step SA102-2 includes:
SA102-2-1 generates a weight vector of each sub-model in the next policy layer according to the environment embedded features and the state embedded features of the environment description information.
The strategy model further comprises a plurality of weight layers connected in series, and the weight layers are in one-to-one correspondence with parts of the strategy layers. The policy device obtains a fusion feature between the state embedded feature and the environment embedded feature. For example, the state embedded feature is multiplied element by element with the environment embedded feature to obtain the fusion feature.
Further, the policy device inputs the fusion feature into a plurality of weight layers; and for any adjacent layer in the weight layers, multiplying the weight vector output by the previous weight layer by the fusion characteristic, and inputting the multiplied weight vector into the next weight layer to obtain the weight vector of the strategy layer corresponding to the next weight layer.
SA102-2-2 weights the output characteristics of each sub-model in the previous policy layer according to the weight vector of each sub-model in the next policy layer to obtain the input characteristics of each sub-model in the next policy layer.
Illustratively, with continued reference to FIG. 3, the logging model is assumed to include L policy levels, each policy level including M sub-models. Each sub-model in the first layer is denoted as F l,m M.epsilon. {1,2, …, M }. The input dimension of each sub-model is expressed asThe output dimension is expressed as +.>In this example, each policy level satisfies the following contract conditions:
(1) F in the same layer l,m M e {1,2, …, M } should have the same dimensions;
(2) Output dimension of l layerShould be equal to the input dimension of the next layer l +1 +.>
Based on the above convention, feature e is embedded due to the state s Generated by, i.e. the first encoder describes the state information s n As input, embed the state into feature e s As an output; thus, at the first policy level, its input dimension is equal to D, and the state embeds feature e s Is uniform in dimension. And for the last policy layer, the actions taken due to each user device are represented asComprising a position determination variable->And wireless transmission power->The two parameters mu and sigma respectively obey the gaussian distribution, and the distribution effect of the gaussian distribution is determined by the two parameters, so that when the total U user devices are arranged, the output dimension of the last strategy layer is 4U.
The above examples describe policy layers in a policy model, and with continued reference to fig. 3, the policy model further includes a dynamic management model that includes a plurality of weight layers. Since the plurality of weight layers are used to weight the results output by the policy layers, the number of weight layers is less than the number of policy layers. In this example, for L policy layers, L-1 weight layers are provided. For the first layer of the L-1 weight layers, its output is expressed as Where i, j e {1,2, …, M }; output->Responsible for dynamically combining each sub-model F in the first policy layer l,j The output of j E {1,2, …, M } is used as each submodel F in the 1+1th policy level l+1,i I.e.the input of {1,2, …, M }. With F in the l+1th policy level l+1 K-th input of submodel->The following are examples:
in the method, in the process of the invention,representing the jth module F in the first layer l,j Is the kth output of (c).
With continued reference to FIG. 3, for weight layer 1, W 1 Embedding the state into feature e s Embedding feature e with environment v Multiplication by element, the result obtained by multiplication is used as the 1 st weight layer W 1 And generatesAs an output.Input to weight layer 2W 2 And obtaining the D-dimensional output. Will e s And e v Results of element-by-element multiplication with W 2 The outputs of (2) are multiplied by element to obtain a 3 rd weight layer W 3 Is input to the computer. Similarly, the input and output of each weight layer can be obtained. Thus, no matter what kind of target edge computing environment description information v env The sub-models reusable by the corresponding policy layers can be extracted through the output weights of the weight layers, so that the appropriate sub-models are combined according to the environment description information of the target edge computing environment.
It should be noted that each submodel F in the first policy layer l,m Can be implemented using mathematical functions with different equations or neural networks with different hidden architectures, provided that the dimensions of its input and output satisfy the above-mentioned requirementsAnd->Is required. />
Based on the description of the policy model in the above embodiment, the step SA102 further includes:
SA102-3 obtains a task unloading strategy considering the target edge computing environment and the state description information according to the output results of the strategy layers.
In this way, through the policy model provided in this embodiment, the processing procedure of the policy layer on the state description information is controlled by using the environment description information, so as to adapt to the current task computing environment, thereby obtaining the optimal task unloading policy.
In addition, the embodiment also provides a training method of the strategy model, and the training method can be implemented through the strategy equipment. In some embodiments, the training method may also be implemented by other electronic devices capable of providing sufficient computing power. The training method is described in detail below in connection with specific embodiments, and the method specifically includes:
SB101, obtaining a plurality of strategy models to be trained and an evaluation model to be trained of each strategy model to be trained.
The strategy models to be trained respectively correspond to different edge computing environments.
SB102, for each strategy model to be trained, interacting with the corresponding edge computing environment through the strategy model to be trained, obtaining task unloading experience aiming at the current state of the edge computing environment, and caching the task unloading experience into an experience pool.
The task offloading experience comprises environment description information of the edge computing environment corresponding to the strategy model to be trained.
SB103, sampling the sampling experience from the experience pool after the task unloading experience collected by the experience pool meets the preset condition, and updating a plurality of strategy models to be trained and corresponding evaluation models to be trained according to the sampling experience.
SB104, judging that the plurality of strategy models to be trained and the respective corresponding evaluation models to be trained do not reach the convergence condition, if yes, executing step SB105, and if not, executing step SB102.
SB105, taking the strategy model to be trained after the current iteration as a pre-trained strategy model.
As shown in fig. 4, in the specific embodiment, similar to the existing deep reinforcement learning model based on the Actor-Critic architecture, each policy model to be trained is used as an Actor to be trained, and each policy evaluation model to be trained is used as Critic to be trained in this embodiment. Each to-be-trained Actor is responsible for interaction in the own edge computing environment, and the Critic to be trained corresponding to the to-be-trained Actor is used for evaluating the influence of a task unloading strategy generated by the Actor on long-term rewards.
Unlike the existing deep reinforcement learning model based on the Actor-Critic architecture, the training process aims to improve experience exploration effectThe rate, simultaneously activating V to-be-trained actors to interact with respective edge computing environments, and integrating the interaction experience (s n ,a n ,r n ,s n+1 ) And corresponding environment description information v env Stored in an experience pool, sampled as a batch epsilon, and model parameters of an Actor to be trained and Critic to be trained are updated according to a preset loss function. After training is completed, the resulting Actor may be adapted to a variety of edge computing environments to generate an optimal task offloading strategy.
The above embodiments describe a model training framework constructed to train out a strategic model that can accommodate a variety of edge computing environments. On this basis, the present embodiment further describes a loss function for enabling the model to converge.
First, the task offloading strategy generated by the to-be-trained Actor is regarded as the operation process of a function, and the function is used as pi φ Representing the parameters to be optimized in the function by phi; the evaluation process of Critic to be trained on task unloading strategy is regarded as the operation process of a function, and the function is used as Q θ Representing (also known as Q-value function) the parameters to be optimized in the function are represented by θ, Q θ The result of the calculation is according to the current pi φ The provided task unloading strategy is used for unloading tasks, and finally, long-term rewards can be achieved, wherein the mathematical expression of the long-term rewards is as follows:
the purpose of training the to-be-trained Actor is to enable the value of the finally obtained long-term task rewards to be maximum after task unloading is carried out according to the task strategy generated by the finally obtained strategy model.
It should be noted that pi φ The generated result satisfies the multidimensional Gaussian distribution N φ (μ, σ), meaning task offloading policy a n The mean μ and covariance σ of each element in (a) are both equal to pi φ Approximation of the result generated, i.e. pi φ (a n |s n ) Denoted at s n In the represented state, a task offloading policy a is selected n Is a gaussian probability distribution of (c). Thus, the following loss function J is provided for the Actor to be trained π (φ);
Providing Critic to be trained with the following loss function J π (θ):
Wherein D is KL Representing DL divergence, Z θ (s n ) Representing normalized exp (Q) θ (s n ,a n ) Is provided) with a distribution function,represents a soft state value function parameterized by ψ (ψ represents V for function stability) ψ Is a target network of (c) a). ψ may be optimized by minimization:
further, since the to-be-trained Actor includes a plurality of policy layers, each policy layer includes a plurality of sub-models, in order to learn to dynamically decide whether to reuse a sub-model for different edge computing environments, the to-be-trained Actor further includes a plurality of weight layers, and the input of the to-be-trained Actor includes environment description information and state description information.
Because the to-be-trained Actor comprises a plurality of strategy layers and a plurality of weight layers, parameters of the to-be-trained Actor and the to-be-trained Actor need to be updated in the training process. The Actor network to be trained outputs a plurality of probability distributions, and samples from the distributions to determine specific actions. However, the sampling operation itself is not differentiable, which means that the back propagation algorithm cannot be used directlyTo update the parameters of the Actor network, because the back-propagation algorithm relies on differentiability to calculate the gradient, this example uses the non-tired parameterization method Gumbel-Softmax to solveThe problem of back propagation difficulties caused by non-microsampling in that the sampling process is approximated by introducing additional noise variables, so that the samples from the probability distribution become differentiable, so that the model can achieve end-to-end training.
For ease of description, phi andthe loss function of the offload policy pi, which represents the network parameters in the policy layer and the weight layer, respectively, can be rewritten as:
furthermore, during the research, some sub-modules are selected and used multiple times during the training process, while some sub-modules are hardly selected and trained, so in order to avoid module degradation, in this embodiment, a regularization term R is further designed and added to the rewritten loss function of the to-be-trained Actor:
Wherein R is designed taking into account the long-term sum of the combined weights of each module j of the first layerWhere T is the time range. To prevent some submodels from being exclusively selected and trained, defineThe goal is to minimize two modules of any layer lj and j'.
Moreover, the embodiment also alleviates the problem of overfitting by introducing dropout. In this regard, it should be appreciated that Dropout is a regularization technique for training a neural network, where the problem of overfitting of the neural network can be reduced by randomly removing a portion of the neurons in the neural network during training, and thus, in this embodiment, a submodel is randomly discarded with a probability by Dropout, so that each submodel is selected and turned off with a probability during each training iteration. In the concrete implementation process, a vector is maintainedWherein each element is a Bernoulli random variable, probability p drop 1. Thus, submodel F l+1 Is +.>Can be expressed as:
based on the loss function, the V to-be-trained actors interact with respective edge computing environments in parallel, so that efficiency of exploring experiences is improved. And caching the task unloading experience in the exploration process into an experience pool xi, sampling from the experience pool xi after the task unloading experience to be cached reaches a preset condition, and updating parameters of the to-be-trained Actor and the to-be-trained Critic by using the obtained sampling experience.
It should be noted that, in order to improve training stability, the policy layer and the weight layer in the Actor to be trained are updated alternately according to a preset period in the training process. Wherein the update period of the network parameter phi of the policy layer is denoted as Γ φ Network parameters of weight layerThe update period of (a) is expressed as +.>
Based on the same inventive concept as the model-combined model-free computing and unloading method in the multi-environment MEC provided in the present embodiment, the present embodiment further provides a model-combined model-free computing and unloading device in the multi-environment MEC. The modular model-free computing offload device in the multi-environment MEC comprises at least one software functional module that may be stored in the memory 301 in software form or cured in the policy device. The processor 302 in the policy device is used to execute the executable modules stored in the memory 301. For example, a module combination model-free computing offload device in a multi-environment MEC includes software function modules and computer programs. Referring to fig. 5, functionally divided, a module combination model-free computing offload device in a multi-environment MEC may include:
an information acquisition module 201, configured to acquire environment description information and current state description information of a target edge computing environment;
The policy generation module 202 is configured to invoke a pre-trained policy model to process the environment description information and the state description information, so as to obtain a task offloading policy that takes into account the target edge computing environment and the state description information.
In this embodiment, the information obtaining module 201 is configured to implement step SA101 in fig. 2, and the policy generating module 202 is configured to implement step SA102 in fig. 2. Therefore, for details of each module, reference may be made to specific implementation manners of corresponding steps, which are not described in detail in this embodiment.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
It should also be appreciated that the above embodiments, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application.
Accordingly, the present embodiment also provides a storage medium storing a computer program which, when executed by a processor, implements the model-combined model-free computing offload method in the multi-environment MEC provided in the present embodiment. The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiment provides a policy device for implementing a model-combined model-free computing unloading method in a multi-environment MEC. As shown in fig. 6, the policy device may include a processor 302 and a memory 301. The memory 301 stores a computer program, and the processor reads and executes the computer program corresponding to the above embodiment in the memory 301 to realize the model-free calculation unloading method of the module combination in the multi-environment MEC provided in the present embodiment.
With continued reference to fig. 6, the electronic device further comprises a communication unit 303. The memory 301, the processor 302 and the communication unit 303 are electrically connected to each other directly or indirectly through a system bus 304 to realize data transmission or interaction.
The memory 301 may be an information recording device based on any electronic, magnetic, optical or other physical principle for recording execution instructions, data, etc. In some embodiments, the memory 301 may be, but is not limited to, volatile memory, non-volatile memory, storage drives, and the like.
In some embodiments, the volatile memory may be random access memory (Random Access Memory, RAM); in some embodiments, the non-volatile Memory may be Read Only Memory (ROM), programmable ROM (Programmable Read-Only Memory, PROM), erasable ROM (Erasble Programmable Read-Only Memory, EPROM), electrically erasable ROM (Electric Erasble Programmable Read-Only Memory, EEPROM), flash Memory, or the like; in some embodiments, the storage drive may be a magnetic disk drive, a solid state disk, any type of storage disk (e.g., optical disk, DVD, etc.), or a similar storage medium, or a combination thereof, etc.
The communication unit 303 is used for transmitting and receiving data through a network. In some embodiments, the network may include a wired network, a wireless network, a fiber optic network, a telecommunications network, an intranet, the internet, a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), a wireless local area network (Wireless Local Area Networks, WLAN), a metropolitan area network (Metropolitan Area Network, MAN), a wide area network (Wide Area Network, WAN), a public switched telephone network (Public switched Telephone Network, PSTN), a bluetooth network, a ZigBee network, a near field communication (Near Field Communication, NFC) network, or the like, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include wired or wireless network access points, such as base stations and/or network switching nodes, through which one or more components of the service request processing system may connect to the network to exchange data and/or information.
The processor 302 may be an integrated circuit chip with signal processing capabilities and may include one or more processing cores (e.g., a single-core processor or a multi-core processor). By way of example only, the processors may include a central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a special instruction set Processor (Application Specific Instruction-set Processor, ASIP), a graphics processing unit (Graphics Processing Unit, GPU), a physical processing unit (Physics Processing Unit, PPU), a digital signal Processor (Digital signal Processor, DSP), a field programmable gate array (Field Programmable Gate Array, FPGA), a programmable logic device (Programmable Logic Device, PLD), a controller, a microcontroller unit, a reduced instruction set computer (Reduced Instruction set Computing, RISC), a microprocessor, or the like, or any combination thereof.
It will be appreciated that the structure shown in fig. 6 is merely illustrative. The policy device may also have more or fewer components than shown in fig. 6, or have a different configuration than shown in fig. 6. The components shown in fig. 6 may be implemented in hardware, software, or a combination thereof.
It should be understood that the apparatus and method disclosed in the above embodiments may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing is merely various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for model-free computational offloading of module combinations in a multi-environmental MEC, the method comprising:
acquiring environment description information and current state description information of a target edge computing environment;
and calling a pre-trained strategy model to process the environment description information and the state description information to obtain a task unloading strategy considering both the target edge computing environment and the state description information.
2. The model-free computing offload method of module combination in a multi-environment MEC according to claim 1, wherein the policy model includes a plurality of policy layers connected in series, each policy layer includes a plurality of sub-models independent of each other, and the invoking the pre-trained policy model processes the environment description information and the state description information to obtain a task offload policy that takes into account the target edge computing environment and the state description information, including:
Inputting state embedded features of the state description information into the plurality of policy layers;
for any adjacent layer in a plurality of strategy layers, screening the characteristics output by each sub-model in the previous strategy layer according to the environment embedded characteristics of the environment description information and the state embedded characteristics, and determining the input characteristics of each sub-model in the next strategy layer;
and determining a task unloading strategy considering both the target edge computing environment and the state description information according to the output results of the strategy layers.
3. The method for model-free computing and offloading of module combination in a multi-environment MEC according to claim 2, wherein the filtering the output feature of each sub-model in the previous policy layer according to the environment embedded feature of the environment description information and the state embedded feature to determine the input feature of each sub-model in the next policy layer comprises:
generating a weight vector of each sub-model in the next strategy layer according to the environment embedded feature of the environment description information and the state embedded feature;
and weighting the output characteristics of each sub-model in the previous strategy layer according to the weight vector of each sub-model in the next strategy layer to obtain the input characteristics of each sub-model in the next strategy layer.
4. The model-free computing offload method of module combination in a multi-environment MEC of claim 3, wherein the policy model further comprises a plurality of weight layers in series, the plurality of weight layers being in one-to-one correspondence with portions of the plurality of policy layers; the generating a weight vector of each sub-model in the next policy layer according to the environment embedded feature of the environment description information and the state embedded feature comprises the following steps:
acquiring fusion characteristics between the state embedded characteristics and environment embedded characteristics;
inputting the fusion features into the plurality of weight layers;
and for any adjacent layer in the weight layers, multiplying the weight vector output by the previous weight layer with the fusion characteristic, and inputting the multiplied weight vector into the next weight layer to obtain the weight vector of the strategy layer corresponding to the next weight layer.
5. The method for model-free computing offloading of module combinations in a multi-environment MEC of claim 4, wherein the obtaining a fusion feature between the state embedded feature and the environment embedded feature comprises:
multiplying the state embedded feature by the environment embedded feature element by element to obtain the fusion feature.
6. The model-free computing offload method of module combining in a multi-environment MEC of claim 2, wherein the policy model further comprises a first encoder and a second encoder, the method further comprising:
Processing the state description information through the first encoder to obtain a state embedding feature of the state description information;
and processing the environment description information through the second encoder to obtain the environment embedded feature of the environment description information.
7. The model-combined model-free computational offload method of a multi-environment MEC of claim 1, further comprising a training method of the policy model, the training method comprising:
acquiring a plurality of strategy models to be trained and an evaluation model to be trained of each strategy model to be trained, wherein the strategy models to be trained respectively correspond to different edge computing environments;
for each strategy model to be trained, interacting with a corresponding edge computing environment through the strategy model to be trained to obtain task unloading experience aiming at the current state of the edge computing environment, and caching the task unloading experience into an experience pool;
sampling experience from the experience pool after the task unloading experience collected by the experience pool meets a preset condition, and updating the plurality of strategy models to be trained and the corresponding evaluation models to be trained according to the sampling experience;
And if the plurality of strategy models to be trained and the corresponding evaluation models to be trained do not reach the convergence condition, returning to the strategy models to be trained, and interacting with the corresponding edge computing environment through the strategy models to be trained to obtain task unloading experience aiming at the current state of the edge computing environment until the convergence condition is met, and taking the strategy models to be trained after the iteration as the pre-trained strategy models.
8. The model-free computing and unloading method of module combination in a multi-environment MEC according to claim 7, wherein updating the plurality of policy models to be trained and the respective corresponding evaluation models to be trained according to the sampling experience comprises:
and respectively updating a plurality of strategy layers and a plurality of weight layers in each strategy model to be trained alternately according to the sampling experience, and updating the evaluation model to be trained corresponding to each strategy model to be trained.
9. The model-combined model-free computing offload method of the multi-environment MEC of claim 7, wherein the task offload experience includes environment description information of the edge computing environment to which the policy model to be trained corresponds.
10. A model-free computing offload device for modular combination in a multi-environment MEC, the device comprising:
the information acquisition module is used for acquiring environment description information and current state description information of the target edge computing environment;
and the strategy generation module is used for calling a pre-trained strategy model to process the environment description information and the state description information so as to obtain a task unloading strategy considering the target edge computing environment and the state description information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410017052.3A CN117852621A (en) | 2024-01-04 | 2024-01-04 | Module combined model-free computing and unloading method and device in multi-environment MEC |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410017052.3A CN117852621A (en) | 2024-01-04 | 2024-01-04 | Module combined model-free computing and unloading method and device in multi-environment MEC |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117852621A true CN117852621A (en) | 2024-04-09 |
Family
ID=90539776
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410017052.3A Pending CN117852621A (en) | 2024-01-04 | 2024-01-04 | Module combined model-free computing and unloading method and device in multi-environment MEC |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117852621A (en) |
-
2024
- 2024-01-04 CN CN202410017052.3A patent/CN117852621A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mao et al. | Routing or computing? The paradigm shift towards intelligent computer network packet transmission based on deep learning | |
CN110832509B (en) | Black box optimization using neural networks | |
AU2024200810A1 (en) | Training tree-based machine-learning modeling algorithms for predicting outputs and generating explanatory data | |
CN111406264B (en) | Neural architecture search | |
US20190251444A1 (en) | Systems and Methods for Modification of Neural Networks Based on Estimated Edge Utility | |
CN114189892A (en) | Cloud-edge collaborative Internet of things system resource allocation method based on block chain and collective reinforcement learning | |
Yan et al. | Optimal model placement and online model splitting for device-edge co-inference | |
CN115066694A (en) | Computation graph optimization | |
CN116523079A (en) | Reinforced learning-based federal learning optimization method and system | |
CN116363452B (en) | Task model training method and device | |
CN112287990A (en) | Model optimization method of edge cloud collaborative support vector machine based on online learning | |
CN116644804A (en) | Distributed training system, neural network model training method, device and medium | |
CN114090108B (en) | Method and device for executing computing task, electronic equipment and storage medium | |
CN113645702B (en) | Internet of things system supporting block chain and optimized by strategy gradient technology | |
CN113128682B (en) | Automatic neural network model adaptation method and device | |
US12093836B2 (en) | Automatic multi-objective hardware optimization for processing of deep learning networks | |
Ma et al. | Temporal pyramid recurrent neural network | |
Ricardo et al. | Developing machine learning and deep learning models for host overload detection in cloud data center | |
CN117852621A (en) | Module combined model-free computing and unloading method and device in multi-environment MEC | |
CN112445617A (en) | Load strategy selection method and system based on mobile edge calculation | |
CN117436485A (en) | Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision | |
CN114217881B (en) | Task unloading method and related device | |
CN115220818A (en) | Real-time dependency task unloading method based on deep reinforcement learning | |
CN113747500A (en) | High-energy-efficiency low-delay task unloading method based on generation countermeasure network in mobile edge computing environment | |
CN113360203B (en) | Task unloading method and device for electric power Internet of things |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |