CN115914227B - Edge internet of things proxy resource allocation method based on deep reinforcement learning - Google Patents
Edge internet of things proxy resource allocation method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN115914227B CN115914227B CN202211401605.2A CN202211401605A CN115914227B CN 115914227 B CN115914227 B CN 115914227B CN 202211401605 A CN202211401605 A CN 202211401605A CN 115914227 B CN115914227 B CN 115914227B
- Authority
- CN
- China
- Prior art keywords
- reinforcement learning
- state
- deep reinforcement
- time
- edge node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 58
- 238000013468 resource allocation Methods 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000004364 calculation method Methods 0.000 claims abstract description 27
- 230000003993 interaction Effects 0.000 claims description 56
- 230000007704 transition Effects 0.000 claims description 33
- 230000009471 action Effects 0.000 claims description 30
- 230000006870 function Effects 0.000 claims description 23
- 238000012549 training Methods 0.000 claims description 18
- 238000013508 migration Methods 0.000 claims description 15
- 230000005012 migration Effects 0.000 claims description 15
- 230000003111 delayed effect Effects 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000009514 concussion Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a method for distributing proxy resources of an edge internet of things based on deep reinforcement learning, which relates to the technical field of the internet of things and comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, then obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally transmitting the data to an edge node e for calculation according to the optimal allocation strategy to realize the proxy resource allocation of the edge Internet of things; the method solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is insufficient for supporting the resource optimization configuration of the complex power Internet of things.
Description
Technical Field
The invention relates to the technical field of the Internet of things, in particular to a method for distributing proxy resources of an edge Internet of things based on deep reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Reasonable resource allocation is an important guarantee for efficiently supporting the power business of the edge internet of things agent; the electric power internet of things is an important component of the national industrial internet; constructing an efficient, safe and reliable sensing layer has become an important construction work in the power industry; however, the current electric power internet of things equipment has limited computing capacity, and cannot effectively realize the task of local large-scale rapid computing; the edge internet of things agent is used as core equipment of the internet of things sensing layer and plays a role of connecting the internet of things terminal and the cloud; along with the access of various data such as voice, video, images and the like, as well as the collection of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy tasks of the internet of things terminal on a proper edge internet of things proxy node is a key problem at the present stage.
At present, key problems of the edge internet of things proxy are mainly embodied in two aspects; firstly, because of interdependence among the internet of things agents at different edges, the existing combined optimization method generally adopts an approximate algorithm or a heuristic algorithm to solve the deployment scheme, thus not only requiring longer running time, but also having limited performance; secondly, a plurality of edge nodes exist in the edge internet of things proxy environment, and the resource capacity of an edge server is limited; therefore, different edge nodes need to cooperate through distributed decision to realize optimal resource allocation so as to support efficient and reliable information interaction.
The appearance of the multi-layer network model provides a new solution for the optimal configuration of communication network resources; training a network model through a multi-layer network to achieve an accurate and efficient solution; currently, some researchers have performed research and analysis; one scheme in the prior art is based on a convolutional neural network, so that reasonable allocation of resources of the Internet of things and efficient interaction and coordination of edge equipment on terminal data and network tasks are realized; the other scheme is to optimize the Q-learning network by using Bayes, so as to realize rationalization and ordering of resource allocation in the network and resist DDoS network attack; in addition, the introduction of the depth space-time residual error network effectively supports the effective load balance of the industrial Internet of things network, and ensures that the network realizes low-delay and high-reliability data interaction; considering the heterogeneity of network devices, in the prior art, a deep learning network is mostly adopted to effectively match a network server with a user request, and the optimal resource quantity is allocated to the user device; however, it should be noted that, due to the network structure of the deep network model, the problem of mismatching between the computing power and the processing problem is easily involved in updating and iterating the network state, which limits the computing efficiency and is insufficient to support the resource optimization configuration of the complex power internet of things.
Disclosure of Invention
The invention aims at: aiming at the defects in the prior art, the edge internet of things proxy resource allocation method based on deep reinforcement learning is provided, and the problems that the edge internet of things proxy resource allocation time is long, the performance is limited and the prior art is insufficient to support the resource optimization configuration of the complex power internet of things are solved.
The technical scheme of the invention is as follows:
an edge internet of things proxy resource allocation method based on deep reinforcement learning comprises the following steps:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the proxy resource allocation of the edge Internet of things.
Further, the training method of the deep reinforcement learning network model in the step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t ;
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value,if yes, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and completing training of the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And returns to step S104.
Further, the system state S in the step S101 is a local uninstalled state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
further, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, the N state transition sequences are re-extracted from the experience pool O and return to step S1071.
Further, the calculation formula of the target value y of the state action pair in the step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
Further, the deep reinforcement learning network model performance index in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3 。
Further, the delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
the deployment variables of the terminal equipment x and the edge node e in the last interaction time are obtained;
the deployment variables of the terminal equipment x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data transmitted.
Further, the calculation of the reliability includes the steps of:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T ex (t)
Further, the reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
fluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex and (t) is the historical interaction times of failure between the terminal equipment x and the edge node e.
Further, the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the edge internet of things proxy resource allocation method based on deep reinforcement learning, an optimal allocation strategy is calculated by using a deep reinforcement learning network model, terminal data is transmitted to an edge node e for calculation according to the optimal allocation strategy, calculation pressure of field devices is effectively relieved, storage difficulty caused by large data volume in a resource allocation process is avoided, reliable and efficient information interaction of a communication network is guaranteed, and better information interaction support service is provided for the electric internet of things.
2. The edge internet of things proxy resource allocation method based on deep reinforcement learning combines the perception capability of deep learning and the decision capability of reinforcement learning, performs advantage complementation, and can support the optimal strategy solution of a large amount of data.
3. The edge internet of things proxy resource allocation method based on deep reinforcement learning comprises a real-time ANN and a delay ANN, wherein after training for a certain number of times, parameters of the delay ANN are updated to parameters of the real-time ANN, timeliness of a delay ANN value function is guaranteed, and correlation among states is reduced.
4. The edge internet of things proxy resource allocation method based on deep reinforcement learning takes global cost and reliability as performance judgment indexes of a network model, and provides judgment basis for searching an optimal strategy for the network model.
5. According to the edge internet of things proxy resource allocation method based on deep reinforcement learning, the interactive information is updated by adopting a sliding window mechanism, the interactive information with longer interval time is directly abandoned, the calculation cost of the user terminal is reduced, the reliability calculation ensures the safety of the user terminal in the task unloading process, and a guarantee is provided for establishing a good interactive environment.
6. The edge internet of things proxy resource allocation method based on deep reinforcement learning calculates various interaction quality values between the user terminal and the edge server, prepares for reliability calculation, and provides a judgment basis for searching an optimal strategy for a network model.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a flow chart of a method for implementing a deep reinforcement learning network model according to the present invention.
FIG. 3 is a flow chart of a training method for real-time ANN and delayed ANN according to the present invention.
FIG. 4 is a flowchart of a reliability calculation method according to the present invention.
FIG. 5 is a schematic view of a sliding window according to the present invention.
FIG. 6 is a diagram of a deep reinforcement learning network according to the present invention.
FIG. 7 is a diagram illustrating parameters of a deep reinforcement learning network model according to an embodiment of the present invention.
FIG. 8 is a graph illustrating network performance at different learning rates for a deep reinforcement learning network model in accordance with an embodiment of the present invention.
Detailed Description
It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The features and capabilities of the present invention are described in further detail below in connection with examples.
Example 1
Referring to fig. 1, a method for allocating proxy resources of an edge internet of things based on deep reinforcement learning includes:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
preferably, in this embodiment, the data collected by the terminal device x is data such as voice, video, and image of the user terminal;
preferably, in this embodiment, python3+tensorf low1.0 is used as a simulation experiment platform, and the hardware conditions are Intel Core i7-5200u and 16GB memories, 50 terminal devices x and 5 edge nodes e are set in a simulation test environment, where the terminal devices x and the edge nodes e are uniformly distributed in a grid of 15 km×15 km;
preferably, in this embodiment, the terminal device x sends a task request to the edge node e once every 1 hour, and the edge node e decides the server that performs the task in a distributed manner; wherein the load of the terminal equipment x comes from a real load data set in which the load of the terminal tasks approximately follows a periodic distribution of 24 hours due to tidal effects, but also random fluctuations due to environmental factors.
Preferably, in the present embodiment, fig. 7 is a deep reinforcement learning network model parameter.
Step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the proxy resource allocation of the edge Internet of things.
In this embodiment, as shown in fig. 2, the training method of the deep reinforcement learning network model in step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t ;
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And return toStep S104.
In this embodiment, specifically, the system state S in step S101 is a local unloading state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
in this embodiment, as shown in fig. 3, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, the N state transition sequences are re-extracted from the experience pool O and return to step S1071.
In this embodiment, specifically, the calculation formula of the target value y of the state action pair in step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
In this embodiment, specifically, the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3 。
In this embodiment, in order to achieve efficient task processing, three factors are considered: delay cost c 1 Migration cost c 2 And load cost c 3 The method comprises the steps of carrying out a first treatment on the surface of the Since the terminal device x needs to send the collected data to the edge node e for processing, a time delay is generated in the transmission of the data during the process; while processing a task, edge node e may also decide whether to send the task to the migrating edge nodePoint j, however, migration costs may result due to the need to redeploy the model for the migration task; due to the limited capacity of the edge node e, if too many tasks are deployed on the same edge node e, the edge node e is often overloaded, resulting in load costs.
In the present embodiment, specifically, the delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
the deployment variables of the terminal equipment x and the edge node e in the last interaction time are obtained;
for terminal equipment x and migration edge node in current interaction timeDeployment variables for j;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data transmitted.
In this embodiment, specifically, as shown in fig. 4, the calculation of the reliability includes the following steps:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
in this embodiment, considering that the interaction experience with longer interval time is not enough to update the current reliable value in time, the latest interaction behavior should be more concerned, so a sliding window mechanism is adopted to update the interaction information; as shown in fig. 5, when the interaction information of the next time slot arrives, the time slot record with the longest interval in the window is discarded, and the effective interaction information is recorded in the window, thereby reducing the calculation overhead of the user terminal;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
in this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the influence on the current reliability evaluation is, and the time decay function is defined as:representing the degree of decay from the information obtained from w interactions to the information of the current interaction slot, where Δt w =t-t w ,t w The end time of the w time interaction time slots is the calculation resource amount which can be provided by the edge server in each interaction process can influence the updating of the interaction information;
step A3: according to the timeThe attenuation degree and the resource allocation rate are calculated to obtain the reliability T ex (t)
In this embodiment, specifically, the reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
epsilon isFluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex and (t) is the historical interaction times of failure between the terminal equipment x and the edge node e.
In this embodiment, specifically, the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
In this embodiment, the optimal allocation strategy is solved by using a deep reinforcement learning network model, as shown in fig. 6, where the deep reinforcement learning network model includes two neural networks, the first neural network is called real-time ANN, and is used to calculate the estimated value Q (s t ,a t θ), θ refers to parameters of the real-time ANN, which are updated each time an estimate of the current state is calculated; a second neural network, called delay ANN, for calculating the value Q (s t+1 ,a t+1 The value of the next state is used to calculate the target value y.
In this embodiment, the influence on the deep reinforcement learning network model under different learning rates is tested, and as shown in fig. 8, when the learning rate factor is set to 0.01, the network loss function cannot be effectively converged, and the function value has obvious concussion phenomenon. In contrast, when the learning rate is set to 0.0001, the dispersibility of the network is effectively improved, the network effectively converges at 60 iterations, but the convergence speed is significantly slow. Obviously, when the setting value is 0.0001, the resource allocation performance is best, the loss function is fast reduced, the network convergence is more stable, and the convergence effect is better.
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.
This background section is provided to generally present the context of the present invention and the work of the presently named inventors, to the extent it is described in this background section, as well as the description of the present section as not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.
Claims (3)
1. The edge internet of things proxy resource allocation method based on deep reinforcement learning is characterized by comprising the following steps of:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: according to the optimal allocation strategy, the data are sent to an edge node e for calculation, and the edge internet of things proxy resource allocation is realized;
the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t ;
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And returns to step S104;
the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, re-extracting N state transition sequences from the experience pool O, and returning to step S1071;
the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3 ;
The delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
for terminal device x and last interaction timeDeployment variables of edge node e;
the deployment variables of the terminal equipment x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data sent;
the reliability calculation includes the following steps:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T ex (t);
The reliability T ex The calculation formula of (t) is as follows:
Nex(t)=1-Pex(t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
epsilon isFluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the number of failed historical interactions between the terminal device x and the edge node e;
the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
2. The method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the system state S in the step S101 is a local unloading state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
3. the method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the calculation formula of the target value y of the state action pair in step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211401605.2A CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211401605.2A CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115914227A CN115914227A (en) | 2023-04-04 |
CN115914227B true CN115914227B (en) | 2024-03-19 |
Family
ID=86493215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211401605.2A Active CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115914227B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112134916A (en) * | 2020-07-21 | 2020-12-25 | 南京邮电大学 | Cloud edge collaborative computing migration method based on deep reinforcement learning |
CN113890653A (en) * | 2021-08-30 | 2022-01-04 | 广东工业大学 | Multi-agent reinforcement learning power distribution method for multi-user benefits |
CN114490057A (en) * | 2022-01-24 | 2022-05-13 | 电子科技大学 | MEC unloaded task resource allocation method based on deep reinforcement learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220180174A1 (en) * | 2020-12-07 | 2022-06-09 | International Business Machines Corporation | Using a deep learning based surrogate model in a simulation |
-
2022
- 2022-11-10 CN CN202211401605.2A patent/CN115914227B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112134916A (en) * | 2020-07-21 | 2020-12-25 | 南京邮电大学 | Cloud edge collaborative computing migration method based on deep reinforcement learning |
CN113890653A (en) * | 2021-08-30 | 2022-01-04 | 广东工业大学 | Multi-agent reinforcement learning power distribution method for multi-user benefits |
CN114490057A (en) * | 2022-01-24 | 2022-05-13 | 电子科技大学 | MEC unloaded task resource allocation method based on deep reinforcement learning |
Non-Patent Citations (6)
Title |
---|
Influence analysis of neutral point grounding mode on the single-phase grounding fault characteristics of distribution network with distributed generation;BO feng 等;2020 5th Asia Conference on Power and Electrical Engineering (ACPEE);20200630;全文 * |
一种最大置信上界经验采样的深度Q网络方法;朱斐;吴文;刘全;伏玉琛;;计算机研究与发展;20180815(08);全文 * |
基于多智能体深度强化学习的分布式协同 干扰功率分配算法;饶宁等;电子学报;20220630;全文 * |
基于深度强化学习的无线网络资源分配算法;李孜恒;孟超;;通信技术;20200810(08);全文 * |
李孜恒 ; 孟超 ; .基于深度强化学习的无线网络资源分配算法.通信技术.2020,(08),全文. * |
邓志龙 ; 张琦玮 ; 曹皓 ; 谷志阳.一种基于深度强化学习的调度优化方法.西北工业大学学报.35(6),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN115914227A (en) | 2023-04-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112486690B (en) | Edge computing resource allocation method suitable for industrial Internet of things | |
CN111629380B (en) | Dynamic resource allocation method for high concurrency multi-service industrial 5G network | |
CN113568727B (en) | Mobile edge computing task allocation method based on deep reinforcement learning | |
CN110365514A (en) | SDN multistage mapping method of virtual network and device based on intensified learning | |
CN112579194B (en) | Block chain consensus task unloading method and device based on time delay and transaction throughput | |
CN111711666B (en) | Internet of vehicles cloud computing resource optimization method based on reinforcement learning | |
CN112395090B (en) | Intelligent hybrid optimization method for service placement in mobile edge calculation | |
US20230153124A1 (en) | Edge network computing system with deep reinforcement learning based task scheduling | |
CN114357455B (en) | Trust method based on multidimensional attribute trust evaluation | |
CN113891276A (en) | Information age-based mixed updating industrial wireless sensor network scheduling method | |
CN113115368B (en) | Base station cache replacement method, system and storage medium based on deep reinforcement learning | |
CN113692021A (en) | 5G network slice intelligent resource allocation method based on intimacy | |
CN109298930A (en) | A kind of cloud workflow schedule method and device based on multiple-objection optimization | |
CN114423023B (en) | Mobile user-oriented 5G network edge server deployment method | |
CN109379747B (en) | Wireless network multi-controller deployment and resource allocation method and device | |
CN115914227B (en) | Edge internet of things proxy resource allocation method based on deep reinforcement learning | |
CN116055406B (en) | Training method and device for congestion window prediction model | |
CN113543160A (en) | 5G slice resource allocation method and device, computing equipment and computer storage medium | |
CN104601424B (en) | The passive transacter of master and method of probabilistic model are utilized in equipment control net | |
CN114500561B (en) | Power Internet of things network resource allocation decision-making method, system, equipment and medium | |
Bensalem et al. | Scaling Serverless Functions in Edge Networks: A Reinforcement Learning Approach | |
CN113596138B (en) | Heterogeneous information center network cache allocation method based on deep reinforcement learning | |
TW202327380A (en) | Method and system for federated reinforcement learning based offloading optimization in edge computing | |
CN114980324A (en) | Slice-oriented low-delay wireless resource scheduling method and system | |
CN113342474A (en) | Method, device and storage medium for forecasting customer flow and training model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |