CN115914227A - Edge Internet of things agent resource allocation method based on deep reinforcement learning - Google Patents
Edge Internet of things agent resource allocation method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN115914227A CN115914227A CN202211401605.2A CN202211401605A CN115914227A CN 115914227 A CN115914227 A CN 115914227A CN 202211401605 A CN202211401605 A CN 202211401605A CN 115914227 A CN115914227 A CN 115914227A
- Authority
- CN
- China
- Prior art keywords
- reinforcement learning
- deep reinforcement
- state
- time
- things
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000013468 resource allocation Methods 0.000 title claims abstract description 30
- 238000004364 calculation method Methods 0.000 claims abstract description 30
- 230000003993 interaction Effects 0.000 claims description 51
- 230000007704 transition Effects 0.000 claims description 30
- 230000009471 action Effects 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 23
- 238000012549 training Methods 0.000 claims description 18
- 239000003795 chemical substances by application Substances 0.000 claims description 16
- 238000013508 migration Methods 0.000 claims description 16
- 230000005012 migration Effects 0.000 claims description 16
- 230000003111 delayed effect Effects 0.000 claims description 13
- 230000002452 interceptive effect Effects 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 101150012763 endA gene Proteins 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The invention discloses an edge Internet of things agent resource allocation method based on deep reinforcement learning, and relates to the technical field of Internet of things, wherein the method comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally sending the data to an edge node e for calculation according to the optimal allocation strategy to realize edge internet of things proxy resource allocation; the invention solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is not enough to support the resource optimization configuration of the complex dynamic Internet of things.
Description
Technical Field
The invention relates to the technical field of Internet of things, in particular to an edge Internet of things agent resource allocation method based on deep reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The reasonable resource allocation is an important guarantee for efficiently supporting the power service of the edge Internet of things agent; the power internet of things is an important component of the national industrial internet; the construction of an efficient, safe and reliable sensing layer becomes an important construction work in the power industry; however, the computing power of the existing power internet of things equipment is limited, and the task of local large-scale rapid computing cannot be effectively realized; the edge Internet of things agent is used as core equipment of an Internet of things sensing layer and plays a role in connecting an Internet of things terminal and a cloud side; with the access of various data such as voice, video, images and the like, the acquisition of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy the task of the internet of things terminal on a proper edge internet of things proxy node is a key problem at the present stage.
At present, the key problems of the edge internet of things agent are mainly embodied in two aspects; firstly, because the agents of the internet of things at different edges are mutually dependent, the existing combined optimization method generally adopts an approximate algorithm or a heuristic algorithm to solve a deployment scheme, so that the method not only needs longer running time, but also has limited performance; secondly, a plurality of edge nodes exist in the edge Internet of things agent environment, and the resource capacity of an edge server is limited; therefore, different edge nodes need to cooperate through distributed decision, and optimal resource allocation is realized so as to support efficient and reliable information interaction.
The emergence of a multi-layer network model provides a new solution for the optimal configuration of communication network resources; training a network model through a multilayer network to achieve an accurate and efficient solution; currently, some researchers have conducted research and analysis; one scheme in the prior art is based on a convolutional neural network, so that reasonable allocation of resources of the internet of things and efficient interaction and coordination of edge equipment on terminal data and network tasks are realized; the other scheme is that Bayes is used for optimizing a Q-learning network, rationalization and ordering of resource allocation in the network are realized, and DDoS network attack is resisted; in addition, the introduction of the deep space-time residual error network effectively supports the effective load balance of the industrial Internet of things network, and ensures that the network realizes low-delay and high-reliability data interaction; in consideration of heterogeneity of network equipment, in the prior art, a deep learning network is mostly adopted to effectively match a network server with a user request, and an optimal resource amount is allocated to the user equipment; however, it should be noted that due to the network structure of the deep network model, the problem of mismatching between the computing capability and the processing problem is easily caused when the network state is updated and iterated, the computing efficiency is limited, and the resource optimization configuration of the complex dynamic internet of things is not sufficiently supported.
Disclosure of Invention
The invention aims to: aiming at the defects in the prior art, the method for allocating the proxy resources of the edge internet of things based on the deep reinforcement learning is provided, and the problems that the proxy resources of the edge internet of things are long in allocation time and limited in performance, and the prior art is not enough to support the resource optimization allocation of the complex dynamic internet of things are solved.
The technical scheme of the invention is as follows:
an edge Internet of things agent resource allocation method based on deep reinforcement learning comprises the following steps:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
Further, the training method of the deep reinforcement learning network model in the step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t ;
Step S105: acting a by the environment according to said system t Feedback reward sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the current system state s t And system action a t Bonus sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence Δ t Storing the experience to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; whether or notThen, the current system state s t Updating to the next state s of the system t+1 And returns to step S104.
Further, the system state S in step S101 is a local offload state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein, b d For the remaining computing resources of the d MEC server, G d For the total calculation resources, is>Allocating the computing resources of each task in the vector M for the computing resources;
the system action a in step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t Back to nextAn objective function value at state;
a' is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
further, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and the value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if yes, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences again from the experience pool O, returning to the step S1071, and otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
Further, the calculation formula of the target value y of the state action pair in step S1072 is as follows:
wherein:
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
Further, the deep reinforcement learning network model performance index in step S1077 includes: global cost and reliability;
the global cost comprises a delay cost c 1 And migration cost c 2 And load cost c 3 。
Further, the delay cost c 1 The expression of (c) is as follows:
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (a) is as follows:
wherein:
j is a migration edge node;
the deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (c) is as follows:
wherein:
u x is the amount of data sent.
Further, the calculation of the reliability includes the steps of:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t)
Further, the reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
H ex (t w ) A resource allocation rate;
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
Further, the expression of the degree of temporal attenuation in step A2 is as follows:
wherein:
Δt w a time gap from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
Compared with the prior art, the invention has the beneficial effects that:
1. an optimal distribution strategy is obtained by utilizing a deep reinforcement learning network model for calculation, and terminal data are transmitted to an edge node e for calculation according to the optimal distribution strategy, so that the calculation pressure of field equipment is effectively relieved, the storage difficulty caused by large data volume in the resource distribution process is avoided, the reliable and efficient information interaction of a communication network is ensured, and a better information interaction support service is provided for the power internet of things.
2. A deep reinforcement learning network model combines the perception capability of deep learning and the decision capability of reinforcement learning to perform advantage complementation, and can support optimal strategy solution of large data volume.
3. A neural network comprises a real-time ANN and a delay ANN, and after training for a certain number of times, parameters of the delay ANN are updated to parameters of the real-time ANN, so that timeliness of a delay ANN value function is guaranteed, and correlation between states is reduced.
4. An edge internet of things agent resource allocation method based on deep reinforcement learning is characterized in that global cost and reliability are used as performance judgment indexes of a network model, and judgment basis is provided for the network model to seek an optimal strategy.
5. A method for distributing agent resources of an edge Internet of things based on deep reinforcement learning updates interaction information by adopting a sliding window mechanism, directly abandons interaction information with longer interval time, reduces the calculation overhead of a user terminal, ensures the safety of the user terminal in a task unloading process by calculating the reliability, and provides guarantee for establishing a good interaction environment.
6. An edge Internet of things agent resource allocation method based on deep reinforcement learning is used for calculating various interaction quality values between a user terminal and an edge server, preparing for reliability calculation and providing judgment basis for a network model to seek an optimal strategy.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a flowchart of an implementation method of the deep reinforcement learning network model according to the present invention.
FIG. 3 is a flow chart of a method for training a real-time ANN and a delayed ANN of the present invention.
FIG. 4 is a flowchart of a reliability calculation method according to the present invention.
FIG. 5 is a schematic view of a sliding window according to the present invention.
FIG. 6 is a diagram of a deep reinforcement learning network structure according to the present invention.
FIG. 7 is a diagram illustrating deep reinforcement learning network model parameters according to an embodiment of the present invention.
Fig. 8 is a network performance curve diagram of the deep reinforcement learning network model in the embodiment of the present invention at different learning rates.
Detailed Description
It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The features and properties of the present invention are described in further detail below with reference to examples.
Example one
Referring to fig. 1, a method for allocating an agent resource of an edge internet of things based on deep reinforcement learning includes:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
preferably, in this embodiment, the data collected by the terminal device x is data such as voice, video, image, etc. of the user terminal;
preferably, in this embodiment, python3+ tensoflow 1.0 is used as a simulation experiment platform, hardware conditions are memories of Intel Core i7-5200u and 16GB, 50 terminal devices x and 5 edge nodes e are set in a simulation test environment, and the terminal devices x and the edge nodes e are uniformly distributed in a grid of 15 kilometers × 15 kilometers;
preferably, in this embodiment, the terminal device x sends a task request to the edge node e every 1 hour, and the edge node e determines a server that executes a task in a distributed manner; where the load of the terminal device x is derived from a real load data set where the load of the terminal task follows a substantially 24 hour periodic distribution due to tidal effects, but also fluctuates randomly due to environmental factors.
Preferably, in this embodiment, fig. 7 illustrates deep reinforcement learning network model parameters.
Step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
In this embodiment, specifically, as shown in fig. 2, the training method of the deep reinforcement learning network model in step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t ;
Step S105: acting a by the environment according to said system t Feedback reward sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the current system state s t And system action a t And a prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence Δ t Storing the experience to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; otherwise, the current system state s is set t Updating to the next state s of the system t+1 And returns to step S104.
In this embodiment, specifically, the system state S in step S101 is a local offload state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a unloading decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein, b d For the remaining computing resources of the d MEC server, G d For total computing resources>Allocating the computing resources of each task in the vector M for the computing resources;
system action a in said step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t The target function value when the next state is reached;
a "is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
in this embodiment, specifically, as shown in fig. 3, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, re-extracting the N state transition sequences from the experience pool O, and returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
In this embodiment, specifically, the formula for calculating the target value y of the state action pair in step S1072 is as follows:
wherein:
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
In this embodiment, specifically, the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost comprises a delay cost c 1 And migration cost c 2 And load cost c 3 。
In this embodiment, in order to realize efficient task processing, three factors are considered: delay cost c 1 Migration cost c 2 And load cost c 3 (ii) a As the terminal device x needs to send the collected data to the edge node e for processing, a time delay may be generated during the transmission of the data; when a task is processed, the edge node e may also determine whether to send the task to the migration edge node j, however, since the migration task needs to redeploy the model, migration cost may be generated; due to the limited capacity of the edge node e, if the same edge node is locatede deploys too many tasks, the edge node e is often overloaded, resulting in load cost.
In the embodiment, specifically, the delay cost c 1 The expression of (a) is as follows:
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (c) is as follows:
wherein:
j is a migration edge node;
deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (c) is as follows:
wherein:
u x is the amount of data sent.
In this embodiment, specifically, as shown in fig. 4, the calculating of the reliability includes the following steps:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
in the embodiment, considering that the interaction experience with a long interval time is not enough to update the current reliable value in time, the latest interaction behavior should be paid more attention to, so that the interaction information is updated by adopting a sliding window mechanism; as shown in fig. 5, when the next slot of the interactive information arrives, the slot record with the longest interval in the window will be discarded, and the effective interactive information will be recorded in the window, thereby reducing the calculation overhead of the user terminal;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
in this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the influence on the current reliability evaluation is, and the time decay function is defined as:representing the degree of attenuation of information from w interactions to the time slot of the current interaction, where at w =t-t w ,t w The end time of the w interactive time slots is the time, and the amount of computing resources which can be provided by the edge server in each interactive process also influences the updating of the interactive information;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t)
In the present embodiment, specifically, the aboveDegree of reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
H ex (t w ) Allocating the resource rate;
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
In this embodiment, specifically, the expression of the time attenuation degree in step A2 is as follows:
wherein:
Δt w for the w-th interaction to endA time gap for the current interaction to begin;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
In this embodiment, a deep reinforcement learning network model is used to solve the optimal allocation strategy, as shown in fig. 6, the deep reinforcement learning network model includes two neural networks, a first neural network, called a real-time ANN, for calculating an estimated value Q(s) of a current state action pair t ,a t Theta) refers to a parameter of the real-time ANN, and is updated each time an estimated value of the current state is calculated; a second neural network, called the delayed ANN, is used to calculate the value Q(s) for the next state t+1 ,a t+1 θ'), the value of the next state is used to calculate the target value y.
In this embodiment, the influence on the deep reinforcement learning network model under different learning rates is tested, and as shown in fig. 8, when the learning rate factor is set to 0.01, the network loss function cannot be effectively converged, and the function value has an obvious oscillation phenomenon. In contrast, when the learning rate is set to 0.0001, the dispersion of the network is effectively improved, and the network effectively converges at 60 iterations, but the convergence speed becomes significantly slow. Obviously, when the value is set to 0.0001, the resource allocation performance is best, the loss function is reduced quickly, the network convergence is more stable, and the convergence effect is better.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.
The background section is provided to present the context of the invention in general, and work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.
Claims (10)
1. An edge internet of things agent resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
2. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 1, wherein the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delay ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t ;
Step S105: acting a by the environment according to said system t Feedback award sigma t+1 And the next state s of the system t+1 ;
Step S106: according to the currentSystem state s t And system action a t Bonus sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence of delta t Storing the data to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; otherwise, the current system state s is set t Updating to the next state s of the system t+1 And returns to step S104.
3. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 2, wherein the system state S in the step S101 is a local uninstalling state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a unloading decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein, b d For the remaining computing resources of the d MEC server, G d For the total calculation resources, is>Allocating the computing resources of each task in the vector M for the computing resources;
the system action a in step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t The target function value when the next state is reached;
a' is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a t ,σ t+1 ,s t+1 )。
4. the method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 3, wherein the training method for the real-time ANN and the delayed ANN in the step S107 comprises the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences again from the experience pool O, returning to the step S1071, and otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
5. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 4, wherein the target value y of the state action pair in the step S1072 is calculated by the following formula:
wherein:
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
6. The method of claim 5, wherein the deep reinforcement learning network model performance indexes in step S1077 include: global cost and reliability;
the global cost comprises a delay cost c 1 Migration cost c 2 And load cost c 3 。
7. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 6, wherein the delay cost c is 1 The expression of (c) is as follows:
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (a) is as follows:
wherein:
j is a migration edge node;
deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (a) is as follows:
wherein:
u x is the amount of data sent.
8. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning as claimed in claim 6, wherein the calculating of the reliability comprises the following steps:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t) 。
9. The method for allocating the proxy resources of the edge internet of things based on the deep reinforcement learning of claim 8, wherein the method is characterized in thatThe reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
H ex (t w ) Allocating the resource rate;
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
10. The method for allocating the edge internet of things proxy resource based on the deep reinforcement learning of claim 9, wherein the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time gap from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211401605.2A CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211401605.2A CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115914227A true CN115914227A (en) | 2023-04-04 |
CN115914227B CN115914227B (en) | 2024-03-19 |
Family
ID=86493215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211401605.2A Active CN115914227B (en) | 2022-11-10 | 2022-11-10 | Edge internet of things proxy resource allocation method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115914227B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112134916A (en) * | 2020-07-21 | 2020-12-25 | 南京邮电大学 | Cloud edge collaborative computing migration method based on deep reinforcement learning |
CN113890653A (en) * | 2021-08-30 | 2022-01-04 | 广东工业大学 | Multi-agent reinforcement learning power distribution method for multi-user benefits |
CN114490057A (en) * | 2022-01-24 | 2022-05-13 | 电子科技大学 | MEC unloaded task resource allocation method based on deep reinforcement learning |
US20220180174A1 (en) * | 2020-12-07 | 2022-06-09 | International Business Machines Corporation | Using a deep learning based surrogate model in a simulation |
-
2022
- 2022-11-10 CN CN202211401605.2A patent/CN115914227B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112134916A (en) * | 2020-07-21 | 2020-12-25 | 南京邮电大学 | Cloud edge collaborative computing migration method based on deep reinforcement learning |
US20220180174A1 (en) * | 2020-12-07 | 2022-06-09 | International Business Machines Corporation | Using a deep learning based surrogate model in a simulation |
CN113890653A (en) * | 2021-08-30 | 2022-01-04 | 广东工业大学 | Multi-agent reinforcement learning power distribution method for multi-user benefits |
CN114490057A (en) * | 2022-01-24 | 2022-05-13 | 电子科技大学 | MEC unloaded task resource allocation method based on deep reinforcement learning |
Non-Patent Citations (5)
Title |
---|
BO FENG 等: "Influence analysis of neutral point grounding mode on the single-phase grounding fault characteristics of distribution network with distributed generation", 2020 5TH ASIA CONFERENCE ON POWER AND ELECTRICAL ENGINEERING (ACPEE), 30 June 2020 (2020-06-30) * |
朱斐;吴文;刘全;伏玉琛;: "一种最大置信上界经验采样的深度Q网络方法", 计算机研究与发展, no. 08, 15 August 2018 (2018-08-15) * |
李孜恒;孟超: "基于深度强化学习的无线网络资源分配算法", 通信技术, vol. 53, no. 008, 31 December 2020 (2020-12-31) * |
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) * |
饶宁等: "基于多智能体深度强化学习的分布式协同 干扰功率分配算法", 电子学报, 30 June 2022 (2022-06-30) * |
Also Published As
Publication number | Publication date |
---|---|
CN115914227B (en) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113568727B (en) | Mobile edge computing task allocation method based on deep reinforcement learning | |
US8250198B2 (en) | Capacity planning for data center services | |
CN112579194B (en) | Block chain consensus task unloading method and device based on time delay and transaction throughput | |
CN113115368B (en) | Base station cache replacement method, system and storage medium based on deep reinforcement learning | |
Zhang et al. | Joint optimization of cooperative edge caching and radio resource allocation in 5G-enabled massive IoT networks | |
CN114357455B (en) | Trust method based on multidimensional attribute trust evaluation | |
EP3547625A1 (en) | Method and system for sending request for acquiring data resource | |
CN112395090B (en) | Intelligent hybrid optimization method for service placement in mobile edge calculation | |
CN109634744A (en) | A kind of fine matching method based on cloud platform resource allocation, equipment and storage medium | |
CN108390775B (en) | User experience quality evaluation method and system based on SPICE | |
CN113543160B (en) | 5G slice resource allocation method, device, computing equipment and computer storage medium | |
CN114500561B (en) | Power Internet of things network resource allocation decision-making method, system, equipment and medium | |
CN110072130A (en) | A kind of HAS video segment method for pushing based on HTTP/2 | |
KR20180027995A (en) | Method and apparatus for future prediction in Internet of thing | |
CN111901134B (en) | Method and device for predicting network quality based on recurrent neural network model (RNN) | |
CN109379747B (en) | Wireless network multi-controller deployment and resource allocation method and device | |
CN117539648A (en) | Service quality management method and device for electronic government cloud platform | |
Bensalem et al. | Scaling Serverless Functions in Edge Networks: A Reinforcement Learning Approach | |
CN115914227B (en) | Edge internet of things proxy resource allocation method based on deep reinforcement learning | |
CN110191362B (en) | Data transmission method and device, storage medium and electronic equipment | |
CN114980324A (en) | Slice-oriented low-delay wireless resource scheduling method and system | |
CN110933119B (en) | Method and equipment for updating cache content | |
CN118245809B (en) | Batch size adjustment method in distributed data parallel online asynchronous training | |
CN117834643B (en) | Deep neural network collaborative reasoning method for industrial Internet of things | |
CN110650187B (en) | Node type determination method for edge node and target network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |