CN115914227A - Edge Internet of things agent resource allocation method based on deep reinforcement learning - Google Patents

Edge Internet of things agent resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN115914227A
CN115914227A CN202211401605.2A CN202211401605A CN115914227A CN 115914227 A CN115914227 A CN 115914227A CN 202211401605 A CN202211401605 A CN 202211401605A CN 115914227 A CN115914227 A CN 115914227A
Authority
CN
China
Prior art keywords
reinforcement learning
deep reinforcement
state
time
things
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211401605.2A
Other languages
Chinese (zh)
Other versions
CN115914227B (en
Inventor
钟加勇
田鹏
吕小红
吴彬
籍勇亮
李俊杰
宫林
何迎春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Original Assignee
Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd, State Grid Corp of China SGCC, State Grid Chongqing Electric Power Co Ltd filed Critical Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
Priority to CN202211401605.2A priority Critical patent/CN115914227B/en
Publication of CN115914227A publication Critical patent/CN115914227A/en
Application granted granted Critical
Publication of CN115914227B publication Critical patent/CN115914227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention discloses an edge Internet of things agent resource allocation method based on deep reinforcement learning, and relates to the technical field of Internet of things, wherein the method comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally sending the data to an edge node e for calculation according to the optimal allocation strategy to realize edge internet of things proxy resource allocation; the invention solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is not enough to support the resource optimization configuration of the complex dynamic Internet of things.

Description

Edge Internet of things agent resource allocation method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of Internet of things, in particular to an edge Internet of things agent resource allocation method based on deep reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The reasonable resource allocation is an important guarantee for efficiently supporting the power service of the edge Internet of things agent; the power internet of things is an important component of the national industrial internet; the construction of an efficient, safe and reliable sensing layer becomes an important construction work in the power industry; however, the computing power of the existing power internet of things equipment is limited, and the task of local large-scale rapid computing cannot be effectively realized; the edge Internet of things agent is used as core equipment of an Internet of things sensing layer and plays a role in connecting an Internet of things terminal and a cloud side; with the access of various data such as voice, video, images and the like, the acquisition of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy the task of the internet of things terminal on a proper edge internet of things proxy node is a key problem at the present stage.
At present, the key problems of the edge internet of things agent are mainly embodied in two aspects; firstly, because the agents of the internet of things at different edges are mutually dependent, the existing combined optimization method generally adopts an approximate algorithm or a heuristic algorithm to solve a deployment scheme, so that the method not only needs longer running time, but also has limited performance; secondly, a plurality of edge nodes exist in the edge Internet of things agent environment, and the resource capacity of an edge server is limited; therefore, different edge nodes need to cooperate through distributed decision, and optimal resource allocation is realized so as to support efficient and reliable information interaction.
The emergence of a multi-layer network model provides a new solution for the optimal configuration of communication network resources; training a network model through a multilayer network to achieve an accurate and efficient solution; currently, some researchers have conducted research and analysis; one scheme in the prior art is based on a convolutional neural network, so that reasonable allocation of resources of the internet of things and efficient interaction and coordination of edge equipment on terminal data and network tasks are realized; the other scheme is that Bayes is used for optimizing a Q-learning network, rationalization and ordering of resource allocation in the network are realized, and DDoS network attack is resisted; in addition, the introduction of the deep space-time residual error network effectively supports the effective load balance of the industrial Internet of things network, and ensures that the network realizes low-delay and high-reliability data interaction; in consideration of heterogeneity of network equipment, in the prior art, a deep learning network is mostly adopted to effectively match a network server with a user request, and an optimal resource amount is allocated to the user equipment; however, it should be noted that due to the network structure of the deep network model, the problem of mismatching between the computing capability and the processing problem is easily caused when the network state is updated and iterated, the computing efficiency is limited, and the resource optimization configuration of the complex dynamic internet of things is not sufficiently supported.
Disclosure of Invention
The invention aims to: aiming at the defects in the prior art, the method for allocating the proxy resources of the edge internet of things based on the deep reinforcement learning is provided, and the problems that the proxy resources of the edge internet of things are long in allocation time and limited in performance, and the prior art is not enough to support the resource optimization allocation of the complex dynamic internet of things are solved.
The technical scheme of the invention is as follows:
an edge Internet of things agent resource allocation method based on deep reinforcement learning comprises the following steps:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
Further, the training method of the deep reinforcement learning network model in the step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t
Step S105: acting a by the environment according to said system t Feedback reward sigma t+1 And the next state s of the system t+1
Step S106: according to the current system state s t And system action a t Bonus sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence Δ t Storing the experience to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; whether or notThen, the current system state s t Updating to the next state s of the system t+1 And returns to step S104.
Further, the system state S in step S101 is a local offload state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],
Figure BDA0003935306850000031
Wherein, b d For the remaining computing resources of the d MEC server, G d For the total calculation resources, is>
Figure BDA0003935306850000032
Allocating the computing resources of each task in the vector M for the computing resources;
the system action a in step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
Figure BDA0003935306850000033
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t Back to nextAn objective function value at state;
a' is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
further, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and the value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if yes, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences again from the experience pool O, returning to the step S1071, and otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
Further, the calculation formula of the target value y of the state action pair in step S1072 is as follows:
Figure BDA0003935306850000053
wherein:
Figure BDA0003935306850000054
is maxQ(s) t+1 ,a t+1 θ') of the fluctuation coefficient;
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
Figure BDA0003935306850000051
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
Further, the deep reinforcement learning network model performance index in step S1077 includes: global cost and reliability;
the global cost comprises a delay cost c 1 And migration cost c 2 And load cost c 3
Further, the delay cost c 1 The expression of (c) is as follows:
Figure BDA0003935306850000052
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
Figure BDA0003935306850000055
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (a) is as follows:
Figure BDA0003935306850000061
wherein:
j is a migration edge node;
Figure BDA0003935306850000062
the deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
Figure BDA0003935306850000063
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (c) is as follows:
Figure BDA0003935306850000064
wherein:
u x is the amount of data sent.
Further, the calculation of the reliability includes the steps of:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t)
Further, the reliability T ex The calculation formula of (t) is as follows:
Figure BDA0003935306850000065
Figure BDA0003935306850000066
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
Figure BDA0003935306850000071
is the degree of temporal decay;
H ex (t w ) A resource allocation rate;
Figure BDA0003935306850000072
the fluctuation coefficient of (a);
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
Further, the expression of the degree of temporal attenuation in step A2 is as follows:
Figure BDA0003935306850000073
wherein:
Δt w a time gap from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
Figure BDA0003935306850000074
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
Compared with the prior art, the invention has the beneficial effects that:
1. an optimal distribution strategy is obtained by utilizing a deep reinforcement learning network model for calculation, and terminal data are transmitted to an edge node e for calculation according to the optimal distribution strategy, so that the calculation pressure of field equipment is effectively relieved, the storage difficulty caused by large data volume in the resource distribution process is avoided, the reliable and efficient information interaction of a communication network is ensured, and a better information interaction support service is provided for the power internet of things.
2. A deep reinforcement learning network model combines the perception capability of deep learning and the decision capability of reinforcement learning to perform advantage complementation, and can support optimal strategy solution of large data volume.
3. A neural network comprises a real-time ANN and a delay ANN, and after training for a certain number of times, parameters of the delay ANN are updated to parameters of the real-time ANN, so that timeliness of a delay ANN value function is guaranteed, and correlation between states is reduced.
4. An edge internet of things agent resource allocation method based on deep reinforcement learning is characterized in that global cost and reliability are used as performance judgment indexes of a network model, and judgment basis is provided for the network model to seek an optimal strategy.
5. A method for distributing agent resources of an edge Internet of things based on deep reinforcement learning updates interaction information by adopting a sliding window mechanism, directly abandons interaction information with longer interval time, reduces the calculation overhead of a user terminal, ensures the safety of the user terminal in a task unloading process by calculating the reliability, and provides guarantee for establishing a good interaction environment.
6. An edge Internet of things agent resource allocation method based on deep reinforcement learning is used for calculating various interaction quality values between a user terminal and an edge server, preparing for reliability calculation and providing judgment basis for a network model to seek an optimal strategy.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a flowchart of an implementation method of the deep reinforcement learning network model according to the present invention.
FIG. 3 is a flow chart of a method for training a real-time ANN and a delayed ANN of the present invention.
FIG. 4 is a flowchart of a reliability calculation method according to the present invention.
FIG. 5 is a schematic view of a sliding window according to the present invention.
FIG. 6 is a diagram of a deep reinforcement learning network structure according to the present invention.
FIG. 7 is a diagram illustrating deep reinforcement learning network model parameters according to an embodiment of the present invention.
Fig. 8 is a network performance curve diagram of the deep reinforcement learning network model in the embodiment of the present invention at different learning rates.
Detailed Description
It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The features and properties of the present invention are described in further detail below with reference to examples.
Example one
Referring to fig. 1, a method for allocating an agent resource of an edge internet of things based on deep reinforcement learning includes:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
preferably, in this embodiment, the data collected by the terminal device x is data such as voice, video, image, etc. of the user terminal;
preferably, in this embodiment, python3+ tensoflow 1.0 is used as a simulation experiment platform, hardware conditions are memories of Intel Core i7-5200u and 16GB, 50 terminal devices x and 5 edge nodes e are set in a simulation test environment, and the terminal devices x and the edge nodes e are uniformly distributed in a grid of 15 kilometers × 15 kilometers;
preferably, in this embodiment, the terminal device x sends a task request to the edge node e every 1 hour, and the edge node e determines a server that executes a task in a distributed manner; where the load of the terminal device x is derived from a real load data set where the load of the terminal task follows a substantially 24 hour periodic distribution due to tidal effects, but also fluctuates randomly due to environmental factors.
Preferably, in this embodiment, fig. 7 illustrates deep reinforcement learning network model parameters.
Step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
In this embodiment, specifically, as shown in fig. 2, the training method of the deep reinforcement learning network model in step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t
Step S105: acting a by the environment according to said system t Feedback reward sigma t+1 And the next state s of the system t+1
Step S106: according to the current system state s t And system action a t And a prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence Δ t Storing the experience to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; otherwise, the current system state s is set t Updating to the next state s of the system t+1 And returns to step S104.
In this embodiment, specifically, the system state S in step S101 is a local offload state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a unloading decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],
Figure BDA0003935306850000112
Wherein, b d For the remaining computing resources of the d MEC server, G d For total computing resources>
Figure BDA0003935306850000113
Allocating the computing resources of each task in the vector M for the computing resources;
system action a in said step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
Figure BDA0003935306850000111
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t The target function value when the next state is reached;
a "is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
in this embodiment, specifically, as shown in fig. 3, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, re-extracting the N state transition sequences from the experience pool O, and returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
In this embodiment, specifically, the formula for calculating the target value y of the state action pair in step S1072 is as follows:
Figure BDA0003935306850000123
wherein:
Figure BDA0003935306850000122
is maxQ(s) t+1 ,a t+1 θ') of the fluctuation coefficient;
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
Figure BDA0003935306850000121
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
In this embodiment, specifically, the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost comprises a delay cost c 1 And migration cost c 2 And load cost c 3
In this embodiment, in order to realize efficient task processing, three factors are considered: delay cost c 1 Migration cost c 2 And load cost c 3 (ii) a As the terminal device x needs to send the collected data to the edge node e for processing, a time delay may be generated during the transmission of the data; when a task is processed, the edge node e may also determine whether to send the task to the migration edge node j, however, since the migration task needs to redeploy the model, migration cost may be generated; due to the limited capacity of the edge node e, if the same edge node is locatede deploys too many tasks, the edge node e is often overloaded, resulting in load cost.
In the embodiment, specifically, the delay cost c 1 The expression of (a) is as follows:
Figure BDA0003935306850000131
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
Figure BDA0003935306850000132
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (c) is as follows:
Figure BDA0003935306850000133
wherein:
j is a migration edge node;
Figure BDA0003935306850000141
deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
Figure BDA0003935306850000142
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (c) is as follows:
Figure BDA0003935306850000143
wherein:
u x is the amount of data sent.
In this embodiment, specifically, as shown in fig. 4, the calculating of the reliability includes the following steps:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
in the embodiment, considering that the interaction experience with a long interval time is not enough to update the current reliable value in time, the latest interaction behavior should be paid more attention to, so that the interaction information is updated by adopting a sliding window mechanism; as shown in fig. 5, when the next slot of the interactive information arrives, the slot record with the longest interval in the window will be discarded, and the effective interactive information will be recorded in the window, thereby reducing the calculation overhead of the user terminal;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
in this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the influence on the current reliability evaluation is, and the time decay function is defined as:
Figure BDA0003935306850000144
representing the degree of attenuation of information from w interactions to the time slot of the current interaction, where at w =t-t w ,t w The end time of the w interactive time slots is the time, and the amount of computing resources which can be provided by the edge server in each interactive process also influences the updating of the interactive information;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t)
In the present embodiment, specifically, the aboveDegree of reliability T ex The calculation formula of (t) is as follows:
Figure BDA0003935306850000151
Figure BDA0003935306850000152
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
Figure BDA0003935306850000153
is the degree of temporal decay;
H ex (t w ) Allocating the resource rate;
epsilon is
Figure BDA0003935306850000154
The fluctuation coefficient of (a);
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
In this embodiment, specifically, the expression of the time attenuation degree in step A2 is as follows:
Figure BDA0003935306850000155
wherein:
Δt w for the w-th interaction to endA time gap for the current interaction to begin;
the calculation formula of the resource allocation rate in the step A2 is as follows:
Figure BDA0003935306850000161
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
In this embodiment, a deep reinforcement learning network model is used to solve the optimal allocation strategy, as shown in fig. 6, the deep reinforcement learning network model includes two neural networks, a first neural network, called a real-time ANN, for calculating an estimated value Q(s) of a current state action pair t ,a t Theta) refers to a parameter of the real-time ANN, and is updated each time an estimated value of the current state is calculated; a second neural network, called the delayed ANN, is used to calculate the value Q(s) for the next state t+1 ,a t+1 θ'), the value of the next state is used to calculate the target value y.
In this embodiment, the influence on the deep reinforcement learning network model under different learning rates is tested, and as shown in fig. 8, when the learning rate factor is set to 0.01, the network loss function cannot be effectively converged, and the function value has an obvious oscillation phenomenon. In contrast, when the learning rate is set to 0.0001, the dispersion of the network is effectively improved, and the network effectively converges at 60 iterations, but the convergence speed becomes significantly slow. Obviously, when the value is set to 0.0001, the resource allocation performance is best, the loss function is reduced quickly, the network convergence is more stable, and the convergence effect is better.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.
The background section is provided to present the context of the invention in general, and work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.

Claims (10)

1. An edge internet of things agent resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;
and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.
2. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 1, wherein the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delay ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting a system action a by using an epsilon-greedy strategy t
Step S105: acting a by the environment according to said system t Feedback award sigma t+1 And the next state s of the system t+1
Step S106: according to the currentSystem state s t And system action a t Bonus sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And converting the state into a sequence of delta t Storing the data to an experience pool O;
step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; otherwise, the current system state s is set t Updating to the next state s of the system t+1 And returns to step S104.
3. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 2, wherein the system state S in the step S101 is a local uninstalling state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a unloading decision vector;
m is a calculation resource allocation vector;
b is a residual computing resource vector; b = [ B ] 1 ,b 2 ,b 3 …b d ,…],
Figure FDA0003935306840000021
Wherein, b d For the remaining computing resources of the d MEC server, G d For the total calculation resources, is>
Figure FDA0003935306840000022
Allocating the computing resources of each task in the vector M for the computing resources;
the system action a in step S104 t The expression of (a) is as follows:
a t =[x,μ,k]
wherein:
x is terminal equipment;
mu is the unloading scheme of the terminal equipment x;
k is a calculation resource allocation scheme of the terminal device x;
the bonus σ in said step S105 t+1 The calculation formula of (a) is as follows:
Figure FDA0003935306840000023
wherein:
r is a reward function;
a is a target function value under the current time t state;
a' is the current system state s t Taking a System action a t The target function value when the next state is reached;
a' is the calculated value under all partial unloads;
the state transition sequence Δ in step S106 t The expression of (a) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
4. the method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 3, wherein the training method for the real-time ANN and the delayed ANN in the step S107 comprises the following steps:
step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences t ,a t θ) and value of the next state Q(s) t+1 ,a t+1 ,θ');
Step S1072: value Q(s) according to the next state t+1 ,a t+1 Theta') and prize sigma t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimate Q(s) from the state action pair t ,a t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);
step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences again from the experience pool O, returning to the step S1071, and otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.
5. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 4, wherein the target value y of the state action pair in the step S1072 is calculated by the following formula:
Figure FDA0003935306840000031
wherein:
Figure FDA0003935306840000032
is maxQ(s) t+1 ,a t+1 θ') of the fluctuation coefficient;
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
Figure FDA0003935306840000041
wherein:
n is the quantity value of the state transition sequence extracted each time;
n is the sequence number of the state transition sequence.
6. The method of claim 5, wherein the deep reinforcement learning network model performance indexes in step S1077 include: global cost and reliability;
the global cost comprises a delay cost c 1 Migration cost c 2 And load cost c 3
7. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 6, wherein the delay cost c is 1 The expression of (c) is as follows:
Figure FDA0003935306840000042
wherein:
t is the number of interactions;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
Figure FDA0003935306840000043
deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal device x and the edge node e;
the migration cost c 2 The expression of (a) is as follows:
Figure FDA0003935306840000044
wherein:
j is a migration edge node;
Figure FDA0003935306840000051
deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;
Figure FDA0003935306840000052
deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (a) is as follows:
Figure FDA0003935306840000053
wherein:
u x is the amount of data sent.
8. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning as claimed in claim 6, wherein the calculating of the reliability comprises the following steps:
step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;
step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;
step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate ex (t) 。
9. The method for allocating the proxy resources of the edge internet of things based on the deep reinforcement learning of claim 8, wherein the method is characterized in thatThe reliability T ex The calculation formula of (t) is as follows:
Figure FDA0003935306840000054
Figure FDA0003935306840000055
N ex (t)=1-P ex (t)
wherein:
u is the number of effective information in the sliding window;
w is current interaction information;
Figure FDA0003935306840000061
is the degree of temporal decay;
H ex (t w ) Allocating the resource rate;
epsilon is
Figure FDA0003935306840000062
The fluctuation coefficient of (a);
P ex (t w ) Positive service satisfaction of the current interaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the historical number of interactions that failed between terminal device x and edge node e.
10. The method for allocating the edge internet of things proxy resource based on the deep reinforcement learning of claim 9, wherein the expression of the time attenuation degree in the step A2 is as follows:
Figure FDA0003935306840000063
wherein:
Δt w a time gap from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
Figure FDA0003935306840000064
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;
source e (t) is the total amount of resources that the edge node e can provide in the current time slot.
CN202211401605.2A 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning Active CN115914227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211401605.2A CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211401605.2A CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115914227A true CN115914227A (en) 2023-04-04
CN115914227B CN115914227B (en) 2024-03-19

Family

ID=86493215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211401605.2A Active CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115914227B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
CN113890653A (en) * 2021-08-30 2022-01-04 广东工业大学 Multi-agent reinforcement learning power distribution method for multi-user benefits
CN114490057A (en) * 2022-01-24 2022-05-13 电子科技大学 MEC unloaded task resource allocation method based on deep reinforcement learning
US20220180174A1 (en) * 2020-12-07 2022-06-09 International Business Machines Corporation Using a deep learning based surrogate model in a simulation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
US20220180174A1 (en) * 2020-12-07 2022-06-09 International Business Machines Corporation Using a deep learning based surrogate model in a simulation
CN113890653A (en) * 2021-08-30 2022-01-04 广东工业大学 Multi-agent reinforcement learning power distribution method for multi-user benefits
CN114490057A (en) * 2022-01-24 2022-05-13 电子科技大学 MEC unloaded task resource allocation method based on deep reinforcement learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BO FENG 等: "Influence analysis of neutral point grounding mode on the single-phase grounding fault characteristics of distribution network with distributed generation", 2020 5TH ASIA CONFERENCE ON POWER AND ELECTRICAL ENGINEERING (ACPEE), 30 June 2020 (2020-06-30) *
朱斐;吴文;刘全;伏玉琛;: "一种最大置信上界经验采样的深度Q网络方法", 计算机研究与发展, no. 08, 15 August 2018 (2018-08-15) *
李孜恒;孟超: "基于深度强化学习的无线网络资源分配算法", 通信技术, vol. 53, no. 008, 31 December 2020 (2020-12-31) *
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) *
饶宁等: "基于多智能体深度强化学习的分布式协同 干扰功率分配算法", 电子学报, 30 June 2022 (2022-06-30) *

Also Published As

Publication number Publication date
CN115914227B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN113568727B (en) Mobile edge computing task allocation method based on deep reinforcement learning
US8250198B2 (en) Capacity planning for data center services
CN112579194B (en) Block chain consensus task unloading method and device based on time delay and transaction throughput
CN113115368B (en) Base station cache replacement method, system and storage medium based on deep reinforcement learning
Zhang et al. Joint optimization of cooperative edge caching and radio resource allocation in 5G-enabled massive IoT networks
CN114357455B (en) Trust method based on multidimensional attribute trust evaluation
EP3547625A1 (en) Method and system for sending request for acquiring data resource
CN112395090B (en) Intelligent hybrid optimization method for service placement in mobile edge calculation
CN109634744A (en) A kind of fine matching method based on cloud platform resource allocation, equipment and storage medium
CN108390775B (en) User experience quality evaluation method and system based on SPICE
CN113543160B (en) 5G slice resource allocation method, device, computing equipment and computer storage medium
CN114500561B (en) Power Internet of things network resource allocation decision-making method, system, equipment and medium
CN110072130A (en) A kind of HAS video segment method for pushing based on HTTP/2
KR20180027995A (en) Method and apparatus for future prediction in Internet of thing
CN111901134B (en) Method and device for predicting network quality based on recurrent neural network model (RNN)
CN109379747B (en) Wireless network multi-controller deployment and resource allocation method and device
CN117539648A (en) Service quality management method and device for electronic government cloud platform
Bensalem et al. Scaling Serverless Functions in Edge Networks: A Reinforcement Learning Approach
CN115914227B (en) Edge internet of things proxy resource allocation method based on deep reinforcement learning
CN110191362B (en) Data transmission method and device, storage medium and electronic equipment
CN114980324A (en) Slice-oriented low-delay wireless resource scheduling method and system
CN110933119B (en) Method and equipment for updating cache content
CN118245809B (en) Batch size adjustment method in distributed data parallel online asynchronous training
CN117834643B (en) Deep neural network collaborative reasoning method for industrial Internet of things
CN110650187B (en) Node type determination method for edge node and target network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant