CN115914227B - Edge internet of things proxy resource allocation method based on deep reinforcement learning - Google Patents

Edge internet of things proxy resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN115914227B
CN115914227B CN202211401605.2A CN202211401605A CN115914227B CN 115914227 B CN115914227 B CN 115914227B CN 202211401605 A CN202211401605 A CN 202211401605A CN 115914227 B CN115914227 B CN 115914227B
Authority
CN
China
Prior art keywords
reinforcement learning
state
deep reinforcement
time
edge node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211401605.2A
Other languages
Chinese (zh)
Other versions
CN115914227A (en
Inventor
钟加勇
田鹏
吕小红
吴彬
籍勇亮
李俊杰
宫林
何迎春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Original Assignee
Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd, State Grid Corp of China SGCC, State Grid Chongqing Electric Power Co Ltd filed Critical Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd
Priority to CN202211401605.2A priority Critical patent/CN115914227B/en
Publication of CN115914227A publication Critical patent/CN115914227A/en
Application granted granted Critical
Publication of CN115914227B publication Critical patent/CN115914227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method for distributing proxy resources of an edge internet of things based on deep reinforcement learning, which relates to the technical field of the internet of things and comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, then obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally transmitting the data to an edge node e for calculation according to the optimal allocation strategy to realize the proxy resource allocation of the edge Internet of things; the method solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is insufficient for supporting the resource optimization configuration of the complex power Internet of things.

Description

Edge internet of things proxy resource allocation method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of the Internet of things, in particular to a method for distributing proxy resources of an edge Internet of things based on deep reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Reasonable resource allocation is an important guarantee for efficiently supporting the power business of the edge internet of things agent; the electric power internet of things is an important component of the national industrial internet; constructing an efficient, safe and reliable sensing layer has become an important construction work in the power industry; however, the current electric power internet of things equipment has limited computing capacity, and cannot effectively realize the task of local large-scale rapid computing; the edge internet of things agent is used as core equipment of the internet of things sensing layer and plays a role of connecting the internet of things terminal and the cloud; along with the access of various data such as voice, video, images and the like, as well as the collection of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy tasks of the internet of things terminal on a proper edge internet of things proxy node is a key problem at the present stage.
At present, key problems of the edge internet of things proxy are mainly embodied in two aspects; firstly, because of interdependence among the internet of things agents at different edges, the existing combined optimization method generally adopts an approximate algorithm or a heuristic algorithm to solve the deployment scheme, thus not only requiring longer running time, but also having limited performance; secondly, a plurality of edge nodes exist in the edge internet of things proxy environment, and the resource capacity of an edge server is limited; therefore, different edge nodes need to cooperate through distributed decision to realize optimal resource allocation so as to support efficient and reliable information interaction.
The appearance of the multi-layer network model provides a new solution for the optimal configuration of communication network resources; training a network model through a multi-layer network to achieve an accurate and efficient solution; currently, some researchers have performed research and analysis; one scheme in the prior art is based on a convolutional neural network, so that reasonable allocation of resources of the Internet of things and efficient interaction and coordination of edge equipment on terminal data and network tasks are realized; the other scheme is to optimize the Q-learning network by using Bayes, so as to realize rationalization and ordering of resource allocation in the network and resist DDoS network attack; in addition, the introduction of the depth space-time residual error network effectively supports the effective load balance of the industrial Internet of things network, and ensures that the network realizes low-delay and high-reliability data interaction; considering the heterogeneity of network devices, in the prior art, a deep learning network is mostly adopted to effectively match a network server with a user request, and the optimal resource quantity is allocated to the user device; however, it should be noted that, due to the network structure of the deep network model, the problem of mismatching between the computing power and the processing problem is easily involved in updating and iterating the network state, which limits the computing efficiency and is insufficient to support the resource optimization configuration of the complex power internet of things.
Disclosure of Invention
The invention aims at: aiming at the defects in the prior art, the edge internet of things proxy resource allocation method based on deep reinforcement learning is provided, and the problems that the edge internet of things proxy resource allocation time is long, the performance is limited and the prior art is insufficient to support the resource optimization configuration of the complex power internet of things are solved.
The technical scheme of the invention is as follows:
an edge internet of things proxy resource allocation method based on deep reinforcement learning comprises the following steps:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the proxy resource allocation of the edge Internet of things.
Further, the training method of the deep reinforcement learning network model in the step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value,if yes, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and completing training of the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And returns to step S104.
Further, the system state S in the step S101 is a local uninstalled state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
further, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, the N state transition sequences are re-extracted from the experience pool O and return to step S1071.
Further, the calculation formula of the target value y of the state action pair in the step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
Further, the deep reinforcement learning network model performance index in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3
Further, the delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
the deployment variables of the terminal equipment x and the edge node e in the last interaction time are obtained;
the deployment variables of the terminal equipment x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data transmitted.
Further, the calculation of the reliability includes the steps of:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T ex (t)
Further, the reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
fluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex and (t) is the historical interaction times of failure between the terminal equipment x and the edge node e.
Further, the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the edge internet of things proxy resource allocation method based on deep reinforcement learning, an optimal allocation strategy is calculated by using a deep reinforcement learning network model, terminal data is transmitted to an edge node e for calculation according to the optimal allocation strategy, calculation pressure of field devices is effectively relieved, storage difficulty caused by large data volume in a resource allocation process is avoided, reliable and efficient information interaction of a communication network is guaranteed, and better information interaction support service is provided for the electric internet of things.
2. The edge internet of things proxy resource allocation method based on deep reinforcement learning combines the perception capability of deep learning and the decision capability of reinforcement learning, performs advantage complementation, and can support the optimal strategy solution of a large amount of data.
3. The edge internet of things proxy resource allocation method based on deep reinforcement learning comprises a real-time ANN and a delay ANN, wherein after training for a certain number of times, parameters of the delay ANN are updated to parameters of the real-time ANN, timeliness of a delay ANN value function is guaranteed, and correlation among states is reduced.
4. The edge internet of things proxy resource allocation method based on deep reinforcement learning takes global cost and reliability as performance judgment indexes of a network model, and provides judgment basis for searching an optimal strategy for the network model.
5. According to the edge internet of things proxy resource allocation method based on deep reinforcement learning, the interactive information is updated by adopting a sliding window mechanism, the interactive information with longer interval time is directly abandoned, the calculation cost of the user terminal is reduced, the reliability calculation ensures the safety of the user terminal in the task unloading process, and a guarantee is provided for establishing a good interactive environment.
6. The edge internet of things proxy resource allocation method based on deep reinforcement learning calculates various interaction quality values between the user terminal and the edge server, prepares for reliability calculation, and provides a judgment basis for searching an optimal strategy for a network model.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a flow chart of a method for implementing a deep reinforcement learning network model according to the present invention.
FIG. 3 is a flow chart of a training method for real-time ANN and delayed ANN according to the present invention.
FIG. 4 is a flowchart of a reliability calculation method according to the present invention.
FIG. 5 is a schematic view of a sliding window according to the present invention.
FIG. 6 is a diagram of a deep reinforcement learning network according to the present invention.
FIG. 7 is a diagram illustrating parameters of a deep reinforcement learning network model according to an embodiment of the present invention.
FIG. 8 is a graph illustrating network performance at different learning rates for a deep reinforcement learning network model in accordance with an embodiment of the present invention.
Detailed Description
It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The features and capabilities of the present invention are described in further detail below in connection with examples.
Example 1
Referring to fig. 1, a method for allocating proxy resources of an edge internet of things based on deep reinforcement learning includes:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
preferably, in this embodiment, the data collected by the terminal device x is data such as voice, video, and image of the user terminal;
preferably, in this embodiment, python3+tensorf low1.0 is used as a simulation experiment platform, and the hardware conditions are Intel Core i7-5200u and 16GB memories, 50 terminal devices x and 5 edge nodes e are set in a simulation test environment, where the terminal devices x and the edge nodes e are uniformly distributed in a grid of 15 km×15 km;
preferably, in this embodiment, the terminal device x sends a task request to the edge node e once every 1 hour, and the edge node e decides the server that performs the task in a distributed manner; wherein the load of the terminal equipment x comes from a real load data set in which the load of the terminal tasks approximately follows a periodic distribution of 24 hours due to tidal effects, but also random fluctuations due to environmental factors.
Preferably, in the present embodiment, fig. 7 is a deep reinforcement learning network model parameter.
Step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the proxy resource allocation of the edge Internet of things.
In this embodiment, as shown in fig. 2, the training method of the deep reinforcement learning network model in step S1 includes the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And return toStep S104.
In this embodiment, specifically, the system state S in step S101 is a local unloading state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
in this embodiment, as shown in fig. 3, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, the N state transition sequences are re-extracted from the experience pool O and return to step S1071.
In this embodiment, specifically, the calculation formula of the target value y of the state action pair in step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
In this embodiment, specifically, the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3
In this embodiment, in order to achieve efficient task processing, three factors are considered: delay cost c 1 Migration cost c 2 And load cost c 3 The method comprises the steps of carrying out a first treatment on the surface of the Since the terminal device x needs to send the collected data to the edge node e for processing, a time delay is generated in the transmission of the data during the process; while processing a task, edge node e may also decide whether to send the task to the migrating edge nodePoint j, however, migration costs may result due to the need to redeploy the model for the migration task; due to the limited capacity of the edge node e, if too many tasks are deployed on the same edge node e, the edge node e is often overloaded, resulting in load costs.
In the present embodiment, specifically, the delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
the deployment variables of the terminal equipment x and the edge node e in the last interaction time are obtained;
for terminal equipment x and migration edge node in current interaction timeDeployment variables for j;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data transmitted.
In this embodiment, specifically, as shown in fig. 4, the calculation of the reliability includes the following steps:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
in this embodiment, considering that the interaction experience with longer interval time is not enough to update the current reliable value in time, the latest interaction behavior should be more concerned, so a sliding window mechanism is adopted to update the interaction information; as shown in fig. 5, when the interaction information of the next time slot arrives, the time slot record with the longest interval in the window is discarded, and the effective interaction information is recorded in the window, thereby reducing the calculation overhead of the user terminal;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
in this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the influence on the current reliability evaluation is, and the time decay function is defined as:representing the degree of decay from the information obtained from w interactions to the information of the current interaction slot, where Δt w =t-t w ,t w The end time of the w time interaction time slots is the calculation resource amount which can be provided by the edge server in each interaction process can influence the updating of the interaction information;
step A3: according to the timeThe attenuation degree and the resource allocation rate are calculated to obtain the reliability T ex (t)
In this embodiment, specifically, the reliability T ex The calculation formula of (t) is as follows:
N ex (t)=1-P ex (t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
epsilon isFluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex and (t) is the historical interaction times of failure between the terminal equipment x and the edge node e.
In this embodiment, specifically, the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
In this embodiment, the optimal allocation strategy is solved by using a deep reinforcement learning network model, as shown in fig. 6, where the deep reinforcement learning network model includes two neural networks, the first neural network is called real-time ANN, and is used to calculate the estimated value Q (s t ,a t θ), θ refers to parameters of the real-time ANN, which are updated each time an estimate of the current state is calculated; a second neural network, called delay ANN, for calculating the value Q (s t+1 ,a t+1 The value of the next state is used to calculate the target value y.
In this embodiment, the influence on the deep reinforcement learning network model under different learning rates is tested, and as shown in fig. 8, when the learning rate factor is set to 0.01, the network loss function cannot be effectively converged, and the function value has obvious concussion phenomenon. In contrast, when the learning rate is set to 0.0001, the dispersibility of the network is effectively improved, the network effectively converges at 60 iterations, but the convergence speed is significantly slow. Obviously, when the setting value is 0.0001, the resource allocation performance is best, the loss function is fast reduced, the network convergence is more stable, and the convergence effect is better.
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.
This background section is provided to generally present the context of the present invention and the work of the presently named inventors, to the extent it is described in this background section, as well as the description of the present section as not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.

Claims (3)

1. The edge internet of things proxy resource allocation method based on deep reinforcement learning is characterized by comprising the following steps of:
step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;
step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;
step S3: according to the optimal allocation strategy, the data are sent to an edge node e for calculation, and the edge internet of things proxy resource allocation is realized;
the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:
step S101: initializing a system state s of the deep reinforcement learning network model;
step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;
step S103: initializing an experience pool O of the deep reinforcement learning network model;
step S104: according to the current system state s t Selecting system action a using an epsilon-greedy policy t
Step S105: from the environment according to the system action a t Feedback prize sigma t+1 And the next state s of the system t+1
Step S106: according to the current system state s t System action a t Prize sigma t+1 And the next state s of the system t+1 Calculating to obtain a state transition sequence delta t And the state transition sequence delta t Storing to an experience pool O;
step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s t Updated to the next state s of the system t+1 And returns to step S104;
the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:
step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences t ,a t θ) and the value Q(s) of the next state t+1 ,a t+1 ,θ');
Step S1072: according to the value Q (s t+1 ,a t+1 θ') and rewards σ t+1 Calculating to obtain a target value y of the state action pair;
step S1073: an estimated value Q(s) of the pair according to the state action t ,a t θ) and a target value y, calculating to obtain a Loss function Loss (θ);
step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;
step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;
step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;
step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;
step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, re-extracting N state transition sequences from the experience pool O, and returning to step S1071;
the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;
the global cost includes a delay cost c 1 Migration cost c 2 And load cost c 3
The delay cost c 1 The expression of (2) is as follows:
wherein:
t is the interaction times;
x is a terminal equipment set;
e is an edge node set;
u x is the amount of data sent;
the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;
τ xe the transmission delay of the terminal equipment x and the edge node e;
the migration cost c 2 The expression of (2) is as follows:
wherein:
j is a migration edge node;
for terminal device x and last interaction timeDeployment variables of edge node e;
the deployment variables of the terminal equipment x and the migration edge node j in the current interaction time are obtained;
the load cost c 3 The expression of (2) is as follows:
wherein:
u x is the amount of data sent;
the reliability calculation includes the following steps:
step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;
step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;
step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T ex (t);
The reliability T ex The calculation formula of (t) is as follows:
Nex(t)=1-Pex(t)
wherein:
u is the amount of effective information in the sliding window;
w is current interaction information;
is the degree of time decay;
H ex (t w ) The resource allocation rate is the resource allocation rate;
epsilon isFluctuation coefficient of (2);
P ex (t w ) Currently interacted with positive service satisfaction;
N ex (t w ) Negative service satisfaction for the current interaction;
s ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;
f ex (t) is the number of failed historical interactions between the terminal device x and the edge node e;
the expression of the time attenuation degree in the step A2 is as follows:
wherein:
Δt w a time interval from the end of the w-th interaction to the start of the current interaction;
the calculation formula of the resource allocation rate in the step A2 is as follows:
wherein:
source ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;
source e and (t) is the total amount of resources that the edge node e can provide in the current time slot.
2. The method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the system state S in the step S101 is a local unloading state, and the expression is as follows:
s=[F,M,B]
wherein:
f is a discharge decision vector;
m is a computing resource allocation vector;
b is the residual computing resource vector; b= [ B ] 1 ,b 2 ,b 3 …b d ,…],Wherein b d G for the remaining computing resources of the d-th MEC server d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;
system action a in step S104 t The expression of (2) is as follows:
a t =[x,μ,k]
wherein:
x is a terminal device;
μ is the discharge scheme of terminal device x;
k is a computing resource allocation scheme of the terminal equipment x;
the prize sigma in the step S105 t+1 The calculation formula of (2) is as follows:
wherein:
r is a reward function;
a is an objective function value in the current time t state;
a' is the current system state s t Take system action a t The objective function value when the next state is reached;
a' is the calculated value under all local unloading;
the state transition sequence delta in the step S106 t The expression of (2) is as follows:
Δ t =(s t ,a tt+1 ,s t+1 )。
3. the method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the calculation formula of the target value y of the state action pair in step S1072 is as follows:
wherein:
is maxQ(s) t+1 ,a t+1 A fluctuation coefficient of θ');
Q(s t+1 ,a t+1 θ') is the value of the next state of the system;
maxQ(s t+1 ,a t+1 θ') is the maximum value of the next state of the system;
the expression of the Loss function Loss (θ) in step S1073 is as follows:
wherein:
n is the number value of state transition sequences extracted each time;
n is the sequence number of the state transition sequence.
CN202211401605.2A 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning Active CN115914227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211401605.2A CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211401605.2A CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115914227A CN115914227A (en) 2023-04-04
CN115914227B true CN115914227B (en) 2024-03-19

Family

ID=86493215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211401605.2A Active CN115914227B (en) 2022-11-10 2022-11-10 Edge internet of things proxy resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115914227B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
CN113890653A (en) * 2021-08-30 2022-01-04 广东工业大学 Multi-agent reinforcement learning power distribution method for multi-user benefits
CN114490057A (en) * 2022-01-24 2022-05-13 电子科技大学 MEC unloaded task resource allocation method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180174A1 (en) * 2020-12-07 2022-06-09 International Business Machines Corporation Using a deep learning based surrogate model in a simulation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
CN113890653A (en) * 2021-08-30 2022-01-04 广东工业大学 Multi-agent reinforcement learning power distribution method for multi-user benefits
CN114490057A (en) * 2022-01-24 2022-05-13 电子科技大学 MEC unloaded task resource allocation method based on deep reinforcement learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Influence analysis of neutral point grounding mode on the single-phase grounding fault characteristics of distribution network with distributed generation;BO feng 等;2020 5th Asia Conference on Power and Electrical Engineering (ACPEE);20200630;全文 *
一种最大置信上界经验采样的深度Q网络方法;朱斐;吴文;刘全;伏玉琛;;计算机研究与发展;20180815(08);全文 *
基于多智能体深度强化学习的分布式协同 干扰功率分配算法;饶宁等;电子学报;20220630;全文 *
基于深度强化学习的无线网络资源分配算法;李孜恒;孟超;;通信技术;20200810(08);全文 *
李孜恒 ; 孟超 ; .基于深度强化学习的无线网络资源分配算法.通信技术.2020,(08),全文. *
邓志龙 ; 张琦玮 ; 曹皓 ; 谷志阳.一种基于深度强化学习的调度优化方法.西北工业大学学报.35(6),全文. *

Also Published As

Publication number Publication date
CN115914227A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
CN112486690B (en) Edge computing resource allocation method suitable for industrial Internet of things
CN111629380B (en) Dynamic resource allocation method for high concurrency multi-service industrial 5G network
CN113568727B (en) Mobile edge computing task allocation method based on deep reinforcement learning
CN110365514A (en) SDN multistage mapping method of virtual network and device based on intensified learning
CN112579194B (en) Block chain consensus task unloading method and device based on time delay and transaction throughput
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
CN112395090B (en) Intelligent hybrid optimization method for service placement in mobile edge calculation
US20230153124A1 (en) Edge network computing system with deep reinforcement learning based task scheduling
CN114357455B (en) Trust method based on multidimensional attribute trust evaluation
CN113891276A (en) Information age-based mixed updating industrial wireless sensor network scheduling method
CN113115368B (en) Base station cache replacement method, system and storage medium based on deep reinforcement learning
CN113692021A (en) 5G network slice intelligent resource allocation method based on intimacy
CN109298930A (en) A kind of cloud workflow schedule method and device based on multiple-objection optimization
CN114423023B (en) Mobile user-oriented 5G network edge server deployment method
CN109379747B (en) Wireless network multi-controller deployment and resource allocation method and device
CN115914227B (en) Edge internet of things proxy resource allocation method based on deep reinforcement learning
CN116055406B (en) Training method and device for congestion window prediction model
CN113543160A (en) 5G slice resource allocation method and device, computing equipment and computer storage medium
CN104601424B (en) The passive transacter of master and method of probabilistic model are utilized in equipment control net
CN114500561B (en) Power Internet of things network resource allocation decision-making method, system, equipment and medium
Bensalem et al. Scaling Serverless Functions in Edge Networks: A Reinforcement Learning Approach
CN113596138B (en) Heterogeneous information center network cache allocation method based on deep reinforcement learning
TW202327380A (en) Method and system for federated reinforcement learning based offloading optimization in edge computing
CN114980324A (en) Slice-oriented low-delay wireless resource scheduling method and system
CN113342474A (en) Method, device and storage medium for forecasting customer flow and training model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant