CN115914227B

CN115914227B - Edge internet of things proxy resource allocation method based on deep reinforcement learning

Info

Publication number: CN115914227B
Application number: CN202211401605.2A
Authority: CN
Inventors: 钟加勇; 田鹏; 吕小红; 吴彬; 籍勇亮; 李俊杰; 宫林; 何迎春
Original assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2024-03-19
Anticipated expiration: 2042-11-10
Also published as: CN115914227A

Abstract

The invention discloses a method for distributing proxy resources of an edge internet of things based on deep reinforcement learning, which relates to the technical field of the internet of things and comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, then obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally transmitting the data to an edge node e for calculation according to the optimal allocation strategy to realize the proxy resource allocation of the edge Internet of things; the method solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is insufficient for supporting the resource optimization configuration of the complex power Internet of things.

Description

Edge internet of things proxy resource allocation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of the Internet of things, in particular to a method for distributing proxy resources of an edge Internet of things based on deep reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Reasonable resource allocation is an important guarantee for efficiently supporting the power business of the edge internet of things agent; the electric power internet of things is an important component of the national industrial internet; constructing an efficient, safe and reliable sensing layer has become an important construction work in the power industry; however, the current electric power internet of things equipment has limited computing capacity, and cannot effectively realize the task of local large-scale rapid computing; the edge internet of things agent is used as core equipment of the internet of things sensing layer and plays a role of connecting the internet of things terminal and the cloud; along with the access of various data such as voice, video, images and the like, as well as the collection of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy tasks of the internet of things terminal on a proper edge internet of things proxy node is a key problem at the present stage.

At present, key problems of the edge internet of things proxy are mainly embodied in two aspects; firstly, because of interdependence among the internet of things agents at different edges, the existing combined optimization method generally adopts an approximate algorithm or a heuristic algorithm to solve the deployment scheme, thus not only requiring longer running time, but also having limited performance; secondly, a plurality of edge nodes exist in the edge internet of things proxy environment, and the resource capacity of an edge server is limited; therefore, different edge nodes need to cooperate through distributed decision to realize optimal resource allocation so as to support efficient and reliable information interaction.

The appearance of the multi-layer network model provides a new solution for the optimal configuration of communication network resources; training a network model through a multi-layer network to achieve an accurate and efficient solution; currently, some researchers have performed research and analysis; one scheme in the prior art is based on a convolutional neural network, so that reasonable allocation of resources of the Internet of things and efficient interaction and coordination of edge equipment on terminal data and network tasks are realized; the other scheme is to optimize the Q-learning network by using Bayes, so as to realize rationalization and ordering of resource allocation in the network and resist DDoS network attack; in addition, the introduction of the depth space-time residual error network effectively supports the effective load balance of the industrial Internet of things network, and ensures that the network realizes low-delay and high-reliability data interaction; considering the heterogeneity of network devices, in the prior art, a deep learning network is mostly adopted to effectively match a network server with a user request, and the optimal resource quantity is allocated to the user device; however, it should be noted that, due to the network structure of the deep network model, the problem of mismatching between the computing power and the processing problem is easily involved in updating and iterating the network state, which limits the computing efficiency and is insufficient to support the resource optimization configuration of the complex power internet of things.

Disclosure of Invention

The invention aims at: aiming at the defects in the prior art, the edge internet of things proxy resource allocation method based on deep reinforcement learning is provided, and the problems that the edge internet of things proxy resource allocation time is long, the performance is limited and the prior art is insufficient to support the resource optimization configuration of the complex power internet of things are solved.

The technical scheme of the invention is as follows:

an edge internet of things proxy resource allocation method based on deep reinforcement learning comprises the following steps:

step S1: collecting data in the environment by the terminal equipment x and transmitting the data to a deep reinforcement learning network model;

step S2: obtaining an optimal allocation strategy by a deep reinforcement learning network model according to the data;

step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the proxy resource allocation of the edge Internet of things.

Further, the training method of the deep reinforcement learning network model in the step S1 includes the following steps:

step S101: initializing a system state s of the deep reinforcement learning network model;

step S102: initializing a real-time ANN and a delayed ANN of the deep reinforcement learning network model;

step S103: initializing an experience pool O of the deep reinforcement learning network model;

step S104: according to the current system state s _t Selecting system action a using an epsilon-greedy policy _t ；

Step S105: from the environment according to the system action a _t Feedback prize sigma _t+1 And the next state s of the system _t+1 ；

Step S106: according to the current system state s _t System action a _t Prize sigma _t+1 And the next state s of the system _t+1 Calculating to obtain a state transition sequence delta _t And the state transition sequence delta _t Storing to an experience pool O;

step S107: judging whether the storage amount of the experience pool O reaches a preset value,if yes, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and completing training of the deep reinforcement learning network model; otherwise, the current system state s _t Updated to the next state s of the system _t+1 And returns to step S104.

Further, the system state S in the step S101 is a local uninstalled state, and the expression is as follows:

s＝[F,M,B]

wherein:

f is a discharge decision vector;

m is a computing resource allocation vector;

b is the residual computing resource vector; b= [ B ] ₁ ,b ₂ ,b ₃ …b _d ,…]，Wherein b _d G for the remaining computing resources of the d-th MEC server _d For total computing resources>Allocating computing resources for each task in the vector M to the computing resources;

system action a in step S104 _t The expression of (2) is as follows:

a _t ＝[x,μ,k]

wherein:

x is a terminal device;

μ is the discharge scheme of terminal device x;

k is a computing resource allocation scheme of the terminal equipment x;

the prize sigma in the step S105 _t+1 The calculation formula of (2) is as follows:

wherein:

r is a reward function;

a is an objective function value in the current time t state;

a' is the current system state s _t Take system action a _t The objective function value when the next state is reached;

a' is the calculated value under all local unloading;

the state transition sequence delta in the step S106 _t The expression of (2) is as follows:

Δ _t ＝(s _t ,a _t ,σ _t+1 ,s _t+1 )。

further, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:

step S1071: for the N state transition sequences, obtaining estimated values Q(s) of the state action pairs according to the state transition sequences _t ,a _t θ) and the value Q(s) of the next state _t+1 ,a _t+1 ,θ')；

Step S1072: according to the value Q (s _t+1 ,a _t+1 θ') and rewards σ _t+1 Calculating to obtain a target value y of the state action pair;

step S1073: an estimated value Q(s) of the pair according to the state action _t ,a _t θ) and a target value y, calculating to obtain a Loss function Loss (θ);

step S1074: adjusting a parameter theta of the real-time ANN through a back propagation mechanism of Loss, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;

step S1075: judging whether the step number of the parameter theta 'of the delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering step S1077; otherwise, go to step S1076;

step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences from the experience pool O again, returning to the step S1071, otherwise, returning to the step S1071;

step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;

step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, the N state transition sequences are re-extracted from the experience pool O and return to step S1071.

Further, the calculation formula of the target value y of the state action pair in the step S1072 is as follows:

wherein:

is maxQ(s) _t+1 ,a _t+1 A fluctuation coefficient of θ');

Q(s _t+1 ,a _t+1 θ') is the value of the next state of the system;

maxQ(s _t+1 ,a _t+1 θ') is the maximum value of the next state of the system;

the expression of the Loss function Loss (θ) in step S1073 is as follows:

wherein:

n is the number value of state transition sequences extracted each time;

n is the sequence number of the state transition sequence.

Further, the deep reinforcement learning network model performance index in step S1077 includes: global cost and reliability;

the global cost includes a delay cost c ₁ Migration cost c ₂ And load cost c ₃ 。

Further, the delay cost c ₁ The expression of (2) is as follows:

wherein:

t is the interaction times;

x is a terminal equipment set;

e is an edge node set;

u _x is the amount of data sent;

the deployment variables of the terminal equipment x and the edge node e in the current interaction time are obtained;

τ _xe the transmission delay of the terminal equipment x and the edge node e;

the migration cost c ₂ The expression of (2) is as follows:

wherein:

j is a migration edge node;

the deployment variables of the terminal equipment x and the edge node e in the last interaction time are obtained;

the deployment variables of the terminal equipment x and the migration edge node j in the current interaction time are obtained;

the load cost c ₃ The expression of (2) is as follows:

wherein:

u _x is the amount of data transmitted.

Further, the calculation of the reliability includes the steps of:

step A1: storing the interaction data of the terminal equipment x and the edge node e in a sliding window, and updating in real time;

step A2: according to historical interaction data of the terminal equipment x and the edge node e, calculating the time attenuation degree and the resource allocation rate of current interaction by adopting an expected value based on Bayesian trust evaluation;

step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T _ex (t)

Further, the reliability T _ex The calculation formula of (t) is as follows:

N _ex (t)＝1-P _ex (t)

wherein:

u is the amount of effective information in the sliding window;

w is current interaction information;

is the degree of time decay;

H _ex (t _w ) The resource allocation rate is the resource allocation rate;

fluctuation coefficient of (2);

P _ex (t _w ) Currently interacted with positive service satisfaction;

N _ex (t _w ) Negative service satisfaction for the current interaction;

s _ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;

f _ex and (t) is the historical interaction times of failure between the terminal equipment x and the edge node e.

Further, the expression of the time attenuation degree in the step A2 is as follows:

wherein:

Δt _w a time interval from the end of the w-th interaction to the start of the current interaction;

the calculation formula of the resource allocation rate in the step A2 is as follows:

wherein:

source _ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current slot;

source _e and (t) is the total amount of resources that the edge node e can provide in the current time slot.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the edge internet of things proxy resource allocation method based on deep reinforcement learning, an optimal allocation strategy is calculated by using a deep reinforcement learning network model, terminal data is transmitted to an edge node e for calculation according to the optimal allocation strategy, calculation pressure of field devices is effectively relieved, storage difficulty caused by large data volume in a resource allocation process is avoided, reliable and efficient information interaction of a communication network is guaranteed, and better information interaction support service is provided for the electric internet of things.

2. The edge internet of things proxy resource allocation method based on deep reinforcement learning combines the perception capability of deep learning and the decision capability of reinforcement learning, performs advantage complementation, and can support the optimal strategy solution of a large amount of data.

3. The edge internet of things proxy resource allocation method based on deep reinforcement learning comprises a real-time ANN and a delay ANN, wherein after training for a certain number of times, parameters of the delay ANN are updated to parameters of the real-time ANN, timeliness of a delay ANN value function is guaranteed, and correlation among states is reduced.

4. The edge internet of things proxy resource allocation method based on deep reinforcement learning takes global cost and reliability as performance judgment indexes of a network model, and provides judgment basis for searching an optimal strategy for the network model.

5. According to the edge internet of things proxy resource allocation method based on deep reinforcement learning, the interactive information is updated by adopting a sliding window mechanism, the interactive information with longer interval time is directly abandoned, the calculation cost of the user terminal is reduced, the reliability calculation ensures the safety of the user terminal in the task unloading process, and a guarantee is provided for establishing a good interactive environment.

6. The edge internet of things proxy resource allocation method based on deep reinforcement learning calculates various interaction quality values between the user terminal and the edge server, prepares for reliability calculation, and provides a judgment basis for searching an optimal strategy for a network model.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of a method for implementing a deep reinforcement learning network model according to the present invention.

FIG. 3 is a flow chart of a training method for real-time ANN and delayed ANN according to the present invention.

FIG. 4 is a flowchart of a reliability calculation method according to the present invention.

FIG. 5 is a schematic view of a sliding window according to the present invention.

FIG. 6 is a diagram of a deep reinforcement learning network according to the present invention.

FIG. 7 is a diagram illustrating parameters of a deep reinforcement learning network model according to an embodiment of the present invention.

FIG. 8 is a graph illustrating network performance at different learning rates for a deep reinforcement learning network model in accordance with an embodiment of the present invention.

Detailed Description

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The features and capabilities of the present invention are described in further detail below in connection with examples.

Example 1

Referring to fig. 1, a method for allocating proxy resources of an edge internet of things based on deep reinforcement learning includes:

preferably, in this embodiment, the data collected by the terminal device x is data such as voice, video, and image of the user terminal;

preferably, in this embodiment, python3+tensorf low1.0 is used as a simulation experiment platform, and the hardware conditions are Intel Core i7-5200u and 16GB memories, 50 terminal devices x and 5 edge nodes e are set in a simulation test environment, where the terminal devices x and the edge nodes e are uniformly distributed in a grid of 15 km×15 km;

preferably, in this embodiment, the terminal device x sends a task request to the edge node e once every 1 hour, and the edge node e decides the server that performs the task in a distributed manner; wherein the load of the terminal equipment x comes from a real load data set in which the load of the terminal tasks approximately follows a periodic distribution of 24 hours due to tidal effects, but also random fluctuations due to environmental factors.

Preferably, in the present embodiment, fig. 7 is a deep reinforcement learning network model parameter.

In this embodiment, as shown in fig. 2, the training method of the deep reinforcement learning network model in step S1 includes the following steps:

step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s _t Updated to the next state s of the system _t+1 And return toStep S104.

In this embodiment, specifically, the system state S in step S101 is a local unloading state, and the expression is as follows:

s＝[F,M,B]

wherein:

f is a discharge decision vector;

m is a computing resource allocation vector;

system action a in step S104 _t The expression of (2) is as follows:

a _t ＝[x,μ,k]

wherein:

x is a terminal device;

μ is the discharge scheme of terminal device x;

k is a computing resource allocation scheme of the terminal equipment x;

wherein:

r is a reward function;

a is an objective function value in the current time t state;

a' is the calculated value under all local unloading;

Δ _t ＝(s _t ,a _t ,σ _t+1 ,s _t+1 )。

in this embodiment, as shown in fig. 3, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:

In this embodiment, specifically, the calculation formula of the target value y of the state action pair in step S1072 is as follows:

wherein:

is maxQ(s) _t+1 ,a _t+1 A fluctuation coefficient of θ');

Q(s _t+1 ,a _t+1 θ') is the value of the next state of the system;

maxQ(s _t+1 ,a _t+1 θ') is the maximum value of the next state of the system;

the expression of the Loss function Loss (θ) in step S1073 is as follows:

wherein:

n is the number value of state transition sequences extracted each time;

n is the sequence number of the state transition sequence.

In this embodiment, specifically, the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;

In this embodiment, in order to achieve efficient task processing, three factors are considered: delay cost c ₁ Migration cost c ₂ And load cost c ₃ The method comprises the steps of carrying out a first treatment on the surface of the Since the terminal device x needs to send the collected data to the edge node e for processing, a time delay is generated in the transmission of the data during the process; while processing a task, edge node e may also decide whether to send the task to the migrating edge nodePoint j, however, migration costs may result due to the need to redeploy the model for the migration task; due to the limited capacity of the edge node e, if too many tasks are deployed on the same edge node e, the edge node e is often overloaded, resulting in load costs.

In the present embodiment, specifically, the delay cost c ₁ The expression of (2) is as follows:

wherein:

t is the interaction times;

x is a terminal equipment set;

e is an edge node set;

u _x is the amount of data sent;

τ _xe the transmission delay of the terminal equipment x and the edge node e;

the migration cost c ₂ The expression of (2) is as follows:

wherein:

j is a migration edge node;

for terminal equipment x and migration edge node in current interaction timeDeployment variables for j;

the load cost c ₃ The expression of (2) is as follows:

wherein:

u _x is the amount of data transmitted.

In this embodiment, specifically, as shown in fig. 4, the calculation of the reliability includes the following steps:

in this embodiment, considering that the interaction experience with longer interval time is not enough to update the current reliable value in time, the latest interaction behavior should be more concerned, so a sliding window mechanism is adopted to update the interaction information; as shown in fig. 5, when the interaction information of the next time slot arrives, the time slot record with the longest interval in the window is discarded, and the effective interaction information is recorded in the window, thereby reducing the calculation overhead of the user terminal;

in this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the influence on the current reliability evaluation is, and the time decay function is defined as:representing the degree of decay from the information obtained from w interactions to the information of the current interaction slot, where Δt _w ＝t-t _w ，t _w The end time of the w time interaction time slots is the calculation resource amount which can be provided by the edge server in each interaction process can influence the updating of the interaction information;

step A3: according to the timeThe attenuation degree and the resource allocation rate are calculated to obtain the reliability T _ex (t)

In this embodiment, specifically, the reliability T _ex The calculation formula of (t) is as follows:

N _ex (t)＝1-P _ex (t)

wherein:

u is the amount of effective information in the sliding window;

w is current interaction information;

is the degree of time decay;

H _ex (t _w ) The resource allocation rate is the resource allocation rate;

epsilon isFluctuation coefficient of (2);

P _ex (t _w ) Currently interacted with positive service satisfaction;

N _ex (t _w ) Negative service satisfaction for the current interaction;

In this embodiment, specifically, the expression of the time attenuation degree in the step A2 is as follows:

wherein:

In this embodiment, the optimal allocation strategy is solved by using a deep reinforcement learning network model, as shown in fig. 6, where the deep reinforcement learning network model includes two neural networks, the first neural network is called real-time ANN, and is used to calculate the estimated value Q (s _t ,a _t θ), θ refers to parameters of the real-time ANN, which are updated each time an estimate of the current state is calculated; a second neural network, called delay ANN, for calculating the value Q (s _t+1 ,a _t+1 The value of the next state is used to calculate the target value y.

In this embodiment, the influence on the deep reinforcement learning network model under different learning rates is tested, and as shown in fig. 8, when the learning rate factor is set to 0.01, the network loss function cannot be effectively converged, and the function value has obvious concussion phenomenon. In contrast, when the learning rate is set to 0.0001, the dispersibility of the network is effectively improved, the network effectively converges at 60 iterations, but the convergence speed is significantly slow. Obviously, when the setting value is 0.0001, the resource allocation performance is best, the loss function is fast reduced, the network convergence is more stable, and the convergence effect is better.

The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

This background section is provided to generally present the context of the present invention and the work of the presently named inventors, to the extent it is described in this background section, as well as the description of the present section as not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.

Claims

1. The edge internet of things proxy resource allocation method based on deep reinforcement learning is characterized by comprising the following steps of:

step S3: according to the optimal allocation strategy, the data are sent to an edge node e for calculation, and the edge internet of things proxy resource allocation is realized;

the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:

step S107: judging whether the storage amount of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delay ANN, and finishing training the deep reinforcement learning network model; otherwise, the current system state s _t Updated to the next state s of the system _t+1 And returns to step S104;

the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:

step S1078: judging whether the test result meets the requirement, if so, finishing the training of the real-time ANN and the delayed ANN to obtain a training-completed deep reinforcement learning network model; otherwise, re-extracting N state transition sequences from the experience pool O, and returning to step S1071;

the performance index of the deep reinforcement learning network model in step S1077 includes: global cost and reliability;

the global cost includes a delay cost c ₁ Migration cost c ₂ And load cost c ₃ ；

The delay cost c ₁ The expression of (2) is as follows:

wherein:

t is the interaction times;

x is a terminal equipment set;

e is an edge node set;

u _x is the amount of data sent;

τ _xe the transmission delay of the terminal equipment x and the edge node e;

the migration cost c ₂ The expression of (2) is as follows:

wherein:

j is a migration edge node;

for terminal device x and last interaction timeDeployment variables of edge node e;

the load cost c ₃ The expression of (2) is as follows:

wherein:

u _x is the amount of data sent;

the reliability calculation includes the following steps:

step A3: according to the time attenuation degree and the resource allocation rate, calculating to obtain reliability T _ex (t)；

The reliability T _ex The calculation formula of (t) is as follows:

Nex(t)＝1-Pex(t)

wherein:

u is the amount of effective information in the sliding window;

w is current interaction information;

is the degree of time decay;

H _ex (t _w ) The resource allocation rate is the resource allocation rate;

epsilon isFluctuation coefficient of (2);

P _ex (t _w ) Currently interacted with positive service satisfaction;

N _ex (t _w ) Negative service satisfaction for the current interaction;

f _ex (t) is the number of failed historical interactions between the terminal device x and the edge node e;

the expression of the time attenuation degree in the step A2 is as follows:

wherein:

2. The method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the system state S in the step S101 is a local unloading state, and the expression is as follows:

s＝[F,M,B]

wherein:

f is a discharge decision vector;

m is a computing resource allocation vector;

system action a in step S104 _t The expression of (2) is as follows:

a _t ＝[x,μ,k]

wherein:

x is a terminal device;

μ is the discharge scheme of terminal device x;

k is a computing resource allocation scheme of the terminal equipment x;

wherein:

r is a reward function;

a is an objective function value in the current time t state;

a' is the calculated value under all local unloading;

Δ _t ＝(s _t ,a _t ,σ _t+1 ,s _t+1 )。

3. the method for allocating proxy resources of the edge internet of things based on deep reinforcement learning according to claim 1, wherein the calculation formula of the target value y of the state action pair in step S1072 is as follows:

wherein:

is maxQ(s) _t+1 ,a _t+1 A fluctuation coefficient of θ');

Q(s _t+1 ,a _t+1 θ') is the value of the next state of the system;

maxQ(s _t+1 ,a _t+1 θ') is the maximum value of the next state of the system;

the expression of the Loss function Loss (θ) in step S1073 is as follows:

wherein:

n is the number value of state transition sequences extracted each time;

n is the sequence number of the state transition sequence.