CN115914227A

CN115914227A - Edge Internet of things agent resource allocation method based on deep reinforcement learning

Info

Publication number: CN115914227A
Application number: CN202211401605.2A
Authority: CN
Inventors: 钟加勇; 田鹏; 吕小红; 吴彬; 籍勇亮; 李俊杰; 宫林; 何迎春
Original assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-04-04
Anticipated expiration: 2042-11-10
Also published as: CN115914227B

Abstract

The invention discloses an edge Internet of things agent resource allocation method based on deep reinforcement learning, and relates to the technical field of Internet of things, wherein the method comprises the following steps: firstly, collecting data in an environment by a terminal device x, transmitting the data to a deep reinforcement learning network model, obtaining an optimal allocation strategy by the deep reinforcement learning network model according to the data, and finally sending the data to an edge node e for calculation according to the optimal allocation strategy to realize edge internet of things proxy resource allocation; the invention solves the problems that the edge Internet of things proxy resource allocation time is long, the performance is limited, and the prior art is not enough to support the resource optimization configuration of the complex dynamic Internet of things.

Description

A resource allocation method for edge IoT agents based on deep reinforcement learning

技术领域Technical Field

本发明涉及物联网技术领域，具体涉及一种基于深度强化学习的边缘物联网代理资源分配方法。The present invention relates to the technical field of Internet of Things, and in particular to an edge Internet of Things agent resource allocation method based on deep reinforcement learning.

背景技术Background Art

本节中的陈述仅提供与本公开相关的背景信息，并且可能不构成现有技术。The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

合理的资源配置是高效支持边缘物联网代理的电力业务的重要保障；电力物联网是国家工业互联网的重要组成部分；构建高效、安全、可靠的感知层已成为电力行业的一项重要建设工作；然而，目前电力物联网设备的计算能力有限，不能有效地实现本地大型快速计算的任务；边缘物联网代理作为物联网感知层的核心设备，发挥着连接物联网终端和云端的作用；随着语音、视频、图像等多种数据的接入，以及高频数据的采集和异构数据的存储，如何动态、自适应地将物联网终端的任务部署在合适的边缘物联网代理节点上是现阶段的关键问题。Reasonable resource allocation is an important guarantee for efficiently supporting the power business of edge IoT agents; the power IoT is an important part of the national industrial Internet; building an efficient, secure and reliable perception layer has become an important construction task in the power industry; however, the computing power of current power IoT devices is limited and cannot effectively achieve local large-scale fast computing tasks; the edge IoT agent, as the core device of the IoT perception layer, plays the role of connecting IoT terminals and the cloud; with the access of various data such as voice, video, and images, as well as the collection of high-frequency data and the storage of heterogeneous data, how to dynamically and adaptively deploy the tasks of IoT terminals on appropriate edge IoT agent nodes is a key issue at this stage.

目前，边缘物联网代理的关键问题主要体现在两个方面；首先，由于不同边缘的物联网代理之间相互依赖，现有的组合优化方法一般采用近似算法或启发式算法来解决部署方案，不仅需要较长的运行时间，而且性能有限；其次，边缘物联网代理环境中存在多个边缘节点，而边缘服务器的资源容量有限；因此，不同的边缘节点需要通过分布式决策进行合作，实现最优的资源分配，以支持高效、可靠的信息交互。At present, the key issues of edge IoT agents are mainly reflected in two aspects; first, due to the mutual dependence between IoT agents at different edges, the existing combinatorial optimization methods generally use approximate algorithms or heuristic algorithms to solve deployment solutions, which not only requires a long running time but also has limited performance; second, there are multiple edge nodes in the edge IoT agent environment, and the resource capacity of the edge server is limited; therefore, different edge nodes need to cooperate through distributed decision-making to achieve optimal resource allocation to support efficient and reliable information interaction.

多层网络模型的出现为通信网络资源的优化配置提供了新的解决方案；通过多层网络对网络模型进行训练，以达到准确、高效的解决方案；目前，一些研究者已经进行了研究和分析；现有技术中的一种方案为基于卷积神经网络，实现物联网资源的合理分配和边缘设备对终端数据及网络任务的高效交互和协调；另一种方案为利用贝叶斯对Q-learning网络进行优化，实现网络中资源分配的合理化和有序化，以抵御DDoS网络攻击；另外，深度时空残差网络的引入有效支持了工业物联网网络的有效负载平衡，保证了网络实现低延迟、高可靠的数据交互；考虑到网络设备的异质性，现有技术多采用深度学习网络对网络服务器和用户请求进行有效匹配，为用户设备分配最佳资源量；但需要注意的是，由于深度网络模型的网络结构，在更新和迭代网络状态时容易陷入计算能力和处理问题不匹配的问题，限制了计算效率，不足以支持复杂动力物联网的资源优化配置。The emergence of multi-layer network models provides a new solution for the optimal configuration of communication network resources; the network model is trained through a multi-layer network to achieve an accurate and efficient solution; currently, some researchers have conducted research and analysis; one solution in the existing technology is based on convolutional neural networks to achieve reasonable allocation of IoT resources and efficient interaction and coordination of edge devices with terminal data and network tasks; another solution is to use Bayesian to optimize the Q-learning network to achieve rationalization and ordering of resource allocation in the network to resist DDoS network attacks; in addition, the introduction of deep spatiotemporal residual networks effectively supports the effective load balancing of industrial IoT networks and ensures that the network achieves low-latency and highly reliable data interaction; considering the heterogeneity of network devices, the existing technology mostly uses deep learning networks to effectively match network servers and user requests to allocate the best amount of resources to user devices; but it should be noted that due to the network structure of the deep network model, it is easy to fall into the problem of mismatch between computing power and processing problems when updating and iterating the network status, which limits the computing efficiency and is insufficient to support the optimal resource configuration of complex power IoT.

发明内容Summary of the invention

本发明的目的在于：针对现有技术中的上述不足，提供了一种基于深度强化学习的边缘物联网代理资源分配方法，解决了边缘物联网代理资源分配时间长、性能有限以及现有技术不足以支持复杂动力物联网的资源优化配置的问题。The purpose of the present invention is to provide an edge Internet of Things agent resource allocation method based on deep reinforcement learning in response to the above-mentioned deficiencies in the prior art, which solves the problems of long edge Internet of Things agent resource allocation time, limited performance and the insufficiency of prior art to support resource optimization configuration of complex power Internet of Things.

本发明的技术方案如下：The technical solution of the present invention is as follows:

一种基于深度强化学习的边缘物联网代理资源分配方法，包括：A resource allocation method for edge IoT agents based on deep reinforcement learning, comprising:

步骤S1：由终端设备x收集环境中的数据，并将所述数据传输至深度强化学习网络模型；Step S1: The terminal device x collects data in the environment and transmits the data to the deep reinforcement learning network model;

步骤S2：根据所述数据，由深度强化学习网络模型得到最优分配策略；Step S2: according to the data, obtaining the optimal allocation strategy by the deep reinforcement learning network model;

步骤S3：根据所述最优分配策略，将所述数据发送至边缘节点e进行计算，实现边缘物联网代理资源分配。Step S3: According to the optimal allocation strategy, the data is sent to the edge node e for calculation to realize edge IoT agent resource allocation.

进一步地，所述步骤S1中深度强化学习网络模型的训练方法包括如下步骤：Furthermore, the training method of the deep reinforcement learning network model in step S1 comprises the following steps:

步骤S101：初始化所述深度强化学习网络模型的系统状态s；Step S101: Initialize the system state s of the deep reinforcement learning network model;

步骤S102：初始化所述深度强化学习网络模型的实时ANN和延迟ANN；Step S102: Initializing the real-time ANN and the delayed ANN of the deep reinforcement learning network model;

步骤S103：初始化所述深度强化学习网络模型的经验池O；Step S103: Initialize the experience pool O of the deep reinforcement learning network model;

步骤S104：根据当前系统状态s_t，利用ε-greedy策略，选择系统动作a_t；Step S104: according to the current system state s _t , using the ε-greedy strategy, select the system action a _t ;

步骤S105：由环境根据所述系统动作a_t反馈奖励σ_t+1和系统下一状态s_t+1；Step S105: The environment feeds back a reward σ _t+1 and the next state s _t+1 of the system according to the system action a _t ;

步骤S106：根据所述当前系统状态s_t、系统动作a_t、奖励σ_t+1和系统下一状态s_t+1，计算得到状态转换序列Δ_t，并将状态转换序列Δ_t存储至经验池O；Step S106: Calculate a state transition sequence Δ _t according to the current system state s _t , system action a _t , reward σ _t+1 and the next system state s _t+1 , and store the state transition sequence Δ _t in the experience pool O;

步骤S107：判断经验池O存储量是否达到预设值，若是，从经验池O中抽取N个状态转换序列对实时ANN和延迟ANN进行训练，完成对深度强化学习网络模型的训练；否则，将当前系统状态s_t更新为系统下一状态s_t+1，并返回步骤S104。Step S107: Determine whether the storage capacity of the experience pool O reaches the preset value. If so, extract N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN to complete the training of the deep reinforcement learning network model; otherwise, update the current system state s _t to the next system state s _t+1 and return to step S104.

进一步地，所述步骤S101中的系统状态s为本地卸载状态，表达式如下：Furthermore, the system state s in step S101 is a local uninstallation state, which is expressed as follows:

s＝[F,M,B]s＝[F,M,B]

其中：in:

F为卸货决策向量；F is the unloading decision vector;

M为计算资源分配向量；M is the computing resource allocation vector;

B为剩余计算资源向量；B＝[b₁,b₂,b₃…b_d,…]，

其中，b_d为第d个MEC服务器的剩余计算资源，G_d为总计算资源，

为分配给计算资源分配向量M中每个任务的计算资源；B is the remaining computing resource vector; B = [b ₁ , b ₂ , b ₃ …b _d , …],

Where b _d is the remaining computing resources of the dth MEC server, G _d is the total computing resources,

Allocate computing resources to each task in the computing resource allocation vector M;

所述步骤S104中的系统动作a_t的表达式如下：The expression of the system action a _t in step S104 is as follows:

a_t＝[x,μ,k]a _t = [x, μ, k]

其中：in:

x为终端设备；x is the terminal device;

μ为终端设备x的卸货方案；μ is the unloading plan of terminal device x;

k为终端设备x的计算资源分配方案；k is the computing resource allocation scheme of terminal device x;

所述步骤S105中的奖励σ_t+1的计算公式如下：The calculation formula of the reward σ _t+1 in step S105 is as follows:

其中：in:

r为奖励函数；r is the reward function;

A为当前时间t状态下的目标函数值；A is the objective function value at the current time t;

A'为当前系统状态s_t采取系统动作a_t后到下一个状态时的目标函数值；A' is the objective function value when the current system state s _t takes system action a _t to the next state;

A”为所有局部卸载下的计算值；A” is the calculated value under all local unloading conditions;

所述步骤S106中的状态转换序列Δ_t的表达式如下：The expression of the state transition sequence _Δt in step S106 is as follows:

Δ_t＝(s_t,a_t,σ_t+1,s_t+1)。Δ _t =(s _t ,a _t ,σ _t+1 ,s _t+1 ).

进一步地，所述步骤S107中对实时ANN和延迟ANN的训练方法包括如下步骤：Furthermore, the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:

步骤S1071：对所述N个状态转换序列，根据状态转换序列得到状态动作对的估计值Q(s_t,a_t,θ)和下一状态的值Q(s_t+1,a_t+1,θ')；Step S1071: for the N state transition sequences, obtain an estimated value Q(s _t , a _t , θ) of the state-action pair and a value Q(s _t+1 , a _t+1 , θ′) of the next state according to the state transition sequence;

步骤S1072：根据所述下一状态的值Q(s_t+1,a_t+1,θ')和奖励σ_t+1，计算得到状态动作对的目标值y；Step S1072: Calculate the target value y of the state-action pair according to the value Q(s _t+1 ,a _t+1 ,θ′) of the next state and the reward σ _t+1 ;

步骤S1073：根据所述状态动作对的估计值Q(s_t,a_t,θ)和目标值y，计算得到损失函数Loss(θ)；Step S1073: Calculate the loss function Loss(θ) according to the estimated value Q(s _t , a _t , θ) of the state-action pair and the target value y;

步骤S1074：通过损失的反向传播机制调整实时ANN的参数θ，并利用优化器RMSprop减小损失函数Loss(θ)；Step S1074: adjusting the parameter θ of the real-time ANN through the back propagation mechanism of the loss, and reducing the loss function Loss(θ) using the optimizer RMSprop;

步骤S1075：判断距离上一次更新延迟ANN的参数θ'的步数是否等于设定值，若是，更新延迟ANN的参数θ'，进入步骤S1077；否则，进入步骤S1076；Step S1075: determine whether the number of steps since the last update of the delay ANN parameter θ' is equal to the set value. If so, update the delay ANN parameter θ' and proceed to step S1077; otherwise, proceed to step S1076;

步骤S1076：判断N个状态转换序列是否训练结束，若是，从经验池O中重新抽取N个状态转换序列，并返回步骤S1071，否则返回步骤S1071；Step S1076: Determine whether the training of the N state transition sequences is completed. If so, re-extract N state transition sequences from the experience pool O and return to step S1071. Otherwise, return to step S1071.

步骤S1077：对所述深度强化学习网络模型性能指标进行测试，得到测试结果；Step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;

步骤S1078：判断所述测试结果是否达到要求，若是，则实时ANN和延迟ANN训练结束，得到训练完成的深度强化学习网络模型；否则，从经验池O中重新抽取N个状态转换序列，并返回步骤S1071。Step S1078: Determine whether the test result meets the requirements. If so, the real-time ANN and delayed ANN training is completed to obtain a trained deep reinforcement learning network model; otherwise, re-extract N state transition sequences from the experience pool O and return to step S1071.

进一步地，所述步骤S1072中的状态动作对的目标值y的计算公式如下：Furthermore, the calculation formula of the target value y of the state-action pair in step S1072 is as follows:

其中：in:

为maxQ(s_t+1,a_t+1,θ')的波动系数；

is the fluctuation coefficient of maxQ(s _t+1 ,a _t+1 ,θ');

Q(s_t+1,a_t+1,θ')为系统下一状态的值；Q(s _t+1 ,a _t+1 ,θ') is the value of the next state of the system;

maxQ(s_t+1,a_t+1,θ')为系统下一状态的最大值；maxQ(s _t+1 ,a _t+1 ,θ') is the maximum value of the next state of the system;

所述步骤S1073中的损失函数Loss(θ)的表达式如下：The expression of the loss function Loss(θ) in step S1073 is as follows:

其中：in:

N为每次抽取的状态转换序列的数量值；N is the number of state transition sequences extracted each time;

n为状态转换序列的序号。n is the sequence number of the state transition sequence.

进一步地，所述步骤S1077中的深度强化学习网络模型性能指标包括：全局成本和可靠度；Further, the performance indicators of the deep reinforcement learning network model in step S1077 include: global cost and reliability;

所述全局成本包括延迟成本c₁、迁移成本c₂和负载成本c₃。The global cost includes delay cost c ₁ , migration cost c ₂ and load cost c ₃ .

进一步地，所述延迟成本c₁的表达式如下：Furthermore, the expression of the delay cost _c1 is as follows:

其中：in:

t为交互次数；t is the number of interactions;

X为终端设备集合；X is a terminal device set;

E为边缘节点集合；E is the set of edge nodes;

u_x为发送的数据量；u _x is the amount of data sent;

为当前交互时间里终端设备x与边缘节点e的部署变量；

is the deployment variable of the terminal device x and the edge node e during the current interaction time;

τ_xe为终端设备x与边缘节点e的传输延迟；τ _xe is the transmission delay between terminal device x and edge node e;

所述迁移成本c₂的表达式如下：The expression of the migration cost _c2 is as follows:

其中：in:

j为迁移边缘节点；j is the migration edge node;

为上一交互时间里终端设备x与边缘节点e的部署变量；

is the deployment variable of the terminal device x and the edge node e in the last interaction time;

为当前交互时间里终端设备x与迁移边缘节点j的部署变量；

is the deployment variable of the terminal device x and the migration edge node j in the current interaction time;

所述负载成本c₃的表达式如下：The expression of the load cost _c3 is as follows:

其中：in:

u_x为发送的数据量。u _x is the amount of data sent.

进一步地，所述可靠度的计算包括以下步骤：Furthermore, the calculation of the reliability includes the following steps:

步骤A1：将终端设备x和边缘节点e的交互数据存储于滑动窗口中，并进行实时更新；Step A1: Store the interaction data between the terminal device x and the edge node e in a sliding window and update it in real time;

步骤A2：根据终端设备x和边缘节点e的历史交互数据，采用基于贝叶斯信任评价的期望值计算当前交互的时间衰减程度和资源分配率；Step A2: According to the historical interaction data between the terminal device x and the edge node e, the expected value based on the Bayesian trust evaluation is used to calculate the time decay degree and resource allocation rate of the current interaction;

步骤A3：根据所述时间衰减程度和资源分配率，计算得到可靠度T_ex(t)Step A3: Calculate the reliability T _ex (t) according to the time decay degree and resource allocation rate

进一步地，所述可靠度T_ex(t)的计算公式如下：Furthermore, the calculation formula of the reliability T _ex (t) is as follows:

N_ex(t)＝1-P_ex(t)N _ex (t) = 1 - P _ex (t)

其中：in:

U为滑动窗口中有效信息的数量；U is the amount of valid information in the sliding window;

w为当前交互信息；w is the current interaction information;

为时间衰减程度；

is the degree of time decay;

H_ex(t_w)为资源分配率；H _ex (t _w ) is the resource allocation rate;

的波动系数；

The coefficient of volatility;

P_ex(t_w)当前交互的正服务满意度；P _ex (t _w ) positive service satisfaction of the current interaction;

N_ex(t_w)为当前交互的负服务满意度；N _ex (t _w ) is the negative service satisfaction of the current interaction;

s_ex(t)为终端设备x和边缘节点e之间成功的历史交互次数；s _ex (t) is the number of successful historical interactions between terminal device x and edge node e;

f_ex(t)为终端设备x和边缘节点e之间失败的历史交互次数。f _ex (t) is the number of failed historical interactions between terminal device x and edge node e.

进一步地，所述步骤A2中时间衰减程度的表达式如下：Furthermore, the expression of the time attenuation degree in step A2 is as follows:

其中：in:

Δt_w为第w次交互结束到当前交互开始的时间间隙；Δt _w is the time interval from the end of the wth interaction to the beginning of the current interaction;

所述步骤A2中资源分配率的计算公式如下：The calculation formula of the resource allocation rate in step A2 is as follows:

其中：in:

source_ex(t)为边缘节点e在当前时隙中能提供给终端设备x的资源量；source _ex (t) is the amount of resources that edge node e can provide to terminal device x in the current time slot;

source_e(t)为边缘节点e在当前时隙中所能提供的资源总量。source _e (t) is the total amount of resources that edge node e can provide in the current time slot.

与现有的技术相比本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

1、一种基于深度强化学习的边缘物联网代理资源分配方法，利用深度强化学习网络模型计算得到最优分配策略，根据所述最优分配策略，将终端数据传输至边缘节点e进行计算，有效缓解了现场设备的计算压力，避免了在资源分配过程中大数据量带来的存储困难，保证了通信网络可靠和高效的信息交互，为电力物联网提供了更好的信息交互支持服务。1. A resource allocation method for edge IoT agents based on deep reinforcement learning. The optimal allocation strategy is calculated using a deep reinforcement learning network model. According to the optimal allocation strategy, the terminal data is transmitted to the edge node e for calculation, which effectively alleviates the computing pressure of the on-site equipment and avoids the storage difficulties caused by large amounts of data in the resource allocation process. It ensures reliable and efficient information interaction of the communication network and provides better information interaction support services for the power IoT.

2、一种基于深度强化学习的边缘物联网代理资源分配方法，其中的深度强化学习网络模型将深度学习的感知能力和强化学习的决策能力相结合，进行优势互补，能够支持大数据量的最优策略求解。2. A resource allocation method for edge IoT agents based on deep reinforcement learning, in which the deep reinforcement learning network model combines the perception ability of deep learning and the decision-making ability of reinforcement learning to complement each other and support the optimal strategy solution for large amounts of data.

3、一种基于深度强化学习的边缘物联网代理资源分配方法，其中的神经网络包括实时ANN和延迟ANN，经过一定次数的训练后，将延迟ANN的参数更新为实时ANN的参数，保证了延迟ANN值函数的及时性，降低了状态之间的相关性。3. A resource allocation method for edge IoT agents based on deep reinforcement learning, in which the neural networks include real-time ANN and delay ANN. After a certain number of trainings, the parameters of the delay ANN are updated to the parameters of the real-time ANN, which ensures the timeliness of the delay ANN value function and reduces the correlation between states.

4、一种基于深度强化学习的边缘物联网代理资源分配方法，将全局成本和可靠度作为网络模型的性能判断指标，为网络模型寻求最优策略提供判断依据。4. A resource allocation method for edge IoT agents based on deep reinforcement learning, which takes global cost and reliability as performance judgment indicators of the network model, and provides a judgment basis for the network model to seek the optimal strategy.

5、一种基于深度强化学习的边缘物联网代理资源分配方法，采用滑动窗口机制更新交互信息，直接摒弃间隔时间较长的交互信息，减少了用户终端的计算开销，且可靠度的计算保证了用户终端在任务卸载过程中的安全性，为建立良好的交互环境提供保障。5. A resource allocation method for edge IoT agents based on deep reinforcement learning adopts a sliding window mechanism to update interaction information, directly discards interaction information with a long interval, reduces the computational overhead of user terminals, and the reliability calculation ensures the security of user terminals during task offloading, providing a guarantee for establishing a good interaction environment.

6、一种基于深度强化学习的边缘物联网代理资源分配方法，计算用户终端与边缘服务器之间的各项交互质量数值，为可靠度的计算做准备，为网络模型寻求最优策略提供判断依据。6. A resource allocation method for edge IoT agents based on deep reinforcement learning, which calculates the quality values of various interactions between user terminals and edge servers, prepares for reliability calculations, and provides a basis for judgment in seeking the optimal strategy for network models.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2为本发明中深度强化学习网络模型的实现方法流程图。FIG2 is a flow chart of a method for implementing a deep reinforcement learning network model in the present invention.

图3为本发明中实时ANN和延迟ANN的训练方法流程图。FIG3 is a flow chart of the training method of the real-time ANN and the delayed ANN in the present invention.

图4为本发明中可靠度计算方法流程图。FIG4 is a flow chart of the reliability calculation method in the present invention.

图5为本发明中滑动窗口示意图。FIG5 is a schematic diagram of a sliding window in the present invention.

图6为本发明中深度强化学习网络结构图。FIG6 is a diagram showing the structure of a deep reinforcement learning network in the present invention.

图7为本发明实施例中深度强化学习网络模型参数。FIG. 7 shows the parameters of the deep reinforcement learning network model in an embodiment of the present invention.

图8为本发明实施例中深度强化学习网络模型不同学习率下的网络性能曲线图。FIG8 is a graph showing network performance at different learning rates for a deep reinforcement learning network model according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

需要说明的是，术语“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that relational terms such as "first" and "second" are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the existence of other identical elements in the process, method, article or device including the elements.

下面结合实施例对本发明的特征和性能作进一步的详细描述。The features and performance of the present invention are further described in detail below in conjunction with the embodiments.

实施例一Embodiment 1

请参阅图1，一种基于深度强化学习的边缘物联网代理资源分配方法，包括：Please refer to FIG1 , a method for allocating resources of edge IoT agents based on deep reinforcement learning, including:

优选地，在本实施例中，终端设备x收集的数据为用户终端的语音、视频、图像等数据；Preferably, in this embodiment, the data collected by the terminal device x is the voice, video, image and other data of the user terminal;

优选地，在本实施例中，采用python3+tensorflow1.0作为仿真实验平台，硬件条件为Intel Core i7-5200u和16GB的内存，在仿真测试环境中设置了50个终端设备x和5个边缘节点e，所述终端设备x和边缘节点e均匀分布在15公里×15公里的网格中；Preferably, in this embodiment, python3+tensorflow1.0 is used as the simulation experiment platform, the hardware conditions are Intel Core i7-5200u and 16GB of memory, and 50 terminal devices x and 5 edge nodes e are set in the simulation test environment, and the terminal devices x and edge nodes e are evenly distributed in a 15 km × 15 km grid;

优选地，在本实施例中，终端设备x每隔1小时向边缘节点e发送一次任务请求，边缘节点e以分布式方式决定执行任务的服务器；其中，终端设备x的负荷来自于真实的负荷数据集，在这个数据集中，由于潮汐效应，终端任务的负荷大致遵循24小时的周期性分布，但也会因为环境因素而产生随机波动。Preferably, in this embodiment, the terminal device x sends a task request to the edge node e once every hour, and the edge node e determines the server that executes the task in a distributed manner; wherein the load of the terminal device x comes from a real load data set, in which, due to the tidal effect, the load of the terminal task roughly follows a 24-hour periodic distribution, but may also produce random fluctuations due to environmental factors.

优选地，在本实施例中，图7为深度强化学习网络模型参数。Preferably, in this embodiment, FIG. 7 is a deep reinforcement learning network model parameter.

在本实施例中，具体的，如图2所示，所述步骤S1中深度强化学习网络模型的训练方法包括如下步骤：In this embodiment, specifically, as shown in FIG2 , the training method of the deep reinforcement learning network model in step S1 includes the following steps:

步骤S105：由环境根据所述系统动作a_t反馈奖励σ_t+1和系统下一状态s_t+1；Step S105: The environment feeds back a reward σ _t+1 and the next state of the system s _t+1 according to the system action a _t ;

在本实施例中，具体的，所述步骤S101中的系统状态s为本地卸载状态，表达式如下：In this embodiment, specifically, the system state s in step S101 is a local uninstallation state, which is expressed as follows:

s＝[F,M,B]s＝[F,M,B]

其中：in:

F为卸货决策向量；F is the unloading decision vector;

M为计算资源分配向量；M is the computing resource allocation vector;

B为剩余计算资源向量；B＝[b₁,b₂,b₃…b_d,…]，

a_t＝[x,μ,k]a _t = [x, μ, k]

其中：in:

x为终端设备；x is the terminal device;

其中：in:

r为奖励函数；r is the reward function;

A'为当前系统状态s_t采取系统动作a_t后到下一个状态时的目标函数值；A' is the objective function value when the current system state s _t takes system action a _t to reach the next state;

Δ_t＝(s_t,a_t,σ_t+1,s_t+1)。Δ _t =(s _t ,a _t ,σ _t+1 ,s _t+1 ).

在本实施例中，具体的，如图3所示，所述步骤S107中对实时ANN和延迟ANN的训练方法包括如下步骤：In this embodiment, specifically, as shown in FIG3 , the training method for the real-time ANN and the delayed ANN in step S107 includes the following steps:

在本实施例中，具体的，所述步骤S1072中的状态动作对的目标值y的计算公式如下：In this embodiment, specifically, the calculation formula of the target value y of the state-action pair in step S1072 is as follows:

其中：in:

为maxQ(s_t+1,a_t+1,θ')的波动系数；

is the fluctuation coefficient of maxQ(s _t+1 ,a _t+1 ,θ');

其中：in:

在本实施例中，具体的，所述步骤S1077中的深度强化学习网络模型性能指标包括：全局成本和可靠度；In this embodiment, specifically, the deep reinforcement learning network model performance indicators in step S1077 include: global cost and reliability;

本实施例中，为了实现高效的任务处理，考虑了三个因素：延迟成本c₁、迁移成本c₂和负载成本c₃；由于终端设备x需要将收集到的数据发送到边缘节点e进行处理，期间数据的传输会产生时间延迟；在处理一个任务时，边缘节点e也可以决定是否将任务发送到迁移边缘节点j，然而，由于迁移任务需要重新部署模型，会产生迁移成本；由于边缘节点e的容量有限，如果在同一个边缘节点e部署太多的任务，边缘节点e往往会过载，产生了负载成本。In this embodiment, in order to achieve efficient task processing, three factors are considered: delay cost _c1 , migration cost _c2 and load cost _c3 ; since the terminal device x needs to send the collected data to the edge node e for processing, the transmission of the data will cause a time delay; when processing a task, the edge node e can also decide whether to send the task to the migration edge node j, however, since the migration task requires the redeployment of the model, a migration cost will be incurred; since the capacity of the edge node e is limited, if too many tasks are deployed on the same edge node e, the edge node e will often be overloaded, resulting in a load cost.

在本实施例中，具体的，所述延迟成本c₁的表达式如下：In this embodiment, specifically, the expression of the delay cost _c1 is as follows:

其中：in:

t为交互次数；t is the number of interactions;

X为终端设备集合；X is a terminal device set;

E为边缘节点集合；E is the set of edge nodes;

u_x为发送的数据量；u _x is the amount of data sent;

为当前交互时间里终端设备x与边缘节点e的部署变量；

其中：in:

j为迁移边缘节点；j is the migration edge node;

为上一交互时间里终端设备x与边缘节点e的部署变量；

为当前交互时间里终端设备x与迁移边缘节点j的部署变量；

其中：in:

u_x为发送的数据量。u _x is the amount of data sent.

在本实施例中，具体的，如图4所示，所述可靠度的计算包括以下步骤：In this embodiment, specifically, as shown in FIG4 , the calculation of the reliability includes the following steps:

本实施例中，考虑到间隔时间较长的交互经验不足以及时更新当前的可靠值，应该更加关注最近的交互行为，所以采用滑动窗口机制来更新交互信息；如图5所示，当下一个时隙的交互信息到来时，窗口中间隔时间最长的时隙记录将被丢弃，有效的交互信息将被记录在窗口中，从而减少了用户终端的计算开销；In this embodiment, considering that the interaction experience with a long interval is not enough to update the current reliability value in time, more attention should be paid to the most recent interaction behavior, so a sliding window mechanism is used to update the interaction information; as shown in FIG5 , when the interaction information of the next time slot arrives, the time slot record with the longest interval in the window will be discarded, and the valid interaction information will be recorded in the window, thereby reducing the calculation overhead of the user terminal;

本实施例中，由于边缘服务器的可靠度是动态更新的，历史交互信息距离当前时间越长，对当前可靠度评估的影响越小，时间衰减函数定义为：

表示从w次交互得到的信息到当前交互时隙的信息的衰减程度，其中Δt_w＝t-t_w，t_w是w次交互时隙的结束时间，边缘服务器在每次交互过程中所能提供的计算资源量也会影响到交互信息的更新；In this embodiment, since the reliability of the edge server is dynamically updated, the longer the historical interaction information is from the current time, the smaller the impact on the current reliability evaluation. The time decay function is defined as:

It represents the attenuation degree of the information obtained from the w-th interaction to the current interaction time slot, where _Δtw = _ttw , _tw is the end time of the w-th interaction time slot, and the amount of computing resources that the edge server can provide during each interaction will also affect the update of the interaction information;

在本实施例中，具体的，所述可靠度T_ex(t)的计算公式如下：In this embodiment, specifically, the calculation formula of the reliability T _ex (t) is as follows:

N_ex(t)＝1-P_ex(t)N _ex (t) = 1 - P _ex (t)

其中：in:

w为当前交互信息；w is the current interaction information;

为时间衰减程度；

is the degree of time attenuation;

H_ex(t_w)为资源分配率；H _ex (t _w ) is the resource allocation rate;

ε为

的波动系数；ε is

The coefficient of volatility;

在本实施例中，具体的，所述步骤A2中时间衰减程度的表达式如下：In this embodiment, specifically, the expression of the time attenuation degree in step A2 is as follows:

其中：in:

其中：in:

本实施例中，采用深度强化学习网络模型求解最优分配策略，如图6所示，深度强化学习网络模型包括两个神经网络，第一个神经网络，称为实时ANN，用于计算当前状态动作对的估计值Q(s_t,a_t,θ)，θ指的是实时ANN的参数，每次计算当前状态的估计值时都会更新；第二个神经网络，称为延迟ANN，用于计算下一状态的值Q(s_t+1,a_t+1,θ')，下一状态的值用于计算目标值y。In this embodiment, a deep reinforcement learning network model is used to solve the optimal allocation strategy. As shown in Figure 6, the deep reinforcement learning network model includes two neural networks. The first neural network, called the real-time ANN, is used to calculate the estimated value Q(s _t , a _t , θ) of the current state-action pair. θ refers to the parameter of the real-time ANN, which is updated each time the estimated value of the current state is calculated; the second neural network, called the delayed ANN, is used to calculate the value Q(s _t+1 , a _t+1 , θ') of the next state. The value of the next state is used to calculate the target value y.

本实施例中，测试了不同学习率下对深度强化学习网络模型的影响，如图8所示，当学习率因子设置为0.01时，网络损失函数不能有效收敛，函数值有明显的震荡现象。相反，当学习率设置为0.0001时，网络的分散性得到有效改善，网络在60次迭代时有效收敛，但收敛速度明显变慢。显然，当设置为0.0001时，资源分配性能最好，损失函数下降很快，网络收敛更稳定，收敛效果更好。In this embodiment, the influence of different learning rates on the deep reinforcement learning network model was tested. As shown in Figure 8, when the learning rate factor is set to 0.01, the network loss function cannot converge effectively, and the function value has obvious oscillation. On the contrary, when the learning rate is set to 0.0001, the dispersion of the network is effectively improved, and the network converges effectively at 60 iterations, but the convergence speed is significantly slower. Obviously, when it is set to 0.0001, the resource allocation performance is the best, the loss function drops quickly, the network converges more stably, and the convergence effect is better.

以上所述实施例仅表达了本申请的具体实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请保护范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请技术方案构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。The above-mentioned embodiments only express the specific implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the protection scope of the present application. It should be pointed out that, for ordinary technicians in this field, several variations and improvements can be made without departing from the technical solution concept of the present application, and these all belong to the protection scope of the present application.

提供本背景技术部分是为了大体上呈现本发明的上下文，当前所署名的发明人的工作、在本背景技术部分中所描述的程度上的工作以及本部分描述在申请时尚不构成现有技术的方面，既非明示地也非暗示地被承认是本发明的现有技术。This background section is provided to generally present the context of the invention, and the work of the presently named inventors, the work to the extent described in this background section, and aspects of the description in this section that did not constitute prior art at the time of application are neither explicitly nor implicitly admitted to be prior art to the present invention.

Claims

1. An edge internet of things agent resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

step S1: collecting data in the environment by a terminal device x, and transmitting the data to a deep reinforcement learning network model;

step S2: obtaining an optimal distribution strategy by a deep reinforcement learning network model according to the data;

and step S3: and sending the data to an edge node e for calculation according to the optimal allocation strategy, so as to realize the allocation of the edge Internet of things agent resources.

2. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 1, wherein the training method of the deep reinforcement learning network model in the step S1 comprises the following steps:

step S101: initializing a system state s of the deep reinforcement learning network model;

step S102: initializing a real-time ANN and a delay ANN of the deep reinforcement learning network model;

step S103: initializing an experience pool O of the deep reinforcement learning network model;

step S104: according to the current system state s _t Selecting a system action a by using an epsilon-greedy strategy _t ；

Step S105: acting a by the environment according to said system _t Feedback award sigma _t+1 And the next state s of the system _t+1 ；

Step S106: according to the currentSystem state s _t And system action a _t Bonus sigma _t+1 And the next state s of the system _t+1 Calculating to obtain a state transition sequence delta _t And converting the state into a sequence of delta _t Storing the data to an experience pool O;

step S107: judging whether the storage capacity of the experience pool O reaches a preset value, if so, extracting N state transition sequences from the experience pool O to train the real-time ANN and the delayed ANN, and finishing the training of the deep reinforcement learning network model; otherwise, the current system state s is set _t Updating to the next state s of the system _t+1 And returns to step S104.

3. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 2, wherein the system state S in the step S101 is a local uninstalling state, and the expression is as follows:

s＝[F,M,B]

wherein:

f is a unloading decision vector;

m is a calculation resource allocation vector;

b is a residual computing resource vector; b = [ B ] ₁ ,b ₂ ,b ₃ …b _d ,…]，

Wherein, b _d For the remaining computing resources of the d MEC server, G _d For the total calculation resources, is>

Allocating the computing resources of each task in the vector M for the computing resources;

the system action a in step S104 _t The expression of (a) is as follows:

a _t ＝[x,μ,k]

wherein:

x is terminal equipment;

mu is the unloading scheme of the terminal equipment x;

k is a calculation resource allocation scheme of the terminal device x;

the bonus σ in said step S105 _t+1 The calculation formula of (a) is as follows:

wherein:

r is a reward function;

a is a target function value under the current time t state;

a' is the current system state s _t Taking a System action a _t The target function value when the next state is reached;

a' is the calculated value under all partial unloads;

the state transition sequence Δ in step S106 _t The expression of (a) is as follows:

Δ _t ＝(s _t ,a _t ,σ _t+1 ,s _t+1 )。

4. the method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning of claim 3, wherein the training method for the real-time ANN and the delayed ANN in the step S107 comprises the following steps:

step S1071: for the N state transition sequences, obtaining an estimated value Q(s) of a state action pair according to the state transition sequences _t ,a _t θ) and value of the next state Q(s) _t+1 ,a _t+1 ,θ')；

Step S1072: value Q(s) according to the next state _t+1 ,a _t+1 Theta') and prize sigma _t+1 Calculating to obtain a target value y of the state action pair;

step S1073: an estimate Q(s) from the state action pair _t ,a _t Theta) and a target value y, and calculating to obtain a Loss function Loss (theta);

step S1074: adjusting a parameter theta of the real-time ANN through a Loss back propagation mechanism, and reducing a Loss function Loss (theta) by using an optimizer RMSprop;

step S1075: judging whether the step number of the parameter theta 'of the last updating delay ANN is equal to a set value or not, if so, updating the parameter theta' of the delay ANN, and entering the step S1077; otherwise, go to step S1076;

step S1076: judging whether the training of the N state transition sequences is finished, if so, extracting the N state transition sequences again from the experience pool O, returning to the step S1071, and otherwise, returning to the step S1071;

step S1077: testing the performance index of the deep reinforcement learning network model to obtain a test result;

step S1078: judging whether the test result meets the requirement, if so, finishing the real-time ANN and delayed ANN training to obtain a trained deep reinforcement learning network model; otherwise, N state transition sequences are re-extracted from the experience pool O, and the process returns to step S1071.

5. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 4, wherein the target value y of the state action pair in the step S1072 is calculated by the following formula:

wherein:

is maxQ(s) _t+1 ,a _t+1 θ') of the fluctuation coefficient;

Q(s _t+1 ,a _t+1 θ') is the value of the next state of the system;

maxQ(s _t+1 ,a _t+1 θ') is the maximum value of the next state of the system;

the expression of the Loss function Loss (θ) in step S1073 is as follows:

wherein:

n is the quantity value of the state transition sequence extracted each time;

n is the sequence number of the state transition sequence.

6. The method of claim 5, wherein the deep reinforcement learning network model performance indexes in step S1077 include: global cost and reliability;

the global cost comprises a delay cost c ₁ Migration cost c ₂ And load cost c ₃ 。

7. The method for allocating the edge internet of things proxy resource based on deep reinforcement learning of claim 6, wherein the delay cost c is ₁ The expression of (c) is as follows:

wherein:

t is the number of interactions;

x is a terminal equipment set;

e is an edge node set;

u _x is the amount of data sent;

deployment variables of the terminal device x and the edge node e in the current interaction time are obtained;

τ _xe the transmission delay of the terminal device x and the edge node e;

the migration cost c ₂ The expression of (a) is as follows:

wherein:

j is a migration edge node;

deployment variables of the terminal device x and the edge node e in the last interactive time are obtained;

deployment variables of the terminal device x and the migration edge node j in the current interaction time are obtained;

the load cost c ₃ The expression of (a) is as follows:

wherein:

u _x is the amount of data sent.

8. The method for allocating the proxy resource of the edge internet of things based on the deep reinforcement learning as claimed in claim 6, wherein the calculating of the reliability comprises the following steps:

step A1: storing the interactive data of the terminal device x and the edge node e in a sliding window, and updating in real time;

step A2: calculating the time attenuation degree and the resource allocation rate of the current interaction by adopting an expected value based on Bayesian trust evaluation according to historical interaction data of the terminal device x and the edge node e;

step A3: calculating to obtain the reliability T according to the time attenuation degree and the resource allocation rate _ex (t) 。

9. The method for allocating the proxy resources of the edge internet of things based on the deep reinforcement learning of claim 8, wherein the method is characterized in thatThe reliability T _ex The calculation formula of (t) is as follows:

N _ex (t)＝1-P _ex (t)

wherein:

u is the number of effective information in the sliding window;

w is current interaction information;

is the degree of temporal decay;

H _ex (t _w ) Allocating the resource rate;

epsilon is

The fluctuation coefficient of (a);

P _ex (t _w ) Positive service satisfaction of the current interaction;

N _ex (t _w ) Negative service satisfaction for the current interaction;

s _ex (t) is the number of successful historical interactions between the terminal device x and the edge node e;

f _ex (t) is the historical number of interactions that failed between terminal device x and edge node e.

10. The method for allocating the edge internet of things proxy resource based on the deep reinforcement learning of claim 9, wherein the expression of the time attenuation degree in the step A2 is as follows:

wherein:

Δt _w a time gap from the end of the w-th interaction to the start of the current interaction;

the calculation formula of the resource allocation rate in the step A2 is as follows:

wherein:

source _ex (t) is the amount of resources that the edge node e can provide to the terminal device x in the current time slot;

source _e (t) is the total amount of resources that the edge node e can provide in the current time slot.