CN111367657A

CN111367657A - Computing resource collaborative cooperation method based on deep reinforcement learning

Info

Publication number: CN111367657A
Application number: CN202010107300.5A
Authority: CN
Inventors: 陈沛锐; 于秀兰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-07-03
Anticipated expiration: 2040-02-21
Also published as: CN111367657B

Abstract

The invention relates to a computing resource collaboration method based on deep reinforcement learning, and belongs to the field of edge computing resource allocation. The method is: deploy edge servers in a honeycomb-like area in a dense 5G area; treat each edge server as an agent, and record a sample of computing resource status and corresponding actions for a period of time; replay from experience at each time t Randomly select state samples to obtain experience tuples, and then store each experience tuple in experience replay to accumulate experience; obtain new experience tuples for the Q value at the same time, and fill in the experience tuples; iterate over the Q value and bring in Two networks, target‑net and eval‑net, are trained to obtain the optimal approximate solution for collaborative decision-making. The invention breaks the association between state samples, makes the samples independent of each other, and improves the utilization rate of computing resources when they cooperate with each other.

Description

A computing resource collaboration method based on deep reinforcement learning

技术领域technical field

本发明属于边缘计算资源分配领域，涉及一种基于深度强化学习的计算资源协同合作方法。The invention belongs to the field of edge computing resource allocation, and relates to a computing resource collaboration method based on deep reinforcement learning.

背景技术Background technique

目前，物联网(IoT)是由互联网技术扩展而来，通过无线网络将无处不在的移动设备(MDs)与传感器连接起来。物联网已广泛应用于许多领域，移动互联网应用中的数据量呈指数级增长。为了提高效率，追求低延迟已经成为一种趋势。但是，数据从终端设备上传至云端，计算后再传回终端设备。这种传统的云计算技术已经不能满足人们对计算效率的高要求。Currently, the Internet of Things (IoT) is an extension of Internet technology that connects ubiquitous mobile devices (MDs) with sensors via wireless networks. The Internet of Things has been widely used in many fields, and the amount of data in mobile Internet applications has grown exponentially. In order to improve efficiency, the pursuit of low latency has become a trend. However, the data is uploaded from the terminal device to the cloud, calculated and then transmitted back to the terminal device. This traditional cloud computing technology can no longer meet people's high requirements for computing efficiency.

5G可以连接无数智能设备，实现数据共享和交互。此外，5G可以提供物联网的基本思路，扩大物联网的覆盖范围，数十亿个MDs接入互联网。对数据服务的需求激增，正对服务供应商和移动网络运营商构成新的挑战。5G中的许多应用，如人脸识别、自然语言处理等，都可以在终端中操作。因此，为了开发计算量卸载，我们需要对计算量卸载和相关的无线电资源分配进行联合管理，这实际上已经引起了很多研究者的关注。它来得正是时候。边缘计算(Edge Computing,EC)能够使移动终端主动地将计算任务转移到附近的边缘服务器，从而边缘人工智能协同合作也应允而生，为解决日益增长的大规模集群计算需求和实现高效的计算协同转移与合作提供了一种有效的方法。5G can connect countless smart devices for data sharing and interaction. In addition, 5G can provide the basic idea of IoT, expand the coverage of IoT, and connect billions of MDs to the Internet. The surge in demand for data services is posing new challenges for service providers and mobile network operators. Many applications in 5G, such as face recognition, natural language processing, etc., can be operated in the terminal. Therefore, in order to exploit computational offloading, we need to jointly manage computational offloading and related radio resource allocation, which has actually attracted the attention of many researchers. It came just in time. Edge computing (EC) enables mobile terminals to actively transfer computing tasks to nearby edge servers, so that edge artificial intelligence collaboration is also possible, in order to solve the growing demand for large-scale cluster computing and achieve efficient computing. Collaborative transfer and cooperation provide an effective method.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种基于深度强化学习的计算资源协同合作方法。In view of this, the purpose of the present invention is to provide a computing resource cooperation method based on deep reinforcement learning.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于深度强化学习的计算资源协同合作方法，该方法包括以下步骤：A computing resource collaboration method based on deep reinforcement learning, the method includes the following steps:

步骤一：为无缝链接，将边缘服务器形成蜂窝状部署于5G网络密集的区域；Step 1: For seamless connection, the edge servers are deployed in a cellular configuration in areas with dense 5G networks;

步骤二：视每个边缘服务器为一个智能体，将记录一段时刻的计算资源状态和对应的动作作为样本放入experience replay中；Step 2: Treat each edge server as an agent, and record the computing resource status and corresponding actions for a period of time as samples into experience replay;

步骤三：为增加样本的的独立性，在每一个时刻t中从experience replay随机选择状态样本获得经验元组，然后将每一次的经验元组存放入experience replay中积攒经验并存储；Step 3: In order to increase the independence of samples, randomly select state samples from experience replay to obtain experience tuples at each time t, and then store each experience tuple in experience replay to accumulate experience and store;

步骤四：再通过目标网络target net和评估网络eval net对Q值迭代同时得到新的状态再次放入experience replay中，利用损失函数更新权重参数，最终得到最优近似解解，获得边缘服务器协作的最优决策。Step 4: Then, through the target network target net and the evaluation network eval net, the Q value is iterated and the new state is obtained and put into the experience replay again, and the weight parameters are updated with the loss function, and finally the optimal approximate solution is obtained, and the cooperative edge server is obtained. optimal decision.

可选的，所述步骤一中，边缘服务器在接收协同计算结果上花费的精力和时间忽略不计；Optionally, in the step 1, the energy and time spent by the edge server in receiving the collaborative calculation result are ignored;

考虑系统模型，N个移动用户通过无线链路将计算任务卸载到边缘服务器；每个用户有M个独立的任务需要完成；Considering the system model, N mobile users offload computing tasks to edge servers through wireless links; each user has M independent tasks to complete;

为对任务建模，使用蜂窝网络形状来最大化边缘服务器的覆盖利用率；通过合作优化每个边缘任务的卸载决策和服务器计算资源分配，以及任务的传输和接收，制定以最小化完成计算任务的能耗和计算资源充分利用为目标的优化案例。To model tasks, the cellular network shape is used to maximize the coverage utilization of edge servers; offloading decisions and server computing resource allocation for each edge task are optimized cooperatively, as well as task transmission and reception, formulated to minimize completion of computing tasks An optimization case aiming at the full utilization of energy consumption and computing resources.

可选的，所述步骤二中，视每个边缘服务器为一个智能体，将其每时刻的CPU、任务量和能耗的计算资源状态作为一个状态样本，其中合作伙伴的cpu空闲配置文件被定义为终端设备的数据，合作伙伴在持续时间t∈[0,T]内计算这些数据，表示为U_bit(t)；Optionally, in the second step, each edge server is regarded as an agent, and the computing resource status of its CPU, task volume and energy consumption at each moment is taken as a status sample, in which the partner's cpu idle configuration file is Defined as the data of the terminal equipment, the partners calculate these data in the duration t∈[0,T], denoted as U _bit (t);

其中有空闲CPU的协作边缘服务器信息如下：协同边缘服务器CPU状态信息是指CPU随时间的状态，通过定义如下的协同者CPU事件空间、进程和纪元来纪录，其中α＝{α₁,α₂}，表示协同边缘服务器的CPU状态空间样本，α₁和α₂分别协同边缘服务器从忙碌到闲，再由空闲到忙碌；然后将协同的边缘处理器进程定义为协处理器事件序列的时间瞬间：The information of collaborative edge servers with idle CPUs is as follows: The collaborative edge server CPU status information refers to the state of the CPU over time, which is recorded by defining the following collaborative CPU event space, process and epoch, where α={α ₁ , α ₂ }, represents the CPU state space sample of the cooperative edge server, α ₁ and α ₂ cooperate with the edge server from busy to idle, and then from idle to busy respectively; then the cooperative edge processor process is defined as the time instant of the co-processor event sequence :

两个连续事件之间伴随着较长的时间间隔T_k＝s_k-s_k-1，其中

称为一个epoch；Two consecutive events are accompanied by a long time interval T _k =s _k -s _k-1 , where

called an epoch;

CPU的进程是允许离线设计协作计算策略，合作者CPU进程的一个样本路径，对于每个epoch k，让I_k表示CPU状态指示器，其中值1和0分别表示空闲状态和繁忙状态；服务器的CPU空闲配置如下：The process of the CPU is a sample path of the partner CPU process that allows offline design of collaborative computing strategies. For each epoch k, let I _k denote the CPU status indicator, where the values 1 and 0 indicate idle and busy states, respectively; the server's The CPU idle configuration is as follows:

边缘协作者对没有CPU空闲的边缘服务器的特性具有非因果知识；假设合作者在CPU中处理之前，为存储卸载的数据保留一个q位缓冲区；The edge collaborators have acausal knowledge of the characteristics of edge servers with no CPU idle; the collaborators are assumed to reserve a q-bit buffer for storing offloaded data before processing in the CPU;

考虑边缘服务器上的两种数据到达形式；一个任务到达假设输入L-bit在t＝0时间到达，因此边缘服务器CPU的事件空间和进程遵循上诉；另一方面，突发的数据到达形成一个随机过程；对于突发数据到达的情况，用

表示组合事件空间，α₃表示新的任务状态到达边缘服务器，对应的过程是一个变量序列：Consider two forms of data arrival on the edge server; a task arrival assumes that the input L-bit arrives at time t = 0, so the event space and process of the edge server CPU follow the appeal; on the other hand, burst data arrivals form a random process; for the case of burst data arrival, use

represents the combined event space, α ₃ represents the arrival of the new task state to the edge server, and the corresponding process is a variable sequence:

{α₁,α₂,α₃...}表示事件序列的时间瞬间；而且，每时每刻

让L_k表示数据到达的大小，L_k＝0表示α₁和α₂状态，L_k≠0表示α₃的状态；

此外否则到达截止日期的任务无法计算，然后输入总数据

然后将其每时刻的计算资源状态作为一个状态样本；{α ₁ , α ₂ , α ₃ ...} represent the time instants of the sequence of events; moreover, every moment

Let L _k denote the size of data arrival, L _k = 0 for α ₁ and α ₂ states, and L _k ≠ 0 for α ₃ states;

Also otherwise tasks that have reached their due date cannot be counted, then enter the total data

Then take its computing resource status at every moment as a status sample;

通过选择状态样本将选择动作来表示如何在两个不同的相邻边缘服务器之间协同合作；对应两个不同相邻状态之间的特定变化/移动；用变量v表示不同时间状态的编号v＝1,2，…NM+3N，然后再考虑行动a(t)＝{a_v(t)}，动作1×(NM+3N)取决于v的选择，对于操作的选择，有以下的行动；Actions will be selected by selecting state samples to represent how to cooperate between two different adjacent edge servers; corresponding to a specific change/movement between two different adjacent states; use variable v to represent the number of different time states v = 1,2,...NM+3N, then consider the action a(t)={a _v (t)}, the action 1×(NM+3N) depends on the choice of v, for the choice of action, there are the following actions;

当1≤v≤NM相应的动作a_v(t)意味着如何改变任务x_nm卸载决策；具体来说,使用：When 1 ≤ v ≤ NM the corresponding action a _v (t) means how to change the task x _nm offloading decision; specifically, use:

如a_v(t)＝1,则

If a _v (t) = 1, then

如a_v(t)＝0,则

If a _v (t) = 0, then

是整型运算，mod(v,M)是余数运算，找到相应的服务器任务v；

is an integer operation, mod(v, M) is a remainder operation, find the corresponding server task v;

当NM+1≤v≤NM+N和NM+N+1≤v≤NM+2N时，相应的动作a_v(t)表示为边缘服务器安排协同计算资源，动作为：When NM+1≤v≤NM+N and NM+N+1≤v≤NM+2N, the corresponding action a _v (t) is expressed as arranging collaborative computing resources for the edge server, and the action is:

如a_v(t)＝1,则

If a _v (t) = 1, then

如a_v(t)＝0,则

If a _v (t) = 0, then

其中更新计算资源由

其中C^co为边缘服务器的CPU处理计算任务的周期数，C^co,max为CPU计算的最大周期，N^do,tot为CPU核数。where the update computing resources are

Among them, C ^co is the number of cycles that the CPU of the edge server processes computing tasks, C ^co,max is the maximum cycle of CPU computing, and N ^do,tot is the number of CPU cores.

可选的，所述步骤三中，基于CPU和计算任务作为状态样本和定义动作的选择，从而定义任意边缘服务器的系统状态，初始阶段，对于相应的状态采取对应的动作，选取其中的状态样本作为给定时间的state，并采取一个特定的行动后的动作，为求到最大累积奖励，即Q值；Optionally, in the third step, the system state of any edge server is defined based on the selection of CPU and computing tasks as state samples and defined actions. In the initial stage, corresponding actions are taken for the corresponding states, and the state samples are selected. As a state at a given time, and take a specific action after the action, in order to obtain the maximum cumulative reward, that is, the Q value;

Q_π(s,a)＝E_π[r_t+1+γr_t+1+...|A_t＝a,S_t＝s]Q _π (s,a)=E _π [r _t+1 +γr _t+1 +...|A _t =a,S _t =s]

Q(s,a)←Q(s,a)+δ[r+γmax_aQ(s`,a`)-Q(s,a)]Q(s,a)←Q(s,a)+δ[r+γmax _a Q(s`,a`)-Q(s,a)]

其中Q(s,a)为动作状态函数值，t时刻开始折扣，γ衰变对Q函数的影响，γ越接近1代表它对后面的决策越有影响，γ越接近0代表它越看重当前的利益价值；Where Q(s, a) is the value of the action state function, the discount starts at time t, and the influence of γ decay on the Q function, the closer γ is to 1, the more influence it has on subsequent decisions, the closer γ is to 0, the more it values the current benefit value;

这时便会有一个经验元组表D_t＝(e₁,…,e_t)来记录每组state和action的值；这时在experience replay中经验元组e₁＝(s_t,a,r_t,s_t+1)没有满，并在experience replay中每一个时间步t中随机选择状态样本获得经验元组，然后将每一次的经验元组存放入experience replay中积攒经验。At this time, there will be an experience tuple table D _t =(e ₁ ,...,e _t ) to record the value of each group of state and action; then in experience replay, the experience tuple e ₁ =(s _t ,a, r _t , s _t+1 ) are not full, and randomly select state samples at each time step t in experience replay to obtain experience tuples, and then store each experience tuple in experience replay to accumulate experience.

可选的，所述步骤四中，采取与之前不同的方法，将其Experience replay中经验元组e₁＝(s_t,a,r_t,s_t+1)带入两个相同神经网络中训练，分别是目标网络target net和评估网络eval net；targetnet的输出值Q^tar表示当前边缘服务器的状态样本s下选择行为a时的衰减得分，即：Optionally, in the fourth step, a different method is adopted, and the experience tuple e ₁ =(s _t , a, r _t , s _t+1 ) in its Experience replay is brought into two identical neural networks Training is the target network target net and the evaluation network eval net respectively; the output value Q ^tar of the targetnet represents the decay score when the behavior a is selected under the current state sample s of the edge server, namely:

其中r和s`分别表示边缘服务器当前状态s下采取行为a时的相应得分和相应的下一个观测状态；γ则是衰减因子；a`为边缘服务器在状态s`时采取的行为，w`为target net的权重参数；where r and s` represent the corresponding score and the corresponding next observation state when the edge server takes action a in the current state s, respectively; γ is the decay factor; a` is the action taken by the edge server in state s`, w` is the weight parameter of target net;

evalnet的输出值Q^eval表示当前边缘服务器的状态样本s下采取行动a时的得分：The output value of evalnet Q ^eval represents the score of taking action a under the current state sample s of the edge server:

其中w为eval net的权重参数；Where w is the weight parameter of eval net;

又采用了ε-贪心策略获取行为a，在基于网络产生决策同时，以一定的概率协作邻近单个边缘服务器的同时也可能探索多个协作的边缘服务器；不断更新经验池中的经验元组，并作为target net和eval net的输入，得出Q^eval和Q^tar；将Q^eval和Q^tar的差值作为损失函数Loss function，以梯度下降法更新评估网络的权重参数；为训练收敛，目标网络的权重参数是以每隔一段固定的时间通过把评估网络的权重参数复制过来的方法更新，模型如下：The ε-greedy strategy is also used to obtain behavior a. While making decisions based on the network, it is possible to explore multiple cooperative edge servers while cooperating with a single edge server with a certain probability; constantly update the experience tuples in the experience pool, and As the input of target net and eval net, Q ^eval and Q ^tar are obtained; the difference between Q ^eval and Q ^tar is used as the loss function Loss function, and the weight parameters of the evaluation network are updated by gradient descent method; for training convergence, the target network's weight parameters The weight parameters are updated by copying the weight parameters of the evaluation network at regular intervals. The model is as follows:

其中s_t和a分别表示边缘服务器当前状态和当前所作出的动作，r表示采取这个行动所得到的奖励reward，γ是期望的折算discount因子，s_t+1表示未来下一步所处的状态，w用于深度神经网络拟合的向量；where s _t and a represent the current state and current action of the edge server respectively, r represents the reward obtained by taking this action, γ is the expected discount factor, and s _t+1 represents the state of the next step in the future, w vector for deep neural network fitting;

然后，利用梯度下降算法最小化目标网络输出与预测之间的差异，即主网输出：Then, the gradient descent algorithm is used to minimize the difference between the target network output and the prediction, i.e. the main network output:

Loss＝(Q^traget(s_t,a)-Q^pre(s_t,a,w))² Loss=(Q ^target (s _t ,a)-Q ^pre (s _t ,a,w)) ²

最终利用经验元组对两个神经网络的训练，不断迭代Q值，使其在边缘服务器受限计算资源状态的情况下边缘服务器将会得到一个最优近似解的解，作为边缘服务器协同合作的最优策略。Finally, the two neural networks are trained using the experience tuple, and the Q value is continuously iterated, so that the edge server will obtain an optimal approximate solution under the condition that the edge server is limited by the computing resource state. optimal strategy.

本发明的有益效果在于：将采集到的计算资源状态和对应的选择动作作为样本放入经验池中，然后从经验池中随机抽取一个样本进行两个行神经网络训练。第一部分这种处理打破了样品之间的联系，使样品彼此独立。固定Q目标网络:要计算网络的目标值，需要已有的Q值。使用较慢的更新网络来提供此Q值。这提高了训练的稳定性和收敛性，以进一步提高了边缘服务器之间的协作效率与成本。The beneficial effect of the invention is that the collected computing resource state and the corresponding selection action are put into the experience pool as samples, and then a sample is randomly selected from the experience pool for two-line neural network training. The first part of this treatment breaks the connection between the samples and makes the samples independent of each other. Fixed-Q target network: To calculate the target value of the network, an existing Q value is required. A slower update network is used to provide this Q value. This improves the stability and convergence of training to further improve the efficiency and cost of collaboration between edge servers.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:

图1为基于深度强化学习的计算资源协同合作方法流程图；Fig. 1 is a flow chart of a computing resource collaboration method based on deep reinforcement learning;

图2为边缘服务器部署示意图；Figure 2 is a schematic diagram of edge server deployment;

图3为需求最优解的协作示意图。Figure 3 is a schematic diagram of the collaboration of the demand for the optimal solution.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.

其中，附图仅用于示例性说明，表示的仅是示意图，而非实物图，不能理解为对本发明的限制；为了更好地说明本发明的实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。Among them, the accompanying drawings are only used for exemplary description, and represent only schematic diagrams, not physical drawings, and should not be construed as limitations of the present invention; in order to better illustrate the embodiments of the present invention, some parts of the accompanying drawings will be omitted, The enlargement or reduction does not represent the size of the actual product; it is understandable to those skilled in the art that some well-known structures and their descriptions in the accompanying drawings may be omitted.

本发明实施例的附图中相同或相似的标号对应相同或相似的部件；在本发明的描述中，需要理解的是，若有术语“上”、“下”、“左”、“右”、“前”、“后”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此附图中描述位置关系的用语仅用于示例性说明，不能理解为对本发明的限制，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。The same or similar numbers in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms “upper”, “lower”, “left” and “right” , "front", "rear" and other indicated orientations or positional relationships are based on the orientations or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must be It has a specific orientation, is constructed and operated in a specific orientation, so the terms describing the positional relationship in the accompanying drawings are only used for exemplary illustration, and should not be construed as a limitation of the present invention. situation to understand the specific meaning of the above terms.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the above-mentioned technical problems is:

请参阅图1～图3，下面结合附图进行说明，本发明包括以下几个步骤：Please refer to FIG. 1 to FIG. 3 . The following description will be given in conjunction with the accompanying drawings. The present invention includes the following steps:

步骤一：边缘服务器在接收协同计算结果上花费的精力和时间可以忽略不计，因为它们通常比卸载对应的小得多。扩展当前分析以包含开销是简单而乏味的。考虑系统模型，N个移动用户通过无线链路将他们的计算任务卸载到边缘服务器。每个用户有M个独立的任务需要完成。为了对任务建模，使用蜂窝网络形状来最大化边缘服务器的覆盖利用率。然后，通过合作优化每个边缘任务的卸载决策和服务器计算资源分配，以及任务的传输和接收，制定了以最小化完成计算任务的能耗和计算资源充分利用为目标的优化案例如图2所示。Step 1: Edge servers spend negligible effort and time in receiving collaborative computation results, as they are usually much smaller than their offload counterparts. Extending the current analysis to include overhead is simple and tedious. Considering the system model, N mobile users offload their computing tasks to edge servers through wireless links. Each user has M independent tasks to complete. To model the task, the cellular network shape is used to maximize the coverage utilization of edge servers. Then, by cooperatively optimizing the offloading decision of each edge task and the allocation of server computing resources, as well as the transmission and reception of tasks, an optimization case with the goal of minimizing the energy consumption of completing computing tasks and making full use of computing resources is formulated as shown in Figure 2. Show.

步骤二：视每个边缘服务器为一个智能体，将其每时刻的CPU、任务量、能耗等计算资源状态作为一个状态样本，其中合作伙伴的cpu空闲配置文件被定义为终端设备的数据(以比特为单位)，合作伙伴可以在持续时间t∈[0,T]内计算这些数据，它被表示为U_bit(t)。其中有空闲CPU的协作边缘服务器信息如下：协同边缘服务器CPU状态信息是指CPU随时间的状态，可以通过定义如下的协同者CPU事件空间、进程和纪元来纪录，其中α＝{α₁,α₂}，表示协同边缘服务器的CPU状态空间样本，α₁和α₂分别协同边缘服务器从忙碌到闲，再由空闲到忙碌。然后可以将协同的边缘处理器进程定义为协处理器事件序列的时间瞬间：Step 2: Treat each edge server as an agent, and take its computing resource status such as CPU, task volume, energy consumption, etc. as a status sample at each moment, in which the partner's cpu idle configuration file is defined as the data of the terminal device ( in bits), the partners can compute these data for the duration t∈[0,T], which is denoted as U _bit (t). The information of the collaborative edge servers with idle CPUs is as follows: The collaborative edge server CPU status information refers to the state of the CPU over time, which can be recorded by defining the following collaborative CPU event space, process and epoch, where α={α ₁ ,α ₂ }, represents the CPU state space sample of the cooperative edge server, α ₁ and α ₂ cooperate with the edge server from busy to idle, and then from idle to busy respectively. A co-processor process can then be defined as a time instant of a sequence of co-processor events:

两个连续事件之间伴随着较长的时间间隔T_k＝s_k-s_k-1，其中

称为一个epoch。Two consecutive events are accompanied by a long time interval T _k =s _k -s _k-1 , where

called an epoch.

CPU的进程是允许离线设计协作计算策略，如合作者CPU进程的一个样本路径，对于每个epoch k，让I_k表示CPU状态指示器，其中值1和0分别表示空闲状态和繁忙状态。然而，用f_h表示协作者的常数CPU频率，再用C表示需要计算用户的1位输入数据中CPU的周期数。基于上述定义，服务器的CPU空闲配置如下：The CPU's process is a sample path that allows offline design of collaborative computing strategies, such as collaborator CPU processes. For each epoch k, let _Ik denote the CPU status indicator, where the values 1 and 0 represent idle and busy states, respectively. However, let f _h denote the constant CPU frequency of the collaborator, and let C denote the number of CPU cycles required to calculate the user's 1-bit input data. Based on the above definition, the server's CPU idle configuration is as follows:

边缘协作者对没有CPU空闲的边缘服务器的特性具有非因果知识。最后，假设合作者在CPU中处理之前，为存储卸载的数据保留一个q位缓冲区。Edge collaborators have acausal knowledge of the characteristics of edge servers with no CPU idle. Finally, assume that the collaborator reserves a q-bit buffer for storing offloaded data before processing in the CPU.

考虑边缘服务器上的两种数据到达形式。一个任务到达假设输入L-bit在t＝0时间到达，因此边缘服务器CPU的事件空间和进程遵循上诉。另一方面，突发的数据到达形成一个随机过程。对于简短的规范，定义一个结合两个进程的数据到达和边缘服务器CPU的随机进程是很有用的。因此对于突发任务到达的组合随机过程。对于突发数据到达的情况，用

表示组合事件空间，α₁,α₂上述已经介绍，α₃表示新的任务状态到达边缘服务器，对应的过程是一个变量序列:Consider two forms of data arrival at the edge server. A task arrival assumes that the input L-bit arrives at time t=0, so the event space and process of the edge server CPU follow the appeal. On the other hand, bursts of data arrivals form a random process. For a short specification, it is useful to define a random process that combines the data arrivals of the two processes and the CPU of the edge server. Hence the combinatorial random process for burst task arrivals. For burst data arrival, use

Represents the combined event space, α ₁ , α ₂ have been introduced above, α ₃ indicates that the new task state arrives at the edge server, and the corresponding process is a variable sequence:

{α₁,α₂,α₃...}表示事件序列的时间瞬间。而且，每时每刻

让L_k表示数据到达的大小，L_k＝0表示α₁和α₂状态，L_k≠0表示α₃的状态。另外，

此外否则到达截止日期的任务无法计算，然后输入总数据

然后将其每时刻的CPU、任务量等计算资源状态作为一个状态样本。{α ₁ , α ₂ , α ₃ ...} represent the time instants of the sequence of events. And, every moment

Let L _k denote the size of data arrival, L _k = 0 for the α ₁ and α ₂ states, and L _k ≠ 0 for the α ₃ state. in addition,

Then, its computing resource status such as CPU and task amount at each moment is taken as a status sample.

通过选择状态样本将选择动作来表示如何在两个不同的相邻边缘服务器之间协同合作。具体来说，对应两个不同相邻状态之间的特定变化/移动。用变量v表示不同时间状态的编号v＝1,2，…NM+3N，然后再考虑行动a(t)＝{a_v(t)}，动作1×(NM+3N)取决于v的选择，对于操作的选择，有以下的行动。Actions are selected by selecting state samples to represent how to cooperate between two different adjacent edge servers. Specifically, corresponds to a specific change/movement between two different adjacent states. Use the variable v to represent the number of different time states v = 1, 2, ... NM + 3N, and then consider the action a(t) = {a _v (t)}, the action 1 × (NM + 3N) depends on the choice of v , for the selection of operations, there are the following actions.

当1≤v≤NM相应的动作a_v(t)意味着如何改变任务x_nm卸载决策。具体来说,使用：When 1≤v≤NM the corresponding action a _v (t) means how to change the offloading decision of task x _nm . Specifically, use:

如a_v(t)＝1,则

If a _v (t) = 1, then

如a_v(t)＝0,则

If a _v (t) = 0, then

是整型运算，mod(v,M)是余数运算，然后我们可以找到相应的服务器任务v。

is an integer operation, mod(v, M) is a remainder operation, and then we can find the corresponding server task v.

如a_v(t)＝1,则

If a _v (t) = 1, then

如a_v(t)＝0,则

If a _v (t) = 0, then

其中更新计算资源由

步骤三：基于CPU和计算任务作为状态样本和定义了动作的选择，从而定义了任意边缘服务器的系统状态，初始阶段，对于相应的状态采取对应的动作，选取其中的状态样本作为给定时间的state，并采取了一个特定的行动后的动作，为了求到最大累积奖励，也就是Q值。Step 3: Based on the CPU and computing tasks as state samples and defining the selection of actions, the system state of any edge server is defined. In the initial stage, corresponding actions are taken for the corresponding states, and the state samples are selected as the state samples at a given time. state, and take an action after a specific action, in order to obtain the maximum cumulative reward, which is the Q value.

其中Q(s,a)为动作状态函数值，t时刻开始折扣，γ衰变对Q函数的影响，γ越接近1代表它对后面的决策越有影响，γ越接近0代表它越看重当前的利益价值。Where Q(s, a) is the value of the action state function, the discount starts at time t, and the influence of γ decay on the Q function, the closer γ is to 1, the more influence it has on subsequent decisions, the closer γ is to 0, the more it values the current benefit value.

这时便会有一个经验元组表D_t＝(e₁,…,e_t)来记录每组state和action的值。这时在experience replay中经验元组e₁＝(s_t,a,r_t,s_t+1)没有满，并在experience replay中每一个时间步t中随机选择状态样本获得经验元组，然后将每一次的经验元组存放入experience replay中积攒经验。At this time, there will be an experience tuple table D _t = (e ₁ ,...,e _t ) to record the value of each group of state and action. At this time, the experience tuple e ₁ =(s _t , a, r _t , s _t+1 ) is not full in the experience replay, and randomly selects a state sample at each time step t in the experience replay to obtain the experience tuple, and then Store each experience tuple in experience replay to accumulate experience.

步骤四：采取与之前不同的方法，将其Experience replay中经验元组e₁＝(s_t,a,r_t,s_t+1)带入两个相同神经网络(三个卷积层和两个全连接层)中训练，分别是targetnet(目标网络)和eval net(评估网络)。target net的输出值Q^tar表示当前边缘服务器的状态样本s下选择行为a时的衰减得分，即：Step 4: Take a different method from before, and bring the experience tuple e ₁ = (s _t , a, r _t , s _t+1 ) in its Experience replay into two identical neural networks (three convolutional layers and two A fully connected layer) are trained in targetnet (target network) and eval net (evaluation network). The output value Q ^tar of the target net represents the decay score when the behavior a is selected under the current state sample s of the edge server, namely:

其中r和s`分别表示边缘服务器当前状态s下采取行为a时的相应得分和相应的下一个观测状态；γ则是衰减因子；a`为边缘服务器在状态s`时采取的行为，w`为target net的权重参数。where r and s` represent the corresponding score and the corresponding next observation state when the edge server takes action a in the current state s, respectively; γ is the decay factor; a` is the action taken by the edge server in state s`, w` is the weight parameter of target net.

其中w为eval net的权重参数。where w is the weight parameter of eval net.

又采用了ε-贪心策略(ε从1至0逐渐减少)获取行为a，在基于网络产生决策同时，又能以一定的概率协作邻近单个边缘服务器的同时也可能探索多个协作的边缘服务器，为了避免最优解的问题。这里便会不断更新经验池中的经验元组，并作为target net和evalnet的输入，得出Q^eval和Q^tar。将Q^eval和Q^tar的差值作为Loss function(损失函数)，以梯度下降法更新评估网络的权重参数。为了训练收敛，目标网络的权重参数是以每隔一段固定的时间通过把评估网络的权重参数复制过来的方法更新，模型如下：It also adopts the ε-greedy strategy (ε gradually decreases from 1 to 0) to obtain behavior a. While making decisions based on the network, it can cooperate with a certain probability to be adjacent to a single edge server, and it is also possible to explore multiple cooperative edge servers. In order to avoid the problem of optimal solution. Here, the experience tuples in the experience pool will be continuously updated and used as the input of target net and evalnet to obtain Q ^eval and Q ^tar . Taking the difference between Q ^eval and Q ^tar as the Loss function (loss function), the weight parameters of the evaluation network are updated by the gradient descent method. For training convergence, the weight parameters of the target network are updated by copying the weight parameters of the evaluation network at regular intervals. The model is as follows:

其中s_t和a分别表示边缘服务器当前状态和当前所作出的动作，r表示采取这个行动所得到的奖励reward，γ是期望的折算discount因子，s_t+1表示未来下一步所处的状态，w用于深度神经网络拟合的向量。where s _t and a represent the current state and current action of the edge server respectively, r represents the reward obtained by taking this action, γ is the expected discount factor, and s _t+1 represents the state of the next step in the future, w Vector for deep neural network fitting.

然后，利用梯度下降算法最小化目标网络输出与预测之间的差异：Then, use the gradient descent algorithm to minimize the difference between the target network output and the prediction:

Loss＝(Q^traget(s_t,a)-Q^eval(s_t,a,w))² Loss=(Q ^target (s _t ,a)-Q ^eval (s _t ,a,w)) ²

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.

Claims

1. a computing resource collaboration method based on deep reinforcement learning, is characterized in that: the method comprises the following steps:

Step 1: For seamless connection, the edge servers are deployed in a cellular configuration in areas with dense 5G networks;

Step 2: Treat each edge server as an agent, and record the computing resource status and corresponding actions for a period of time as samples into experiencereplay;

Step 3: In order to increase the independence of samples, randomly select state samples from experiencereplay to obtain experience tuples at each time t, and then store each experience tuple in experiencereplay to accumulate experience and store;

Step 4: Then through the target network targetnet and the evaluation network evalnet to iterate the Q value and get the new state and put it into experiencereplay again, use the loss function to update the weight parameters, finally get the optimal approximate solution, and obtain the optimal decision for edge server collaboration .

2. A kind of computing resource collaboration method based on deep reinforcement learning according to claim 1, is characterized in that: in described step 1, the energy and time spent by edge server on receiving collaborative computing result are neglected;

Considering the system model, N mobile users offload computing tasks to edge servers through wireless links; each user has M independent tasks to complete;

To model tasks, the cellular network shape is used to maximize the coverage utilization of edge servers; offloading decisions and server computing resource allocation for each edge task are optimized cooperatively, as well as task transmission and reception, formulated to minimize completion of computing tasks An optimization case aiming at the full utilization of energy consumption and computing resources.

3. A kind of computing resource collaboration method based on deep reinforcement learning according to claim 1, it is characterized in that: in described step 2, regard each edge server as an agent, its CPU, task at every moment The computing resource state of the amount and energy consumption is taken as a state sample, in which the partner's cpu idle profile is defined as the data of the terminal device, and the partner computes these data for the duration t ∈ [0, T], denoted as U _bit (t);

The information of collaborative edge servers with idle CPUs is as follows: The collaborative edge server CPU status information refers to the state of the CPU over time, which is recorded by defining the following collaborative CPU event space, process and epoch, where α={α ₁ , α ₂ }, represents the CPU state space sample of the cooperative edge server, α ₁ and α ₂ cooperate with the edge server from busy to idle, and then from idle to busy respectively; then the cooperative edge processor process is defined as the time instant of the co-processor event sequence :

Two consecutive events are accompanied by a long time interval T _k =s _k -s _k-1 , where

called an epoch;

The process of the CPU is a sample path of the partner CPU process that allows offline design of collaborative computing strategies. For each epoch k, let I _k denote the CPU status indicator, where the values 1 and 0 indicate idle and busy states, respectively; the server's The CPU idle configuration is as follows:

The edge collaborators have acausal knowledge of the characteristics of edge servers with no CPU idle; the collaborators are assumed to reserve a q-bit buffer for storing offloaded data before processing in the CPU;

Consider two forms of data arrival on the edge server; a task arrival assumes that the input L-bit arrives at time t=0, so the event space and process of the edge server CPU follow the appeal; on the other hand, burst data arrivals form a random process; for the case of burst data arrival, use

{α ₁ , α ₂ , α ₃ ...} represent the time instants of the sequence of events; moreover, every moment

Then take its computing resource status at every moment as a status sample;

Actions will be selected by selecting state samples to represent how to cooperate between two different adjacent edge servers; corresponding to a specific change/movement between two different adjacent states; use variable v to represent the number of different time states v = 1,2,...NM+3N, then consider the action a(t)={a _v (t)}, the action 1×(NM+3N) depends on the choice of v, for the choice of action, there are the following actions;

When 1 ≤ v ≤ NM the corresponding action a _v (t) means how to change the task x _nm offloading decision; specifically, use:

If a _v (t) = 1, then

If a _v (t) = 0, then

When NM+1≤v≤NM+N and NM+N+1≤v≤NM+2N, the corresponding action a _v (t) is expressed as arranging collaborative computing resources for the edge server, and the action is:

If a _v (t) = 1, then

If a _v (t) = 0, then

where the update computing resources are

4. a kind of computing resource collaboration method based on deep reinforcement learning according to claim 1, is characterized in that: in described step 3, based on CPU and computing task as the selection of state sample and definition action, thereby define any edge The system state of the server, in the initial stage, take the corresponding action for the corresponding state, select the state sample as the state at a given time, and take a specific action after the action, in order to obtain the maximum cumulative reward, that is, the Q value;

Q _π (s,a)=E _π [r _t+1 +γr _t+1 +...|A _t =a,S _t =s]

Q(s,a)←Q(s,a)+δ[r+γmax _a Q(s`,a`)-Q(s,a)]

Where Q(s, a) is the value of the action state function, the discount starts at time t, and the influence of γ decay on the Q function, the closer γ is to 1, the more influence it has on subsequent decisions, the closer γ is to 0, the more it values the current benefit value;

At this time, there will be an experience tuple table D _t =(e ₁ ,...,e _t ) to record the value of each group of state and action; then in experience replay, the experience tuple e ₁ =(s _t ,a, r _t , s _t+1 ) are not full, and randomly select state samples at each time step t in experience replay to obtain experience tuples, and then store each experience tuple in experience replay to accumulate experience.

5. a kind of computing resource collaboration method based on deep reinforcement learning according to claim 1, is characterized in that: in described step 4, adopt different method from before, experience tuple e ₁ = (s _t , a, r _t , s _t+1 ) are brought into two identical neural networks for training, namely the target network target net and the evaluation network eval net; the output value Q ^tar of the target net represents the current state sample of the edge server The decay score when behavior a is selected under s, that is:

where r and s` represent the corresponding score and the corresponding next observation state when the edge server takes action a in the current state s, respectively; γ is the decay factor; a` is the action taken by the edge server in state s`, w` is the weight parameter of target net;

The output value Q ^eval of eval net represents the score of taking action a under the current state sample s of the edge server:

Where w is the weight parameter of eval net;

The ε-greedy strategy is also used to obtain behavior a. While making decisions based on the network, it is possible to explore multiple cooperative edge servers while cooperating with a single edge server with a certain probability; constantly update the experience tuples in the experience pool, and As the input of target net and eval net, Q ^eval and Q ^tar are obtained; the difference between Q ^eval and Q ^tar is used as the loss function Loss function, and the weight parameters of the evaluation network are updated by gradient descent method; for training convergence, the target network's weight parameters The weight parameters are updated by copying the weight parameters of the evaluation network at regular intervals. The model is as follows:

where s _t and a represent the current state and current action of the edge server respectively, r represents the reward obtained by taking this action, γ is the expected discount factor, and s _t+1 represents the state of the next step in the future, w vector for deep neural network fitting;

Then, the gradient descent algorithm is used to minimize the difference between the target network output and the prediction, i.e. the main network output:

Loss=(Q ^target (s _t ,a)-Q ^pre (s _t ,a,w)) ²

Finally, the two neural networks are trained using the experience tuple, and the Q value is continuously iterated, so that the edge server will obtain an optimal approximate solution under the condition that the edge server is limited by the computing resource state. optimal strategy.