CN115767562B

CN115767562B - Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission

Info

Publication number: CN115767562B
Application number: CN202211012894.7A
Authority: CN
Inventors: 王侃; 袁鹏; 周红芳; 李军怀; 王怀军
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2024-06-21
Anticipated expiration: 2042-08-23
Also published as: CN115767562A

Abstract

The present invention discloses a service function chain deployment method based on reinforcement learning and joint cooperative multi-point transmission. First, the edge network model is described, the channel characteristics of the server and the user in the edge network are described, and beamforming is used to eliminate the communication interference between multiple servers and users; a mathematical model under the constraints of the number of server VNF instantiations, server processing capacity, physical link bandwidth, VNF routing and VNF migration budget is established; a long-term optimization problem is modeled; the long-term optimization problem is decoupled into a time slot optimization problem; finally, a sub-optimization problem for solving a reward function is established to reduce the complexity of action space search. The present invention uses a zero-forcing beamforming technology based on CoMP to eliminate wireless link interference between users, and then uses an Actor-Critic algorithm based on a natural gradient method to decouple the long-term dynamic SFC deployment problem into a time slot optimization problem, so as to solve it online.

Description

Service function chain deployment method based on reinforcement learning and joint cooperative multi-point transmission

技术领域Technical Field

本发明属于通信技术领域，具体涉及一种基于强化学习联合协作多点传输的服务功能链部署方法。The present invention belongs to the field of communication technology, and in particular relates to a service function chain deployment method based on reinforcement learning and joint collaborative multi-point transmission.

背景技术Background Art

在SDN(Software Defined Network，SDN)架构下，6G通信系统有望通过新兴的网络功能虚拟化(Network Function Virtualization，NFV)技术来提升网络服务质量，使得服务功能可以在商用服务器中利用虚拟机或容器技术直接部署，而无需部署在专属硬件中。这使得6G通信系统运营商可以通过增加商用服务器数量的方式，直接按需灵活扩展网络规模。利用NFV技术，可将数据包所历经的所有服务功能的逻辑顺序进行级联，形成一条SFC，并通过虚拟化网络功能(Virtualized Network Function，VNF)的编排，实现多种业务的灵活定制和部署。Under the SDN (Software Defined Network, SDN) architecture, the 6G communication system is expected to improve the quality of network services through the emerging Network Function Virtualization (NFV) technology, so that service functions can be directly deployed in commercial servers using virtual machines or container technology without being deployed in dedicated hardware. This allows 6G communication system operators to flexibly expand the network scale directly on demand by increasing the number of commercial servers. Using NFV technology, the logical sequence of all service functions that a data packet has passed through can be cascaded to form an SFC, and through the orchestration of virtualized network functions (VNF), flexible customization and deployment of multiple services can be achieved.

目前，已有较多工作研究面向传统网络架构的SFC部署方法，例如，内容分发网络和集中式云计算网络。与传统网络架构不同，在6G边缘网络中部署SFC，可使得用户和VNF的位置毗邻，就近获得服务，从而提高计算密集型和时延敏感型业务的服务质量；同时，利用NFV技术，边缘服务器可以提供更丰富的VNF编排组合，进而提供更复杂和灵活的服务类型。At present, there are many studies on SFC deployment methods for traditional network architectures, such as content distribution networks and centralized cloud computing networks. Unlike traditional network architectures, deploying SFC in 6G edge networks can make users and VNFs located close to each other and obtain services nearby, thereby improving the service quality of computing-intensive and delay-sensitive services; at the same time, using NFV technology, edge servers can provide richer VNF orchestration combinations, thereby providing more complex and flexible service types.

目前，边缘网络中SFC部署方法的研究主要针对于单时隙的静态网络，忽略了边缘网络中无线信道的干扰特性，也忽略了边缘服务器中缓存资源、计算资源和通信资源的动态变化特性。因此，本发明提出了一种6G无线边缘网络中基于在线Actor-Critic学习、联合CoMP波束赋形的SFC部署方法，利用基于CoMP的迫零波束赋形技术来消除多条SFC之间的无线链路干扰。At present, the research on SFC deployment methods in edge networks mainly focuses on static networks with single time slots, ignoring the interference characteristics of wireless channels in edge networks and the dynamic change characteristics of cache resources, computing resources and communication resources in edge servers. Therefore, the present invention proposes an SFC deployment method based on online Actor-Critic learning and joint CoMP beamforming in 6G wireless edge networks, and uses CoMP-based zero-forcing beamforming technology to eliminate wireless link interference between multiple SFCs.

发明内容Summary of the invention

本发明的目的是提供一种基于强化学习联合协作多点传输的服务功能链部署方法，利用基于CoMP的迫零波束赋形技术来消除用户之间的无线链路干扰，然后使用基于自然梯度法的Actor-Critic算法将长期动态SFC部署问题解耦为逐时隙的优化问题，从而进行在线求解。The purpose of the present invention is to provide a service function chain deployment method based on reinforcement learning and joint cooperative multi-point transmission, which utilizes the CoMP-based zero-forcing beamforming technology to eliminate the wireless link interference between users, and then uses the Actor-Critic algorithm based on the natural gradient method to decouple the long-term dynamic SFC deployment problem into a time slot-by-time slot optimization problem, so as to solve it online.

本发明所采用的技术方案是，基于强化学习联合协作多点传输的服务功能链部署方法，具体按照以下步骤实施：The technical solution adopted by the present invention is a service function chain deployment method based on reinforcement learning combined with cooperative multi-point transmission, which is specifically implemented according to the following steps:

步骤1、描述边缘网络模型，包括边缘服务器、网络虚拟功能、用户和服务功能链的特征；Step 1: Describe the edge network model, including the characteristics of edge servers, network virtual functions, users, and service function chains;

步骤2、描述边缘网络中服务器与用户的信道特征，使用波束赋形消除多个服务器与用户之间的通信干扰；Step 2: Describe the channel characteristics between servers and users in the edge network, and use beamforming to eliminate communication interference between multiple servers and users;

步骤3、建立服务器VNF实例化数量限制、服务器处理容量、物理链路带宽、VNF路由和VNF迁移预算限制下的数学模型；Step 3: Establish a mathematical model under the constraints of the number of server VNF instantiations, server processing capacity, physical link bandwidth, VNF routing, and VNF migration budget;

步骤4、根据步骤1-3建立的资源约束，对长期优化问题进行建模；Step 4: Model the long-term optimization problem based on the resource constraints established in steps 1-3;

步骤5、构建马尔可夫决策过程模型MDP，将长期优化问题解耦为逐时隙优化问题；Step 5: Construct a Markov decision process model MDP to decouple the long-term optimization problem into a time-slot optimization problem.

步骤6、使用基于自然梯度的Actor-Critic强化学习算法，逐时隙在线学习SFC最优部署策略；Step 6: Use the Actor-Critic reinforcement learning algorithm based on natural gradient to learn the optimal SFC deployment strategy online in each time slot;

步骤7、在搜索动作空间时，建立奖励函数求解的子优化问题，降低动作空间搜索复杂度，最终得到最优解。Step 7: When searching the action space, establish a sub-optimization problem to solve the reward function, reduce the complexity of the action space search, and finally obtain the optimal solution.

本发明的特点还在于，The present invention is also characterized in that:

步骤1具体按照以下步骤实施：Step 1 is implemented as follows:

步骤1.1、在边缘网络中，一台边缘服务器与一个远程无线射频模块RRH相连，并使用同时表示第n个边缘服务器和该服务器的RRH，其中表示边缘网络中服务器的集合，N表示边缘网络中服务器总数量；边缘服务器之间通过X2链路相互连接，且每个边缘服务器可利用虚拟机技术提供多种不同的虚拟功能；Step 1.1: In the edge network, an edge server is connected to a remote radio frequency module (RRH) and uses It represents both the nth edge server and the RRH of the server, where represents the set of servers in the edge network, and N represents the total number of servers in the edge network; the edge servers are connected to each other through X2 links, and each edge server can provide a variety of different virtual functions using virtual machine technology;

步骤1.2使用表示边缘网络中的第m个用户，表示边缘网络中用户的集合，M表示用户总数量，假设每个用户只能被一条服务功能链SFC所服务，并定义SFC如下：Step 1.2 Use represents the mth user in the edge network, represents the set of users in the edge network, M represents the total number of users, assuming that each user can only be served by one service function chain SFC, and SFC is defined as follows:

其中，表示用户m的SFC的第一个服务功能，表示第l个服务功能，表示最后一个服务功能并指定为基带处理vBBU。in, represents the first service function of the SFC of user m, represents the lth service function, Indicates the last service function and is designated as the baseband processing vBBU.

步骤2具体按照以下步骤实施：Step 2 is implemented as follows:

2.1、用户m和RRH之间存在瑞利衰落和路径损耗，用表示用户m和编号为n的RRH之间的信道矩阵，其中，表示一个L_n×L_m维复数矩阵，L_n表示编号为n的RRH的发送天线数目，L_m表示用户m的接收天线数，则在时隙t中用户m接收到的信号u_m,t可表示为：2.1、There is Rayleigh fading and path loss between user m and RRH. represents the channel matrix between user m and RRH numbered n, where represents an L _n ×L _m dimensional complex matrix, L _n represents the number of transmit antennas of RRH numbered n, and L _m represents the number of receive antennas of user m. Then the signal um _{,t received by user m in time slot t} can be expressed as:

其中，表示在时隙t中用户m和边缘网络中所有RRH之间的信道矩阵，其中表示用户m和编号为n的RRH之间的信道矩阵，其中(·)^H表示矩阵的共轭转置，表示所有RRH的天线总数；表示为所有RRH对用户m的波束赋形矩阵，d_m为用户m接收的数据流个数；用I表示单位矩阵，表示均值为零、协方差为的高斯随机码本；n_m,t为协方差为的高斯白噪声；in, represents the channel matrix between user m and all RRHs in the edge network in time slot t, where represents the channel matrix between user m and RRH numbered n, where (·) ^H represents the conjugate transpose of the matrix, Indicates the total number of antennas of all RRHs; It is represented as the beamforming matrix of all RRHs for user m, d _m is the number of data streams received by user m; I is used to represent the identity matrix, means that the mean is zero and the covariance is Gaussian random codebook; n _m,t is the covariance Gaussian white noise;

2.2、对步骤2.1中用户m接收信号u_m,t通过基于高斯随机码本的连续编码，可以去掉公式第二项，则时隙t中用户m的接收数据率R_m,t可表示为:2.2. The received signal u _{m,t of} user m in step 2.1 can be continuously encoded based on the Gaussian random codebook, and the second term of the formula can be removed. Then the received data rate R _{m,t of user m in time slot t} can be expressed as:

2.3、设定和P_m,n分别表示边缘服务器n提供给用户m的服务功能处理功耗和无线传输功耗，令表示用户m是否使用边缘服务器n的VNF实例vBBU；则所有RRH对用户m的波束赋形矩阵应满足：2.3. Settings and P _m,n represent the service function processing power consumption and wireless transmission power consumption provided by edge server n to user m, respectively. Indicates whether user m uses the VNF instance vBBU of edge server n; then the beamforming matrix of all RRHs for user m Should meet:

其中Tr(·)表示对矩阵进行求迹操作；in Tr(·) represents the trace operation of the matrix;

2.4、利用RRH的迫零波束赋形技术成消除SFC之间的无线干扰，即将所有用户的信道矩阵堆栈，然后进行QR分解：2.4. Use the zero-forcing beamforming technology of RRH to eliminate the wireless interference between SFCs. That is, stack the channel matrices of all users and then perform QR decomposition:

其中，中的每一项可以表示为：是一组正交基，且是满秩上三角矩阵，上三角的其余矩阵块为任意非零矩阵，由此，用户m的波束赋形矩阵可以表示为其中 in, Each item in can be expressed as: is an orthogonal basis, and is a full-rank upper triangular matrix, and the remaining matrix blocks of the upper triangle are arbitrary non-zero matrices. Therefore, the beamforming matrix of user m can be expressed as in

2.5、为消除干扰须满足条件：和在公式中，S_m,t的前L_m行与接收数据率有关，因此可以简化为其中H_m,t可以简化为H_m,t＝Q_m,tR_m,m,t，有效的波束赋形矩阵∑_m,t可定义为：令矩阵则步骤2.3建立约束等价于：2.5. To eliminate interference, the following conditions must be met: and In the formula The first L _m rows of S _m,t are related to the received data rate and can be simplified to in H _m,t can be simplified to H _m,t =Q _m,t R _m,m,t , and the effective beamforming matrix ∑ _m,t can be defined as: Let the matrix Then step 2.3 to establish constraints is equivalent to:

约束1： Constraint 1:

约束2： Constraint 2:

2.6、步骤2.2中的接收数据率R_m,t需要大于等于数据率阈值R_m,th，以于进行正确的数据解码，即：2.6. The received data rate R _m,t in step 2.2 needs to be greater than or equal to the data rate threshold R _m,th in order to perform correct data decoding, that is:

约束3：R_m,t＝log₂|I_Lm+∑_m,t|≥R_m,th。Constraint 3: R _m,t =log ₂ | _ILm +∑ _m,t |≥R _m,th .

步骤3具体按照以下步骤实施：Step 3 is implemented as follows:

3.1、设表示能够提供服务功能f的边缘服务器集合，假设每个服务功能只能部署到一台边缘服务器上，即：3.1、Assume represents the set of edge servers that can provide service function f. Assume that each service function can only be deployed on one edge server, that is:

约束4：其中表示服务功能是否部署在边缘服务器n上，表示在时隙t中边缘服务器n中是否提供服务功能f，且和满足：约束5： Constraint 4: in Indicates service function Whether it is deployed on edge server n, represents whether the service function f is provided in the edge server n in time slot t, and and Satisfies: Constraint 5:

3.2、某个VNF实例所处理的服务流总数据率，不能超过该VNF实例的处理容量即：3.2. The total data rate of the service flows processed by a VNF instance cannot exceed the processing capacity of the VNF instance. Right now:

约束6： Constraint 6:

3.3、某条链路上所流经的服务流总数据率，不能超过其链路带宽即:3.3. The total data rate of service flows on a link cannot exceed its link bandwidth Right now:

约束7：其中，表示和是否分别部署在边缘服务器n和s上；Constraint 7: in, express and Whether they are deployed on edge servers n and s respectively;

3.4、在时隙t中只有当和同时为1时，才可取1；则和之间的关系可以描述为：3.4. In time slot t, only when and When both are 1, Only then can it be 1; then and The relationship between can be described as:

约束8： Constraint 8:

3.5、定义为边缘服务器n和s的服务迁移代价，则系统总服务迁移代价，不能超过迁移阈值C_mig,th，即：3.5 Definition is the service migration cost of edge servers n and s, then the total service migration cost of the system cannot exceed the migration threshold C _mig,th , that is:

约束9： Constraint 9:

步骤4具体按照以下步骤实施：Step 4 is implemented according to the following steps:

4.1、定义系统总开销包括数据流开销和功耗开销；4.1、Define the total system overhead including data flow overhead and power consumption overhead;

4.2、首先定义则RRH无线传输功耗为然后定义P_f,n为边缘服务器n开启服务功能f的能量消耗,为边缘服务器n维持服务功能的能量消耗，则时隙t中系统部署SFC的总开销为：4.2. First define Then the RRH wireless transmission power consumption is Then define P _f,n as the energy consumption of edge server n when enabling service function f, Maintain service functionality for edge servers n The total cost of deploying SFC in the system in time slot t is:

其中，η为数据流开销与功耗开销之间的折中系数，上式中，第一项表示边缘服务器之间的数据流开销，第二项表示开启服务功能的功耗开销，第三项表示为用户提供服务功能的功耗开销，第四项表示RRH进行波束赋形的无线传输功耗；Among them, η is the trade-off coefficient between data flow overhead and power consumption overhead. In the above formula, the first term represents the data flow overhead between edge servers, the second term represents the power consumption overhead of enabling service functions, the third term represents the power consumption overhead of providing service functions for users, and the fourth term represents the wireless transmission power consumption of RRH for beamforming;

4.3、步骤4.2建立了单个时隙t中系统部署SFC的开销，在此基础上，长期动态的SFC部署开销定义为整个部署过程中各个时隙系统开销的平均值，用T表示部署过程中时隙总数，用表示求解长期动态的SFC部署开销的最小值，即：4.3. Step 4.2 establishes the cost of deploying SFC in a single time slot t. On this basis, the long-term dynamic SFC deployment cost is defined as the average cost of each time slot in the entire deployment process. T represents the total number of time slots in the deployment process, and It means to find the minimum value of the long-term dynamic SFC deployment cost, that is:

其中，C_t中的变量R_m,t、P_f,和∑_m,t受到步骤3和步骤2建立的约束条件1-9的约束，通过求解得到每个时隙SFC的具体部署结果 Among them, the variables in C _t R _m,t 、 P _f, and ∑ _m,t are subject to the constraints 1-9 established in steps 3 and 2. By solving Get the specific deployment results of SFC for each time slot

步骤5具体按照以下步骤实施：Step 5 is implemented according to the following steps:

5.1、建立MDP四元组其中状态空间包含四个元素，分别是用户和RRH之间的无线信道矩阵、VNF实例的处理容量、边缘服务器之间的链路带宽和上一时隙的SFC部署结果，即：5.1. Establishing MDP quadruple The state space It contains four elements, namely, the wireless channel matrix between the user and RRH, the processing capacity of the VNF instance, the link bandwidth between the edge servers, and the SFC deployment result of the previous time slot, namely:

5.2、定义动作 5.2. Define actions

5.3、定义为(s_t,a_t)对应的奖励函数；若采取的动作a_t无法找到可行解，则将奖励函数设为一个较小负数；5.3 Definition is the reward function corresponding to (s _t ,a _t ); if the action a _t taken cannot find a feasible solution, the reward function is set to a small negative number;

5.4、在给定动作a_t的前提下，求解奖励函数r(s_t,a_t)的最大值，并将最大奖励函数的求解问题记为即：5.4. Given an action a _t , find the maximum value of the reward function r(s _t , a _t ) and record the problem of finding the maximum reward function as Right now:

其中，表示求时给定的参数，将转化为用来求解每个时隙SFC的具体部署结果 in, Express request The given parameters will be Convert to Used to solve the specific deployment results of SFC in each time slot

步骤6具体按照以下步骤实施：Step 6 is implemented according to the following steps:

6.1、使用一个Actor神经网络来输出部署策略，使用一个Critic神经网络通过Q值逼近的方法对每个策略进行评估，使用神经网络w来近似动作价值函数，即Q_w(s_t,a_t)≈Q^π(s_t,a_t)，Q_w(s_t,a_t)表示在状态s_t下采取动作a_t以后，后续各个状态所能得到的回报的期望，Q^π(s_t,a_t)为动作价值函数；6.1. Use an Actor neural network to output the deployment strategy, use a Critic neural network to evaluate each strategy through the Q-value approximation method, and use the neural network w to approximate the action value function, that is, _Qw ( _st , _at ) ^≈Qπ ( _st , _at ), where _Qw ( _st , _at ) represents the expected reward that can be obtained in subsequent states after taking action _at in state _st , and ^Qπ ( _st , _at ) is the action value function;

6.2、采用经验回放和目标网络技术以提高训练稳定性，此时Critic网络的损失函数可定义为：6.2. Experience replay and target network technology are used to improve training stability. At this time, the loss function of the Critic network can be defined as:

其中，表示求期望算子，为经验回放池，w′为在时隙t中目标网络的模型，为在平均回报期望值的估计值；in, represents the expectation operator, is the experience replay pool, w′ is the model of the target network in time slot t, is an estimate of the expected value of the average return;

6.3、对Loss(w)对w求梯度，则w的更新方式如下:6.3. Find the gradient of Loss(w) with respect to w, and the update method of w is as follows:

其中，α_c为Critic网络的学习率，I为经验回放池中取得的样本个数；Among them, α _c is the learning rate of the Critic network, and I is the number of samples obtained in the experience replay pool;

6.4、基于参数化策略π_θ，定义平均回报的期望值如下：6.4. Based on the parameterized strategy π _θ , the expected value of the average return is defined as follows:

其中，表示状态s的稳态分布；in, represents the steady-state distribution of state s;

6.5、采用自然梯度法进行Actor网络的训练，网络模型θ的更新方式变为：其中α_a表示Actor网络的学习率，F(θ)为费歇尔信息矩阵，表示J(π_θ)对θ的梯度；6.5. Using the natural gradient method to train the Actor network, the update method of the network model θ becomes: Where α _a represents the learning rate of the Actor network, F(θ) is the Fisher information matrix, represents the gradient of J(π _θ ) with respect to θ;

6.6、将Actor网络和Critic网络集成，令神经网络的训练沿着自然梯度方向进行，使得神经网络模型趋近于全局最优。6.6. Integrate the Actor network and the Critic network so that the neural network training is carried out along the natural gradient direction, making the neural network model approach the global optimum.

步骤7具体按照以下步骤实施：Step 7 is implemented according to the following steps:

7.1、对和进行松弛化处理，将转换为一个凸问题；因此引入L_p(0<p<1)范数惩罚函数强制松弛变量为0-1整数；令得到的渐进最优子问题如下：7.1. Yes and To relax the Transformed into a convex problem; therefore, the L _p (0<p<1) norm penalty function is introduced to force the slack variable to be an integer between 0 and 1; let get The asymptotically optimal subproblem of as follows:

其中，σ是惩罚参数，δ是一个任意小的正数， Among them, σ is the penalty parameter, δ is an arbitrarily small positive number,

变量和满足约束 variable and Satisfy constraints

7.2、定义惩罚参数的迭代方式如下：δ_v+1＝ηδ_v(η>1)，使得惩罚项P_δ(y)以线性速率收敛到0；7.2. The iterative method for defining the penalty parameter is as follows: δ _v+1 =ηδ _v (η>1), so that the penalty term P _δ (y) converges to 0 at a linear rate;

7.3、由于中的惩罚项是非凸的，导致难以求解，采用连续凸近似SCA技术将转化为一个凸问题求解，将的惩罚项进行一阶泰勒展开，即其中y^v为上一次SCA迭代的最优解，是P_δ(y)在y^v点对y的梯度值；7.3. Due to The penalty term in is non-convex, resulting in It is difficult to solve, so we use the continuous convex approximation SCA technology to Transform it into a convex problem and solve it. The penalty term is expanded by the first order Taylor, that is Where y ^v is the optimal solution of the last SCA iteration, is the gradient of P _δ (y) with respect to y at point y ^v ;

7.4、在第v+1次SCA迭代中，最终变为一个凸问题，即：7.4. In the v+1th SCA iteration, Finally it becomes a convex problem, namely:

7.5、根据上述步骤到P_1-S为的一个渐进最优解，通过求解P_1-S得到的最优解，进一步得到奖励函数的最大值并最终根据最大奖励，得到各个时隙SFC的部署结果 7.5. According to the above steps, P _1-S is An asymptotically optimal solution of is obtained by solving P _1-S The optimal solution of the reward function is further obtained, and finally the deployment results of the SFC in each time slot are obtained according to the maximum reward.

本发明的有益效果是，基于强化学习联合协作多点传输的服务功能链部署方法，能在确保用户服务质量的前提下，以无干扰方式完成SFC的长期动态部署，并进一步减少部署过程中边缘服务器的运行开销和RRH的无线传输功耗开销。The beneficial effect of the present invention is that the service function chain deployment method based on reinforcement learning and joint collaborative multi-point transmission can complete the long-term dynamic deployment of SFC in a non-interference manner under the premise of ensuring user service quality, and further reduce the operating overhead of the edge server and the wireless transmission power consumption overhead of RRH during the deployment process.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明提出的一种无线边缘网络中联合CoMP波束赋形的SFC部署的系统模型示意图；FIG1 is a schematic diagram of a system model of SFC deployment with joint CoMP beamforming in a wireless edge network proposed by the present invention;

图2是本发明提出的一种无线边缘网络中基于Actor-Critic学习的SFC在线部署的算法流程示意图。FIG2 is a schematic diagram of an algorithm flow of an SFC online deployment based on Actor-Critic learning in a wireless edge network proposed by the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施方式对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

结合图1-图2，其中图1为无线边缘网络中联合CoMP波束赋形的SFC部署的系统模型示意图，其中包含了两条SFC实例、一个边缘网络、多个服务功能与边缘服务器之间的映射关系和针对两个不同用户的两条CoMP波束；图2为解决无线边缘网络中SFC在线部署问题所使用的Actor-Critic算法流程图。本实施例对无线边缘网络中基于Actor-Critic算法的SFC在线部署方法进行详细说明。In conjunction with Figures 1 and 2, Figure 1 is a schematic diagram of a system model for SFC deployment with joint CoMP beamforming in a wireless edge network, which includes two SFC instances, an edge network, a mapping relationship between multiple service functions and edge servers, and two CoMP beams for two different users; Figure 2 is a flow chart of the Actor-Critic algorithm used to solve the problem of SFC online deployment in a wireless edge network. This embodiment describes in detail a method for online deployment of SFC based on the Actor-Critic algorithm in a wireless edge network.

本发明基于强化学习联合协作多点传输的服务功能链部署方法，具体按照以下步骤实施：The service function chain deployment method based on reinforcement learning combined with cooperative multi-point transmission of the present invention is specifically implemented according to the following steps:

步骤1具体按照以下步骤实施：Step 1 is implemented as follows:

步骤1.1、在边缘网络中，一台边缘服务器与一个远程无线射频模块(Remote Radio Head，RRH)相连，并使用表示边缘网络中服务器的集合，N表示边缘网络中服务器总数量。边缘服务器之间通过X2链路相互连接，且每个边缘服务器可利用虚拟机技术提供多种不同的虚拟功能(例如，缓存、计算和防火墙)。Step 1.1: In the edge network, an edge server is connected to a remote radio head (RRH) and uses represents the set of servers in the edge network, and N represents the total number of servers in the edge network. The edge servers are connected to each other through X2 links, and each edge server can provide a variety of different virtual functions (for example, cache, computing, and firewall) using virtual machine technology.

步骤1.2使用表示边缘网络中的第m个用户，表示边缘网络中用户的集合，M表示用户总数量。假设每个用户只能被一条服务功能链(ServiceFunction Chaining，SFC)所服务，并定义SFC如下：Step 1.2 Use represents the mth user in the edge network, Represents the set of users in the edge network, and M represents the total number of users. Assume that each user can only be served by one service function chain (SFC), and define SFC as follows:

其中，表示用户m的SFC的第一个服务功能，表示第l个服务功能，表示最后一个服务功能并指定为基带处理(Virtualized Building Base band Unit，vBBU)。in, represents the first service function of the SFC of user m, represents the lth service function, The last service function is indicated and designated as baseband processing (Virtualized Building Base band Unit, vBBU).

步骤2、在步骤1的基础上，描述边缘网络中服务器与用户的信道特征，使用波束赋形消除多个服务器与用户之间的通信干扰；Step 2: Based on step 1, describe the channel characteristics of the server and the user in the edge network, and use beamforming to eliminate communication interference between multiple servers and users;

步骤2具体按照以下步骤实施：Step 2 is implemented as follows:

2.1用户m和RRH之间存在瑞利衰落和路径损耗，用表示用户m和编号为n的RRH之间的信道矩阵，其中，表示一个L_n×L_m维复数矩阵，L_n表示编号为n的RRH的发送天线数目，L_m表示用户m的接收天线数。则在时隙t中用户m接收到的信号u_m,t可表示为：2.1 There is Rayleigh fading and path loss between user m and RRH. represents the channel matrix between user m and RRH numbered n, where represents an L _n ×L _m dimensional complex matrix, L _n represents the number of transmit antennas of RRH numbered n, and L _m represents the number of receive antennas of user m. Then the signal um _{,t received by user m in time slot t} can be expressed as:

2.2对步骤2.1中用户m接收信号u_m,t通过基于高斯随机码本的连续编码，可以去掉公式第二项，则时隙t中用户m的接收数据率R_m,t可表示为: 2.2 The received signal u _{m,t of} user m in step 2.1 can be continuously encoded based on the Gaussian random codebook, and the second term of the formula can be removed. Then the received data rate R _m,t of user m in time slot t can be expressed as:

2.3设定和P_m,n分别表示边缘服务器n提供给用户m的服务功能处理功耗和无线传输功耗，令表示用户m是否使用边缘服务器n的VNF实例vBBU；则所有RRH对用户m的波束赋形矩阵应满足：2.3 Settings and P _m,n represent the service function processing power consumption and wireless transmission power consumption provided by edge server n to user m, respectively. Indicates whether user m uses the VNF instance vBBU of edge server n; then the beamforming matrix of all RRHs for user m Should meet:

其中Tr(·)表示对矩阵进行求迹操作。in Tr(·) represents the trace operation of the matrix.

2.4利用RRH的迫零波束赋形技术成消除SFC之间的无线干扰，即将所有用户的信道矩阵堆栈，然后进行QR分解：2.4 Use the zero-forcing beamforming technology of RRH to eliminate the wireless interference between SFCs, that is, stack the channel matrices of all users and then perform QR decomposition:

其中，中的每一项可以表示为：是一组正交基，且是满秩上三角矩阵，上三角的其余矩阵块为非零矩阵。由此，用户m的波束赋形矩阵可以表示为其中 in, Each item in can be expressed as: is an orthogonal basis, and is a full-rank upper triangular matrix, and the remaining matrix blocks in the upper triangle are non-zero matrices. Therefore, the beamforming matrix of user m can be expressed as in

2.5为消除干扰须满足条件：和在公式中，S_m,t的前L_m行与接收数据率有关，因此可以简化为其中H_m,t可以简化为H_m,t＝Q_m,tR_m,m,t。有效的波束赋形矩阵∑_m,t可定义为：令矩阵则步骤2.3建立约束等价于：2.5 To eliminate interference, the following conditions must be met: and In the formula The first L _m rows of S _m,t are related to the received data rate and can be simplified to in H _m,t can be simplified to H _m,t = Q _m,t R _m,m,t . The effective beamforming matrix ∑ _m,t can be defined as: Let the matrix Then step 2.3 to establish constraints is equivalent to:

约束1： Constraint 1:

约束2： Constraint 2:

2.6步骤2.2中的接收数据率R_m,t需要大于等于数据率阈值R_m,th，以于进行正确的数据解码，即：2.6 The received data rate R _m,t in step 2.2 needs to be greater than or equal to the data rate threshold R _m,th for correct data decoding, that is:

约束3：R_m,t＝log₂|I_Lm+∑_m,t|≥R_m,th Constraint 3: R _m,t ＝log ₂ |I _Lm +∑ _m,t |≥R _m,th

步骤3具体按照以下步骤实施：Step 3 is implemented as follows:

3.1设表示能够提供服务功能f的边缘服务器集合，假设每个服务功能只能部署到一台边缘服务器上，即：3.1 Design represents the set of edge servers that can provide service function f. Assume that each service function can only be deployed on one edge server, that is:

约束4：其中表示服务功能是否部署在边缘服务器n上，表示在时隙t中边缘服务器n中是否提供服务功能f。且和满足：约束5： Constraint 4: in Indicates service function Whether it is deployed on edge server n, Indicates whether the service function f is provided in the edge server n in time slot t. And and Satisfies: Constraint 5:

3.2某个VNF实例所处理的服务流总数据率，不能超过该VNF实例的处理容量即：3.2 The total data rate of service flows processed by a VNF instance cannot exceed the processing capacity of the VNF instance Right now:

约束6： Constraint 6:

3.3某条链路上所流经的服务流总数据率，不能超过其链路带宽即:3.3 The total data rate of service flows on a link cannot exceed its link bandwidth Right now:

约束7：其中，表示和是否分别部署在边缘服务器n和s上。Constraint 7: in, express and Whether to deploy them on edge servers n and s respectively.

3.4在时隙t中只有当和同时为1时，才可取1。则和之间的关系可以描述为：3.4 In time slot t, only when and When both are 1, Only then can we take 1. Then and The relationship between can be described as:

约束8： Constraint 8:

3.5定义为边缘服务器n和s的服务迁移代价，则系统总服务迁移代价，不能超过迁移阈值C_mig,th，即：3.5 Definition is the service migration cost of edge servers n and s, then the total service migration cost of the system cannot exceed the migration threshold C _mig,th , that is:

约束9： Constraint 9:

4.1定义系统总开销包括数据流开销和功耗开销。4.1 Definition The total system overhead includes data flow overhead and power consumption overhead.

4.2首先定义则RRH无线传输功耗为然后定义P_f,n为边缘服务器n开启服务功能f的能量消耗,为边缘服务器n维持服务功能的能量消耗，则时隙t中系统部署SFC的总开销为：4.2 First define Then the RRH wireless transmission power consumption is Then define P _f,n as the energy consumption of edge server n when enabling service function f. Maintain service functionality for edge servers n The total cost of deploying SFC in the system in time slot t is:

其中，η为数据流开销与功耗开销之间的折中系数。上式中，第一项表示边缘服务器之间的数据流开销，第二项表示开启服务功能的功耗开销，第三项表示为用户提供服务功能的功耗开销，第四项表示RRH进行波束赋形的无线传输功耗。Among them, η is the trade-off coefficient between data flow overhead and power consumption overhead. In the above formula, the first term represents the data flow overhead between edge servers, the second term represents the power consumption overhead of enabling service functions, the third term represents the power consumption overhead of providing service functions for users, and the fourth term represents the wireless transmission power consumption of RRH for beamforming.

4.3、步骤4.2建立了单个时隙t中系统部署SFC的开销，在此基础上，长期动态的SFC部署开销定义为整个部署过程中各个时隙系统开销的平均值。用T表示部署过程中时隙总数，用表示求解长期动态的SFC部署开销的最小值，即：4.3. Step 4.2 establishes the cost of deploying SFC in a single time slot t. On this basis, the long-term dynamic SFC deployment cost is defined as the average of the system cost of each time slot during the entire deployment process. T represents the total number of time slots during the deployment process, and It means to find the minimum value of the long-term dynamic SFC deployment cost, that is:

其中，C_t中的变量R_m,t、P_f,n和∑_m,t受到步骤3和步骤2建立的约束条件1-9的约束，通过求解得到每个时隙SFC的具体部署结果 Among them, the variables in C _t R _m,t 、 P _f,n and ∑ _m,t are subject to the constraints 1-9 established in steps 3 and 2. By solving Get the specific deployment results of SFC for each time slot

步骤5、构建马尔可夫决策过程(Markov Decision Processes)模型MDP，将长期优化问题解耦为逐时隙优化问题；Step 5: Construct a Markov Decision Processes model MDP to decouple the long-term optimization problem into a time-slot optimization problem.

5.1建立MDP四元组其中状态空间包含四个元素，分别是用户和RRH之间的无线信道矩阵、VNF实例的处理容量、边缘服务器之间的链路带宽和上一时隙的SFC部署结果，即：5.1 Establishing MDP Quadruple The state space It contains four elements, namely, the wireless channel matrix between the user and the RRH, the processing capacity of the VNF instance, the link bandwidth between the edge servers, and the SFC deployment result of the previous time slot, namely:

5.2定义动作这是因为原始动作空间中包含变量维度过高，为故将动作空间降维处理。5.2 Defining Actions This is because the original action space contains variables The dimension is too high. Therefore, the action space is reduced in dimension.

5.3定义为(s_t,a_t)对应的奖励函数；若采取的动作a_t无法找到可行解，则将奖励函数设为一个较小负数。5.3 Definitions is the reward function corresponding to (s _t ,a _t ); if the action a _t taken cannot find a feasible solution, the reward function is set to a small negative number.

5.4在给定动作a_t的前提下，求解奖励函数r(s_t,a_t)的最大值，并将最大奖励函数的求解问题记为即：5.4 Given an action a _t , find the maximum value of the reward function r(s _t , a _t ) and record the problem of finding the maximum reward function as Right now:

6.1使用一个Actor神经网络来输出部署策略，使用一个Critic神经网络通过Q值逼近的方法对每个策略进行评估。使用神经网络w来近似动作价值函数，即Q_w(s_t,a_t)≈Q^π(s_t,a_t)，Q_w(s_t,a_t)表示在状态s_t下采取动作a_t以后，后续各个状态所能得到的回报的期望，Q^π(s_t,a_t)为动作价值函数。6.1 Use an Actor neural network to output the deployment strategy, and use a Critic neural network to evaluate each strategy through the Q-value approximation method. Use the neural network w to approximate the action value function, that is, Q _w (s _t , a _t ) ≈ Q ^π (s _t , a _t ), Q _w (s _t , a _t ) represents the expected return of each subsequent state after taking action a _t in state s _t , and Q ^π (s _t , a _t ) is the action value function.

6.2为了打破样本之间的时间相关性，采用经验回放和目标网络技术以提高训练稳定性，此时Critic网络的损失函数可定义为：6.2 In order to break the temporal correlation between samples, experience replay and target network technology are used to improve training stability. At this time, the loss function of the Critic network can be defined as:

其中，表示求期望算子，为经验回放池，w′为在时隙t中目标网络的模型，为在平均回报期望值的估计值。in, represents the expectation operator, is the experience replay pool, w′ is the model of the target network in time slot t, is an estimate of the expected value of the average return.

6.3对Loss(w)对w求梯度，则w的更新方式如下:6.3 Find the gradient of Loss(w) with respect to w, and the update method of w is as follows:

其中，α_c为Critic网络的学习率，I为经验回放池中取得的样本个数。Among them, _αc is the learning rate of the Critic network, and I is the number of samples obtained in the experience replay pool.

6.4基于参数化策略π_θ，定义平均回报的期望值如下：6.4 Based on the parameterized strategy π _θ , the expected value of the average return is defined as follows:

其中，表示状态s的稳态分布。in, represents the steady-state distribution of state s.

6.5为避免J(π_θ)沿标准梯度方向训练时陷入局部最优，采用自然梯度法进行Actor网络的训练，网络模型θ的更新方式变为：其中α_a表示Actor网络的学习率，F(θ)为费歇尔信息矩阵，表示J(π_θ)对θ的梯度。6.5 In order to avoid J(π _θ ) falling into the local optimum when training along the standard gradient direction, the natural gradient method is used to train the Actor network, and the update method of the network model θ becomes: Where α _a represents the learning rate of the Actor network, F(θ) is the Fisher information matrix, represents the gradient of J(π _θ ) with respect to θ.

6.6算法流程见附图2，将Actor网络和Critic网络集成，令神经网络的训练沿着自然梯度方向进行，使得神经网络模型趋近于全局最优。6.6 The algorithm flow is shown in Figure 2. The Actor network and the Critic network are integrated so that the neural network training is carried out along the natural gradient direction, making the neural network model approach the global optimum.

步骤7、在搜索动作空间时，建立奖励函数求解的子优化问题，降低动作空间搜索复杂度，并对步骤5.3设置的奖励函数进行求解，得到的渐进最优解。Step 7: When searching the action space, establish a sub-optimization problem for solving the reward function to reduce the complexity of the action space search, and solve the reward function set in step 5.3 to obtain The asymptotically optimal solution of .

7.1对和进行松弛化处理，将转换为一个凸问题；然而，松弛后得到的凸问题无法保障最优解为0-1整数解，故松弛问题与原问题不等价，因此引入L_p(0<p<1)范数惩罚函数强制松弛变量为0-1整数。令得到的渐进最优子问题如下：7.1 pairs and To relax the is converted to a convex problem; however, the convex problem obtained after relaxation cannot guarantee that the optimal solution is a 0-1 integer solution, so the relaxed problem is not equivalent to the original problem. Therefore, the L _p (0<p<1) norm penalty function is introduced to force the relaxation variable to be a 0-1 integer. get The asymptotically optimal subproblem of as follows:

其中，σ是惩罚参数， δ是一个任意小的正数，变量和满足约束 Among them, σ is the penalty parameter, δ is an arbitrarily small positive number, variable and Satisfy constraints

7.2定义惩罚参数的迭代方式如下：δ_v+1＝ηδ_v(η>1)，使得惩罚项P_δ(y)以线性速率收敛到0。7.2 The iterative method for defining the penalty parameter is as follows: δ _v+1 =ηδ _v (η>1), so that the penalty term P _δ (y) converges to 0 at a linear rate.

7.3由于中的惩罚项是非凸的，导致难以求解，采用连续凸近似(Successive Convex Approximation，SCA)技术将转化为一个凸问题求解。将的惩罚项进行一阶泰勒展开，即其中y^v为上一次SCA迭代的最优解，是P_δ(y)在y^v点对y的梯度值。7.3 Due to The penalty term in is non-convex, resulting in It is difficult to solve, so we use the Successive Convex Approximation (SCA) technique to solve Transform it into a convex problem and solve it. The penalty term is expanded by the first order Taylor, that is Where y ^v is the optimal solution of the last SCA iteration, It is the gradient of _Pδ (y) with respect to y at point ^yv .

7.4在第v+1次SCA迭代中，最终变为一个凸问题，即：7.4 In the v+1th SCA iteration, Finally it becomes a convex problem, namely:

7.5根据上述步骤到P_1-S为的一个渐进最优解，通过求解P_1-S得到的最优解，进一步得到奖励函数的最大值并最终根据最大奖励，得到各个时隙SFC的部署结果 7.5 According to the above steps, P _1-S is An asymptotically optimal solution of is obtained by solving P _1-S The optimal solution of the reward function is further obtained, and finally the deployment results of the SFC in each time slot are obtained according to the maximum reward.

Claims

1. A method for deploying a service function chain based on reinforcement learning and coordinated multi-point transmission, characterized in that the method is implemented in the following steps:

Step 1: Describe the edge network model, including the characteristics of edge servers, network virtual functions, users, and service function chains;

Step 2: describe the channel characteristics of the server and the user in the edge network, and use beamforming to eliminate the communication interference between multiple servers and users; Step 2 is specifically implemented according to the following steps:

2.1、There is Rayleigh fading and path loss between user m and RRH. represents the channel matrix between user m and RRH numbered n, where represents an L _n ×L _m dimensional complex matrix, L _n represents the number of transmit antennas of RRH numbered n, and L _m represents the number of receive antennas of user m. Then the signal um _{,t received by user m in time slot t} can be expressed as:

in, represents the channel matrix between user m and all RRHs in the edge network in time slot t, where represents the channel matrix between user m and RRH numbered n, where (·) ^H represents the conjugate transpose of the matrix, Indicates the total number of antennas of all RRHs; It is represented as the beamforming matrix of all RRHs for user m, d _m is the number of data streams received by user m; I is used to represent the identity matrix, means that the mean is zero and the covariance is Gaussian random codebook; n _m,t is the covariance Gaussian white noise;

2.2. The received signal u _{m,t of} user m in step 2.1 can be continuously encoded based on the Gaussian random codebook, and the second term of the formula can be removed. Then the received data rate R _{m,t of user m in time slot t} can be expressed as:

2.3. Settings and P _m,n represent the service function processing power consumption and wireless transmission power consumption provided by edge server n to user m, respectively. Indicates whether user m uses the VNF instance vBBU of edge server n; then the beamforming matrix of all RRHs for user m Should meet:

in Tr(·) represents the trace operation of the matrix;

2.4. Use the zero-forcing beamforming technology of RRH to eliminate the wireless interference between SFCs. That is, stack the channel matrices of all users and then perform QR decomposition:

in, Each item in can be expressed as: is an orthogonal basis, and is a full-rank upper triangular matrix, and the remaining matrix blocks of the upper triangle are arbitrary non-zero matrices. Therefore, the beamforming matrix of user m can be expressed as in

2.5. To eliminate interference, the following conditions must be met: and In the formula The first L _m rows of S _m,t are related to the received data rate and can be simplified to in H _m,t can be simplified to H _m,t =Q _m,t R _m,m,t , and the effective beamforming matrix ∑ _m,t can be defined as: Let the matrix Then step 2.3 to establish constraints is equivalent to:

Constraint 1:

Constraint 2:

2.6. The received data rate R _m,t in step 2.2 needs to be greater than or equal to the data rate threshold R _m,th in order to perform correct data decoding, that is:

Constraint 3: R _m,t = log ₂ |I _Lm +∑ _m,t |≥R _m,th ;

Step 3: Establish a mathematical model under the constraints of the number of server VNF instantiations, server processing capacity, physical link bandwidth, VNF routing, and VNF migration budget;

The step 3 is specifically implemented according to the following steps:

3.1、Assume represents the set of edge servers that can provide service function f. Assume that each service function can only be deployed on one edge server, that is:

Constraint 4: in Indicates whether the service function _fl ^m is deployed on the edge server n, Indicates whether the service function f is provided in the edge server n in time slot t, and and Satisfies: Constraint 5:

3.2. The total data rate of the service flows processed by a VNF instance cannot exceed the processing capacity of the VNF instance. Right now:

Constraint 6:

3.3. The total data rate of service flows on a link cannot exceed its link bandwidth Right now:

Constraint 7:

in, represents _fl ^m and Whether they are deployed on edge servers n and s respectively;

3.4. In time slot t, only when and When both are 1, Only then can it be 1; then and The relationship between can be described as:

Constraint 8:

3.5 Definition is the service migration cost of edge servers n and s, then the total service migration cost of the system cannot exceed the migration threshold C _mig,th , that is:

Constraint 9:

Step 4: Model the long-term optimization problem based on the resource constraints established in steps 1-3;

The step 4 is specifically implemented according to the following steps:

4.1、Define the total system overhead including data flow overhead and power consumption overhead;

4.2. First define Then the RRH wireless transmission power consumption is

Then define P _f,n as the energy consumption of edge server n when enabling service function f. The energy consumption of maintaining service function _fl ^m for edge server n is then the total cost of deploying SFC in the system in time slot t is:

Among them, η is the trade-off coefficient between data flow overhead and power consumption overhead. In the above formula, the first term represents the data flow overhead between edge servers, the second term represents the power consumption overhead of enabling service functions, the third term represents the power consumption overhead of providing service functions for users, and the fourth term represents the wireless transmission power consumption of RRH for beamforming;

4.3. Step 4.2 establishes the cost of deploying SFC in a single time slot t. On this basis, the long-term dynamic SFC deployment cost is defined as the average of the system cost of each time slot in the entire deployment process. T represents the total number of time slots in the deployment process, and It means to find the minimum value of the long-term dynamic SFC deployment cost, that is:

Among them, the variables in C _t P _f,n and ∑ _m,t are subject to the constraints 1-9 established in steps 3 and 2. By solving Get the specific deployment results of SFC for each time slot

Step 5: Construct a Markov decision process model MDP to decouple the long-term optimization problem into a time-slot optimization problem.

Step 6: Use the Actor-Critic reinforcement learning algorithm based on natural gradient to learn the optimal SFC deployment strategy online in each time slot;

Step 7: When searching the action space, establish a sub-optimization problem to solve the reward function, reduce the complexity of the action space search, and finally obtain the optimal solution.

2. According to the service function chain deployment method based on reinforcement learning and coordinated multi-point transmission in claim 1, it is characterized in that the step 1 is specifically implemented according to the following steps:

Step 1.1: In the edge network, an edge server is connected to a remote radio frequency module (RRH) and uses It represents both the nth edge server and the RRH of the server, where represents the set of servers in the edge network, and N represents the total number of servers in the edge network; the edge servers are connected to each other through X2 links, and each edge server can provide a variety of different virtual functions using virtual machine technology;

Step 1.2 Use represents the mth user in the edge network, represents the set of users in the edge network, M represents the total number of users, assuming that each user can only be served by one service function chain SFC, and SFC is defined as follows:

Where, f ₁ ^m represents the first service function of the SFC of user m, and _fl ^m represents the lth service function. Indicates the last service function and is designated as the baseband processing vBBU.

3. The service function chain deployment method based on reinforcement learning and coordinated multi-point transmission according to claim 1 is characterized in that step 5 is specifically implemented according to the following steps:

5.1. Establishing MDP quadruple The state space It contains four elements, namely, the wireless channel matrix between the user and the RRH, the processing capacity of the VNF instance, the link bandwidth between the edge servers, and the SFC deployment result of the previous time slot, namely:

5.2. Define actions

5.3 Definition is the reward function corresponding to (s _t ,a _t ); if the action a _t taken cannot find a feasible solution, the reward function is set to a small negative number;

5.4. Given an action a _t , find the maximum value of the reward function r(s _t , a _t ) and record the problem of finding the maximum reward function as Right now:

in, Express request The given parameters will be Convert to Used to solve the specific deployment results of SFC in each time slot

4. The service function chain deployment method based on reinforcement learning and coordinated multi-point transmission according to claim 1 is characterized in that step 6 is specifically implemented according to the following steps:

6.1. Use an Actor neural network to output the deployment strategy, use a Critic neural network to evaluate each strategy through the Q-value approximation method, and use the neural network w to approximate the action value function, that is, _Qw ( _st , _at ) ^≈Qπ ( _st , _at ), where _Qw ( _st , _at ) represents the expected reward of each subsequent state after taking action _{at in state st} _.

Q ^π (s _t ,a _t ) is the action value function;

6.2. Experience replay and target network technology are used to improve training stability. At this time, the loss function of the Critic network can be defined as:

in, represents the expectation operator, is the experience replay pool, w′ is the model of the target network in time slot t, is an estimate of the expected value of the average return;

6.3. Find the gradient of Loss(w) with respect to w, and the update method of w is as follows:

Among them, α _c is the learning rate of the Critic network, and I is the number of samples obtained in the experience replay pool;

6.4. Based on the parameterized strategy π _θ , the expected value of the average return is defined as follows:

in, represents the steady-state distribution of state s;

6.5. Using the natural gradient method to train the Actor network, the update method of the network model θ becomes: Where α _a represents the learning rate of the Actor network, F(θ) is the Fisher information matrix, represents the gradient of J(π _θ ) with respect to θ;

6.6. Integrate the Actor network and the Critic network so that the neural network training is carried out along the natural gradient direction, making the neural network model approach the global optimum.

5. The service function chain deployment method based on reinforcement learning and coordinated multi-point transmission according to claim 1 is characterized in that step 7 is specifically implemented according to the following steps:

7.1. Yes and To relax the Transformed into a convex problem; therefore, the L _p (0<p<1) norm penalty function is introduced to force the slack variable to be an integer between 0 and 1; let get The asymptotically optimal subproblem of as follows:

Among them, σ is the penalty parameter, δ is an arbitrarily small positive number, variable and Satisfy constraints

7.2. The iterative method for defining the penalty parameter is as follows: δ _v+1 =ηδ _v (η>1), so that the penalty term P _δ (y) converges to 0 at a linear rate;

7.3. Due to The penalty term in is non-convex, resulting in It is difficult to solve, so we use the continuous convex approximation SCA technology to Transform it into a convex problem and solve it. The penalty term is expanded by the first order Taylor, that is

Where y ^v is the optimal solution of the last SCA iteration, is the gradient of P _δ (y) with respect to y at point y ^v ;

7.4. In the v+1th SCA iteration, Finally it becomes a convex problem, namely:

7.5. According to the above steps, P _1-S is An asymptotically optimal solution of is obtained by solving P _1-S The optimal solution of the reward function is further obtained, and finally the deployment results of the SFC in each time slot are obtained according to the maximum reward.