CN115767562B - Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission - Google Patents

Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission Download PDF

Info

Publication number
CN115767562B
CN115767562B CN202211012894.7A CN202211012894A CN115767562B CN 115767562 B CN115767562 B CN 115767562B CN 202211012894 A CN202211012894 A CN 202211012894A CN 115767562 B CN115767562 B CN 115767562B
Authority
CN
China
Prior art keywords
user
network
deployment
time slot
service function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211012894.7A
Other languages
Chinese (zh)
Other versions
CN115767562A (en
Inventor
王侃
袁鹏
周红芳
李军怀
王怀军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202211012894.7A priority Critical patent/CN115767562B/en
Publication of CN115767562A publication Critical patent/CN115767562A/en
Application granted granted Critical
Publication of CN115767562B publication Critical patent/CN115767562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a service function chain deployment method based on reinforcement learning joint cooperation multipoint transmission, which comprises the steps of firstly describing an edge network model, describing channel characteristics of a server and a user in an edge network, and eliminating communication interference between a plurality of servers and the user by using beam forming; establishing a mathematical model under the limitation of the number of server VNF instantiations, the processing capacity of the server, the physical link bandwidth, the VNF routing and the VNF migration budget; modeling the long-term optimization problem; decoupling the long-term optimization problem into a slot-by-slot optimization problem; and finally, establishing a sub-optimization problem for solving the reward function, and reducing the complexity of searching the action space. The invention eliminates the wireless link interference between users by utilizing a zero-forcing beamforming technology based on CoMP, and then decouples the long-term dynamic SFC deployment problem into a time slot-by-time slot optimization problem by using an Actor-Critic algorithm based on a natural gradient method, thereby carrying out online solving.

Description

Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission
Technical Field
The invention belongs to the technical field of communication, and particularly relates to a service function chain deployment method based on reinforcement learning joint cooperation multipoint transmission.
Background
Under SDN (Software Defined Network, SDN) architecture, 6G communication systems are expected to promote network quality of service through emerging network function virtualization (Network Function Virtualization, NFV) technologies, so that service functions can be deployed directly in commercial servers using virtual machine or container technologies without being deployed in proprietary hardware. This allows the 6G communication system operator to directly flexibly scale the network on demand by increasing the number of commercial servers. By utilizing the NFV technology, the logic sequences of all service functions which the data packet passes through can be cascaded to form an SFC, and flexible customization and deployment of various services are realized through arrangement of virtualized network functions (Virtualized Network Function, VNF).
Currently, there is much work to research SFC deployment methods for traditional network architecture, such as content distribution networks and centralized cloud computing networks. Unlike traditional network architecture, SFC is deployed in the 6G edge network, which can enable the positions of users and VNs to be adjacent to each other, thereby obtaining services nearby and improving the service quality of computationally intensive and delay sensitive services; meanwhile, with NFV technology, the edge server can provide richer VNF orchestration combinations, thereby providing more complex and flexible service types.
At present, the research of the SFC deployment method in the edge network mainly aims at a single-time-slot static network, ignores the interference characteristic of a wireless channel in the edge network, and ignores the dynamic change characteristics of cache resources, calculation resources and communication resources in an edge server. Therefore, the invention provides an SFC deployment method based on online Actor-Critic learning and combined CoMP beam forming in a 6G wireless edge network, which utilizes a zero-forcing beam forming technology based on CoMP to eliminate wireless link interference among a plurality of SFCs.
Disclosure of Invention
The invention aims to provide a service function chain deployment method based on reinforcement learning combined coordinated multi-point transmission, which utilizes a zero-forcing beamforming technology based on CoMP to eliminate wireless link interference among users, and then uses an Actor-Critic algorithm based on a natural gradient method to decouple a long-term dynamic SFC deployment problem into a time slot-by-time slot optimization problem so as to solve online.
The technical scheme adopted by the invention is that the service function chain deployment method based on reinforcement learning combined coordinated multi-point transmission is implemented according to the following steps:
step 1, describing an edge network model, including the characteristics of an edge server, a network virtual function, a user and a service function chain;
Step 2, describing channel characteristics of a server and a user in an edge network, and eliminating communication interference between a plurality of servers and the user by using beam forming;
Step 3, establishing a mathematical model under the limitation of the number of server VNF instantiations, the processing capacity of the server, the physical link bandwidth, the VNF routing and the VNF migration budget;
Step 4, modeling a long-term optimization problem according to the resource constraint established in the step 1-3;
step 5, constructing a Markov decision process model MDP, and decoupling a long-term optimization problem into a time slot-by-time slot optimization problem;
Step 6, using an Actor-Critic reinforcement learning algorithm based on natural gradient to learn the SFC optimal deployment strategy on line time slot by time slot;
and 7, establishing a sub-optimization problem of the solution of the reward function when searching the action space, reducing the complexity of searching the action space, and finally obtaining the optimal solution.
The present invention is also characterized in that,
The step 1 is specifically implemented according to the following steps:
Step 1.1, in the edge network, an edge server is connected with a remote radio frequency module RRH and uses Simultaneously representing the nth edge server and the RRH of that server, whereinRepresenting a set of servers in the edge network, N representing the total number of servers in the edge network; the edge servers are connected with each other through an X2 link, and each edge server can provide a plurality of different virtual functions by utilizing a virtual machine technology;
Step 1.2 use Representing the mth user in the edge network,Representing the set of users in the edge network, M representing the total number of users, assuming that each user can only be served by one service function chain SFC, and defining the SFC as follows:
Wherein, First service function of SFC representing user m,Representing the first service function,The last service function is indicated and designated as baseband processing vBBU.
The step 2 is specifically implemented according to the following steps:
2.1, rayleigh fading and path loss between user m and RRH, using Representing the channel matrix between user m and the RRH numbered n, whereinRepresenting an L n×Lm -dimensional complex matrix, L n represents the number of transmit antennas of RRH numbered n, and L m represents the number of receive antennas of user m, then the signal u m,t received by user m in time slot t can be represented as:
Wherein, Representing the channel matrix between user m and all RRHs in the edge network in time slot t, whereinRepresenting the channel matrix between user m and the RRH numbered n, where (-) H represents the conjugate transpose of the matrix,Representing the total number of antennas for all RRHs; /(I)The beamforming matrix is expressed as a beamforming matrix of all RRHs to the user m, and d m is the number of data streams received by the user m; the identity matrix is denoted by I,Mean value is zero, covariance isIs a gaussian random codebook of (c); n m,t is covarianceIs white gaussian noise;
2.2, the second term of the formula can be removed by continuously encoding the received signal u m,t of the user m in step 2.1 based on the gaussian random codebook, and the received data rate R m,t of the user m in the time slot t can be expressed as:
2.3 setting And P m,n respectively represent the service function processing power consumption and the wireless transmission power consumption provided by the edge server n to the user m, and letVNF instance vBBU indicating whether user m uses edge server n; then all RRHs will beam-form matrix/>, for user mThe following should be satisfied:
Wherein the method comprises the steps of Tr (·) represents the trace operation performed on the matrix;
2.4, eliminating wireless interference between SFCs by utilizing a zero forcing beamforming technology of RRH, namely stacking channel matrixes of all users, and then performing QR decomposition:
Wherein, Can be expressed as: /(I) Is a set of orthogonal bases, and Is an upper triangular matrix of full rank, the remaining matrix blocks of the upper triangle are any non-zero matrix, and thus, the beamforming matrix of user m can be expressed asWherein
2.5, The conditions must be satisfied to eliminate interference: and/> In the formulaThe first L m line of S m,t is related to the received data rate and can therefore be simplified toWhereinH m,t may be simplified to H m,t=Qm,tRm,m,t, and the effective beamforming matrix Σ m,t may be defined as: /(I)Let matrixStep 2.3 establishes the constraint is equivalent to:
Constraint 1:
Constraint 2:
the received data rate R m,t in step 2.2 of 2.6 needs to be equal to or greater than the data rate threshold R m,th in order to perform correct data decoding, namely:
Constraint 3: r m,t=log2|ILm+∑m,t|≥Rm,th.
The step 3 is specifically implemented according to the following steps:
3.1, set up Representing a set of edge servers capable of providing service functions f, it is assumed that each service function can only be deployed on one edge server, namely:
constraint 4: Wherein/> Representing service functionsWhether or not to be deployed on edge server n, Indicates whether the service function f is provided in the edge server n in the time slot t, andAndThe method meets the following conditions: constraint 5: /(I)
3.2 The total data rate of service flows handled by a certain VNF instance cannot exceed the processing capacity of the VNF instanceNamely:
Constraint 6:
3.3 the total data rate of the service flow flowing through a certain link cannot exceed the link bandwidth Namely:
constraint 7: Wherein/> RepresentationAndWhether deployed on edge servers n and s, respectively;
3.4, only when in time slot t AndAt the same time 1,Can take 1; thenAndThe relationship between can be described as:
Constraint 8:
3.5, definition For the service migration costs of the edge servers n and s, the total service migration cost of the system cannot exceed the migration threshold C mig,th, namely:
Constraint 9:
Step 4 is specifically implemented according to the following steps:
4.1, defining the total overhead of the system to comprise data flow overhead and power consumption overhead;
4.2, first define The RRH wireless transmission power consumption isP f,n is then defined as the energy consumption of the edge server n to turn on the service function f,Maintaining service functions for edge server nThe total overhead of the system deployment SFC in time slot t is:
Wherein η is a compromise coefficient between data flow overhead and power consumption overhead, in the above formula, the first term represents data flow overhead between edge servers, the second term represents power consumption overhead for starting a service function, the third term represents power consumption overhead for providing a service function for a user, and the fourth term represents wireless transmission power consumption for performing beamforming by an RRH;
4.3, step 4.2 establishes the overhead of system deployment SFC in a single time slot T, on the basis, the long-term dynamic SFC deployment overhead is defined as the average value of the system overhead of each time slot in the whole deployment process, the total number of time slots in the deployment process is represented by T, and the system deployment overhead is represented by T Representing the minimum of SFC deployment overhead to solve for long term dynamics, namely:
Wherein the variables in C t Rm,tP f, and Σ m,t are constrained by constraints 1-9 established in step 3 and step 2 by solvingObtaining specific deployment result/>, of SFC of each time slot
Step 5 is specifically implemented according to the following steps:
5.1 establishing MDP four-tuple Wherein the state spaceThe four elements are respectively a wireless channel matrix between the user and the RRH, processing capacity of the VNF instance, link bandwidth between edge servers and SFC deployment result of the last time slot, namely:
5.2 define actions
5.3 Definition ofIs (s t,at) the corresponding bonus function; if the action a t taken cannot find a feasible solution, setting the bonus function to a smaller negative number;
5.4 solving the maximum value of the reward function r (s t,at) under the premise of the given action a t, and recording the solving problem of the maximum reward function as Namely:
Wherein, Representation ofGiven parameters, willConversion toSpecific deployment results/>, to solve for each slot SFC
Step 6 is specifically implemented according to the following steps:
Using an Actor neural network to output deployment strategies, using a Critic neural network to evaluate each strategy through a Q value approximation method, using a neural network w to approximate an action cost function, namely Qw(st,at)≈Qπ(st,at),Qw(st,at) representing the expectation of rewards obtained by each subsequent state after action a t is taken in state s t, and Q π(st,at) as the action cost function;
6.2, employing empirical playback and target network techniques to improve training stability, the loss function of the Critic network can be defined as:
Wherein, Representing a desirability operator,For the empirical playback pool, w' is the model of the target network in time slot t,An estimate of the expected value at the average return;
6.3, gradient is calculated for Loss (w) to w, and then the updating mode of w is as follows:
Wherein alpha c is the learning rate of the Critic network, and I is the number of samples obtained from the experience playback pool;
6.4, defining the expected value of average return based on parameterized policy pi θ as follows:
Wherein, Representing the steady state distribution of state s;
And 6.5, training an Actor network by adopting a natural gradient method, wherein the updating mode of the network model theta is changed into: Where a a denotes the learning rate of the Actor network, F (θ) is the fischer information matrix, Represents the gradient of J (pi θ) to θ;
and 6.6, integrating the Actor network and the Critic network, so that training of the neural network is performed along the natural gradient direction, and the neural network model is approximate to global optimum.
The step 7 is specifically implemented according to the following steps:
7.1 pair of AndRelaxation treatment willConverting into a convex problem; thus introducing an L p (0 < p < 1) norm penalty function to force the relaxation variable to be an integer from 0 to 1; letObtainProgressive optimal sub-problemThe following are provided:
where σ is the penalty parameter, Delta is an arbitrarily small positive number,
Variable(s)AndSatisfy constraint
7.2, Defining the iterative mode of punishment parameter as follows: delta v+1=ηδv (eta > 1) such that penalty term P δ (y) converges to 0 at a linear rate;
7.3 due to The penalty term in (a) is non-convex, resulting inDifficult to solve, and/>, adopting a continuous convex approximation SCA techniqueConversion to a convex problem solution, willFirst-order taylor expansion of penalty term of (2), i.eWhere y v is the optimal solution for the last SCA iteration,Is the gradient value of P δ (y) at y v to y;
7.4, in the v+1st SCA iteration, Eventually becomes a convex problem, namely:
7.5, according to the above steps to P 1-S as Is obtained by solving for P 1-S Further obtaining the maximum value of the rewarding function and finally obtaining the deployment result of each time slot SFC according to the maximum rewarding
The service function chain deployment method based on reinforcement learning combined coordinated multi-point transmission has the advantages that the long-term dynamic deployment of SFC can be completed in a non-interference mode on the premise of ensuring the service quality of users, and the operation cost of an edge server and the wireless transmission power consumption cost of RRH in the deployment process are further reduced.
Drawings
Fig. 1 is a schematic diagram of a system model of SFC deployment in combination with CoMP beamforming in a wireless edge network according to the present invention;
fig. 2 is a schematic flow chart of an algorithm for SFC online deployment based on Actor-Critic learning in a wireless edge network according to the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
Referring to fig. 1-2, fig. 1 is a schematic diagram of a system model of SFC deployment for joint CoMP beamforming in a wireless edge network, where two SFC instances, one edge network, a mapping relationship between multiple service functions and an edge server, and two CoMP beams for two different users are included; FIG. 2 is a flowchart of an Actor-Critic algorithm used to solve the SFC online deployment problem in a wireless edge network. The embodiment describes in detail an SFC online deployment method based on an Actor-Critic algorithm in a wireless edge network.
The invention discloses a service function chain deployment method based on reinforcement learning joint cooperation multipoint transmission, which is implemented according to the following steps:
step 1, describing an edge network model, including the characteristics of an edge server, a network virtual function, a user and a service function chain;
The step 1 is specifically implemented according to the following steps:
step 1.1 in the edge network, an edge server is connected to a remote radio frequency module (Remote Radio Head, RRH) and uses Represents a set of servers in the edge network, and N represents the total number of servers in the edge network. The edge servers are interconnected by an X2 link and each edge server may provide a variety of different virtual functions (e.g., caching, computing, and firewall) using virtual machine technology.
Step 1.2 useRepresenting the mth user in the edge network,Representing the set of users in the edge network, M representing the total number of users. It is assumed that each user can only be served by one service function chain (Service Function Chaining, SFC) and that SFC is defined as follows:
Wherein, First service function of SFC representing user m,Representing the first service function,The last service function is indicated and designated as baseband processing (Virtualized Building Base band Unit, vBBU).
Step 2, describing channel characteristics of a server and a user in an edge network on the basis of the step 1, and eliminating communication interference between a plurality of servers and the user by using beam forming;
The step 2 is specifically implemented according to the following steps:
2.1 Rayleigh fading and path loss exist between user m and RRH, using Representing the channel matrix between user m and the RRH numbered n, whereinAn L n×Lm -dimensional complex matrix is represented, L n represents the number of transmit antennas of RRH numbered n, and L m represents the number of receive antennas of user m. The signal u m,t received by user m in time slot t may be expressed as:
Wherein, Representing the channel matrix between user m and all RRHs in the edge network in time slot t, whereinRepresenting the channel matrix between user m and the RRH numbered n, where (-) H represents the conjugate transpose of the matrix,Representing the total number of antennas for all RRHs; /(I)The beamforming matrix is expressed as a beamforming matrix of all RRHs to the user m, and d m is the number of data streams received by the user m; the identity matrix is denoted by I,Mean value is zero, covariance isIs a gaussian random codebook of (c); n m,t is covarianceIs white gaussian noise;
2.2 the second term of the formula can be removed by continuous coding based on a gaussian random codebook for the received signal u m,t of user m in step 2.1, and the received data rate R m,t of user m in time slot t can be expressed as:
2.3 setting And P m,n respectively represent the service function processing power consumption and the wireless transmission power consumption provided by the edge server n to the user m, and letVNF instance vBBU indicating whether user m uses edge server n; then all RRHs will beam-form matrix/>, for user mThe following should be satisfied:
Wherein the method comprises the steps of Tr (·) represents the trace operation performed on the matrix.
2.4, Eliminating wireless interference between SFCs by utilizing a zero forcing beamforming technology of RRH, namely stacking channel matrixes of all users, and then performing QR decomposition:
Wherein, Can be expressed as: /(I) Is a set of orthogonal bases, and Is the full rank upper triangular matrix, and the rest matrix blocks of the upper triangular matrix are non-zero matrices. Thus, the beamforming matrix for user m may be represented asWherein
2.5 Conditions to be satisfied for interference cancellation: and/> In the formulaThe first L m line of S m,t is related to the received data rate and can therefore be simplified toWhereinH m,t can be simplified to H m,t=Qm,tRm,m,t. The effective beamforming matrix Σ m,t may be defined as: /(I)Let matrixStep 2.3 establishes the constraint is equivalent to:
Constraint 1:
Constraint 2:
2.6 the received data rate R m,t in step 2.2 needs to be equal to or greater than the data rate threshold R m,th in order to perform correct data decoding, i.e.:
Constraint 3: r m,t=log2|ILm+∑m,t|≥Rm,th
Step 3, establishing a mathematical model under the limitation of the number of server VNF instantiations, the processing capacity of the server, the physical link bandwidth, the VNF routing and the VNF migration budget;
The step 3 is specifically implemented according to the following steps:
3.1 arrangement Representing a set of edge servers capable of providing service functions f, it is assumed that each service function can only be deployed on one edge server, namely:
constraint 4: Wherein/> Representing service functionsWhether or not to be deployed on edge server n, Indicating whether the service function f is provided in the edge server n in the time slot t. AndAndThe method meets the following conditions: constraint 5: /(I)
3.2 The total data rate of service flows handled by a VNF instance cannot exceed the processing capacity of the VNF instanceNamely:
Constraint 6:
3.3 the total data rate of the service flows flowing over a link must not exceed its link bandwidth Namely:
constraint 7: Wherein/> RepresentationAndWhether deployed on edge servers n and s, respectively.
3.4 Only when in time slot tAndAt the same time 1,Only 1 can be taken. ThenAndThe relationship between can be described as:
Constraint 8:
3.5 definition For the service migration costs of the edge servers n and s, the total service migration cost of the system cannot exceed the migration threshold C mig,th, namely:
Constraint 9:
Step 4, modeling a long-term optimization problem according to the resource constraint established in the step 1-3;
Step 4 is specifically implemented according to the following steps:
4.1 defining the total overhead of the system includes data flow overhead and power consumption overhead.
4.2 First definitionThe RRH wireless transmission power consumption isP f,n is then defined as the energy consumption of the edge server n to turn on the service function f,Maintaining service functions for edge server nThe total overhead of the system deployment SFC in time slot t is:
Where η is a trade-off coefficient between data flow overhead and power consumption overhead. In the above formula, the first term represents data flow overhead between edge servers, the second term represents power consumption overhead for starting a service function, the third term represents power consumption overhead for providing a service function for a user, and the fourth term represents wireless transmission power consumption for performing beamforming by an RRH.
4.3, Step 4.2 establishes the overhead of system deployment SFC in a single time slot t, and on the basis, the long-term dynamic SFC deployment overhead is defined as the average value of the system overhead of each time slot in the whole deployment process. The total time slot number in the deployment process is represented by TRepresenting the minimum of SFC deployment overhead to solve for long term dynamics, namely:
Wherein the variables in C t Rm,tP f,n and Σ m,t are constrained by constraints 1-9 established in step 3 and step 2 by solvingObtaining specific deployment result/>, of SFC of each time slot
Step 5, constructing a Markov decision process (Markov Decision Processes) model MDP, and decoupling a long-term optimization problem into a time slot-by-time slot optimization problem;
step 5 is specifically implemented according to the following steps:
5.1 establishing MDP four-tuple Wherein the state spaceThe four elements are respectively a wireless channel matrix between the user and the RRH, processing capacity of the VNF instance, link bandwidth between edge servers and SFC deployment result of the last time slot, namely:
5.2 define actions This is because the original action space contains the variablesToo high a dimension, isTherefore, the motion space is processed to reduce the dimension.
5.3 DefinitionIs (s t,at) the corresponding bonus function; if the action a t taken cannot find a viable solution, then the bonus function is set to a small negative number. /(I)
5.4 Solving the maximum value of the reward function r (s t,at) given action a t, and noting the solution problem of the maximum reward function asNamely:
Wherein, Representation ofGiven parameters, willConversion toSpecific deployment results/>, to solve for each slot SFC
Step 6, using an Actor-Critic reinforcement learning algorithm based on natural gradient to learn the SFC optimal deployment strategy on line time slot by time slot;
step 6 is specifically implemented according to the following steps:
6.1 outputting deployment strategies by using an Actor neural network, and evaluating each strategy by using a Critic neural network through a Q value approximation method. The neural network w is used to approximate the action cost function, i.e., Qw(st,at)≈Qπ(st,at),Qw(st,at) represents the expected return that each subsequent state will get after action a t is taken in state s t, Q π(st,at) is the action cost function.
6.2 To break the time correlation between samples, empirical playback and target network techniques are employed to improve training stability, where the loss function of Critic network can be defined as:
Wherein, Representing a desirability operator,For the empirical playback pool, w' is the model of the target network in time slot t,Is an estimate of the expected value at the average return.
6.3 Gradient is calculated for Loss (w) to w, and then the updating mode of w is as follows:
Wherein, alpha c is the learning rate of the Critic network, and I is the number of samples obtained from the experience playback pool.
6.4 Based on parameterized policy pi θ, the expected value of average return is defined as follows:
Wherein, Representing the steady state distribution of state s.
6.5 In order to avoid the situation that J (pi θ) falls into local optimum when training along the standard gradient direction, a natural gradient method is adopted to train an Actor network, and the updating mode of a network model theta is changed into: where α a represents the learning rate of the Actor network, F (θ) is the Fischer information matrix,/> Represents the gradient of J (pi θ) to θ.
6.6 The algorithm flow is shown in figure 2, and the Actor network and the Critic network are integrated, so that the training of the neural network is carried out along the natural gradient direction, and the neural network model is enabled to approach global optimum.
Step 7, establishing a sub-optimization problem of the solution of the reward function when searching the action space, reducing the searching complexity of the action space, and solving the reward function set in the step 5.3 to obtainIs a progressive optimal solution of (a). /(I)
The step 7 is specifically implemented according to the following steps:
7.1 pair of AndRelaxation treatment willConverting into a convex problem; however, the convex problem obtained after relaxation cannot guarantee that the optimal solution is a 0-1 integer solution, so that the relaxation problem is not equivalent to the original problem, and therefore, an L p (0 < p < 1) norm penalty function is introduced to force a relaxation variable to be a 0-1 integer. LetObtainProgressive optimal sub-problemThe following are provided:
where σ is the penalty parameter, Delta is an arbitrarily small positive number, VariableAndSatisfy constraint
7.2 The iterative way of defining penalty parameters is as follows: delta v+1=ηδv (eta > 1) such that penalty term P δ (y) converges to 0 at a linear rate.
7.3 Due toThe penalty term in (a) is non-convex, resulting inDifficult to solve, using a continuous convex approximation (Successive Convex Approximation, SCA) technique willConverting into a convex problem solution. Will beFirst order taylor expansion of penalty term of (2), i.e.Where y v is the optimal solution for the last SCA iteration,Is the gradient value of P δ (y) at y v to y.
7.4 In the v+1st SCA iteration,Eventually becomes a convex problem, namely:
7.5 according to the above procedure to P 1-S as Is obtained by solving for P 1-S Further obtaining the maximum value of the rewarding function and finally obtaining the deployment result of each time slot SFC according to the maximum rewarding

Claims (5)

1. The service function chain deployment method based on reinforcement learning combined coordinated multi-point transmission is characterized by comprising the following steps of:
step 1, describing an edge network model, including the characteristics of an edge server, a network virtual function, a user and a service function chain;
Step 2, describing channel characteristics of a server and a user in an edge network, and eliminating communication interference between a plurality of servers and the user by using beam forming; the step 2 is specifically implemented according to the following steps:
2.1, rayleigh fading and path loss between user m and RRH, using Representing the channel matrix between user m and the RRH numbered n, whereinRepresenting an L n×Lm -dimensional complex matrix, L n represents the number of transmit antennas of RRH numbered n, and L m represents the number of receive antennas of user m, then the signal u m,t received by user m in time slot t can be represented as:
Wherein, Representing the channel matrix between user m and all RRHs in the edge network in time slot t, whereinRepresenting the channel matrix between user m and the RRH numbered n, where (-) H represents the conjugate transpose of the matrix,Representing the total number of antennas for all RRHs; /(I)The beamforming matrix is expressed as a beamforming matrix of all RRHs to the user m, and d m is the number of data streams received by the user m; the identity matrix is denoted by I,Mean value is zero, covariance isIs a gaussian random codebook of (c); n m,t is covarianceIs white gaussian noise;
2.2, the second term of the formula can be removed by continuously encoding the received signal u m,t of the user m in step 2.1 based on the gaussian random codebook, and the received data rate R m,t of the user m in the time slot t can be expressed as:
2.3 setting And P m,n respectively represent the service function processing power consumption and the wireless transmission power consumption provided by the edge server n to the user m, and letVNF instance vBBU indicating whether user m uses edge server n; then all RRHs will beam-form matrix/>, for user mThe following should be satisfied:
Wherein the method comprises the steps of Tr (·) represents the trace operation performed on the matrix;
2.4, eliminating wireless interference between SFCs by utilizing a zero forcing beamforming technology of RRH, namely stacking channel matrixes of all users, and then performing QR decomposition:
Wherein, Can be expressed as: /(I) Is a set of orthogonal bases, and Is an upper triangular matrix of full rank, the remaining matrix blocks of the upper triangle are any non-zero matrix, and thus, the beamforming matrix of user m can be expressed asWherein
2.5, The conditions must be satisfied to eliminate interference: and/> In the formulaThe first L m line of S m,t is related to the received data rate and can therefore be simplified toWhereinH m,t may be simplified to H m,t=Qm,tRm,m,t, and the effective beamforming matrix Σ m,t may be defined as: /(I)Let matrixStep 2.3 establishes the constraint is equivalent to:
Constraint 1:
Constraint 2:
the received data rate R m,t in step 2.2 of 2.6 needs to be equal to or greater than the data rate threshold R m,th in order to perform correct data decoding, namely:
constraint 3:R m,t=log2|ILm+∑m,t|≥Rm,th;
Step 3, establishing a mathematical model under the limitation of the number of server VNF instantiations, the processing capacity of the server, the physical link bandwidth, the VNF routing and the VNF migration budget;
The step 3 is specifically implemented according to the following steps:
3.1, set up Representing a set of edge servers capable of providing service functions f, it is assumed that each service function can only be deployed on one edge server, namely:
constraint 4: Wherein/> Indicating whether the service function f l m is deployed on the edge server n,Indicates whether the service function f is provided in the edge server n in the time slot t, andAndThe method meets the following conditions: constraint 5: /(I)
3.2 The total data rate of service flows handled by a certain VNF instance cannot exceed the processing capacity of the VNF instanceNamely:
Constraint 6:
3.3 the total data rate of the service flow flowing through a certain link cannot exceed the link bandwidth Namely:
constraint 7:
Wherein, Represents f l m andWhether deployed on edge servers n and s, respectively;
3.4, only when in time slot t AndAt the same time 1,Can take 1; thenAndThe relationship between can be described as:
Constraint 8:
3.5, definition For the service migration costs of the edge servers n and s, the total service migration cost of the system cannot exceed the migration threshold C mig,th, namely:
Constraint 9:
Step 4, modeling a long-term optimization problem according to the resource constraint established in the step 1-3;
The step 4 is specifically implemented according to the following steps:
4.1, defining the total overhead of the system to comprise data flow overhead and power consumption overhead;
4.2, first define The RRH wireless transmission power consumption is
P f,n is then defined as the energy consumption of the edge server n to turn on the service function f,Maintaining the energy consumption of the service function f l m for the edge server n, the total overhead of the system deployment SFC in the slot t is:
Wherein η is a compromise coefficient between data flow overhead and power consumption overhead, in the above formula, the first term represents data flow overhead between edge servers, the second term represents power consumption overhead for starting a service function, the third term represents power consumption overhead for providing a service function for a user, and the fourth term represents wireless transmission power consumption for performing beamforming by an RRH;
4.3, step 4.2 establishes the overhead of system deployment SFC in a single time slot T, on the basis, the long-term dynamic SFC deployment overhead is defined as the average value of the system overhead of each time slot in the whole deployment process, the total number of time slots in the deployment process is represented by T, and the system deployment overhead is represented by T Representing the minimum of SFC deployment overhead to solve for long term dynamics, namely:
Wherein the variables in C t P f,n and Σ m,t are constrained by constraints 1-9 established in step 3 and step 2 by solvingObtaining specific deployment result/>, of SFC of each time slot
Step 5, constructing a Markov decision process model MDP, and decoupling a long-term optimization problem into a time slot-by-time slot optimization problem;
Step 6, using an Actor-Critic reinforcement learning algorithm based on natural gradient to learn the SFC optimal deployment strategy on line time slot by time slot;
and 7, establishing a sub-optimization problem of the solution of the reward function when searching the action space, reducing the complexity of searching the action space, and finally obtaining the optimal solution.
2. The service function chain deployment method based on reinforcement learning joint cooperative multipoint transmission according to claim 1, wherein said step 1 is specifically implemented according to the steps of:
Step 1.1, in the edge network, an edge server is connected with a remote radio frequency module RRH and uses Simultaneously representing the nth edge server and the RRH of that server, whereinRepresenting a set of servers in the edge network, N representing the total number of servers in the edge network; the edge servers are connected with each other through an X2 link, and each edge server can provide a plurality of different virtual functions by utilizing a virtual machine technology;
Step 1.2 use Representing the mth user in the edge network,Representing the set of users in the edge network, M representing the total number of users, assuming that each user can only be served by one service function chain SFC, and defining the SFC as follows:
where f 1 m denotes the first service function of the SFC of user m, f l m denotes the first service function, The last service function is indicated and designated as baseband processing vBBU.
3. The service function chain deployment method based on reinforcement learning joint cooperative multipoint transmission according to claim 1, wherein said step 5 is specifically implemented according to the steps of:
5.1 establishing MDP four-tuple Wherein the state spaceThe four elements are respectively a wireless channel matrix between the user and the RRH, processing capacity of the VNF instance, link bandwidth between edge servers and SFC deployment result of the last time slot, namely:
5.2 define actions
5.3 Definition ofIs (s t,at) the corresponding bonus function; if the action a t taken cannot find a feasible solution, setting the bonus function to a smaller negative number;
5.4 solving the maximum value of the reward function r (s t,at) under the premise of the given action a t, and recording the solving problem of the maximum reward function as Namely:
Wherein, Representation ofGiven parameters, willConversion toSpecific deployment results/>, to solve for each slot SFC
4. The service function chain deployment method based on reinforcement learning joint cooperative multipoint transmission according to claim 1, wherein said step 6 is specifically implemented according to the steps of:
6.1, using an Actor neural network to output deployment policies, using a Critic neural network to evaluate each policy by means of Q-value approximation, using neural network w to approximate action cost functions, i.e. Qw(st,at)≈Qπ(st,at),Qw(st,at) to represent the expectations of rewards that can be obtained for subsequent states after action a t is taken in state s t,
Q π(st,at) is an action cost function;
6.2, employing empirical playback and target network techniques to improve training stability, the loss function of the Critic network can be defined as:
Wherein, Representing a desirability operator,For the empirical playback pool, w' is the model of the target network in time slot t,An estimate of the expected value at the average return;
6.3, gradient is calculated for Loss (w) to w, and then the updating mode of w is as follows:
Wherein alpha c is the learning rate of the Critic network, and I is the number of samples obtained from the experience playback pool;
6.4, defining the expected value of average return based on parameterized policy pi θ as follows:
Wherein, Representing the steady state distribution of state s;
And 6.5, training an Actor network by adopting a natural gradient method, wherein the updating mode of the network model theta is changed into: Where a a denotes the learning rate of the Actor network, F (θ) is the fischer information matrix, Represents the gradient of J (pi θ) to θ;
and 6.6, integrating the Actor network and the Critic network, so that training of the neural network is performed along the natural gradient direction, and the neural network model is approximate to global optimum.
5. The service function chain deployment method based on reinforcement learning joint cooperative multipoint transmission according to claim 1, wherein said step 7 is specifically implemented according to the steps of:
7.1 pair of AndRelaxation treatment willConverting into a convex problem; thus introducing an L p (0 < p < 1) norm penalty function to force the relaxation variable to be an integer from 0 to 1; letObtainProgressive optimal sub-problemThe following are provided:
where σ is the penalty parameter, Delta is an arbitrarily small positive number, VariableAndSatisfy constraint
7.2, Defining the iterative mode of punishment parameter as follows: delta v+1=ηδv (eta > 1) such that penalty term P δ (y) converges to 0 at a linear rate;
7.3 due to The penalty term in (a) is non-convex, resulting inDifficult to solve, and/>, adopting a continuous convex approximation SCA techniqueConversion to a convex problem solution, willFirst-order taylor expansion of penalty term of (2), i.e
Where y v is the optimal solution for the last SCA iteration,Is the gradient value of P δ (y) at y v to y;
7.4, in the v+1st SCA iteration, Eventually becomes a convex problem, namely:
7.5, according to the above steps to P 1-S as Is obtained by solving for P 1-S Further obtaining the maximum value of the reward function and finally obtaining the deployment result/>, of each time slot SFC according to the maximum reward
CN202211012894.7A 2022-08-23 2022-08-23 Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission Active CN115767562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211012894.7A CN115767562B (en) 2022-08-23 2022-08-23 Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211012894.7A CN115767562B (en) 2022-08-23 2022-08-23 Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission

Publications (2)

Publication Number Publication Date
CN115767562A CN115767562A (en) 2023-03-07
CN115767562B true CN115767562B (en) 2024-06-21

Family

ID=85349254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211012894.7A Active CN115767562B (en) 2022-08-23 2022-08-23 Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission

Country Status (1)

Country Link
CN (1) CN115767562B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116599687B (en) * 2023-03-15 2023-11-24 中国人民解放军61660部队 Low-communication-delay cascade vulnerability scanning probe deployment method and system
CN117938669B (en) * 2024-03-25 2024-06-18 贵州大学 Network function chain self-adaptive arrangement method for 6G general intelligent service

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113573320A (en) * 2021-07-06 2021-10-29 西安理工大学 SFC deployment method based on improved actor-critic algorithm in edge network
WO2022109184A1 (en) * 2020-11-20 2022-05-27 Intel Corporation Service function chaining policies for 5g systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022109184A1 (en) * 2020-11-20 2022-05-27 Intel Corporation Service function chaining policies for 5g systems
CN113573320A (en) * 2021-07-06 2021-10-29 西安理工大学 SFC deployment method based on improved actor-critic algorithm in edge network

Also Published As

Publication number Publication date
CN115767562A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
Lee et al. Deep power control: Transmit power control scheme based on convolutional neural network
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
Zhao et al. Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted UAV communications
CN115767562B (en) Service function chain deployment method based on reinforcement learning joint coordinated multi-point transmission
De Kerret et al. Team deep neural networks for interference channels
CN113222179A (en) Federal learning model compression method based on model sparsification and weight quantization
CN113162666B (en) Intelligent steel-oriented large-scale MIMO hybrid precoding method and device
Xue et al. Cooperative deep reinforcement learning enabled power allocation for packet duplication URLLC in multi-connectivity vehicular networks
Yuan et al. Adapting to dynamic LEO-B5G systems: Meta-critic learning based efficient resource scheduling
Elbir et al. Hybrid federated and centralized learning
CN113613301B (en) Air-ground integrated network intelligent switching method based on DQN
Li et al. Deep neural network based computational resource allocation for mobile edge computing
CN116436512A (en) Multi-objective optimization method, system and equipment for RIS auxiliary communication
Siddiqi et al. Deep reinforcement based power allocation for the max-min optimization in non-orthogonal multiple access
Razavikia et al. Blind asynchronous over-the-air federated edge learning
Wang et al. Federated learning for precoding design in cell-free massive mimo systems
Wu et al. Client Selection and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning
CN117320075A (en) Edge computing network deployment and resource management method for water area ship
Zhang et al. Transformer-based channel prediction for rate-splitting multiple access-enabled vehicle-to-everything communication
Huang et al. Wireless federated learning over MIMO networks: Joint device scheduling and beamforming design
de Kerret et al. Decentralized deep scheduling for interference channels
Waraiet et al. Deep Reinforcement Learning-based Robust Design for an IRS-assisted MISO-NOMA System
Wang et al. Deep transfer reinforcement learning for resource allocation in hybrid multiple access systems
Zhang et al. Reinforcement Learning-Based Offloading for RIS-Aided Cloud-Edge Computing in IoT Networks: Modeling, Analysis and Optimization
CN117746172A (en) Heterogeneous model polymerization method and system based on domain difference perception distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant