CN114615744A

CN114615744A - Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method

Info

Publication number: CN114615744A
Application number: CN202210185185.2A
Authority: CN
Inventors: 赵楠; 任凡; 杜威; 陈金莲; 陈哲
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-10

Abstract

The invention discloses a knowledge migration reinforcement learning network slice general sensing resource collaborative optimization method, aiming at establishing a network slice general sensing resource collaborative optimization problem model by taking the maximum total throughput of user equipment as an optimization target on the basis of considering the constraint limits of network slice general sensing resource, user equipment time delay, energy consumption and the like according to the differentiated service requirements of mobile edge network slice users. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.

Description

Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a knowledge migration reinforcement learning network slice general sensing calculation resource collaborative optimization method.

Background

With the rapid development of 5G mobile communication and the Internet of things, the mobile communication is being driven to be intelligently evolved by the massive Internet of things equipment and ubiquitous connection requirements. As a new network paradigm, the mobile edge network can significantly reduce transmission energy consumption, network congestion and processing delay by migrating communication resources and computing resources to the edge of the network, and promotes deep fusion of communication, perception and computation.

In a mobile edge network, a network slice can meet the diversified requirements of low time delay and high reliability by sharing resources such as communication, perception, calculation, infrastructure and the like. The existing network slice resource allocation method is roughly researched from the following two aspects. Firstly, the resource optimization type is adopted, and compared with a single resource optimization strategy, the multi-dimensional resource collaborative optimization method is complex, mainly adopts a centralized mode, and has more communication and control overhead. And secondly, a resource allocation mode is adopted, and compared with a static resource allocation mode, the dynamic allocation strategy is changed according to the network environment, so that real-time dynamic optimization of resources can be realized. In an actual mobile edge network, the network slice performance is often related to various types of resources, and the complex dependence relationship of the resources is difficult to be represented by an accurate mathematical model; high-dimensional dynamic network states such as randomness of wireless channel states and time-varying of service flow of sliced users also restrict improvement of network slicing service quality.

Reinforcement learning, a model-free method, with its powerful decision-making capability in a high-dimensional space, is considered as one of promising solutions to solve the above problems. However, the method is less concerned about the safe and efficient optimization problem of the multidimensional resources under the general perception and calculation fusion.

Disclosure of Invention

In order to overcome the problem that a high-dimensional dynamic network restricts the service quality of a network slice, the invention aims to provide a safe and efficient optimization method of network slice multi-dimensional resources under the condition of general sensing calculation fusion.

In order to achieve the purpose, the invention adopts the technical scheme that: a knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:

step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;

step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:

assuming that M base stations share K resource blocks and F computing resources, and N edge sensing devices can be supported to be accessed; the ith base station owns at time t

A resource block,

A computing resource and

edge Devices (ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile broadband network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type.

At time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device as

If it is

Indicating that the base station i allocates the kth resource block for the EDj; if it is

It indicates that the base station i does not allocate the kth resource block for EDj. Consider that each resource block is allocated to at most one edge device, with

Binary computation for defining that the ith base station accesses the jth edge device at the time tResource allocation variable

If it is

Indicating that the base station i distributes the calculation resource f for the EDj; if it is

It indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, of

At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, eMBB slice concerns the sum of throughputs of all EDs

uRLLC section emphasizes the sum of time delays of all EDs

Considering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDs

Therefore, in order to balance the slice differentiation requirements, under the limitations of communication resources, sensing resources, computing resources, total user time delay, total energy consumption and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing computation fusion multidimensional resource collaborative optimization model in step 1 is as follows:

the limiting conditions C1, C2, C3, C4 and C5 are respectively the limiting conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and a total energy consumption E;

indicating the number of resource blocks owned by base station i at time t,

indicating the computational resources that base station i owns at time t,

represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;

and

respectively representing the binary resource block allocation variable of the ith base station for allocating the kth resource block to the jth edge device at the time t and the binary computing resource of the computing resource fDistributing variables;

representing the sum of the throughputs of all EDs at time t,

representing the sum of the time delays of all EDs at time t,

and M is the total number of base stations, and represents the sum of energy consumption of all EDs at the time t.

Step 2, according to the network slice general sensing calculation fusion resource collaborative optimization model in the step 1, the number of resource blocks owned by the base station i at the time t is optimized through a multi-agent reinforcement learning optimization method based on knowledge migration

Computing resources

And number of edge devices

And binary resource block allocation variable

And binary computing resource allocation variables

Carrying out optimization solution to obtain the sum of all optimized EDs throughput

Step 2.1, modeling the multi-agent random game process: the optimization problem is modeled into a multi-agent random game process, and each base station is equivalent to an agent.

The state of each base station is defined as:

wherein,

indicating the state of the ith base station at time t,

and representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t.

The action of each base station is defined as:

wherein,

showing the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station i

Counting of resources

Number of edge sensing devices

And the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge device

And a computing resource allocation policy when allocating computing resources f

The reward function of each base station is defined as:

wherein,

and (4) representing the reward function of the ith base station at the time t, and reflecting the sum of the throughputs of all EDs in the ith base station.

Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in fig. 1, wherein a base station i observes a state from a network environment at a time t

Under the framework of the Actor-Critic algorithm, an Actor network takes action

Selecting a student or self-learning behavior mode before, then training the network model in each behavior mode, and updating the network parameters, thereby obtaining the optimal user resource and computing resource allocation strategy.

At time t, base station i uses long-short term memory network unit b_iWill continue for z states

And actions

Wait for historical knowledge

As its hidden state

The student mode is based on a depth determination strategy gradient model, and the Actor network outputs the student modeProbability P'_ssWhen the probability exceeds a threshold G, i.e. P'_ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'_ssAnd when G is less than or equal to G, the base station i selects a self-learning mode.

Designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time t

And actions

Waiting for history information

And a policy network parameter θ_n. Learning parameter P in consideration of multi-head attention mechanism₁、P₂And P₃The assigned attention weight can be obtained:

where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have the weight sum of the linear transformation:

wherein, P_sIs the learning parameter for the strategy parameter decoding.

Thus, the student base station i utilizes the hidden state

The motion at this time is obtained using parameters from a multi-head attention mechanism model:

the gain in base station learning performance from student mode is defined herein as a student reward

The student Actor-criticic network is trained using a trained attention selection model. By minimizing the student's loss function

Updating student Critic network parameters of base station i

Wherein,

is given by the parameter

The student goal Critic network of (1),

and

respectively representing the hidden state at the current time t and the hidden state at the next time t,

and

respectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,

and

representing the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,

γ is the discount factor for the student reward function.

Student Actor network passing strategy with mu parameter

A policy gradient update is performed as follows:

in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station i

And sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method.

The DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weights

State-action value of

Function to approximate an optimal state-action value function, wherein

For the hidden state of base station i at time t,

is as followsAn action of previous value network generation; targeted value network usage with weighting

State-action value of

Function to improve the performance of the whole network, wherein

For the hidden state of base station i at the next time t,

an action generated for the target value network. After a certain number of rounds, the weights of the current value network are copied

To update the weights of the target value network

Weighting of current value network by gradient descent method

The update is performed to obtain the minimum loss function:

wherein

For the self-learning reward function, γ is the discount factor.

Meanwhile, in order to reduce the correlation of the empirical data, the algorithm adopts an empirical playback strategy. In a hidden state

Next, base station i acts by performing

Earning rewards

Then hide the state

Transition to the hidden state at the next instant t

The deep neural network transfers the state to information

Stored in the empirical playback memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B

To train the neural network. By continuously reducing the correlation among training samples, the base station can be helped to learn and train better so as to avoid the problem that the optimal strategy falls into the local minimum. In addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batch

The problem of overfitting can be effectively reduced.

The method has the advantages that the constructed network slice general sensing calculation fusion resource collaborative optimization model is complete, and the base station finds the optimal allocation strategy of user resources and calculation resources through the continuous interaction between the base station and the environment due to the unknown network environment of the model, so that the method has high practical application value; the transfer learning and the deep reinforcement learning are combined, the existing knowledge domain data of a trained base station can be reused, existing large amount of work is reserved, data are transferred and applied to the training process of other base stations rapidly, the timeliness advantage is embodied, the decision efficiency of the deep reinforcement learning in a high-dimensional space is improved, the learning capacity and the generalization capacity of the base stations are also improved effectively, the complexity and the sparseness when resources are allocated for operating the base stations manually in an uncertain environment are avoided, and the base stations can complete resource collaborative optimization allocation more safely and efficiently.

Drawings

FIG. 1: multi-agent knowledge migration reinforcement learning framework

Detailed Description

The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.

Aiming at the differentiated service requirements of mobile edge network slice users, on the basis of considering the constraint limits of network slice general sensing resources, user equipment time delay, energy consumption and the like, a network slice general sensing resource collaborative optimization problem model is established by taking the total throughput of the user equipment as an optimization target to be maximized. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.

assuming that M-5 base stations share K-100 resource blocks and F-100 computing resources, and N-75 edge aware devices can be supported to access the resource blocks; the ith base station owns at time t

A resource block,

A computing resource and

If it is

Defining a binary computing resource allocation variable for an ith base station to access a jth edge device at time t

If it is

It indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, there

At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, sum of throughput of all EDs of attention by eMBB slice

uRLLC section emphasizes the sum of time delays of all EDs

Then, in order to balance the slice differentiation requirements, under the restrictions of communication resources, sensing resources, computing resources, total user time delay, total energy consumption, and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing fusion multidimensional resource collaborative optimization model in step 1 is as follows:

the limiting conditions C1, C2, C3, C4 and C5 are respectively constraint conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and total energy consumption E, and the user total time delay T and the total energy consumption E are determined according to actual engineering conditions;

indicating the number of resource blocks owned by base station i at time t,

indicating the computational resources that base station i owns at time t,

and

respectively representing a binary resource block allocation variable of the ith base station for allocating the kth resource block and a binary computing resource allocation variable of the computing resource f for the jth edge device at the moment t;

representing the sum of the throughputs of all EDs at time t,

representing the sum of the time delays of all EDs at time t,

and M is the total number of base stations, and represents the sum of energy consumptions of all EDs at the time t.

Step 2, the network slice general sensing calculation fusion according to the step 1A resource collaborative optimization model, which is used for optimizing the number of resource blocks owned by a base station i at the time t by a knowledge migration-based multi-agent deep reinforcement learning optimization method

Computing resources

And number of edge devices

And binary resource block allocation variable

And binary computing resource allocation variables

The state of each base station is defined as:

wherein,

indicating the state of the ith base station at time t,

The action of each base station is defined as:

wherein,

Counting of resources

Number of edge sensing devices

The reward function of each base station is defined as:

wherein,

Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in the figure1, base station i observes the state from the network environment at time t

And actions

Wait for historical knowledge

As its hidden state

In actual engineering, too few z state settings can lead to insufficient data volume of network training, and too many settings can lead to low system training efficiency.

The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'_ssWhen the probability exceeds a threshold G, i.e. P'_ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'_ssWhen G is less than or equal to G, the base station i selects a self-learning mode; when the threshold value G is less than 0.5, the base station i is more inclined to select the student mode; on the contrary, when the threshold value G is greater than 0.5, the base station i is more inclined to select the self-learning mode.

And actions

Waiting for history information

And a policy network parameter θ_n. Considering a multi-head attention mechanism learning parameter P₁、P₂And P₃An assigned attention weight can be obtained:

where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have a linear transformed weight sum:

wherein, P_sIs the learning parameter for the strategy parameter decoding.

Thus, the student base station i utilizes the hidden state

the gain in base station learning performance from student mode is defined herein as student rewards

Updating student Critic network parameters of base station i

Wherein,

is given by the parameter

The student goal Critic network of (1),

and

and

and

γ is the discount factor for the student reward function.

Student Actor network passing strategy with mu parameter

A policy gradient update is performed as follows:

State-action value of

Function to approximate an optimal state-action value function, wherein

For the hidden state of base station i at time t,

an action generated for the nonce network; targeted value network usage with weighting

State-action value of

Function to improve the performance of the whole network, wherein

For the hidden state of base station i at the next time t,

To update the weights of the target value network

Weighting of current value network by gradient descent method

The update is performed to obtain the minimum loss function:

wherein

For the self-learning reward function, γ is the discount factor.

Next, base station i acts by performing

Earning rewards

Then hide the state

Transition to the hidden state at the next instant t

The deep neural network transfers the state to information

Stored in the empirical replay memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B

The problem of overfitting can be effectively reduced.

Claims

1. A knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:

Resource block, Y_i ^l(t) computing resources and

is provided at one edge withSpare (Edge Device, ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile BroadBand network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type;

I is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K; if it is

Indicating that the base station i is the k-th resource block which is not allocated to the EDj; consider that each resource block is allocated to at most one edge device, with

F is more than or equal to 1 and less than or equal to F; if it is

Indicating that the base station i distributes the calculation resource f for the EDj; consider that each computing resource is allocated to at most one edge device, there

uRLLC section emphasizes the sum of time delays of all EDs

s.t.C1:

C2:

C3:

C4:

C5：

indicates the number of resource blocks, Y, owned by the base station i at time t_i ^l(t) represents the computational resources owned by base station i at time t,

and

representing the sum of the throughputs of all EDs at time t,

representing the sum of the time delays of all EDs at time t,

representing the sum of energy consumption of all EDs at the time t, wherein M is the total number of base stations;

step 2, according to step 1The network slice general sensing calculation fusion resource collaborative optimization model carries out the reinforcement learning optimization method of the number of the resource blocks owned by the base station i at the time t through the knowledge transfer based multi-agent

Computing resource Y_i ^l(t) and number of edge devices

And binary resource block allocation variable

And binary computing resource allocation variables

Step 2.1, modeling the multi-agent random game process: modeling the optimization problem into a multi-agent random game process, and equivalently using each base station as an agent;

the state of each base station is defined as:

wherein,

indicating the state of the ith base station at time t,

representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t;

the action of each base station is defined as:

wherein,

indicating the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station i

Number of computing resources Y_i ^l(t) number of edge sensing devices

The reward function of each base station is defined as:

wherein r is_i ^tThe reward function of the ith base station at the moment t is represented, and the sum of the throughputs of all EDs in the ith base station is reflected;

Selecting a student or self-learning behavior mode, training network models in respective behavior modes, and updating network parameters so as to obtain an optimal user resource and computing resource allocation strategy;

And actions

Wait for historical knowledge

As its hidden state

The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'_ssWhen the probability exceeds a threshold G, i.e. P'_ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'_ssWhen G is less than or equal to G, the base station i selects a self-learning mode;

And actions

Waiting for history information

And a policy network parameter θ_n(ii) a Learning parameter P in consideration of multi-head attention mechanism₁、P₂And P₃The assigned attention weight can be obtained:

wherein D is the dimension of the historical information vector of the teacher base station n; the final strategy proposal is to have the weight sum of the linear transformation:

wherein, P_sIs a learning parameter for policy parameter decoding;

thus, the student base station i utilizes the hidden state

The student Actor-criticic network is trained by using a trained attention selection model; by minimizing the student's loss function

Updating student Critic network parameters of base station i

Wherein,

is given by the parameter

The student goal Critic network of (1),

and

and

and

reward function for students, gammaIs a discount factor;

student Actor network passing strategy with mu parameter

A policy gradient update is performed as follows:

Sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method;

State-action value of

Function to approximate an optimal state-action value function, wherein

For the hidden state of base station i at time t,

State-action value of

Function to improve the performance of the whole network, wherein

For the hidden state of base station i at the next time t,

an action generated for the target value network; after a certain number of rounds, the weights of the current value network are copied

To update the weights of the target value network

Weighting of current value network by gradient descent method

The update is performed to obtain the minimum loss function:

wherein r is_i ^tGamma is a discount factor for the self-learning reward function;

meanwhile, in order to reduce the correlation of the empirical data, an algorithm adopts an empirical playback strategy; in a hidden state

Next, base station i acts by performing

Earning a reward r_i ^tThen hide the state

Transition to the hidden state at the next instant t

The deep neural network transfers the state to information

Stored in the experience playback memory B; in the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory B

To train the neural network; by continuously reducing the correlation among training samples, the base station can be helped to better learn and train so as to avoid the problem that the optimal strategy falls into the local minimum; in addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batch

The problem of overfitting can be effectively reduced.