CN114189870A

CN114189870A - Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning

Info

Publication number: CN114189870A
Application number: CN202111512524.5A
Authority: CN
Inventors: 王小明; 胡静; 徐友云; 李大鹏
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-15

Abstract

The invention discloses a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems. The method comprises the following steps: step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation; step 2: acquiring a state; and step 3: sub-channel allocation and power allocation; and 4, step 4: feedback acquisition and parameter updating; and 5: a decision-driven mechanism. The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole system.

Description

Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning

Technical Field

The invention relates to the field of wireless communication, in particular to a resource allocation method for processing multi-cell eMBB and URLLC simultaneous transmission by a multi-agent deep reinforcement learning-based method, so as to improve the service satisfaction level of eMBB and URLLC users in multiple cells.

Background

The 6G network is a world of global connection integrating terrestrial radio and satellite communication, and can flexibly adapt to services in different application scenarios such as enhanced mobile broadband (eMBB), ultra-reliable low-latency communications (URLLC), and the like, with the support of technologies such as global satellite positioning system, telecommunication satellite system, and the like. Service applications such as immersive cloud XR, holographic communication and sensory interconnection in 6G have higher requirements on eMBB and URLLC. How to utilize limited system resources to meet the different requirements of the two services becomes a key issue of wireless communication networks. Therefore, it is important to solve the problem of resource allocation in the coexistence of eMBB and URLLC.

It was found through retrieval that x.wang et al published a sentence entitled "Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks (Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks)" in IEEE Conference on Computer Communications, pp.1970-1978, April 2018 (society of electrical and electronics engineers Computer Communications Conference, 4.2018, page 1970-1978), which proposed a linear model, a convex model and a threshold model to evaluate loss of eMBB data rate, and synergistically optimize bandwidth allocation of eMBB users and resource preemption location of URLLC Traffic in the case where the URLLC Traffic stably arrives. However, in practical applications, URLLC flow is time-varying, and a long-term optimal solution cannot be obtained using this method. With the increasing of users and the increasing of system scale, the method has the problems of long periodicity, high calculation complexity and the like, so researchers consider applying the reinforcement learning method with strong calculation capacity and learning rate to the problem of wireless network resource allocation.

Through the patent finding, CN109561504A discloses a resource multiplexing of URLLC and eMBB based on deep reinforcement learning. Firstly, acquiring data packet information, channel information and queue information of M mini-slot URLLC and eMBB as training data; then establishing a URLLC and eMBB resource multiplexing model based on deep reinforcement learning, and training model parameters by using training data; after training is finished, the information of the URLLC and eMBB data packets, the channel information and the queue information of the current mini-slot are input into a trained model, and finally a resource multiplexing decision result is obtained, so that the reasonable distribution and utilization of time-frequency resources and power are realized. But the invention only considers the single-cell eMBB and URLLC system resource allocation scheme. In practical application scenarios, since each cell occupies the same spectrum resource, users in the cell will inevitably be interfered by neighboring cells, and therefore, it has become a current research focus to improve system performance by reasonably allocating subchannels and powers of multi-cell eMBB and URLLC user systems.

Disclosure of Invention

The invention provides a multi-cell eMBB and URLLC user system resource allocation method based on multi-agent deep reinforcement learning, which solves the problem of multi-cell eMBB and URLLC user system resource allocation based on the multi-agent deep reinforcement learning method, utilizes a mode of centralized training and distributed execution of a plurality of agents to carry out global control and reduce the dimensionality of complex tasks, and effectively improves the performance of the system while reducing the time cost. Specifically, the sub-channel and power allocation schemes of each cell are output respectively based on a joint contention deep Q network (DDQN) and a deep deterministic policy gradient network (DDPG), and then the allocation strategies are adjusted according to the feedback of the system to maximize the service satisfaction level of multi-cell eMBB and URLLC users.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and comprises the following steps:

step 1: and constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation.

Specifically, a multi-cell eMBB and URLLC user system is provided with N base stations, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirement of a user, different time durations are used for transmission, and the time domain is divided into 1 millisecond time slots for transmitting eMBB streamsAnd dividing the time slot into 7 small time slots for transmitting URLLC flow. Wherein each time slot has D_uThe arrival of URLLC packets, each packet of size Z_uA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. In order to maximize the service satisfaction level of eMBB and URLLC users under the condition of limited spectrum resources, a multi-agent deep reinforcement learning network is constructed in the patent and used for solving the problem of sub-channel and power distribution of multi-cell eMBB and URLLC users. First, N Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information. Then, a centralized training network is established at the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated.

Step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cell_t(ii) a If the state of the nth base station at the time t is as follows:

and step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 2 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:

and

in particular, at the beginning of each time slot, the local state s obtained_n(t) are sent to the corresponding local Q-DNN n' networkNetworks and operator DNN' networks. Selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategy

As a subchannel allocation scheme within the current time slot. Wherein, the e-greedy strategy refers to randomly selecting an action from the sub-channel allocation action space with the probability of being

Or selecting the action with the maximum estimated Q value with the probability 1-epsilon as

To balance the exploration of new actions with the exploitation of known actions. At the same time, the local operator DNN' network is also activated, using the same state as input, according to which

To output a corresponding power allocation action, wherein mu(s)_n(t)；θ′_n) Is a policy function of the local actor DNN n 'network, θ'_nIs a network parameter of the local operator DNN',

representing a random noise process and following a positive distribution. Finally, the local network output joint sub-channel and power allocation action is as follows:

a(t)＝{a₁(t)，a₂(t)，...，a_N(t)}＝

{[C₁(t)，P₁(t)]，[C₂(t)，P₂(t)]，...，[C_N(t)，P_N(t)]}。

and 4, step 4: feedback acquisition and parameter updating.

Each cell receives the joint sub-channel allocation and power allocation action a_nAfter (t), from the current state s_n(t) move to the next state s'_n(t) and giving a local prizer_n(t) and then fed back to the local network. Local network continuously collecting experience e_n＝{s_n(t)，a_n(t)，r_n(t)，s′_n(t) and upload it to the central network. Central network reception

Then, the global information s₁(t)，s₂(t)，...，s_N(t)，a₁(t)，a₂(t)，...，a_N(t)，r(t)，s′₁(t)，s′₂(t)，...，s′_N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,

at the central network, a multi-agent network is established based on the DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss function_nAnd beta_n，

Wherein the content of the first and second substances,

and

are network parameters of the central network target Q-DNN. And then assigning the parameters of the central network Q-DNN to the corresponding target Q-DNN every X steps, as follows:

and

finally the updated network parameter alpha_nAnd beta_nAnd downloading to the local to realize the updating of the network parameters of the Q-DNN n'.

Parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,

wherein the content of the first and second substances,

y(t)＝r(t)+γQ(s′₁(t)，...，s′_N(t)，P′₁(t)，...，P′_N(t)；δ^-)，δ^-is the network parameter of the target critic DNN.

Then, the network parameters of the target critic DNN are updated in a Soft-update manner as follows:

δ-←τ_cδ+(1-τ_c)δ^-wherein 0 < tau_c＜＜1,τ_cRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global reward_nUpdating is carried out by the following method:

similar to the target critical network, the network parameters of the target operator DNN are updated as follows:

wherein, 0 < tau_n＜＜1,τ_nRepresenting the learning rate of the target actor DNN network. Finally, the updated network parameter theta_nDown to local to implement the updating of the operator DNN' network parameters.

And 5: a decision-driven mechanism.

The invention designs a decision driving mechanism, and by monitoring the state of the system, when the states of two connected time slots are almost the same, a new round of learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.

Specifically, a state error threshold value ρ is set,

ρ＝||s(t)-s(t-1)||²where s (t) represents the state of the current slot, and s (t-1) represents the state of the last slot. By monitoring the current state and then comparing with the state of the last time slot, whether to perform a new round of learning is decided as follows:

wherein, a_n(t-1) represents the output action of the base station n in the last time slot,

indicating the output action after a new round of learning by the base station n.

The invention has the beneficial effects that: the method of the invention designs a plurality of intelligent agents for distributed execution and centralized training in local and central based on DDQN and DDPG networks, the method well solves the problems of sub-channel distribution and power distribution of multi-cell eMBB and URLLC user systems, and effectively reduces the input and output dimension, signaling cost and computational complexity of the network; compared with a common reinforcement learning method, the service satisfaction degree level of multi-cell eMBB and URLLC users is improved, and the performance of the whole network is further improved; the method combines a multi-agent deep reinforcement learning network of sub-channel distribution and power distribution to improve the system performance of multi-cell eMBB and URLLC simultaneous transmission, and under the condition of considering inter-cell co-frequency interference, the maximization of the user service satisfaction degree of the multi-cell eMBB and URLLC is realized.

Drawings

Fig. 1 is a diagram illustrating a multi-cell eMBB and URLLC multiplexing scenario according to the present invention.

Fig. 2 is a block diagram of multi-cell eMBB and URLLC user system resource allocation based on multi-agent deep reinforcement learning.

FIG. 3 is a schematic diagram of information interaction between a multi-agent network and a multi-cell system according to the present invention.

Fig. 4 is a diagram illustrating the comparison of multi-cell eMBB and URLLC user service satisfaction in accordance with the present invention.

FIG. 5 is a graphical representation of the time cost per execution of the method of the present invention compared to other methods.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.

The invention relates to a joint sub-channel distribution and power distribution method of a multi-cell eMBB and URLLC user system based on multi-agent deep reinforcement learning.

N base stations are arranged in a multi-cell eMBB and URLLC user system, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirements of users, different durations are used for transmission, in the patent, a time domain is divided into 1 millisecond time slots for transmitting eMBB flow, and then the time slots are further divided into 7 small time slots for transmitting URLLC flow. Wherein each time slot has D_uThe arrival of URLLC packets, each packet of size Z_uA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. Water for realizing service satisfaction degree of eMBB and URLLC users under limited spectrum resource conditionThe average is maximized.

The method is realized by the following steps:

Specifically, the method comprises the following steps: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information; then, a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated. Finally, the agent reaches reward maximization through continuous learning.

And 2, establishing the signal-to-noise ratio (SINR) and the obtained data rate of each eMBB user and each URLLC user based on the interference between adjacent cells in the multi-cell eMBB and URLLC user system, and setting a target reward.

Specifically, the SINR of the l sub-channel received by the eMBB user b from the base station n in the k small time slot is:

the SINR of the l sub-channel received by URLLC user u from base station n in the k small time slot is:

wherein the content of the first and second substances,

respectively representing m channel allocation indexes of users, channel gain of k small time slot and transmitting power of l sub-channel received from base station N in k small time slot, N₀Representing the noise power.

Then, according to the shannon formula, the transmission rates of the eMBB user b and the URLLC user u realized in the kth small time slot of the ith sub-channel of the base station n are respectively:

and

and finally, obtaining the rate realized by all eMBB users of the base station n in the t time slot:

and the rate achieved in the t-th slot for all URLLC users at base station n:

the objective reward of the invention is to realize the maximization of the service satisfaction level of multi-cell eMBB and URLLC users, and the service satisfaction level of the eMBB and URLLC users at a base station n is respectively measured by the following formula.

And

wherein the content of the first and second substances,

is the lowest rate requirement of all eMBB users of the base station n at the t-th time slot,

is the arrival of the user at the t-th slot URLLC of the base station n.

In order to convert the multi-objective problem into a single-objective problem, the service satisfaction level of multi-cell eMBB and URLLC users is set as a target reward, and a specific optimization problem is described as follows:

P1：

s.t.C1：

C2：

representing the maximum transmit power of base station n.

Step 3, setting the state, and taking the channel gain information of all users in each cell on different sub-channels as the current state s_tIf the state of the nth base station at the time t is:

and 4, step 4: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 3 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:

and

specifically, at the beginning of each time slot, the obtained local state sn (t) is sent to the corresponding local Q-DNN 'network and the operator DNN' network. Local Q-DNN n' network adopts E-greedy strategy to distribute action from local sub-channelSelecting an action in space

a(t)＝{a₁(t)，a₂(t)，...，a_N(t)}＝

{[C₁(t)，P₁(t)]，[C₂(t)，P₂(t)]，...，[C_N(t)，P_N(t)]}。

and 5: feedback acquisition and parameter updating.

Each cell receives the joint sub-channel allocation and power allocation action a_nAfter (t), from the current state s_n(t) move to the next state s'_n(t) and gives a local prize r_n(t) and then fed back to the local network. Local network continuously collecting experience e_n＝{s_n(t)，a_n(t)，r_n(t)，s′_n(t) and upload it to the central network, which receives it

in the central network, the patent establishes a multi-agent network based on DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss function_nAnd beta_n，

Wherein the content of the first and second substances,

and

the network parameters of the central network target Q-DNN n are calculated, and then the parameters of the central network Q-DNN are assigned to the corresponding target Q-DNN every X steps, as shown in the following:

and

wherein the content of the first and second substances,

δ^-←τ_cδ+(1-τ_c)δ^-wherein 0 < tau_c＜＜1,τ_cRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global reward_nUpdating is carried out by the following method:

Step 6: a decision-driven mechanism.

The invention relates to a decision driving mechanism, which is characterized in that the state of a system is monitored, when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.

Specifically, the method comprises the following steps: a state error threshold value p is set,

ρ＝||s(t)-s(t-1)||²wherein s (t) represents the state of the current time slot, s (t-1) represents the state of the previous time slot, and the decision driving module determines whether to perform a new round of learning by monitoring the current state and then comparing with the state of the previous time slot, as follows:

As shown in fig. 1-5, considering multi-cell eMBB and URLLC system scenarios, a sub-channel and power allocation scheme for each user is jointly optimized, and main parameters of the simulation scenario of this embodiment are shown in table 1.

TABLE 1 Main parameters of the System

Fig. 4 and 5 are schematic diagrams comparing the algorithm of the present invention with other methods with respect to the multi-cell eMBB and URLLC user service satisfaction level and cost per execution time. It can be seen from the figure that the system performance obtained by the MADRL and MADRL-DD algorithms provided by the present invention is slightly lower than that obtained by an exhaustive method, and is much higher than that obtained by a general reinforcement learning algorithm and a random method. In addition, the performance of the MADRL-DD algorithm is infinitely close to that of the MADRL algorithm, and it can be seen that the time cost and the calculation overhead are effectively reduced under the condition that the service satisfaction level of multi-cell eMBB and URLLC users is guaranteed by the decision driving module.

The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole network.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and is characterized in that: the multi-cell multi-service resource allocation method comprises the following steps:

step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation;

step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cell_t；

And step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in the step 2 as input and then outputs a local sub-channel distribution action and a local power distribution action;

and 4, step 4: feedback acquisition and parameter update: transmitting the subchannel allocation action and the power allocation action obtained in the step (3) to a system, giving a reward by the system and moving to the next state, continuously collecting the current state, the current action, the current reward and the next time state by a local network, uploading the current state, the current reward and the next time state to an experience memory pool, and extracting sample data from the experience memory pool to train central network parameters and transmitting the central network parameters to the local network;

and 5: and the decision driving mechanism monitors the state of the system, and when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.

2. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the construction of the multi-agent network in the step 1 specifically comprises the following steps:

step 1-1: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information;

step 1-2: a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then the parameters of the local network are updated;

step 1-3: the agent reaches reward maximization through continuous learning.

3. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the specific steps of the sub-channel allocation and the power allocation in the step 3 are as follows:

step 3-1: local state s obtained at the beginning of each time slot_n(t) is sent to the corresponding local Q-DNN n 'network and operator DNN n' network;

step 3-2: selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategy

As a sub-channel allocation scheme within the current time slot;

step 3-3: while the local operator DNN n' network is activated, using the same state as input, according to

represents a random noise process and follows a positive distribution;

step 3-4: finally, the local network output joint sub-channel and power allocation action is as follows:

a(t)＝{a₁(t)，a₂(t)，...，a_N(t)}＝{[C₁(t)，P₁(t)]，[C₂(t)，P₂(t)]，...，[C_N(t)，P_N(t)]}。

4. the multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 4, the feedback acquisition and parameter update specifically include:

step 4-1: each cell receives the joint sub-channel allocation and power allocation action a of step 3_nAfter (t), from the current state s_n(t) move to the next state s'_n(t) and gives a local prize r_n(t) then fed back to the local network;

step 4-2: local network continuously collecting experience e_n＝{s_n(t)，a_n(t)，r_n(t)，s′_n(t) and uploading it to the central network;

step 4-3: central network reception

step 4-4: in the central network, a part of sample data is selected from the experience memory pool, central network parameters are updated, and finally the updated network parameters are downloaded to the local to realize the updating of local network parameters.

5. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 4, wherein: the step 4-4 is specifically as follows:

step 4-4-1: parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss function_nAnd beta_n，

Wherein the content of the first and second substances,

and

and

finally the updated network parameter alpha_nAnd beta_nDownloading to the local to realize the updating of the Q-DNN n' network parameters;

step 4-4-2: parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,

wherein the content of the first and second substances,

y(t)＝r(t)+γQ(s′₁(t)，...，s′_N(t)，P₁′(t)，...，P′_N(t)；θ^-)，θ^-is a network parameter of the target critic DNN;

step 4-4-3: updating the network parameters of the target critic DNN in a Soft-update manner as follows:

θ^-←τ_cδ+(1-τ_c)δ^-

wherein, 0 < tau_c＜＜1，τ_cRepresenting the learning rate of the target critic DNN network;

step 4-4-4: training network parameters of a central network operator DNN, network parameters θ of the central network operator DNN, by maximizing the expectation of global rewards_nUpdating is carried out by the following method:

the network parameter updating mode of the target operator DNN is as follows:

wherein, 0 < tau_n＜＜1，τ_nRepresenting the learning rate of the target actor DNN network;

step 4-4-5: network parameter θ to be updated_nDown to local to implement the updating of the operator DNN' network parameters.

6. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 5, the decision driving mechanism specifically includes:

step 5-1: a state error threshold value p is set,

ρ＝||s(t)-s(t-1)||²wherein s (t) represents the state of the current time slot, and s (t-1) represents the state of the last time slot;

step 5-2: the decision driver module determines whether to perform a new round of learning by monitoring the current state and then comparing the current state with the state of the previous time slot, as follows: