CN114189870A - Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning - Google Patents

Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning Download PDF

Info

Publication number
CN114189870A
CN114189870A CN202111512524.5A CN202111512524A CN114189870A CN 114189870 A CN114189870 A CN 114189870A CN 202111512524 A CN202111512524 A CN 202111512524A CN 114189870 A CN114189870 A CN 114189870A
Authority
CN
China
Prior art keywords
network
local
dnn
cell
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111512524.5A
Other languages
Chinese (zh)
Inventor
王小明
胡静
徐友云
李大鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111512524.5A priority Critical patent/CN114189870A/en
Publication of CN114189870A publication Critical patent/CN114189870A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems. The method comprises the following steps: step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation; step 2: acquiring a state; and step 3: sub-channel allocation and power allocation; and 4, step 4: feedback acquisition and parameter updating; and 5: a decision-driven mechanism. The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole system.

Description

Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning
Technical Field
The invention relates to the field of wireless communication, in particular to a resource allocation method for processing multi-cell eMBB and URLLC simultaneous transmission by a multi-agent deep reinforcement learning-based method, so as to improve the service satisfaction level of eMBB and URLLC users in multiple cells.
Background
The 6G network is a world of global connection integrating terrestrial radio and satellite communication, and can flexibly adapt to services in different application scenarios such as enhanced mobile broadband (eMBB), ultra-reliable low-latency communications (URLLC), and the like, with the support of technologies such as global satellite positioning system, telecommunication satellite system, and the like. Service applications such as immersive cloud XR, holographic communication and sensory interconnection in 6G have higher requirements on eMBB and URLLC. How to utilize limited system resources to meet the different requirements of the two services becomes a key issue of wireless communication networks. Therefore, it is important to solve the problem of resource allocation in the coexistence of eMBB and URLLC.
It was found through retrieval that x.wang et al published a sentence entitled "Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks (Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks)" in IEEE Conference on Computer Communications, pp.1970-1978, April 2018 (society of electrical and electronics engineers Computer Communications Conference, 4.2018, page 1970-1978), which proposed a linear model, a convex model and a threshold model to evaluate loss of eMBB data rate, and synergistically optimize bandwidth allocation of eMBB users and resource preemption location of URLLC Traffic in the case where the URLLC Traffic stably arrives. However, in practical applications, URLLC flow is time-varying, and a long-term optimal solution cannot be obtained using this method. With the increasing of users and the increasing of system scale, the method has the problems of long periodicity, high calculation complexity and the like, so researchers consider applying the reinforcement learning method with strong calculation capacity and learning rate to the problem of wireless network resource allocation.
Through the patent finding, CN109561504A discloses a resource multiplexing of URLLC and eMBB based on deep reinforcement learning. Firstly, acquiring data packet information, channel information and queue information of M mini-slot URLLC and eMBB as training data; then establishing a URLLC and eMBB resource multiplexing model based on deep reinforcement learning, and training model parameters by using training data; after training is finished, the information of the URLLC and eMBB data packets, the channel information and the queue information of the current mini-slot are input into a trained model, and finally a resource multiplexing decision result is obtained, so that the reasonable distribution and utilization of time-frequency resources and power are realized. But the invention only considers the single-cell eMBB and URLLC system resource allocation scheme. In practical application scenarios, since each cell occupies the same spectrum resource, users in the cell will inevitably be interfered by neighboring cells, and therefore, it has become a current research focus to improve system performance by reasonably allocating subchannels and powers of multi-cell eMBB and URLLC user systems.
Disclosure of Invention
The invention provides a multi-cell eMBB and URLLC user system resource allocation method based on multi-agent deep reinforcement learning, which solves the problem of multi-cell eMBB and URLLC user system resource allocation based on the multi-agent deep reinforcement learning method, utilizes a mode of centralized training and distributed execution of a plurality of agents to carry out global control and reduce the dimensionality of complex tasks, and effectively improves the performance of the system while reducing the time cost. Specifically, the sub-channel and power allocation schemes of each cell are output respectively based on a joint contention deep Q network (DDQN) and a deep deterministic policy gradient network (DDPG), and then the allocation strategies are adjusted according to the feedback of the system to maximize the service satisfaction level of multi-cell eMBB and URLLC users.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and comprises the following steps:
step 1: and constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation.
Specifically, a multi-cell eMBB and URLLC user system is provided with N base stations, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirement of a user, different time durations are used for transmission, and the time domain is divided into 1 millisecond time slots for transmitting eMBB streamsAnd dividing the time slot into 7 small time slots for transmitting URLLC flow. Wherein each time slot has DuThe arrival of URLLC packets, each packet of size ZuA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. In order to maximize the service satisfaction level of eMBB and URLLC users under the condition of limited spectrum resources, a multi-agent deep reinforcement learning network is constructed in the patent and used for solving the problem of sub-channel and power distribution of multi-cell eMBB and URLLC users. First, N Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information. Then, a centralized training network is established at the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated.
Step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cellt(ii) a If the state of the nth base station at the time t is as follows:
Figure BDA0003398164280000031
and step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 2 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:
Figure BDA0003398164280000032
and
Figure BDA0003398164280000041
in particular, at the beginning of each time slot, the local state s obtainedn(t) are sent to the corresponding local Q-DNN n' networkNetworks and operator DNN' networks. Selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategy
Figure BDA0003398164280000042
As a subchannel allocation scheme within the current time slot. Wherein, the e-greedy strategy refers to randomly selecting an action from the sub-channel allocation action space with the probability of being
Figure BDA0003398164280000043
Or selecting the action with the maximum estimated Q value with the probability 1-epsilon as
Figure BDA0003398164280000044
To balance the exploration of new actions with the exploitation of known actions. At the same time, the local operator DNN' network is also activated, using the same state as input, according to which
Figure BDA0003398164280000045
To output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',
Figure BDA0003398164280000046
representing a random noise process and following a positive distribution. Finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}=
{[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
and 4, step 4: feedback acquisition and parameter updating.
Each cell receives the joint sub-channel allocation and power allocation action anAfter (t), from the current state sn(t) move to the next state s'n(t) and giving a local prizern(t) and then fed back to the local network. Local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and upload it to the central network. Central network reception
Figure BDA0003398164280000047
Then, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
Figure BDA0003398164280000048
at the central network, a multi-agent network is established based on the DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan
Figure BDA0003398164280000051
Wherein the content of the first and second substances,
Figure BDA0003398164280000052
Figure BDA0003398164280000053
and
Figure BDA0003398164280000054
are network parameters of the central network target Q-DNN. And then assigning the parameters of the central network Q-DNN to the corresponding target Q-DNN every X steps, as follows:
Figure BDA0003398164280000055
and
Figure BDA0003398164280000056
finally the updated network parameter alphanAnd betanAnd downloading to the local to realize the updating of the network parameters of the Q-DNN n'.
Parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
Figure BDA0003398164280000057
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P′1(t),...,P′N(t);δ-),δ-is the network parameter of the target critic DNN.
Then, the network parameters of the target critic DNN are updated in a Soft-update manner as follows:
δ-←τcδ+(1-τc-wherein 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global rewardnUpdating is carried out by the following method:
Figure BDA0003398164280000058
similar to the target critical network, the network parameters of the target operator DNN are updated as follows:
Figure BDA0003398164280000061
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network. Finally, the updated network parameter thetanDown to local to implement the updating of the operator DNN' network parameters.
And 5: a decision-driven mechanism.
The invention designs a decision driving mechanism, and by monitoring the state of the system, when the states of two connected time slots are almost the same, a new round of learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
Specifically, a state error threshold value ρ is set,
ρ=||s(t)-s(t-1)||2where s (t) represents the state of the current slot, and s (t-1) represents the state of the last slot. By monitoring the current state and then comparing with the state of the last time slot, whether to perform a new round of learning is decided as follows:
Figure BDA0003398164280000062
wherein, an(t-1) represents the output action of the base station n in the last time slot,
Figure BDA0003398164280000063
indicating the output action after a new round of learning by the base station n.
The invention has the beneficial effects that: the method of the invention designs a plurality of intelligent agents for distributed execution and centralized training in local and central based on DDQN and DDPG networks, the method well solves the problems of sub-channel distribution and power distribution of multi-cell eMBB and URLLC user systems, and effectively reduces the input and output dimension, signaling cost and computational complexity of the network; compared with a common reinforcement learning method, the service satisfaction degree level of multi-cell eMBB and URLLC users is improved, and the performance of the whole network is further improved; the method combines a multi-agent deep reinforcement learning network of sub-channel distribution and power distribution to improve the system performance of multi-cell eMBB and URLLC simultaneous transmission, and under the condition of considering inter-cell co-frequency interference, the maximization of the user service satisfaction degree of the multi-cell eMBB and URLLC is realized.
Drawings
Fig. 1 is a diagram illustrating a multi-cell eMBB and URLLC multiplexing scenario according to the present invention.
Fig. 2 is a block diagram of multi-cell eMBB and URLLC user system resource allocation based on multi-agent deep reinforcement learning.
FIG. 3 is a schematic diagram of information interaction between a multi-agent network and a multi-cell system according to the present invention.
Fig. 4 is a diagram illustrating the comparison of multi-cell eMBB and URLLC user service satisfaction in accordance with the present invention.
FIG. 5 is a graphical representation of the time cost per execution of the method of the present invention compared to other methods.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
The invention relates to a joint sub-channel distribution and power distribution method of a multi-cell eMBB and URLLC user system based on multi-agent deep reinforcement learning.
N base stations are arranged in a multi-cell eMBB and URLLC user system, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirements of users, different durations are used for transmission, in the patent, a time domain is divided into 1 millisecond time slots for transmitting eMBB flow, and then the time slots are further divided into 7 small time slots for transmitting URLLC flow. Wherein each time slot has DuThe arrival of URLLC packets, each packet of size ZuA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. Water for realizing service satisfaction degree of eMBB and URLLC users under limited spectrum resource conditionThe average is maximized.
The method is realized by the following steps:
step 1: and constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation.
Specifically, the method comprises the following steps: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information; then, a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated. Finally, the agent reaches reward maximization through continuous learning.
And 2, establishing the signal-to-noise ratio (SINR) and the obtained data rate of each eMBB user and each URLLC user based on the interference between adjacent cells in the multi-cell eMBB and URLLC user system, and setting a target reward.
Specifically, the SINR of the l sub-channel received by the eMBB user b from the base station n in the k small time slot is:
Figure BDA0003398164280000081
the SINR of the l sub-channel received by URLLC user u from base station n in the k small time slot is:
Figure BDA0003398164280000082
wherein the content of the first and second substances,
Figure BDA0003398164280000083
respectively representing m channel allocation indexes of users, channel gain of k small time slot and transmitting power of l sub-channel received from base station N in k small time slot, N0Representing the noise power.
Then, according to the shannon formula, the transmission rates of the eMBB user b and the URLLC user u realized in the kth small time slot of the ith sub-channel of the base station n are respectively:
Figure BDA0003398164280000084
and
Figure BDA0003398164280000085
and finally, obtaining the rate realized by all eMBB users of the base station n in the t time slot:
Figure BDA0003398164280000086
and the rate achieved in the t-th slot for all URLLC users at base station n:
Figure BDA0003398164280000091
the objective reward of the invention is to realize the maximization of the service satisfaction level of multi-cell eMBB and URLLC users, and the service satisfaction level of the eMBB and URLLC users at a base station n is respectively measured by the following formula.
Figure BDA0003398164280000092
And
Figure BDA0003398164280000093
wherein the content of the first and second substances,
Figure BDA0003398164280000094
is the lowest rate requirement of all eMBB users of the base station n at the t-th time slot,
Figure BDA0003398164280000095
is the arrival of the user at the t-th slot URLLC of the base station n.
In order to convert the multi-objective problem into a single-objective problem, the service satisfaction level of multi-cell eMBB and URLLC users is set as a target reward, and a specific optimization problem is described as follows:
P1:
Figure BDA0003398164280000096
s.t.C1:
Figure BDA0003398164280000097
C2:
Figure BDA0003398164280000098
Figure BDA0003398164280000099
representing the maximum transmit power of base station n.
Step 3, setting the state, and taking the channel gain information of all users in each cell on different sub-channels as the current state stIf the state of the nth base station at the time t is:
Figure BDA00033981642800000910
and 4, step 4: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 3 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:
Figure BDA0003398164280000101
and
Figure BDA0003398164280000102
specifically, at the beginning of each time slot, the obtained local state sn (t) is sent to the corresponding local Q-DNN 'network and the operator DNN' network. Local Q-DNN n' network adopts E-greedy strategy to distribute action from local sub-channelSelecting an action in space
Figure BDA0003398164280000103
As a subchannel allocation scheme within the current time slot. Wherein, the e-greedy strategy refers to randomly selecting an action from the sub-channel allocation action space with the probability of being
Figure BDA0003398164280000104
Or selecting the action with the maximum estimated Q value with the probability 1-epsilon as
Figure BDA0003398164280000105
To balance the exploration of new actions with the exploitation of known actions. At the same time, the local operator DNN' network is also activated, using the same state as input, according to which
Figure BDA0003398164280000106
To output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',
Figure BDA0003398164280000107
representing a random noise process and following a positive distribution. Finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}=
{[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
and 5: feedback acquisition and parameter updating.
Each cell receives the joint sub-channel allocation and power allocation action anAfter (t), from the current state sn(t) move to the next state s'n(t) and gives a local prize rn(t) and then fed back to the local network. Local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and upload it to the central network, which receives it
Figure BDA0003398164280000108
Then, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
Figure BDA0003398164280000111
in the central network, the patent establishes a multi-agent network based on DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan
Figure BDA0003398164280000112
Wherein the content of the first and second substances,
Figure BDA0003398164280000113
Figure BDA0003398164280000114
and
Figure BDA0003398164280000115
the network parameters of the central network target Q-DNN n are calculated, and then the parameters of the central network Q-DNN are assigned to the corresponding target Q-DNN every X steps, as shown in the following:
Figure BDA0003398164280000116
and
Figure BDA0003398164280000117
finally the updated network parameter alphanAnd betanAnd downloading to the local to realize the updating of the network parameters of the Q-DNN n'.
Parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
Figure BDA0003398164280000118
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P′1(t),...,P′N(t);δ-),δ-is the network parameter of the target critic DNN.
Then, the network parameters of the target critic DNN are updated in a Soft-update manner as follows:
δ-←τcδ+(1-τc-wherein 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global rewardnUpdating is carried out by the following method:
Figure BDA0003398164280000121
similar to the target critical network, the network parameters of the target operator DNN are updated as follows:
Figure BDA0003398164280000122
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network. Finally, the updated network parameter thetanDown to local to implement the updating of the operator DNN' network parameters.
Step 6: a decision-driven mechanism.
The invention relates to a decision driving mechanism, which is characterized in that the state of a system is monitored, when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
Specifically, the method comprises the following steps: a state error threshold value p is set,
ρ=||s(t)-s(t-1)||2wherein s (t) represents the state of the current time slot, s (t-1) represents the state of the previous time slot, and the decision driving module determines whether to perform a new round of learning by monitoring the current state and then comparing with the state of the previous time slot, as follows:
Figure BDA0003398164280000123
wherein, an(t-1) represents the output action of the base station n in the last time slot,
Figure BDA0003398164280000124
indicating the output action after a new round of learning by the base station n.
As shown in fig. 1-5, considering multi-cell eMBB and URLLC system scenarios, a sub-channel and power allocation scheme for each user is jointly optimized, and main parameters of the simulation scenario of this embodiment are shown in table 1.
TABLE 1 Main parameters of the System
Figure BDA0003398164280000131
Fig. 4 and 5 are schematic diagrams comparing the algorithm of the present invention with other methods with respect to the multi-cell eMBB and URLLC user service satisfaction level and cost per execution time. It can be seen from the figure that the system performance obtained by the MADRL and MADRL-DD algorithms provided by the present invention is slightly lower than that obtained by an exhaustive method, and is much higher than that obtained by a general reinforcement learning algorithm and a random method. In addition, the performance of the MADRL-DD algorithm is infinitely close to that of the MADRL algorithm, and it can be seen that the time cost and the calculation overhead are effectively reduced under the condition that the service satisfaction level of multi-cell eMBB and URLLC users is guaranteed by the decision driving module.
The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole network.
The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (6)

1. A multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and is characterized in that: the multi-cell multi-service resource allocation method comprises the following steps:
step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation;
step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cellt
And step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in the step 2 as input and then outputs a local sub-channel distribution action and a local power distribution action;
and 4, step 4: feedback acquisition and parameter update: transmitting the subchannel allocation action and the power allocation action obtained in the step (3) to a system, giving a reward by the system and moving to the next state, continuously collecting the current state, the current action, the current reward and the next time state by a local network, uploading the current state, the current reward and the next time state to an experience memory pool, and extracting sample data from the experience memory pool to train central network parameters and transmitting the central network parameters to the local network;
and 5: and the decision driving mechanism monitors the state of the system, and when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
2. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the construction of the multi-agent network in the step 1 specifically comprises the following steps:
step 1-1: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information;
step 1-2: a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then the parameters of the local network are updated;
step 1-3: the agent reaches reward maximization through continuous learning.
3. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the specific steps of the sub-channel allocation and the power allocation in the step 3 are as follows:
step 3-1: local state s obtained at the beginning of each time slotn(t) is sent to the corresponding local Q-DNN n 'network and operator DNN n' network;
step 3-2: selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategy
Figure FDA0003398164270000022
As a sub-channel allocation scheme within the current time slot;
step 3-3: while the local operator DNN n' network is activated, using the same state as input, according to
Figure FDA0003398164270000023
To output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',
Figure FDA0003398164270000025
represents a random noise process and follows a positive distribution;
step 3-4: finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}={[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
4. the multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 4, the feedback acquisition and parameter update specifically include:
step 4-1: each cell receives the joint sub-channel allocation and power allocation action a of step 3nAfter (t), from the current state sn(t) move to the next state s'n(t) and gives a local prize rn(t) then fed back to the local network;
step 4-2: local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and uploading it to the central network;
step 4-3: central network reception
Figure FDA0003398164270000021
Then, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
Figure FDA0003398164270000024
step 4-4: in the central network, a part of sample data is selected from the experience memory pool, central network parameters are updated, and finally the updated network parameters are downloaded to the local to realize the updating of local network parameters.
5. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 4, wherein: the step 4-4 is specifically as follows:
step 4-4-1: parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan
Figure FDA0003398164270000031
Wherein the content of the first and second substances,
Figure FDA0003398164270000032
Figure FDA0003398164270000033
and
Figure FDA0003398164270000034
the network parameters of the central network target Q-DNN n are calculated, and then the parameters of the central network Q-DNN are assigned to the corresponding target Q-DNN every X steps, as shown in the following:
Figure FDA0003398164270000035
and
Figure FDA0003398164270000036
finally the updated network parameter alphanAnd betanDownloading to the local to realize the updating of the Q-DNN n' network parameters;
step 4-4-2: parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
Figure FDA0003398164270000037
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P1′(t),...,P′N(t);θ-),θ-is a network parameter of the target critic DNN;
step 4-4-3: updating the network parameters of the target critic DNN in a Soft-update manner as follows:
θ-←τcδ+(1-τc-
wherein, 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network;
step 4-4-4: training network parameters of a central network operator DNN, network parameters θ of the central network operator DNN, by maximizing the expectation of global rewardsnUpdating is carried out by the following method:
Figure FDA0003398164270000041
the network parameter updating mode of the target operator DNN is as follows:
Figure FDA0003398164270000042
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network;
step 4-4-5: network parameter θ to be updatednDown to local to implement the updating of the operator DNN' network parameters.
6. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 5, the decision driving mechanism specifically includes:
step 5-1: a state error threshold value p is set,
ρ=||s(t)-s(t-1)||2wherein s (t) represents the state of the current time slot, and s (t-1) represents the state of the last time slot;
step 5-2: the decision driver module determines whether to perform a new round of learning by monitoring the current state and then comparing the current state with the state of the previous time slot, as follows:
Figure FDA0003398164270000043
wherein, an(t-1) represents the output action of the base station n in the last time slot,
Figure FDA0003398164270000044
indicating the output action after a new round of learning by the base station n.
CN202111512524.5A 2021-12-08 2021-12-08 Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning Pending CN114189870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111512524.5A CN114189870A (en) 2021-12-08 2021-12-08 Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111512524.5A CN114189870A (en) 2021-12-08 2021-12-08 Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN114189870A true CN114189870A (en) 2022-03-15

Family

ID=80604542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111512524.5A Pending CN114189870A (en) 2021-12-08 2021-12-08 Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114189870A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002720A (en) * 2022-06-02 2022-09-02 中山大学 Internet of vehicles channel resource optimization method and system based on deep reinforcement learning
CN115038155A (en) * 2022-05-23 2022-09-09 香港中文大学(深圳) Ultra-dense multi-access-point dynamic cooperative transmission method
CN116367223A (en) * 2023-03-30 2023-06-30 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium
CN117614573A (en) * 2024-01-23 2024-02-27 中国人民解放军战略支援部队航天工程大学 Combined power channel allocation method, system and equipment based on deep reinforcement learning

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115038155A (en) * 2022-05-23 2022-09-09 香港中文大学(深圳) Ultra-dense multi-access-point dynamic cooperative transmission method
CN115038155B (en) * 2022-05-23 2023-02-07 香港中文大学(深圳) Ultra-dense multi-access-point dynamic cooperative transmission method
CN115002720A (en) * 2022-06-02 2022-09-02 中山大学 Internet of vehicles channel resource optimization method and system based on deep reinforcement learning
CN116367223A (en) * 2023-03-30 2023-06-30 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium
CN116367223B (en) * 2023-03-30 2024-01-02 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium
CN117614573A (en) * 2024-01-23 2024-02-27 中国人民解放军战略支援部队航天工程大学 Combined power channel allocation method, system and equipment based on deep reinforcement learning
CN117614573B (en) * 2024-01-23 2024-03-26 中国人民解放军战略支援部队航天工程大学 Combined power channel allocation method, system and equipment based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN114189870A (en) Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning
CN102612085B (en) Sub-band dependent resource management
CN111628855B (en) Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning
CN112601284B (en) Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning
US20150036626A1 (en) Method of grouping users to reduce interference in mimo-based wireless network
CN113163451A (en) D2D communication network slice distribution method based on deep reinforcement learning
WO2023179010A1 (en) User packet and resource allocation method and apparatus in noma-mec system
CN106792451B (en) D2D communication resource optimization method based on multi-population genetic algorithm
CN107343268B (en) Non-orthogonal multicast and unicast transmission beamforming method and system
CN105873214A (en) Resource allocation method of D2D communication system based on genetic algorithm
CN112566261A (en) Deep reinforcement learning-based uplink NOMA resource allocation method
CN113596785A (en) D2D-NOMA communication system resource allocation method based on deep Q network
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
CN111787543A (en) 5G communication system resource allocation method based on improved wolf optimization algorithm
CN105530203B (en) The connection control method and system of D2D communication links
CN113099425B (en) High-energy-efficiency unmanned aerial vehicle-assisted D2D resource allocation method
CN114423028A (en) CoMP-NOMA (coordinated multi-point-non-orthogonal multiple Access) cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
CN102316596B (en) Control station and method for scheduling resource block
CN106851726A (en) A kind of cross-layer resource allocation method based on minimum speed limit constraint
Wadhai et al. Performance analysis of hybrid channel allocation scheme for mobile cellular network
CN113242602B (en) Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN115633402A (en) Resource scheduling method for mixed service throughput optimization
CN115915454A (en) SWIPT-assisted downlink resource allocation method and device
CN113162662B (en) User clustering and power distribution method under CF-mMIMO

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination