CN114189870A - Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning - Google Patents
Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning Download PDFInfo
- Publication number
- CN114189870A CN114189870A CN202111512524.5A CN202111512524A CN114189870A CN 114189870 A CN114189870 A CN 114189870A CN 202111512524 A CN202111512524 A CN 202111512524A CN 114189870 A CN114189870 A CN 114189870A
- Authority
- CN
- China
- Prior art keywords
- network
- local
- dnn
- cell
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/02—Resource partitioning among network components, e.g. reuse partitioning
- H04W16/10—Dynamic resource partitioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0453—Resources in frequency domain, e.g. a carrier in FDMA
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0473—Wireless resource allocation based on the type of the allocated resource the resource being transmission power
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems. The method comprises the following steps: step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation; step 2: acquiring a state; and step 3: sub-channel allocation and power allocation; and 4, step 4: feedback acquisition and parameter updating; and 5: a decision-driven mechanism. The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole system.
Description
Technical Field
The invention relates to the field of wireless communication, in particular to a resource allocation method for processing multi-cell eMBB and URLLC simultaneous transmission by a multi-agent deep reinforcement learning-based method, so as to improve the service satisfaction level of eMBB and URLLC users in multiple cells.
Background
The 6G network is a world of global connection integrating terrestrial radio and satellite communication, and can flexibly adapt to services in different application scenarios such as enhanced mobile broadband (eMBB), ultra-reliable low-latency communications (URLLC), and the like, with the support of technologies such as global satellite positioning system, telecommunication satellite system, and the like. Service applications such as immersive cloud XR, holographic communication and sensory interconnection in 6G have higher requirements on eMBB and URLLC. How to utilize limited system resources to meet the different requirements of the two services becomes a key issue of wireless communication networks. Therefore, it is important to solve the problem of resource allocation in the coexistence of eMBB and URLLC.
It was found through retrieval that x.wang et al published a sentence entitled "Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks (Joint Scheduling of URLLC and eMBB Traffic in 5G Wireless Networks)" in IEEE Conference on Computer Communications, pp.1970-1978, April 2018 (society of electrical and electronics engineers Computer Communications Conference, 4.2018, page 1970-1978), which proposed a linear model, a convex model and a threshold model to evaluate loss of eMBB data rate, and synergistically optimize bandwidth allocation of eMBB users and resource preemption location of URLLC Traffic in the case where the URLLC Traffic stably arrives. However, in practical applications, URLLC flow is time-varying, and a long-term optimal solution cannot be obtained using this method. With the increasing of users and the increasing of system scale, the method has the problems of long periodicity, high calculation complexity and the like, so researchers consider applying the reinforcement learning method with strong calculation capacity and learning rate to the problem of wireless network resource allocation.
Through the patent finding, CN109561504A discloses a resource multiplexing of URLLC and eMBB based on deep reinforcement learning. Firstly, acquiring data packet information, channel information and queue information of M mini-slot URLLC and eMBB as training data; then establishing a URLLC and eMBB resource multiplexing model based on deep reinforcement learning, and training model parameters by using training data; after training is finished, the information of the URLLC and eMBB data packets, the channel information and the queue information of the current mini-slot are input into a trained model, and finally a resource multiplexing decision result is obtained, so that the reasonable distribution and utilization of time-frequency resources and power are realized. But the invention only considers the single-cell eMBB and URLLC system resource allocation scheme. In practical application scenarios, since each cell occupies the same spectrum resource, users in the cell will inevitably be interfered by neighboring cells, and therefore, it has become a current research focus to improve system performance by reasonably allocating subchannels and powers of multi-cell eMBB and URLLC user systems.
Disclosure of Invention
The invention provides a multi-cell eMBB and URLLC user system resource allocation method based on multi-agent deep reinforcement learning, which solves the problem of multi-cell eMBB and URLLC user system resource allocation based on the multi-agent deep reinforcement learning method, utilizes a mode of centralized training and distributed execution of a plurality of agents to carry out global control and reduce the dimensionality of complex tasks, and effectively improves the performance of the system while reducing the time cost. Specifically, the sub-channel and power allocation schemes of each cell are output respectively based on a joint contention deep Q network (DDQN) and a deep deterministic policy gradient network (DDPG), and then the allocation strategies are adjusted according to the feedback of the system to maximize the service satisfaction level of multi-cell eMBB and URLLC users.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning, which is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and comprises the following steps:
step 1: and constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation.
Specifically, a multi-cell eMBB and URLLC user system is provided with N base stations, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirement of a user, different time durations are used for transmission, and the time domain is divided into 1 millisecond time slots for transmitting eMBB streamsAnd dividing the time slot into 7 small time slots for transmitting URLLC flow. Wherein each time slot has DuThe arrival of URLLC packets, each packet of size ZuA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. In order to maximize the service satisfaction level of eMBB and URLLC users under the condition of limited spectrum resources, a multi-agent deep reinforcement learning network is constructed in the patent and used for solving the problem of sub-channel and power distribution of multi-cell eMBB and URLLC users. First, N Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information. Then, a centralized training network is established at the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated.
Step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cellt(ii) a If the state of the nth base station at the time t is as follows:
and step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 2 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:
and
in particular, at the beginning of each time slot, the local state s obtainedn(t) are sent to the corresponding local Q-DNN n' networkNetworks and operator DNN' networks. Selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategyAs a subchannel allocation scheme within the current time slot. Wherein, the e-greedy strategy refers to randomly selecting an action from the sub-channel allocation action space with the probability of beingOr selecting the action with the maximum estimated Q value with the probability 1-epsilon asTo balance the exploration of new actions with the exploitation of known actions. At the same time, the local operator DNN' network is also activated, using the same state as input, according to whichTo output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',representing a random noise process and following a positive distribution. Finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}=
{[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
and 4, step 4: feedback acquisition and parameter updating.
Each cell receives the joint sub-channel allocation and power allocation action anAfter (t), from the current state sn(t) move to the next state s'n(t) and giving a local prizern(t) and then fed back to the local network. Local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and upload it to the central network. Central network receptionThen, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
at the central network, a multi-agent network is established based on the DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan,
Wherein the content of the first and second substances,
andare network parameters of the central network target Q-DNN. And then assigning the parameters of the central network Q-DNN to the corresponding target Q-DNN every X steps, as follows:
finally the updated network parameter alphanAnd betanAnd downloading to the local to realize the updating of the network parameters of the Q-DNN n'.
Parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P′1(t),...,P′N(t);δ-),δ-is the network parameter of the target critic DNN.
Then, the network parameters of the target critic DNN are updated in a Soft-update manner as follows:
δ-←τcδ+(1-τc)δ-wherein 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global rewardnUpdating is carried out by the following method:
similar to the target critical network, the network parameters of the target operator DNN are updated as follows:
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network. Finally, the updated network parameter thetanDown to local to implement the updating of the operator DNN' network parameters.
And 5: a decision-driven mechanism.
The invention designs a decision driving mechanism, and by monitoring the state of the system, when the states of two connected time slots are almost the same, a new round of learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
Specifically, a state error threshold value ρ is set,
ρ=||s(t)-s(t-1)||2where s (t) represents the state of the current slot, and s (t-1) represents the state of the last slot. By monitoring the current state and then comparing with the state of the last time slot, whether to perform a new round of learning is decided as follows:
wherein, an(t-1) represents the output action of the base station n in the last time slot,indicating the output action after a new round of learning by the base station n.
The invention has the beneficial effects that: the method of the invention designs a plurality of intelligent agents for distributed execution and centralized training in local and central based on DDQN and DDPG networks, the method well solves the problems of sub-channel distribution and power distribution of multi-cell eMBB and URLLC user systems, and effectively reduces the input and output dimension, signaling cost and computational complexity of the network; compared with a common reinforcement learning method, the service satisfaction degree level of multi-cell eMBB and URLLC users is improved, and the performance of the whole network is further improved; the method combines a multi-agent deep reinforcement learning network of sub-channel distribution and power distribution to improve the system performance of multi-cell eMBB and URLLC simultaneous transmission, and under the condition of considering inter-cell co-frequency interference, the maximization of the user service satisfaction degree of the multi-cell eMBB and URLLC is realized.
Drawings
Fig. 1 is a diagram illustrating a multi-cell eMBB and URLLC multiplexing scenario according to the present invention.
Fig. 2 is a block diagram of multi-cell eMBB and URLLC user system resource allocation based on multi-agent deep reinforcement learning.
FIG. 3 is a schematic diagram of information interaction between a multi-agent network and a multi-cell system according to the present invention.
Fig. 4 is a diagram illustrating the comparison of multi-cell eMBB and URLLC user service satisfaction in accordance with the present invention.
FIG. 5 is a graphical representation of the time cost per execution of the method of the present invention compared to other methods.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
The invention relates to a joint sub-channel distribution and power distribution method of a multi-cell eMBB and URLLC user system based on multi-agent deep reinforcement learning.
N base stations are arranged in a multi-cell eMBB and URLLC user system, and each base station has M users randomly distributed in a cell, wherein the eMBB users have B users and the URLLC users have U users, and M is B + U. Each user is assigned one antenna for receiving and transmitting data, and there are L subchannels in each base station. Then, according to the specific requirements of users, different durations are used for transmission, in the patent, a time domain is divided into 1 millisecond time slots for transmitting eMBB flow, and then the time slots are further divided into 7 small time slots for transmitting URLLC flow. Wherein each time slot has DuThe arrival of URLLC packets, each packet of size ZuA byte. Assume that the total bandwidth of the multi-cell system is 3 MHz. Water for realizing service satisfaction degree of eMBB and URLLC users under limited spectrum resource conditionThe average is maximized.
The method is realized by the following steps:
step 1: and constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation.
Specifically, the method comprises the following steps: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information; then, a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then parameters of the local network are updated. Finally, the agent reaches reward maximization through continuous learning.
And 2, establishing the signal-to-noise ratio (SINR) and the obtained data rate of each eMBB user and each URLLC user based on the interference between adjacent cells in the multi-cell eMBB and URLLC user system, and setting a target reward.
Specifically, the SINR of the l sub-channel received by the eMBB user b from the base station n in the k small time slot is:
the SINR of the l sub-channel received by URLLC user u from base station n in the k small time slot is:
wherein the content of the first and second substances,respectively representing m channel allocation indexes of users, channel gain of k small time slot and transmitting power of l sub-channel received from base station N in k small time slot, N0Representing the noise power.
Then, according to the shannon formula, the transmission rates of the eMBB user b and the URLLC user u realized in the kth small time slot of the ith sub-channel of the base station n are respectively:
and finally, obtaining the rate realized by all eMBB users of the base station n in the t time slot:
and the rate achieved in the t-th slot for all URLLC users at base station n:
the objective reward of the invention is to realize the maximization of the service satisfaction level of multi-cell eMBB and URLLC users, and the service satisfaction level of the eMBB and URLLC users at a base station n is respectively measured by the following formula.
Andwherein the content of the first and second substances,is the lowest rate requirement of all eMBB users of the base station n at the t-th time slot,is the arrival of the user at the t-th slot URLLC of the base station n.
In order to convert the multi-objective problem into a single-objective problem, the service satisfaction level of multi-cell eMBB and URLLC users is set as a target reward, and a specific optimization problem is described as follows:
and 4, step 4: sub-channel allocation and power allocation: the local neural network takes the state obtained in step 3 as an input, and then outputs a local sub-channel allocation action and a local power allocation action, for example, the sub-channel and the power allocation action of the nth base station at time t are respectively:
and
specifically, at the beginning of each time slot, the obtained local state sn (t) is sent to the corresponding local Q-DNN 'network and the operator DNN' network. Local Q-DNN n' network adopts E-greedy strategy to distribute action from local sub-channelSelecting an action in spaceAs a subchannel allocation scheme within the current time slot. Wherein, the e-greedy strategy refers to randomly selecting an action from the sub-channel allocation action space with the probability of beingOr selecting the action with the maximum estimated Q value with the probability 1-epsilon asTo balance the exploration of new actions with the exploitation of known actions. At the same time, the local operator DNN' network is also activated, using the same state as input, according to whichTo output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',representing a random noise process and following a positive distribution. Finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}=
{[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
and 5: feedback acquisition and parameter updating.
Each cell receives the joint sub-channel allocation and power allocation action anAfter (t), from the current state sn(t) move to the next state s'n(t) and gives a local prize rn(t) and then fed back to the local network. Local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and upload it to the central network, which receives itThen, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
in the central network, the patent establishes a multi-agent network based on DDQN and DDPG for updating local network parameters. Parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan,
Wherein the content of the first and second substances,
andthe network parameters of the central network target Q-DNN n are calculated, and then the parameters of the central network Q-DNN are assigned to the corresponding target Q-DNN every X steps, as shown in the following:
finally the updated network parameter alphanAnd betanAnd downloading to the local to realize the updating of the network parameters of the Q-DNN n'.
Parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P′1(t),...,P′N(t);δ-),δ-is the network parameter of the target critic DNN.
Then, the network parameters of the target critic DNN are updated in a Soft-update manner as follows:
δ-←τcδ+(1-τc)δ-wherein 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network. Thereafter, the network parameters of the central network operator DNN, the network parameters θ of the central network operator DNN, are trained by maximizing the expectation of the global rewardnUpdating is carried out by the following method:
similar to the target critical network, the network parameters of the target operator DNN are updated as follows:
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network. Finally, the updated network parameter thetanDown to local to implement the updating of the operator DNN' network parameters.
Step 6: a decision-driven mechanism.
The invention relates to a decision driving mechanism, which is characterized in that the state of a system is monitored, when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
Specifically, the method comprises the following steps: a state error threshold value p is set,
ρ=||s(t)-s(t-1)||2wherein s (t) represents the state of the current time slot, s (t-1) represents the state of the previous time slot, and the decision driving module determines whether to perform a new round of learning by monitoring the current state and then comparing with the state of the previous time slot, as follows:
wherein, an(t-1) represents the output action of the base station n in the last time slot,indicating the output action after a new round of learning by the base station n.
As shown in fig. 1-5, considering multi-cell eMBB and URLLC system scenarios, a sub-channel and power allocation scheme for each user is jointly optimized, and main parameters of the simulation scenario of this embodiment are shown in table 1.
TABLE 1 Main parameters of the System
Fig. 4 and 5 are schematic diagrams comparing the algorithm of the present invention with other methods with respect to the multi-cell eMBB and URLLC user service satisfaction level and cost per execution time. It can be seen from the figure that the system performance obtained by the MADRL and MADRL-DD algorithms provided by the present invention is slightly lower than that obtained by an exhaustive method, and is much higher than that obtained by a general reinforcement learning algorithm and a random method. In addition, the performance of the MADRL-DD algorithm is infinitely close to that of the MADRL algorithm, and it can be seen that the time cost and the calculation overhead are effectively reduced under the condition that the service satisfaction level of multi-cell eMBB and URLLC users is guaranteed by the decision driving module.
The method effectively reduces the input and output dimension, signaling overhead and computational complexity of the network, well ensures the service satisfaction level of multi-cell eMBB and URLLC users, and further improves the performance of the whole network.
The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.
Claims (6)
1. A multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning is suitable for the resource allocation problem of multi-cell eMBB and URLLC user systems, and is characterized in that: the multi-cell multi-service resource allocation method comprises the following steps:
step 1: constructing a multi-agent network for solving the problem of multi-cell eMBB and URLLC user system resource allocation;
step 2: and (3) state acquisition: taking the channel gain information of eMBB and URLLC users in the cell on different sub-channels of different base stations as the current state s of the cellt;
And step 3: sub-channel allocation and power allocation: the local neural network takes the state obtained in the step 2 as input and then outputs a local sub-channel distribution action and a local power distribution action;
and 4, step 4: feedback acquisition and parameter update: transmitting the subchannel allocation action and the power allocation action obtained in the step (3) to a system, giving a reward by the system and moving to the next state, continuously collecting the current state, the current action, the current reward and the next time state by a local network, uploading the current state, the current reward and the next time state to an experience memory pool, and extracting sample data from the experience memory pool to train central network parameters and transmitting the central network parameters to the local network;
and 5: and the decision driving mechanism monitors the state of the system, and when the states of two connected time slots are almost the same, a new learning process is triggered, otherwise, the action output by the last time slot is continuously taken as the optimal resource allocation action of the current time slot.
2. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the construction of the multi-agent network in the step 1 specifically comprises the following steps:
step 1-1: n Q-DNNs and N operator DNNs are established locally, and the local network outputs local sub-channel allocation actions and power allocation actions according to local channel state information;
step 1-2: a centralized training network is established in the center based on the DDQN and the DDPG, parameters of the network are updated through environment feedback rewards, and then the parameters of the local network are updated;
step 1-3: the agent reaches reward maximization through continuous learning.
3. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: the specific steps of the sub-channel allocation and the power allocation in the step 3 are as follows:
step 3-1: local state s obtained at the beginning of each time slotn(t) is sent to the corresponding local Q-DNN n 'network and operator DNN n' network;
step 3-2: selecting an action from the local sub-channel allocation action space by the local Q-DNN n' network by adopting an e-greedy strategyAs a sub-channel allocation scheme within the current time slot;
step 3-3: while the local operator DNN n' network is activated, using the same state as input, according toTo output a corresponding power allocation action, wherein mu(s)n(t);θ′n) Is a policy function of the local actor DNN n 'network, θ'nIs a network parameter of the local operator DNN',represents a random noise process and follows a positive distribution;
step 3-4: finally, the local network output joint sub-channel and power allocation action is as follows:
a(t)={a1(t),a2(t),...,aN(t)}={[C1(t),P1(t)],[C2(t),P2(t)],...,[CN(t),PN(t)]}。
4. the multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 4, the feedback acquisition and parameter update specifically include:
step 4-1: each cell receives the joint sub-channel allocation and power allocation action a of step 3nAfter (t), from the current state sn(t) move to the next state s'n(t) and gives a local prize rn(t) then fed back to the local network;
step 4-2: local network continuously collecting experience en={sn(t),an(t),rn(t),s′n(t) and uploading it to the central network;
step 4-3: central network receptionThen, the global information s1(t),s2(t),...,sN(t),a1(t),a2(t),...,aN(t),r(t),s′1(t),s′2(t),...,s′N(t) } are stored in the experience pool D in a first-in, first-out manner, wherein,
step 4-4: in the central network, a part of sample data is selected from the experience memory pool, central network parameters are updated, and finally the updated network parameters are downloaded to the local to realize the updating of local network parameters.
5. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 4, wherein: the step 4-4 is specifically as follows:
step 4-4-1: parameter update for local Q-DNN n': at time t, a part of sample data is selected from the experience memory pool, and the network parameter alpha of the central network Q-DNN n is updated by minimizing the following loss functionnAnd betan,
Wherein the content of the first and second substances,
andthe network parameters of the central network target Q-DNN n are calculated, and then the parameters of the central network Q-DNN are assigned to the corresponding target Q-DNN every X steps, as shown in the following:
finally the updated network parameter alphanAnd betanDownloading to the local to realize the updating of the Q-DNN n' network parameters;
step 4-4-2: parameter update for local actor DNN's: at time t, the network parameter δ of the central network critical DNN is updated by minimizing the loss function also with the sample data described above,
wherein the content of the first and second substances,
y(t)=r(t)+γQ(s′1(t),...,s′N(t),P1′(t),...,P′N(t);θ-),θ-is a network parameter of the target critic DNN;
step 4-4-3: updating the network parameters of the target critic DNN in a Soft-update manner as follows:
θ-←τcδ+(1-τc)δ-
wherein, 0 < tauc<<1,τcRepresenting the learning rate of the target critic DNN network;
step 4-4-4: training network parameters of a central network operator DNN, network parameters θ of the central network operator DNN, by maximizing the expectation of global rewardsnUpdating is carried out by the following method:
the network parameter updating mode of the target operator DNN is as follows:
wherein, 0 < taun<<1,τnRepresenting the learning rate of the target actor DNN network;
step 4-4-5: network parameter θ to be updatednDown to local to implement the updating of the operator DNN' network parameters.
6. The multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning as claimed in claim 1, wherein: in step 5, the decision driving mechanism specifically includes:
step 5-1: a state error threshold value p is set,
ρ=||s(t)-s(t-1)||2wherein s (t) represents the state of the current time slot, and s (t-1) represents the state of the last time slot;
step 5-2: the decision driver module determines whether to perform a new round of learning by monitoring the current state and then comparing the current state with the state of the previous time slot, as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111512524.5A CN114189870A (en) | 2021-12-08 | 2021-12-08 | Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111512524.5A CN114189870A (en) | 2021-12-08 | 2021-12-08 | Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114189870A true CN114189870A (en) | 2022-03-15 |
Family
ID=80604542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111512524.5A Pending CN114189870A (en) | 2021-12-08 | 2021-12-08 | Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114189870A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115002720A (en) * | 2022-06-02 | 2022-09-02 | 中山大学 | Internet of vehicles channel resource optimization method and system based on deep reinforcement learning |
CN115038155A (en) * | 2022-05-23 | 2022-09-09 | 香港中文大学(深圳) | Ultra-dense multi-access-point dynamic cooperative transmission method |
CN116367223A (en) * | 2023-03-30 | 2023-06-30 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
CN117614573A (en) * | 2024-01-23 | 2024-02-27 | 中国人民解放军战略支援部队航天工程大学 | Combined power channel allocation method, system and equipment based on deep reinforcement learning |
-
2021
- 2021-12-08 CN CN202111512524.5A patent/CN114189870A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115038155A (en) * | 2022-05-23 | 2022-09-09 | 香港中文大学(深圳) | Ultra-dense multi-access-point dynamic cooperative transmission method |
CN115038155B (en) * | 2022-05-23 | 2023-02-07 | 香港中文大学(深圳) | Ultra-dense multi-access-point dynamic cooperative transmission method |
CN115002720A (en) * | 2022-06-02 | 2022-09-02 | 中山大学 | Internet of vehicles channel resource optimization method and system based on deep reinforcement learning |
CN116367223A (en) * | 2023-03-30 | 2023-06-30 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
CN116367223B (en) * | 2023-03-30 | 2024-01-02 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
CN117614573A (en) * | 2024-01-23 | 2024-02-27 | 中国人民解放军战略支援部队航天工程大学 | Combined power channel allocation method, system and equipment based on deep reinforcement learning |
CN117614573B (en) * | 2024-01-23 | 2024-03-26 | 中国人民解放军战略支援部队航天工程大学 | Combined power channel allocation method, system and equipment based on deep reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
CN114189870A (en) | Multi-cell multi-service resource allocation method based on multi-agent deep reinforcement learning | |
CN102612085B (en) | Sub-band dependent resource management | |
CN111628855B (en) | Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning | |
CN112601284B (en) | Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning | |
US20150036626A1 (en) | Method of grouping users to reduce interference in mimo-based wireless network | |
CN113163451A (en) | D2D communication network slice distribution method based on deep reinforcement learning | |
WO2023179010A1 (en) | User packet and resource allocation method and apparatus in noma-mec system | |
CN106792451B (en) | D2D communication resource optimization method based on multi-population genetic algorithm | |
CN107343268B (en) | Non-orthogonal multicast and unicast transmission beamforming method and system | |
CN105873214A (en) | Resource allocation method of D2D communication system based on genetic algorithm | |
CN112566261A (en) | Deep reinforcement learning-based uplink NOMA resource allocation method | |
CN113596785A (en) | D2D-NOMA communication system resource allocation method based on deep Q network | |
CN114867030A (en) | Double-time-scale intelligent wireless access network slicing method | |
CN111787543A (en) | 5G communication system resource allocation method based on improved wolf optimization algorithm | |
CN105530203B (en) | The connection control method and system of D2D communication links | |
CN113099425B (en) | High-energy-efficiency unmanned aerial vehicle-assisted D2D resource allocation method | |
CN114423028A (en) | CoMP-NOMA (coordinated multi-point-non-orthogonal multiple Access) cooperative clustering and power distribution method based on multi-agent deep reinforcement learning | |
CN102316596B (en) | Control station and method for scheduling resource block | |
CN106851726A (en) | A kind of cross-layer resource allocation method based on minimum speed limit constraint | |
Wadhai et al. | Performance analysis of hybrid channel allocation scheme for mobile cellular network | |
CN113242602B (en) | Millimeter wave large-scale MIMO-NOMA system resource allocation method and system | |
CN115633402A (en) | Resource scheduling method for mixed service throughput optimization | |
CN115915454A (en) | SWIPT-assisted downlink resource allocation method and device | |
CN113162662B (en) | User clustering and power distribution method under CF-mMIMO |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |