CN113316154A - Authorized and unauthorized D2D communication resource joint intelligent distribution method - Google Patents
Authorized and unauthorized D2D communication resource joint intelligent distribution method Download PDFInfo
- Publication number
- CN113316154A CN113316154A CN202110581716.5A CN202110581716A CN113316154A CN 113316154 A CN113316154 A CN 113316154A CN 202110581716 A CN202110581716 A CN 202110581716A CN 113316154 A CN113316154 A CN 113316154A
- Authority
- CN
- China
- Prior art keywords
- agent
- action
- user
- reward
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000004891 communication Methods 0.000 title claims abstract description 62
- 230000009471 action Effects 0.000 claims abstract description 124
- 230000001413 cellular effect Effects 0.000 claims abstract description 64
- 230000006870 function Effects 0.000 claims abstract description 44
- 230000002787 reinforcement Effects 0.000 claims abstract description 13
- 230000008569 process Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 30
- 230000006399 behavior Effects 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 15
- 238000013468 resource allocation Methods 0.000 claims description 11
- 230000000875 corresponding effect Effects 0.000 claims description 10
- 238000004088 simulation Methods 0.000 claims description 10
- 238000013475 authorization Methods 0.000 claims description 9
- 238000000342 Monte Carlo simulation Methods 0.000 claims description 7
- 206010073261 Ovarian theca cell tumour Diseases 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000003595 spectral effect Effects 0.000 claims description 7
- 208000001644 thecoma Diseases 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 230000015556 catabolic process Effects 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 238000006731 degradation reaction Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 206010010071 Coma Diseases 0.000 claims 3
- 238000005516 engineering process Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 2
- 238000005562 fading Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/02—Resource partitioning among network components, e.g. reuse partitioning
- H04W16/10—Dynamic resource partitioning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/06—Testing, supervising or monitoring using simulated traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/0215—Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices
- H04W28/0221—Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices power availability or consumption
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention relates to a joint intelligent distribution method of authorized and unauthorized D2D communication resources, belonging to the field of D2D communication. The invention comprises the following steps: s1: establishing a D2D user communication model; s2: establishing an objective function to be optimized; s3: establishing a multi-agent deep reinforcement learning D2D communication model; s4: setting an action set, a state set and a reward function of the multi-agent; s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state; s6: calculating a TDerror of the Critic network, updating parameters of the Critic network, calculating a counterfactual baseline of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baseline, and updating the state; s7: steps S5-S6 are repeated until the target state is reached. According to the multi-agent deep reinforcement learning framework, the agents continuously interact with the environment, continuously train and learn, find the optimal strategy, and the action corresponding to the adopted strategy obtains the optimal throughput, and meanwhile, the communication quality of WiFi and cellular users is guaranteed.
Description
Technical Field
The invention belongs to the field of D2D communication, and relates to a joint intelligent allocation method for authorized and unauthorized D2D communication resources.
Background
With the rapid popularization of intelligent terminals, the number of connection devices in this year, including smart phones, tablets, car networking devices, internet of things devices and the like, is expected to reach 500 hundred million, and data traffic is increased by 1000 times, so that the requirement for the evolution of a wireless communication technology is more urgent. On one hand, wireless communication performance indexes such as network capacity, spectrum efficiency and the like need to be obviously improved to deal with the explosively increased data traffic; on the other hand, the expansion of cellular communication applications needs to be realized, and the end user experience is improved.
In the face of the situation that the data traffic and the authorized spectrum which are explosively increased are almost distributed, the new spectrum is expanded to improve the system capacity so as to adapt to the rapid increase of the mobile traffic and the diversification of the services, and the method becomes the current primary target. The frequency spectrum working frequency band of the unlicensed frequency band mainly comprises a 2.4GHz frequency band, a 5GHz frequency band and a 60GHz millimeter wave frequency band. Since the free space loss increases as the frequency band increases, only the frequency band of 6GHz or less is considered to be able to better resist path fading. In the frequency band below 6GHz, the 2.4GHz frequency band is already densely occupied by WiFi, Bluetooth and other wireless technologies, the interference is complex, the 5GHz frequency band has an available space close to 500MHz, only one part is occupied by WiFi, and the utilization rate is low. Therefore, the D2D technology can be realized in the 5GHz band as long as the interference between the D2D technology and the WiFi is controlled within an acceptable range. The 5GHz unlicensed frequency band channel has large fading and is suitable for short-distance communication, D2D is used as a proximity service, the property of the D2D is very consistent when the D2D is placed in the unlicensed frequency band, and compared with other short-distance communication technologies (WiFi direct connection, Bluetooth, Zigbee protocol and the like) which work in the unlicensed frequency band, the D2D communication has great advantages, pairing of users and resource allocation of channels and power are controlled by a base station, and access is more efficient and safer. D2D communication is deployed in the unlicensed frequency band, so that the use rate of the unlicensed spectrum can be improved, and seamless fusion with the existing cellular system can be realized. Similar to the LTE-U technology, the D2D communication is initially operated only in the licensed spectrum, and there is no coexistence mechanism with the WiFi system, and if the unlicensed frequency band is directly accessed, the performance of the WiFi system is seriously affected, so that it is also ensured that the D2D user and the original WiFi user coexist harmoniously when the D2D communication is implemented in the unlicensed frequency band.
The existing D2D communication is mainly deployed in an authorized frequency band, joint deployment of the D2D in an unlicensed frequency band and an authorized frequency band is rarely considered, and authorization and unlicensed selection of a D2D user is considered on the premise of ensuring the minimum communication quality of a WiFi user, and the problem of NP-hard is solved due to the fact that the problem of D2D communication resource allocation is solved, and the traditional algorithm is difficult to solve. Therefore, the current very popular machine learning method is utilized to solve the problem which is difficult to solve in the traditional algorithm, and has very important research significance.
Disclosure of Invention
In view of this, the invention provides a joint intelligent allocation method for authorized and unlicensed D2D communication resources, which solves the problem of joint authorized and unlicensed spectrum selection D2D communication resource allocation that is difficult to solve by the conventional algorithm.
In order to achieve the purpose, the invention provides the following technical scheme:
a joint intelligent allocation method for authorized and unauthorized D2D communication resources comprises the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a multi-agent deep reinforcement learning D2D communication model;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
Further, in step S1, the number of D2D accessing the WiFi band is calculated, but which D2D is selected to avoid the licensed band, and the power and channel selection of the remaining D2D still have a serious impact on the cellular user.
Multiplexing the authorized frequency band: in this mode, two D2D may multiplex the uplink of the same existing cellular user for direct communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
in the formula pk,mIs the transmit power of the kth D2D pair,is the transmit power of the cellular user m,is the channel gain, B, of D2D k to cellular user mCIs the granted bandwidth of the channel and,is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D for k is:
whereinIs the channel power gain, h, of the cellular user m and the base stationk,BD2D channel gain between the transmitting terminal k and the base station.
The existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on the total D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the WiFi user minimum throughput, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
When x isiD2D multiplexes i with the channel of the uplink cellular user; x is the number ofiIf 0, D2D will access the WiFi unlicensed band for i.
When theta isi,m1, D2D multiplexes i with the channel of the uplink cellular user m; thetai,m0, indicates that D2D multiplexes the channel of the uplink cellular user m to i.
One channel can be multiplexed by a plurality of D2D users, and only one channel can be selected by one D2D for data transmission.
Further, in step S2, in order to obtain the maximum cellular user and the system throughput of the licensed band D2D user, there is
s.t.xk∈{0,1},θk,m∈{0,1}
0≤pk,m≤pmax
The first term above represents the selection of access authorization and authorization exemption for the D2D user, the second term represents the power limit for the D2D user, the third term represents the satisfaction of minimum WiFi throughput requirements, and the fourth term represents the assurance that the D2D user and the cellular user meet minimum signal-to-noise ratio requirements.
Further, in step S3, in order to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (computational Multi-Agent), is adopted to model the Multi-Agent environment as markov game to optimize the strategy, and meanwhile, consider the behavior strategies of other agents, the method is to make a single Agent perform the optimization of the strategyThe effect of an agent on the reward is marginalized by comparing the action taken by the agent at a certain point in time t with all other actions it may take at t, which may be achieved by a centralized Critic, so that the cost function of all agents is the same, but each agent will receive a customized error term based on its counterfactual action. In a collaborative agent system, when evaluating how little the contribution of an agent's actions is, the agent's actions can be changed to default actions, and the current actions can be seen to increase or decrease the overall score compared to the default actions, if increasing, the agent's current actions are better than the default actions, and if decreasing, the agent's current actions are worse than the default actions. And this default action is referred to as the baseline. However, the following problem is how to determine the default action, for example, by confirming the default action in some way, and the quality of the default action needs additional simulation for evaluation, which increases the complexity of the calculation. Instead of using default actions, COMA calculates this baseline by solving the edge distribution for the current agent's policy using the current behavior value function, with no additional simulation to calculate this baseline. In this way, the COMA may avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Further, in step S4, in the RL model of the D2D underlying communication, the agent D2D takes corresponding actions for interacting with the environment and according to the policy. At each time t, agent D2D observes a state S from state space StAnd taking corresponding actions (selecting mode, selecting RB, selecting power level) from the action space A according to the strategy pi. After performing this action, the environment enters a new state st+1Agent receives the reward.
State space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t:
Wherein, the number of authorized and unauthorized selections: 2, power level: 10, RB selection number: 20. so the action space number of each agent: α × β × η is 400.
The reward function R: the reward function in the RL drives the whole learning process, so a reasonable reward function is the key, and the reward function is designed into three parts: the selection mode of the D2D, the speeds of the D2D and the cellular users and the signal-to-noise ratios of the two, if the agent enters the unlicensed band in the selected mode, the reward obtained by the agent is set to a positive value, but a larger negative value is obtained when the number of the D2D exceeds the satisfied maximum access number, if the action taken by the agent causes the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward function is a negative value as the sum of the corresponding speed and the reward of the selected cellular users of the same multiplexing spectrum is used as the reward, and conversely, if the action taken by the agent causes the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, because the signal-to-noise ratio is smaller than the signal-to-noise ratio, the signal cannot be decoded.
The limitation of the number of the access to the authorization-free mode designs a function
Limiting SINR of D2D and CU
Wherein
Further, in step S5, the hyper-parameter γ, α in the network is initialized firstθ,αλ,αφBeta, state s0And the parameters λ, φ in the Actor, Critic network0,φ1,...,φJ,θ0,θ1,...,θJ. Each agent takes the action with the highest probability according to the policy network of the agent as the action taken in the current state, so that the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D SINR rewardAnd the next state st+1。
Further, in step S6, the monte carlo method provides a basis for model-free learning, but they still have the limitation of discontinuous, offline learning. The TD error method makes up the difference between the Monte Carlo method and the dynamic programming, and is the core idea of RL. The TD method can also learn in a model-free environment, and can iteratively learn from value estimates (online learning), allowing training in a continuous environment. Calculating TD error from Critic network:
strategy parameter updating is carried out based on strategy gradient, and TD error adopts a gradient ascending method:
to solve the problem of confidence allocation in a multi-agent. The COMA algorithm solves the confidence assignment problem with a counterfactual baseline by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, by centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual action. The jth agent counterfactual baseline is defined as:
the jth agent updates the Actor network parameters of the jth agent through the counterfactual base line according to the formula:
and the intelligent agent updates the network according to the merit function obtained by the Critic network.
Further, in step S7, the training process is completed by the BS, the history information collected during the execution process of the D2D user is uploaded to the BS, the BS completes centralized training, and Critic obtains the policy of the agent at the BS to evaluate how well to take action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
The invention has the following effective effects: in the D2D resource allocation problem, the D2D user is allocated with the unlicensed frequency band, the frequency spectrum and the power jointly, the number of D2D accessible to the unlicensed frequency band can be determined on the premise of ensuring the minimum communication quality of the WiFi user, then the D2D entering the unlicensed frequency band is determined, and the power and the frequency spectrum are allocated to the D2D still remaining in the licensed frequency band, so that the throughput of the cellular user and the D2D user in the licensed frequency band is maximized, an effective multi-agent deep reinforcement learning algorithm is provided, and the problem of NP-hard which is difficult to solve in the traditional algorithm is solved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention better clear, the invention provides the following drawings for illustration:
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a diagram of a network model for D2D communication;
FIG. 3 is a diagram of an AC network framework model according to an embodiment of the present invention;
FIG. 4 is a COMA model diagram of a multi-agent deep reinforcement learning algorithm according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a joint intelligent allocation method of authorized and unauthorized D2D communication resources, aiming at the problem of uplink data transmission in a cellular network and a D2D network. In order to obtain the number of accessible unlicensed bands D2D, the number of accessible unlicensed bands is determined by utilizing an established two-dimensional time Markov model on the premise of ensuring the WiFi communication quality, after the number of accessible unlicensed bands is obtained, for selecting a part of D2D users to remove the unlicensed bands and performing power and spectrum allocation on the rest D2D in the licensed bands, in order to maximize the throughput of D2D users and cellular users in the licensed bands, a multi-agent deep reinforcement learning method COMA algorithm is provided, a multi-agent environment is modeled as a Markov game to optimize strategies, and meanwhile, the action strategies of other agents are considered, the method is to marginalize the influence of a single agent on rewards, the action taken by the agent at a certain time point t is compared with all other actions which the agent may take at t, and the action can be realized by a centralized criticc, so the cost function of all agents is the same, but each agent will receive a customized error term based on its counter-fact actions, in this way the COMA can avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. Therefore, the training process is completed by the BS, the historical information collected by the D2D user during the execution process is uploaded to the BS, centralized training is completed at the BS, and Critic obtains the strategy of the agent at the base station to evaluate the quality of the action taken. In the distributed execution process, the D2D obtains counterfactual baselines from the BS so as to update own Actor network, the Actor selects behaviors based on states observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, so that the optimal strategy is obtained. A flow chart of a multi-intelligent deep reinforcement learning method for joint authorized and unauthorized D2D communication resource allocation in D2D communication is shown in fig. 1.
A diagram of a network model based on D2D communication is shown in fig. 2. The D2D user transmits in the authorized frequency band, it will multiplex the channel with the existing cellular user, thus cause the communication between D2D user and cellular user to interfere, the D2D user chooses to work in the unlicensed frequency band, then will influence the user communication quality of WiFi frequency band, consider that the access way of deploying D2D user in the unlicensed frequency band is LBT, therefore can model D2D user and WiFi user as two-dimensional Markov model, and then can confirm the quantity that can access to the unlicensed frequency band on the premise of guaranteeing the communication quality of WiFi user. Considering that the communication link between devices is uplink resource sharing, since it is much easier to handle channel interference in the uplink than in the downlink, in order to maximize the system capacity deployed in the licensed band, the same channel can be used by multiple pairs of D2D, but only one channel can be selected for multiplexing per pair of D2D. Therefore, authorization and authorization-free selection and power and spectrum allocation need to be performed on D2D users, and the problem is an NP-hard problem, so a machine learning method is used to solve the problem, in which D2D users are regarded as agents, actions are authorization and authorization-free selection, power and spectrum selection, a combined state is SINR of all D2D users, a reasonable reward function is set for multiple agents, the agents continuously interact with the environment to select actions, update states and update network parameters, the agents continuously learn in the environment, and select corresponding actions with maximum reward.
As shown in fig. 1, a joint intelligent allocation method for authorized and unauthorized D2D communication resources includes the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a multi-agent deep reinforcement learning D2D communication model;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
To improve spectral efficiency, D2D users reuse cellular user resources on the licensed band, thereby causing interference to cellular users. Multiplexing the authorized frequency band: in this mode, two D2D may multiplex the uplink of the same existing cellular user for direct communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
in the formula pk,mIs the transmit power of the kth D2D pair,is the transmit power of the cellular user m,is the channel gain, B, of D2D k to cellular user mCIs the granted bandwidth of the channel and,is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D for k is:
whereinIs the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between D2D transmitter k and the base station.
The existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the minimum throughput of WiFi users, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
When x isiD2D multiplexes i with the channel of the uplink cellular user, xi=0D2D will access the WiFi unlicensed band for i.
When theta isi,m1, D2D multiplexes i the channel of the uplink cellular user m, θi,m0, indicates that D2D multiplexes the channel of the uplink cellular user m to i.
In order to obtain the maximum cellular user and system throughput for D2D, there is
s.t.xk∈{0,1},θk,m∈{0,1}
0≤pk,m≤pmax
The first term above represents the selection of access authorization and authorization exemption for the D2D user, the second term represents the power limit for the D2D user, the third term represents the satisfaction of minimum WiFi throughput requirements, and the fourth term represents the assurance that the D2D user and the cellular user meet minimum signal-to-noise ratio requirements.
In order to solve the NP-hard problem in D2D communication resource allocation, a multi-agent deep reinforcement learning COMA method is used herein, which models the multi-agent environment as markov game to optimize the strategy, and considers the behavior strategy of other agents, the method is to marginalize the influence of a single agent on the reward, compare the behavior taken by the agent at a certain time point t with all other behaviors it may take at t, which can be realized by a centralized Critic, so that the cost functions of all agents are the same, but the cost functions are the sameEach agent will receive a customized error item based on its counterfactual action. In a collaborative agent system, when evaluating how little the contribution of an agent's actions is, the agent's actions can be changed to default actions, and the current actions can be seen to increase or decrease the overall score compared to the default actions, if increasing, the agent's current actions are better than the default actions, and if decreasing, the agent's current actions are worse than the default actions. And this default action is referred to as the baseline. However, the following problem is how to determine the default action, for example, by confirming the default action in some way, and the quality of the default action needs additional simulation for evaluation, which increases the complexity of the calculation. Instead of using default actions, COMA calculates this baseline by solving the edge distribution for the current agent's policy using the current behavior value function, with no additional simulation to calculate this baseline. In this way, the COMA may avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. The training process is completed by the BS, historical information collected by the D2D user in the execution process is uploaded to the BS, centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy. The AC network model is shown in fig. three.
In the RL model of D2D underlying communication, agent D2D is a framework based on AC network model for interacting with the environment and taking corresponding actions according to policies. At each time t, agent D2D pairs with a slave stateObserving a state S in space StAnd taking corresponding actions (selecting mode, selecting RB, selecting power level) from the action space A according to the strategy pi. After performing this action, the environment enters a new state st+1Agent receives the reward.
State space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t
Wherein, the number of authorized and unauthorized selections: 2, power level: 10, RB selection number: 20. so the action space number of each agent: α × β × η is 400.
The reward function R: the reward function in the RL drives the whole learning process, so a reasonable reward function is the key, and the reward function is designed into three parts: the selection mode of the D2D, the speeds of the D2D and the cellular users and the signal-to-noise ratios of the two, if the agent enters the unlicensed band in the selected mode, the reward obtained by the agent is set to a positive value, but a larger negative value is obtained when the number of the D2D exceeds the satisfied maximum access number, if the action taken by the agent causes the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward function is a negative value as the sum of the corresponding speed and the reward of the selected cellular users of the same multiplexing spectrum is used as the reward, and conversely, if the action taken by the agent causes the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, because the signal-to-noise ratio is smaller than the signal-to-noise ratio, the signal cannot be decoded.
The limitation of the number of the access to the authorization-free mode designs a function
Limiting SINR of D2D and CU
Wherein
First, a hyper-parameter gamma, alpha in the network is initializedθ,αλ,Beta, state s0And Actor, the parameter lambda in the critical network,θ0,θ1,...,θJ. Each agent takes the action with the highest probability according to the policy network of the agent as the action taken in the current state, so that the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D SINR rewardAnd the next state st+1。
The Monte Carlo method provides a basis for model-free learning, but the Monte Carlo method still has the limitation of discontinuous and offline learning. The TD error method makes up the difference between the Monte Carlo method and the dynamic programming, and is the core idea of RL. The TD method can also learn in a model-free environment, and can iteratively learn from value estimates (online learning), allowing training in a continuous environment. Calculating TD error from Critic network:
strategy parameter updating is carried out based on strategy gradient, and TD error adopts a gradient ascending method:
to solve the problem of confidence allocation in a multi-agent. The COMA algorithm solves the confidence assignment problem with a counterfactual baseline by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, by centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual action. The jth agent counterfactual baseline is defined as:
the jth agent updates the Actor network parameters of the jth agent through the counterfactual base line according to the formula:
a schematic diagram of a multi-agent deep reinforcement learning COMA algorithm is shown in FIG. 4.
The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.
Claims (8)
1. A joint intelligent allocation method for authorized and unauthorized D2D communication resources comprises the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a multi-agent deep reinforcement learning D2D communication model;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
2. The method of claim 1 for joint intelligent allocation of granted and denied D2D communication resources, wherein: in step S1. The number of D2D access to WiFi bands is calculated, but selecting which D2D to go to WiFi bands is a mode selection problem, and the remaining power and channel selection of D2D still have a serious impact on CU users.
Multiplexing the allowed frequency band: in this mode, two D2D may multiplex the uplink of the same existing cellular user for direct communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
in the formula pk,mIs the transmit power of the kth D2D pair,is the transmit power of the cellular user m,is the channel gain, B, of D2D k to cellular user mCIs the allowed channel bandwidth and is,is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D for k is:
whereinIs the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between D2D transmitting terminal k and the base station.
The existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on the total D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the WiFi user minimum throughput, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
When x isiD2D multiplexes i with the channel of the uplink cellular user, xiIf 0, D2D will access the WiFi unlicensed band for i.
When theta isi,m1, D2D multiplexes i the channel of the uplink cellular user m, θi,mTable (0)D2D is shown multiplexing the channels of the uplink cellular user m for i.
3. The method of claim 2 for joint intelligent allocation of granted and denied D2D communication resources, further comprising: in step S2, in order to obtain the maximum cellular user and the system throughput of the D2D user in the authorized frequency band, the method comprises
s.t.xk∈{0,1},θk,m∈{0,1}
0≤pk,m≤pmax
The first term above represents the selection of access authorization and authorization exemption for the D2D user, the second term represents the power limit for the D2D user, the third term represents the satisfaction of minimum WiFi throughput requirements, and the fourth term represents the assurance that the D2D user and the cellular user meet minimum signal-to-noise ratio requirements.
4. The method of claim 3 for joint intelligent allocation of granted and denied D2D communication resources, further comprising: in step S3, to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (statistical artificial Multi-Agent), is adopted to model the Multi-Agent environment as markov game to optimize the strategy, and simultaneously consider the behavior strategies of other agents, by marginalizing the effect of a single Agent on the reward, and comparing the behavior of the Agent at a certain time point t with all other behaviors it may take at t, which can be done byIt is implemented with a centralized Critic, so the cost function of all agents is the same, but each agent will receive a custom error term based on its counterfactual action. In a collaborative agent system, when evaluating how little the contribution of an agent's actions is, the agent's actions can be changed to default actions, and the current actions can be seen to increase or decrease the overall score compared to the default actions, if increasing, the agent's current actions are better than the default actions, and if decreasing, the agent's current actions are worse than the default actions. And this default action is referred to as the baseline. However, the following problem is how to determine the default action, for example, by confirming the default action in some way, and the quality of the default action needs additional simulation for evaluation, which increases the complexity of the calculation. Instead of using default actions, COMA calculates this baseline by solving the edge distribution for the current agent's policy using the current behavior value function, with no additional simulation to calculate this baseline. In this way, the COMA may avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
5. A joint intelligence of granted and denied D2D communication resources as claimed in claim 4The assignable method is characterized in that further, in step S4, in the RL model of D2D underlying communication, the agent D2D is paired to interact with the environment and take corresponding actions according to the policy. At each time t, agent D2D observes a state S from state space StAnd taking corresponding actions (selecting mode, selecting RB, selecting power level) from the action space A according to the strategy pi. After performing this action, the environment enters a new state st+1Agent receives the reward. State space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t
Wherein, the mode selection: 2, power level: 10, RB selection: 20. so the action space number of each agent: α × β × η is 400.
The reward function R: the reward function in the RL drives the whole learning process, so a reasonable reward function is the key, and the reward function is designed into three parts: the selection mode of the D2D, the speeds of the D2D and the cellular users and the signal-to-noise ratios of the two, if the agent enters the unlicensed band in the selected mode, the reward obtained by the agent is set to a positive value, but a larger negative value is obtained when the number of the D2D exceeds the satisfied maximum access number, if the action taken by the agent causes the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward function is a negative value as the sum of the corresponding speed and the reward of the selected cellular users of the same multiplexing spectrum is used as the reward, and conversely, if the action taken by the agent causes the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, because the signal-to-noise ratio is smaller than the signal-to-noise ratio, the signal cannot be decoded.
The limitation of the number of the access to the authorization-free mode designs a function
Limiting SINR of D2D and CU
Wherein
6. The method of claim 5 for joint intelligent allocation of granted and denied D2D communication resources, further comprising: in step S5, the hyper-parameter γ, α in the network is first initializedθ,αλ,Beta, state s0And parameters in the Actor, Critic networkEach agent takes the action with the highest probability according to the policy network of the agent as the action taken in the current state, so that the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D SINR rewardAnd the next state st+1。
7. The method of claim 6, further characterized in that in step S6, the monte carlo method provides a basis for model-free learning, but they still have the limitation of discontinuous off-line learning. The TD error method makes up the difference between the Monte Carlo method and the dynamic programming, and is the core idea of RL. The TD method can also learn in a model-free environment, and can iteratively learn from value estimates (online learning), allowing training in a continuous environment. Calculating TD error from Critic network:
strategy parameter updating is carried out based on strategy gradient, and TD error adopts a gradient ascending method:
λt+1=λt+αλ▽λQλ(st,ut)δt
to solve the problem of confidence allocation in a multi-agent. The COMA algorithm solves the confidence assignment problem with a counterfactual baseline by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, by centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual action. The jth agent counterfactual baseline is defined as:
the jth agent updates the Actor network parameters of the jth agent through the counterfactual base line according to the formula:
and the intelligent agent updates the network according to the merit function obtained by the Critic network.
8. The method of claim 7, wherein the training process is performed by the BS in step S7, history information collected during the performance process of the D2D user is uploaded to the BS, the BS performs centralized training, and Critic obtains the policy of the agent at the BS to evaluate how well to take the action. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110581716.5A CN113316154B (en) | 2021-05-26 | 2021-05-26 | Authorized and unauthorized D2D communication resource joint intelligent distribution method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110581716.5A CN113316154B (en) | 2021-05-26 | 2021-05-26 | Authorized and unauthorized D2D communication resource joint intelligent distribution method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113316154A true CN113316154A (en) | 2021-08-27 |
CN113316154B CN113316154B (en) | 2022-06-21 |
Family
ID=77375597
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110581716.5A Active CN113316154B (en) | 2021-05-26 | 2021-05-26 | Authorized and unauthorized D2D communication resource joint intelligent distribution method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113316154B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114363938A (en) * | 2021-12-21 | 2022-04-15 | 重庆邮电大学 | Cellular network flow unloading method |
CN114466386A (en) * | 2022-01-13 | 2022-05-10 | 重庆邮电大学 | Direct access method for D2D communication |
CN114928549A (en) * | 2022-04-20 | 2022-08-19 | 清华大学 | Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning |
CN115278593A (en) * | 2022-06-20 | 2022-11-01 | 重庆邮电大学 | Transmission method of unmanned aerial vehicle-non-orthogonal multiple access communication system based on semi-authorization-free protocol |
CN115515101A (en) * | 2022-09-23 | 2022-12-23 | 西北工业大学 | Decoupling Q learning intelligent codebook selection method for SCMA-V2X system |
CN116367332A (en) * | 2023-05-31 | 2023-06-30 | 华信咨询设计研究院有限公司 | Hierarchical control-based D2D resource allocation method under 5G system |
WO2024032228A1 (en) * | 2022-08-12 | 2024-02-15 | 华为技术有限公司 | Reinforcement learning training method and related device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160066337A1 (en) * | 2014-09-03 | 2016-03-03 | Futurewei Technologies, Inc. | System and Method for D2D Resource Allocation |
US20190124667A1 (en) * | 2017-10-23 | 2019-04-25 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Method for allocating transmission resources using reinforcement learning |
CN110267338A (en) * | 2019-07-08 | 2019-09-20 | 西安电子科技大学 | Federated resource distribution and Poewr control method in a kind of D2D communication |
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
WO2019231289A1 (en) * | 2018-06-01 | 2019-12-05 | Samsung Electronics Co., Ltd. | Method and apparatus for machine learning based wide beam optimization in cellular network |
CN110769514A (en) * | 2019-11-08 | 2020-02-07 | 山东师范大学 | Heterogeneous cellular network D2D communication resource allocation method and system |
CN111556572A (en) * | 2020-04-21 | 2020-08-18 | 北京邮电大学 | Spectrum resource and computing resource joint allocation method based on reinforcement learning |
US20210136785A1 (en) * | 2019-10-30 | 2021-05-06 | University Of Ottawa | System and method for joint power and resource allocation using reinforcement learning |
CN112822781A (en) * | 2021-01-20 | 2021-05-18 | 重庆邮电大学 | Resource allocation method based on Q learning |
-
2021
- 2021-05-26 CN CN202110581716.5A patent/CN113316154B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160066337A1 (en) * | 2014-09-03 | 2016-03-03 | Futurewei Technologies, Inc. | System and Method for D2D Resource Allocation |
US20190124667A1 (en) * | 2017-10-23 | 2019-04-25 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Method for allocating transmission resources using reinforcement learning |
WO2019231289A1 (en) * | 2018-06-01 | 2019-12-05 | Samsung Electronics Co., Ltd. | Method and apparatus for machine learning based wide beam optimization in cellular network |
CN110267338A (en) * | 2019-07-08 | 2019-09-20 | 西安电子科技大学 | Federated resource distribution and Poewr control method in a kind of D2D communication |
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
US20210136785A1 (en) * | 2019-10-30 | 2021-05-06 | University Of Ottawa | System and method for joint power and resource allocation using reinforcement learning |
CN110769514A (en) * | 2019-11-08 | 2020-02-07 | 山东师范大学 | Heterogeneous cellular network D2D communication resource allocation method and system |
CN111556572A (en) * | 2020-04-21 | 2020-08-18 | 北京邮电大学 | Spectrum resource and computing resource joint allocation method based on reinforcement learning |
CN112822781A (en) * | 2021-01-20 | 2021-05-18 | 重庆邮电大学 | Resource allocation method based on Q learning |
Non-Patent Citations (2)
Title |
---|
Y.LUO: "Dynamic resource allocations based on Q-learning for D2D communication in cellular networks", 《014 11TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIEV MEDIA TECHNOLOGY AND INFORMATION PROCESSING》 * |
滑思忠: "D2D 通信资源复用分配奖惩加权算法研究", 《计算机应用研究》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114363938A (en) * | 2021-12-21 | 2022-04-15 | 重庆邮电大学 | Cellular network flow unloading method |
CN114363938B (en) * | 2021-12-21 | 2024-01-26 | 深圳千通科技有限公司 | Cellular network flow unloading method |
CN114466386A (en) * | 2022-01-13 | 2022-05-10 | 重庆邮电大学 | Direct access method for D2D communication |
CN114466386B (en) * | 2022-01-13 | 2023-09-29 | 深圳市晨讯达科技有限公司 | Direct access method for D2D communication |
CN114928549A (en) * | 2022-04-20 | 2022-08-19 | 清华大学 | Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning |
CN115278593A (en) * | 2022-06-20 | 2022-11-01 | 重庆邮电大学 | Transmission method of unmanned aerial vehicle-non-orthogonal multiple access communication system based on semi-authorization-free protocol |
WO2024032228A1 (en) * | 2022-08-12 | 2024-02-15 | 华为技术有限公司 | Reinforcement learning training method and related device |
CN115515101A (en) * | 2022-09-23 | 2022-12-23 | 西北工业大学 | Decoupling Q learning intelligent codebook selection method for SCMA-V2X system |
CN116367332A (en) * | 2023-05-31 | 2023-06-30 | 华信咨询设计研究院有限公司 | Hierarchical control-based D2D resource allocation method under 5G system |
CN116367332B (en) * | 2023-05-31 | 2023-09-15 | 华信咨询设计研究院有限公司 | Hierarchical control-based D2D resource allocation method under 5G system |
Also Published As
Publication number | Publication date |
---|---|
CN113316154B (en) | 2022-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113316154B (en) | Authorized and unauthorized D2D communication resource joint intelligent distribution method | |
CN109474980B (en) | Wireless network resource allocation method based on deep reinforcement learning | |
Azari et al. | Risk-aware resource allocation for URLLC: Challenges and strategies with machine learning | |
US12067487B2 (en) | Method and apparatus employing distributed sensing and deep learning for dynamic spectrum access and spectrum sharing | |
Wu et al. | Secrecy-based energy-efficient data offloading via dual connectivity over unlicensed spectrums | |
CN105230070B (en) | A kind of wireless resource allocation methods and radio resource allocation apparatus | |
CN112351433A (en) | Heterogeneous network resource allocation method based on reinforcement learning | |
Bi et al. | Deep reinforcement learning based power allocation for D2D network | |
Sande et al. | Access and radio resource management for IAB networks using deep reinforcement learning | |
Yuan et al. | Deep reinforcement learning for resource allocation with network slicing in cognitive radio network | |
Kaur et al. | Intelligent spectrum management based on reinforcement learning schemes in cooperative cognitive radio networks | |
CN102984736A (en) | Optimizing method for wireless ubiquitous heterogeneous network resources | |
Jiang | Reinforcement learning-based spectrum sharing for cognitive radio | |
Banitalebi et al. | Distributed learning-based resource allocation for self-organizing c-v2x communication in cellular networks | |
CN110139282A (en) | A kind of energy acquisition D2D communication resource allocation method neural network based | |
Lall et al. | Multi-agent reinfocement learning for stochastic power management in cognitive radio network | |
Khuntia et al. | An actor-critic reinforcement learning for device-to-device communication underlaying cellular network | |
CN110049436B (en) | Distributed channel allocation and sharing method and system based on heterogeneous spectrum | |
CN115811788B (en) | D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning | |
Chen et al. | Power allocation based on deep reinforcement learning in hetnets with varying user activity | |
CN115915454A (en) | SWIPT-assisted downlink resource allocation method and device | |
Li et al. | User scheduling and slicing resource allocation in industrial Internet of Things | |
Jiang et al. | Dueling double deep q-network based computation offloading and resource allocation scheme for internet of vehicles | |
Ginde et al. | A game-theoretic analysis of link adaptation in cellular radio networks | |
De Mari et al. | Energy-efficient proactive scheduling in ultra dense networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240707 Address after: 230000 B-1015, wo Yuan Garden, 81 Ganquan Road, Shushan District, Hefei, Anhui. Patentee after: HEFEI MINGLONG ELECTRONIC TECHNOLOGY Co.,Ltd. Country or region after: China Address before: 400065 No. 2, Chongwen Road, Nan'an District, Chongqing Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS Country or region before: China |
|
TR01 | Transfer of patent right |