CN113316154A - Authorized and unauthorized D2D communication resource joint intelligent distribution method - Google Patents

Authorized and unauthorized D2D communication resource joint intelligent distribution method Download PDF

Info

Publication number
CN113316154A
CN113316154A CN202110581716.5A CN202110581716A CN113316154A CN 113316154 A CN113316154 A CN 113316154A CN 202110581716 A CN202110581716 A CN 202110581716A CN 113316154 A CN113316154 A CN 113316154A
Authority
CN
China
Prior art keywords
agent
action
reward
authorization
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110581716.5A
Other languages
Chinese (zh)
Other versions
CN113316154B (en
Inventor
裴二荣
徐成义
陶凯
宋珈锐
黄一格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Minglong Electronic Technology Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110581716.5A priority Critical patent/CN113316154B/en
Publication of CN113316154A publication Critical patent/CN113316154A/en
Application granted granted Critical
Publication of CN113316154B publication Critical patent/CN113316154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/06Testing, supervising or monitoring using simulated traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/0215Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices
    • H04W28/0221Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices power availability or consumption

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a joint intelligent distribution method of authorized and unauthorized D2D communication resources, belonging to the field of D2D communication. The invention comprises the following steps: s1: establishing a D2D user communication model; s2: establishing an objective function to be optimized; s3: establishing a multi-agent deep reinforcement learning D2D communication model; s4: setting an action set, a state set and a reward function of the multi-agent; s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state; s6: calculating a TDerror of the Critic network, updating parameters of the Critic network, calculating a counterfactual baseline of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baseline, and updating the state; s7: steps S5-S6 are repeated until the target state is reached. According to the multi-agent deep reinforcement learning framework, the agents continuously interact with the environment, continuously train and learn, find the optimal strategy, and the action corresponding to the adopted strategy obtains the optimal throughput, and meanwhile, the communication quality of WiFi and cellular users is guaranteed.

Description

Authorized and unauthorized D2D communication resource joint intelligent distribution method
Technical Field
The invention belongs to the field of D2D communication, and relates to a joint intelligent allocation method for authorized and unauthorized D2D communication resources.
Background
With the rapid popularization of intelligent terminals, the number of connection devices in this year, including smart phones, tablets, car networking devices, internet of things devices and the like, is expected to reach 500 hundred million, and data traffic is increased by 1000 times, so that the requirement for the evolution of a wireless communication technology is more urgent. On one hand, wireless communication performance indexes such as network capacity, spectrum efficiency and the like need to be obviously improved to deal with the explosively increased data traffic; on the other hand, the expansion of cellular communication applications needs to be realized, and the end user experience is improved.
In the face of the situation that the data traffic and the authorized spectrum which are explosively increased are almost distributed, the new spectrum is expanded to improve the system capacity so as to adapt to the rapid increase of the mobile traffic and the diversification of the services, and the method becomes the current primary target. The frequency spectrum working frequency band of the unlicensed frequency band mainly comprises a 2.4GHz frequency band, a 5GHz frequency band and a 60GHz millimeter wave frequency band. Since the free space loss increases as the frequency band increases, only the frequency band of 6GHz or less is considered to be able to better resist path fading. In the frequency band below 6GHz, the 2.4GHz frequency band is already densely occupied by WiFi, Bluetooth and other wireless technologies, the interference is complex, the 5GHz frequency band has an available space close to 500MHz, only one part is occupied by WiFi, and the utilization rate is low. Therefore, the D2D technology can be realized in the 5GHz band as long as the interference between the D2D technology and the WiFi is controlled within an acceptable range. The 5GHz unlicensed frequency band channel has large fading and is suitable for short-distance communication, D2D is used as a proximity service, the property of the D2D is very consistent when the D2D is placed in the unlicensed frequency band, and compared with other short-distance communication technologies (WiFi direct connection, Bluetooth, Zigbee protocol and the like) which work in the unlicensed frequency band, the D2D communication has great advantages, pairing of users and resource allocation of channels and power are controlled by a base station, and access is more efficient and safer. D2D communication is deployed in the unlicensed frequency band, so that the use rate of the unlicensed spectrum can be improved, and seamless fusion with the existing cellular system can be realized. Similar to the LTE-U technology, the D2D communication is initially operated only in the licensed spectrum, and there is no coexistence mechanism with the WiFi system, and if the unlicensed frequency band is directly accessed, the performance of the WiFi system is seriously affected, so that it is also ensured that the D2D user and the original WiFi user coexist harmoniously when the D2D communication is implemented in the unlicensed frequency band.
The existing D2D communication is mainly deployed in an authorized frequency band, joint deployment of the D2D in an unlicensed frequency band and an authorized frequency band is rarely considered, and authorization and unlicensed selection of a D2D user is considered on the premise of ensuring the minimum communication quality of a WiFi user, and the problem of NP-hard is solved due to the fact that the problem of D2D communication resource allocation is solved, and the traditional algorithm is difficult to solve. Therefore, the current very popular machine learning method is utilized to solve the problem which is difficult to solve in the traditional algorithm, and has very important research significance.
Disclosure of Invention
In view of this, the invention provides a joint intelligent allocation method for authorized and unlicensed D2D communication resources, which solves the problem of joint authorized and unlicensed spectrum selection D2D communication resource allocation that is difficult to solve by the conventional algorithm.
In order to achieve the purpose, the invention provides the following technical scheme:
a joint intelligent allocation method for authorized and unauthorized D2D communication resources comprises the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a multi-agent deep reinforcement learning D2D communication model;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
Further, in step S1, the number of D2D accessing the WiFi band is calculated, but which D2D is selected to avoid the licensed band, and the power and channel selection of the remaining D2D still have a serious impact on the cellular user.
Multiplexing the authorized frequency band: in this mode, two D2D may multiplex the uplink of the same existing cellular user for direct communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
Figure BDA0003085521030000021
in the formula pk,mIs the transmit power of the kth D2D pair,
Figure BDA0003085521030000022
is the transmit power of the cellular user m,
Figure BDA0003085521030000023
is the channel gain, B, of D2D k to cellular user mCIs the granted bandwidth of the channel and,
Figure BDA0003085521030000024
is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D for k is:
Figure BDA0003085521030000031
wherein
Figure BDA0003085521030000032
Is the channel power gain, h, of the cellular user m and the base stationk,BD2D channel gain between the transmitting terminal k and the base station.
The existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on the total D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the WiFi user minimum throughput, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
When x isiD2D multiplexes i with the channel of the uplink cellular user; x is the number ofiIf 0, D2D will access the WiFi unlicensed band for i.
When theta isi,m1, D2D multiplexes i with the channel of the uplink cellular user m; thetai,m0, indicates that D2D multiplexes the channel of the uplink cellular user m to i.
One channel can be multiplexed by a plurality of D2D users, and only one channel can be selected by one D2D for data transmission.
Further, in step S2, in order to obtain the maximum cellular user and the system throughput of the licensed band D2D user, there is
Figure BDA0003085521030000033
s.t.xk∈{0,1},θk,m∈{0,1}
0≤pk,m≤pmax
Figure BDA0003085521030000034
Figure BDA0003085521030000035
The first term above represents the selection of access authorization and authorization exemption for the D2D user, the second term represents the power limit for the D2D user, the third term represents the satisfaction of minimum WiFi throughput requirements, and the fourth term represents the assurance that the D2D user and the cellular user meet minimum signal-to-noise ratio requirements.
Further, in step S3, in order to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (computational Multi-Agent), is adopted to model the Multi-Agent environment as markov game to optimize the strategy, and meanwhile, consider the behavior strategies of other agents, the method is to make a single Agent perform the optimization of the strategyThe effect of an agent on the reward is marginalized by comparing the action taken by the agent at a certain point in time t with all other actions it may take at t, which may be achieved by a centralized Critic, so that the cost function of all agents is the same, but each agent will receive a customized error term based on its counterfactual action. In a collaborative agent system, when evaluating how little the contribution of an agent's actions is, the agent's actions can be changed to default actions, and the current actions can be seen to increase or decrease the overall score compared to the default actions, if increasing, the agent's current actions are better than the default actions, and if decreasing, the agent's current actions are worse than the default actions. And this default action is referred to as the baseline. However, the following problem is how to determine the default action, for example, by confirming the default action in some way, and the quality of the default action needs additional simulation for evaluation, which increases the complexity of the calculation. Instead of using default actions, COMA calculates this baseline by solving the edge distribution for the current agent's policy using the current behavior value function, with no additional simulation to calculate this baseline. In this way, the COMA may avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Further, in step S4, in the RL model of the D2D underlying communication, the agent D2D takes corresponding actions for interacting with the environment and according to the policy. At each time t, agent D2D observes a state S from state space StAnd taking corresponding actions (selecting mode, selecting RB, selecting power level) from the action space A according to the strategy pi. After performing this action, the environment enters a new state st+1Agent receives the reward.
State space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t:
Figure BDA0003085521030000051
an action space A:
Figure BDA0003085521030000052
mode selection, power level selection, and RB selection, respectively
Wherein, the number of authorized and unauthorized selections: 2, power level: 10, RB selection number: 20. so the action space number of each agent: α × β × η is 400.
The reward function R: the reward function in the RL drives the whole learning process, so a reasonable reward function is the key, and the reward function is designed into three parts: the selection mode of the D2D, the speeds of the D2D and the cellular users and the signal-to-noise ratios of the two, if the agent enters the unlicensed band in the selected mode, the reward obtained by the agent is set to a positive value, but a larger negative value is obtained when the number of the D2D exceeds the satisfied maximum access number, if the action taken by the agent causes the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward function is a negative value as the sum of the corresponding speed and the reward of the selected cellular users of the same multiplexing spectrum is used as the reward, and conversely, if the action taken by the agent causes the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, because the signal-to-noise ratio is smaller than the signal-to-noise ratio, the signal cannot be decoded.
The limitation of the number of the access to the authorization-free mode designs a function
Figure BDA0003085521030000053
Limiting SINR of D2D and CU
Figure BDA0003085521030000054
Wherein
Figure BDA0003085521030000055
Figure BDA0003085521030000056
Further, in step S5, the hyper-parameter γ, α in the network is initialized firstθλφBeta, state s0And the parameters λ, φ in the Actor, Critic network0,φ1,...,φJ01,...,θJ. Each agent takes the action with the highest probability according to the policy network of the agent as the action taken in the current state, so that the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D SINR reward
Figure BDA0003085521030000065
And the next state st+1
Further, in step S6, the monte carlo method provides a basis for model-free learning, but they still have the limitation of discontinuous, offline learning. The TD error method makes up the difference between the Monte Carlo method and the dynamic programming, and is the core idea of RL. The TD method can also learn in a model-free environment, and can iteratively learn from value estimates (online learning), allowing training in a continuous environment. Calculating TD error from Critic network:
Figure BDA0003085521030000061
strategy parameter updating is carried out based on strategy gradient, and TD error adopts a gradient ascending method:
Figure BDA0003085521030000062
to solve the problem of confidence allocation in a multi-agent. The COMA algorithm solves the confidence assignment problem with a counterfactual baseline by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, by centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual action. The jth agent counterfactual baseline is defined as:
Figure BDA0003085521030000063
the jth agent updates the Actor network parameters of the jth agent through the counterfactual base line according to the formula:
Figure BDA0003085521030000064
and the intelligent agent updates the network according to the merit function obtained by the Critic network.
Further, in step S7, the training process is completed by the BS, the history information collected during the execution process of the D2D user is uploaded to the BS, the BS completes centralized training, and Critic obtains the policy of the agent at the BS to evaluate how well to take action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
The invention has the following effective effects: in the D2D resource allocation problem, the D2D user is allocated with the unlicensed frequency band, the frequency spectrum and the power jointly, the number of D2D accessible to the unlicensed frequency band can be determined on the premise of ensuring the minimum communication quality of the WiFi user, then the D2D entering the unlicensed frequency band is determined, and the power and the frequency spectrum are allocated to the D2D still remaining in the licensed frequency band, so that the throughput of the cellular user and the D2D user in the licensed frequency band is maximized, an effective multi-agent deep reinforcement learning algorithm is provided, and the problem of NP-hard which is difficult to solve in the traditional algorithm is solved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention better clear, the invention provides the following drawings for illustration:
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a diagram of a network model for D2D communication;
FIG. 3 is a diagram of an AC network framework model according to an embodiment of the present invention;
FIG. 4 is a COMA model diagram of a multi-agent deep reinforcement learning algorithm according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a joint intelligent allocation method of authorized and unauthorized D2D communication resources, aiming at the problem of uplink data transmission in a cellular network and a D2D network. In order to obtain the number of accessible unlicensed bands D2D, the number of accessible unlicensed bands is determined by utilizing an established two-dimensional time Markov model on the premise of ensuring the WiFi communication quality, after the number of accessible unlicensed bands is obtained, for selecting a part of D2D users to remove the unlicensed bands and performing power and spectrum allocation on the rest D2D in the licensed bands, in order to maximize the throughput of D2D users and cellular users in the licensed bands, a multi-agent deep reinforcement learning method COMA algorithm is provided, a multi-agent environment is modeled as a Markov game to optimize strategies, and meanwhile, the action strategies of other agents are considered, the method is to marginalize the influence of a single agent on rewards, the action taken by the agent at a certain time point t is compared with all other actions which the agent may take at t, and the action can be realized by a centralized criticc, so the cost function of all agents is the same, but each agent will receive a customized error term based on its counter-fact actions, in this way the COMA can avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. Therefore, the training process is completed by the BS, the historical information collected by the D2D user during the execution process is uploaded to the BS, centralized training is completed at the BS, and Critic obtains the strategy of the agent at the base station to evaluate the quality of the action taken. In the distributed execution process, the D2D obtains counterfactual baselines from the BS so as to update own Actor network, the Actor selects behaviors based on states observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, so that the optimal strategy is obtained. A flow chart of a multi-intelligent deep reinforcement learning method for joint authorized and unauthorized D2D communication resource allocation in D2D communication is shown in fig. 1.
A diagram of a network model based on D2D communication is shown in fig. 2. The D2D user transmits in the authorized frequency band, it will multiplex the channel with the existing cellular user, thus cause the communication between D2D user and cellular user to interfere, the D2D user chooses to work in the unlicensed frequency band, then will influence the user communication quality of WiFi frequency band, consider that the access way of deploying D2D user in the unlicensed frequency band is LBT, therefore can model D2D user and WiFi user as two-dimensional Markov model, and then can confirm the quantity that can access to the unlicensed frequency band on the premise of guaranteeing the communication quality of WiFi user. Considering that the communication link between devices is uplink resource sharing, since it is much easier to handle channel interference in the uplink than in the downlink, in order to maximize the system capacity deployed in the licensed band, the same channel can be used by multiple pairs of D2D, but only one channel can be selected for multiplexing per pair of D2D. Therefore, authorization and authorization-free selection and power and spectrum allocation need to be performed on D2D users, and the problem is an NP-hard problem, so a machine learning method is used to solve the problem, in which D2D users are regarded as agents, actions are authorization and authorization-free selection, power and spectrum selection, a combined state is SINR of all D2D users, a reasonable reward function is set for multiple agents, the agents continuously interact with the environment to select actions, update states and update network parameters, the agents continuously learn in the environment, and select corresponding actions with maximum reward.
As shown in fig. 1, a joint intelligent allocation method for authorized and unauthorized D2D communication resources includes the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a multi-agent deep reinforcement learning D2D communication model;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
To improve spectral efficiency, D2D users reuse cellular user resources on the licensed band, thereby causing interference to cellular users. Multiplexing the authorized frequency band: in this mode, two D2D may multiplex the uplink of the same existing cellular user for direct communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
Figure BDA0003085521030000091
in the formula pk,mIs the transmit power of the kth D2D pair,
Figure BDA0003085521030000092
is the transmit power of the cellular user m,
Figure BDA0003085521030000093
is the channel gain, B, of D2D k to cellular user mCIs the granted bandwidth of the channel and,
Figure BDA0003085521030000094
is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D for k is:
Figure BDA0003085521030000095
wherein
Figure BDA0003085521030000096
Is the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between D2D transmitter k and the base station.
The existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the minimum throughput of WiFi users, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
When x isiD2D multiplexes i with the channel of the uplink cellular user, xi=0D2D will access the WiFi unlicensed band for i.
When theta isi,m1, D2D multiplexes i the channel of the uplink cellular user m, θi,m0, indicates that D2D multiplexes the channel of the uplink cellular user m to i.
In order to obtain the maximum cellular user and system throughput for D2D, there is
Figure BDA0003085521030000097
Figure BDA0003085521030000101
s.t.xk∈{0,1},θk,m∈{0,1}
0≤pk,m≤pmax
Figure BDA0003085521030000102
Figure BDA0003085521030000103
The first term above represents the selection of access authorization and authorization exemption for the D2D user, the second term represents the power limit for the D2D user, the third term represents the satisfaction of minimum WiFi throughput requirements, and the fourth term represents the assurance that the D2D user and the cellular user meet minimum signal-to-noise ratio requirements.
In order to solve the NP-hard problem in D2D communication resource allocation, a multi-agent deep reinforcement learning COMA method is used herein, which models the multi-agent environment as markov game to optimize the strategy, and considers the behavior strategy of other agents, the method is to marginalize the influence of a single agent on the reward, compare the behavior taken by the agent at a certain time point t with all other behaviors it may take at t, which can be realized by a centralized Critic, so that the cost functions of all agents are the same, but the cost functions are the sameEach agent will receive a customized error item based on its counterfactual action. In a collaborative agent system, when evaluating how little the contribution of an agent's actions is, the agent's actions can be changed to default actions, and the current actions can be seen to increase or decrease the overall score compared to the default actions, if increasing, the agent's current actions are better than the default actions, and if decreasing, the agent's current actions are worse than the default actions. And this default action is referred to as the baseline. However, the following problem is how to determine the default action, for example, by confirming the default action in some way, and the quality of the default action needs additional simulation for evaluation, which increases the complexity of the calculation. Instead of using default actions, COMA calculates this baseline by solving the edge distribution for the current agent's policy using the current behavior value function, with no additional simulation to calculate this baseline. In this way, the COMA may avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for reward functions, and the deep learning training process results in a large amount of computational overhead. The training process is completed by the BS, historical information collected by the D2D user in the execution process is uploaded to the BS, centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy. The AC network model is shown in fig. three.
In the RL model of D2D underlying communication, agent D2D is a framework based on AC network model for interacting with the environment and taking corresponding actions according to policies. At each time t, agent D2D pairs with a slave stateObserving a state S in space StAnd taking corresponding actions (selecting mode, selecting RB, selecting power level) from the action space A according to the strategy pi. After performing this action, the environment enters a new state st+1Agent receives the reward.
State space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t
Figure BDA0003085521030000111
An action space A:
Figure BDA0003085521030000112
mode selection, power level selection, and RB selection, respectively
Wherein, the number of authorized and unauthorized selections: 2, power level: 10, RB selection number: 20. so the action space number of each agent: α × β × η is 400.
The reward function R: the reward function in the RL drives the whole learning process, so a reasonable reward function is the key, and the reward function is designed into three parts: the selection mode of the D2D, the speeds of the D2D and the cellular users and the signal-to-noise ratios of the two, if the agent enters the unlicensed band in the selected mode, the reward obtained by the agent is set to a positive value, but a larger negative value is obtained when the number of the D2D exceeds the satisfied maximum access number, if the action taken by the agent causes the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward function is a negative value as the sum of the corresponding speed and the reward of the selected cellular users of the same multiplexing spectrum is used as the reward, and conversely, if the action taken by the agent causes the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, because the signal-to-noise ratio is smaller than the signal-to-noise ratio, the signal cannot be decoded.
The limitation of the number of the access to the authorization-free mode designs a function
Figure BDA0003085521030000113
Limiting SINR of D2D and CU
Figure BDA0003085521030000121
Wherein
Figure BDA0003085521030000122
Figure BDA0003085521030000123
First, a hyper-parameter gamma, alpha in the network is initializedθλ,
Figure BDA0003085521030000124
Beta, state s0And Actor, the parameter lambda in the critical network,
Figure BDA0003085521030000125
θ01,...,θJ. Each agent takes the action with the highest probability according to the policy network of the agent as the action taken in the current state, so that the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D SINR reward
Figure BDA0003085521030000126
And the next state st+1
The Monte Carlo method provides a basis for model-free learning, but the Monte Carlo method still has the limitation of discontinuous and offline learning. The TD error method makes up the difference between the Monte Carlo method and the dynamic programming, and is the core idea of RL. The TD method can also learn in a model-free environment, and can iteratively learn from value estimates (online learning), allowing training in a continuous environment. Calculating TD error from Critic network:
Figure BDA0003085521030000127
strategy parameter updating is carried out based on strategy gradient, and TD error adopts a gradient ascending method:
to solve the problem of confidence allocation in a multi-agent. The COMA algorithm solves the confidence assignment problem with a counterfactual baseline by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, by centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual action. The jth agent counterfactual baseline is defined as:
Figure BDA0003085521030000128
the jth agent updates the Actor network parameters of the jth agent through the counterfactual base line according to the formula:
Figure BDA0003085521030000129
a schematic diagram of a multi-agent deep reinforcement learning COMA algorithm is shown in FIG. 4.
The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims (8)

1.一种授权和免授权D2D通信资源联合智能分配方法,包括以下步骤:1. A kind of authorization and authorization-free D2D communication resource joint intelligent allocation method, comprises the following steps: S1:建立D2D用户通信模型;S1: Establish a D2D user communication model; S2:建立需要优化的目标函数;S2: Establish the objective function that needs to be optimized; S3:建立多智能体深度强化学习D2D通信模型;S3: Establish a multi-agent deep reinforcement learning D2D communication model; S4:设置多智能体的动作集合,状态集合和奖励函数;S4: Set the action set, state set and reward function of the multi-agent; S5:智能体根据自身的Actor网络采取动作,获得状态和奖励以及下一个状态;S5: The agent takes actions according to its Actor network, obtains the state and reward and the next state; S6:计算Critic网络的TD error,并更新Critic网络参数,Critic网络计算每个智能体的反事实基线,通过反事实基线更新Actor网络参数,更新状态。S6: Calculate the TD error of the Critic network, and update the Critic network parameters. The Critic network calculates the counterfactual baseline of each agent, updates the Actor network parameters through the counterfactual baseline, and updates the state. S7:重复步骤S5-S6,直到达到目标状态。S7: Repeat steps S5-S6 until the target state is reached. 2.根据权力要求1所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于:在步骤S1中。计算得到接入WiFi频段的D2D数量,但选择哪些D2D去WiFi频段是一个模式选择问题,而剩下的D2D的功率和信道选择仍然对CU用户有着严重的影响。2. A method for joint intelligent allocation of authorized and authorization-free D2D communication resources according to claim 1, characterized in that: in step S1. The number of D2Ds connected to the WiFi band is calculated, but choosing which D2Ds to go to the WiFi band is a mode selection problem, and the power and channel selection of the remaining D2Ds still have a serious impact on CU users. 复用许可频段:在这个模式中,两个D2D可以复用同一个现有蜂窝用户的上行链路直接进行通信。复用蜂窝用户m的信道的D2D对k的频谱效率为:Multiplexing Licensed Bands: In this mode, two D2Ds can reuse the uplink of the same existing cellular user to communicate directly. The spectral efficiency of D2D versus k for multiplexing the channel of cellular user m is:
Figure FDA0003085521020000011
Figure FDA0003085521020000011
在式中pk,m是第k个D2D对的发射功率,
Figure FDA0003085521020000012
是蜂窝用户m的发射功率,
Figure FDA0003085521020000013
是D2D k到蜂窝用户m的信道增益,BC是许可的信道带宽,
Figure FDA0003085521020000014
是噪声功率密度,hk,m是蜂窝用户m与D2D对k的接收机之间的干扰功率增益。蜂窝用户m被D2D对k复用的频谱效率为:
where p k,m is the transmit power of the kth D2D pair,
Figure FDA0003085521020000012
is the transmit power of cellular user m,
Figure FDA0003085521020000013
is the channel gain from D2D k to cellular user m, B C is the licensed channel bandwidth,
Figure FDA0003085521020000014
is the noise power density, and h k,m is the interference power gain between the cellular user m and the receiver of the D2D pair k. The spectral efficiency of cellular user m multiplexed by D2D pair k is:
Figure FDA0003085521020000015
Figure FDA0003085521020000015
其中
Figure FDA0003085521020000016
是蜂窝用户m和基站的信道功率增益,hk,B为D2D发射端k和基站间的信道增益。
in
Figure FDA0003085521020000016
is the channel power gain of the cellular user m and the base station, h k,B is the channel gain between the D2D transmitter k and the base station.
D2D通信的存在会对蜂窝和WiFi用户有较大的影响,所以提出一种在满足WiFi用户最低吞吐量条件下,确定出能够接入WiFi授权频段的最大D2D的数量后,对总的D2D用户进行模式选择和资源的分配,以最大程度地减少D2D通信引起的蜂窝和WiFi用户的下降。The existence of D2D communication will have a greater impact on cellular and WiFi users. Therefore, a method is proposed to determine the maximum number of D2Ds that can access the WiFi licensed frequency band under the condition of satisfying the minimum throughput of WiFi users. Mode selection and resource allocation are performed to minimize the drop in cellular and WiFi users caused by D2D communications. 当xi=1,则D2D对i复用上行蜂窝用户的信道,xi=0,则D2D对i将接入WiFi免授权频段。When xi =1, the D2D pair i multiplexes the channel of the uplink cellular user, and when xi =0, the D2D pair i will access the WiFi unlicensed frequency band. 当θi,m=1,表示D2D对i复用上行蜂窝用户m的信道,θi,m=0,表示D2D对i未复用上行蜂窝用户m的信道。When θ i,m =1, it means that D2D multiplexes the channel of uplink cellular user m for i, and θ i,m =0 means that D2D does not multiplex the channel of uplink cellular user m for i.
3.根据权力要求2所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步:在步骤S2中,为了得到最大蜂窝用户和授权频段D2D用户系统吞吐量,从而有3. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 2, it is characterized in that further: in step S2, in order to obtain maximum cellular user and authorized frequency band D2D user system throughput, thereby have.
Figure FDA0003085521020000021
Figure FDA0003085521020000021
s.t.xk∈{0,1},θk,m∈{0,1}stx k ∈ {0,1}, θ k,m ∈ {0,1} 0≤pk,m≤pmax 0≤p k,m ≤p max
Figure FDA0003085521020000022
Figure FDA0003085521020000022
Figure FDA0003085521020000023
Figure FDA0003085521020000023
上式第一项表示D2D用户接入授权与免授权的选择,第二项表示D2D用户的功率限制,第三项表示满足最低WiFi吞吐量要求,第四项表示确保D2D用户和蜂窝用户满足最低信噪比要求。The first item of the above formula represents the choice between D2D user access authorization and authorization-free, the second item represents the power limit of D2D users, the third item represents meeting the minimum WiFi throughput requirements, and the fourth item represents ensuring that D2D users and cellular users meet the minimum requirements. Signal-to-noise ratio requirements.
4.根据权力要求3所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步:在步骤S3中,为了解决D2D通信资源分配中的NP-hard难题,采用一种多智能体强化学习方法,COMA(Counterfactual Multi-Agent),将多智能体环境建模为马尔可夫博弈来优化策略,同时考虑其他智能体的行为策略,方法是将单个智能体对奖励的影响边缘化,将智能体在某个时间点t,采取的行为与它在t可能采取的所有其他行为进行比较,这可以通过一个集中的Critic来实现,所以所有智能体的价值函数是相同的,但是每个智能体会根据自己的反事实动作接收一个定制的错误项。在协作智能体系统中,评价一个智能体的动作贡献到底是多少时,可以把这个智能体的动作换成一个默认的动作,看看当前的动作跟默认的动作相比使得总体的得分增加了还是减少了,如果增加了,说明智能体的当前动作比默认动作好,如果减少了,则说明智能体当前动作比默认动作差。而这个默认的动作就称为基线。然而接下来的问题是,如何确定这个默认的动作,比如通过某种方式确认了默认动作,那么这个默认的动作的好坏还需要额外的模拟进行评估,这无疑增加了计算的复杂性。COMA没有使用默认的动作,也没有用额外的模拟计算这个基线,而是利用当前的策略,利用当前的行为值函数对当前智能体的策略求解边缘分布来计算这个基线。通过这种方式,COMA可以避免设计额外的默认动作和额外的模拟计算。因此,对于每个智能体来说,以便能够更好的学习需要多智能体协调的策略。COMA提供一种高效的算法来为奖励函数执行credit分配,深度学习训练过程将导致大量的计算开销。训练过程在BS完成,将D2D用户在执行过程中收集到的历史信息上传到BS,在BS完成集中式训练,在基站上Critic获得智能体的策略用来评估采取动作的好坏。分布式执行过程中,D2D用户从基站获取的Aj(s,u)更新自己的Actor网络中,Actor基于智能体从环境中观测到的状态选择行为,智能体不断与环境交互,智能体进行足够多的训练次数,最终将收敛于一个奖励值最大的动作上,从而得到最优的策略。4. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 3, it is characterized in that further: in step S3, in order to solve the NP-hard problem in D2D communication resource allocation, adopt a kind of multiple. The agent reinforcement learning method, COMA (Counterfactual Multi-Agent), models the multi-agent environment as a Markov game to optimize the strategy, while considering the behavioral strategies of other agents, by estimating the influence of a single agent on the reward edge , which compares the action taken by the agent at some point in time t with all other actions it might have taken at t, this can be achieved by a centralized Critic, so the value function is the same for all agents, but Each agent receives a customized error term based on its own counterfactual action. In the cooperative agent system, when evaluating how much the action contribution of an agent is, you can replace the action of the agent with a default action, and see how the current action increases the overall score compared with the default action. It is still reduced. If it increases, it means that the current action of the agent is better than the default action. If it decreases, it means that the current action of the agent is worse than the default action. This default action is called the baseline. However, the next question is how to determine this default action. For example, if the default action is confirmed in a certain way, then the quality of this default action needs to be evaluated by additional simulation, which undoubtedly increases the computational complexity. COMA does not use the default action, nor does it calculate this baseline with additional simulations, but uses the current policy, and uses the current behavior value function to solve the marginal distribution of the current agent's policy to calculate this baseline. In this way, COMA can avoid designing additional default actions and additional simulation calculations. Therefore, for each agent, in order to be able to learn better policies that require multi-agent coordination. COMA provides an efficient algorithm to perform credit assignment for the reward function, and the deep learning training process will result in a large computational overhead. The training process is completed in the BS, the historical information collected by the D2D users during the execution process is uploaded to the BS, the centralized training is completed in the BS, and the Critic obtains the agent's strategy on the base station to evaluate the quality of the action taken. In the distributed execution process, D2D users update their Actor network with A j (s, u) obtained from the base station. Actors select behaviors based on the states observed by the agents from the environment. The agents constantly interact with the environment, and the agents perform Enough training times will eventually converge on an action with the largest reward value, thereby obtaining the optimal strategy. 5.根据权力要求4所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步,在步骤S4中,在D2D底层通信的RL模型中,智能体D2D对与环境交互并根据策略采取相应的行为。在每个时刻t,智能体D2D对从状态空间S中观测一个状态st,并根据策略π从动作空间A中采取相应的行为(选择模式、选择RB、选择功率级别)。在执行该行为之后,环境进入新的状态st+1,agent获得奖励。状态空间S:在任何时间t,系统状态均由所有D2D在该时间t的联合SINR值表示5. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 4, it is characterized in that further, in step S4, in the RL model of D2D bottom communication, the agent D2D pair interacts with the environment and Act accordingly according to the policy. At each time t, the agent D2D pair observes a state s t from the state space S, and takes corresponding actions (select mode, select RB, select power level) from the action space A according to the policy π. After performing this action, the environment enters a new state s t+1 and the agent is rewarded. State space S: At any time t, the system state is represented by the joint SINR value of all D2Ds at that time t
Figure FDA0003085521020000031
Figure FDA0003085521020000031
动作空间A:
Figure FDA0003085521020000032
分别为模式选择,功率级别选择,和RB选择
Action Space A:
Figure FDA0003085521020000032
For mode selection, power level selection, and RB selection, respectively
其中,模式选择:2,功率级别:10,RB选择:20。所以每个智能体的动作空间数位:α×β×η=400。Among them, mode selection: 2, power level: 10, RB selection: 20. So the action space digits of each agent: α×β×η=400. 奖励函数R:RL中的奖励函数驱动整个学习过程,因此合理的reward函数是关键,奖励函数设计三个部分:D2D的选择模式、D2D和蜂窝用户的速率以及二者的信噪比,智能体如果选择的模式时进入免授权频段,那么将其获得的奖励设置为一个正值,但是当D2D数量超过满足的最大接入数量后,将获得较大的负值,如果智能体采取的行为使得蜂窝用户和D2D用户的信噪比大于设定的阈值,则以其对应的速率和选择的相同复用频谱的蜂窝用户奖励之和作为奖励,反之,如果智能体采取的行为,导致D2D或者蜂窝用户的信噪比小于设定的阈值,则奖励函数为负值,因为小于信噪比将导致信号不能解码。Reward function R: The reward function in RL drives the entire learning process, so a reasonable reward function is the key. The reward function is designed in three parts: the selection mode of D2D, the rate of D2D and cellular users, and the signal-to-noise ratio of the two, the agent If the selected mode enters the unlicensed frequency band, the reward obtained by it is set to a positive value, but when the number of D2D exceeds the maximum number of accesses, it will obtain a larger negative value. If the behavior of the agent makes If the signal-to-noise ratio of the cellular user and the D2D user is greater than the set threshold, the sum of the corresponding rate and the selected cellular user reward of the same reused spectrum will be used as the reward. If the user's signal-to-noise ratio is less than the set threshold, the reward function is negative, because less than the signal-to-noise ratio will result in the signal not being decoded. 限制进入免授权模式数量的限定设计了一个函数A function is designed to limit the number of access to the license-free mode
Figure FDA0003085521020000041
Figure FDA0003085521020000041
对D2D和CU的SINR进行限制Limit the SINR of D2D and CU
Figure FDA0003085521020000042
Figure FDA0003085521020000042
其中in
Figure FDA0003085521020000043
Figure FDA0003085521020000043
Figure FDA0003085521020000044
Figure FDA0003085521020000044
6.根据权力要求5所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步:在步骤S5中,首先初始化网络中的超参数γ,αθλ,
Figure FDA0003085521020000045
β,状态s0和Actor,Critic网络中的参数
Figure FDA0003085521020000046
每个智能体根据自身的策略网络采取概率最大的动作,作为在当前状态下采取的动作,因此将所有智能体采取的动作状态联合起来可以得到,从环境状态st下的联合动作at,D2D SINR奖励
Figure FDA0003085521020000047
和下一个状态st+1
6. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 5, is characterized in that further: in step S5, at first initialize hyperparameter γ, α θ , α λ in the network,
Figure FDA0003085521020000045
β, state s 0 and Actor, parameters in the Critic network
Figure FDA0003085521020000046
Each agent takes the action with the highest probability according to its own policy network as the action taken in the current state. Therefore, the action states taken by all the agents can be combined to obtain, from the joint action a t in the environmental state s t , D2D SINR reward
Figure FDA0003085521020000047
and the next state s t+1 .
7.根据权力要求6所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步,在步骤S6中,蒙特卡洛方法为无模型学习提供了基础,但他们仍然又不连续、离线学习的限制。TD error方法弥补了蒙特卡罗方法和动态规划之间的差距,是RL的核心思想。TD方法同样可以在无模型的环境中学习,可以从价值估计中迭代学习(在线学习),从而允许在连续环境中进行训练。从Critic网络中计算TD error:7. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 6, it is characterized in that further, in step S6, Monte Carlo method provides a basis for model-free learning, but they still do not. Limitations of continuous, offline learning. The TD error method bridges the gap between Monte Carlo methods and dynamic programming, and is the core idea of RL. TD methods can likewise learn in a model-free setting, and can learn iteratively (online learning) from value estimates, allowing training in a continuous setting. Calculate the TD error from the Critic network:
Figure FDA0003085521020000051
Figure FDA0003085521020000051
策略参数更新是基于策略梯度进行的,TD误差采用梯度上升法:The policy parameter update is based on the policy gradient, and the TD error adopts the gradient ascent method: λt+1=λtλλQλ(st,utt λ t+1tλλ Q λ (s t ,u tt 为了解决多智能体中的置信分配问题。COMA算法利用反事实基线解决了置信分配问题,方法是将单个智能体对奖励的影响边缘化,并将智能体在某个时间t采取的行为与t时可能采取的所有其他行为进行比较,这通过集中的Critic来实现,因此所有智能体的值函数时相同的,但是每个智能体都会根据自己的反事实行为得到一个特定的误差项。第j个智能体反事实基线定义为:To solve the problem of confidence assignment in multi-agent. The COMA algorithm solves the confidence assignment problem using counterfactual baselines by marginalizing the influence of a single agent on the reward and comparing the actions taken by the agent at some time t with all other actions that the agent might have taken at time t, which This is achieved through centralized Critic, so that the value function is the same for all agents, but each agent gets a specific error term based on its counterfactual behavior. The jth agent counterfactual baseline is defined as:
Figure FDA0003085521020000052
Figure FDA0003085521020000052
第j个智能体通过反事实基线,更新自身的Actor网络参数,依据公式:The jth agent updates its Actor network parameters through the counterfactual baseline, according to the formula:
Figure FDA0003085521020000053
Figure FDA0003085521020000053
智能体根据Critic网络获得的优势函数,进行网络更新。The agent updates the network according to the advantage function obtained by the Critic network.
8.根据权力要求7所述的一种授权和免授权D2D通信资源联合智能分配方法,其特征在于进一步,在步骤S7中,训练过程由BS完成,将D2D用户在执行过程中收集到的历史信息上传到BS,在BS完成集中式训练,在基站上Critic获得智能体的策略用来评估采取动作的好坏。训练过程在BS完成,将D2D用户在执行过程中收集到的历史信息上传到BS,在BS完成集中式训练,在基站上Critic获得智能体的策略用来评估采取动作的好坏。分布式执行过程中,D2D用户从基站获取的Aj(s,u)更新自己的Actor网络中,Actor基于智能体从环境中观测到的状态选择行为,智能体不断与环境交互,智能体进行足够多的训练次数,最终将收敛于一个奖励值最大的动作上,从而得到最优的策略。8. a kind of authorization and authorization-free D2D communication resource joint intelligent allocation method according to claim 7, is characterized in that further, in step S7, training process is completed by BS, the history that D2D user collects in execution process The information is uploaded to the BS, the centralized training is completed in the BS, and the Critic obtains the strategy of the agent on the base station to evaluate the quality of the action taken. The training process is completed in the BS, the historical information collected by the D2D users during the execution process is uploaded to the BS, the centralized training is completed in the BS, and the Critic obtains the agent's strategy on the base station to evaluate the quality of the action taken. In the distributed execution process, D2D users update their Actor network with A j (s, u) obtained from the base station. Actors select behaviors based on the states observed by the agents from the environment. The agents constantly interact with the environment, and the agents perform Enough training times will eventually converge on an action with the largest reward value, thereby obtaining the optimal policy.
CN202110581716.5A 2021-05-26 2021-05-26 A joint intelligent allocation method for authorized and license-free D2D communication resources Active CN113316154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110581716.5A CN113316154B (en) 2021-05-26 2021-05-26 A joint intelligent allocation method for authorized and license-free D2D communication resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110581716.5A CN113316154B (en) 2021-05-26 2021-05-26 A joint intelligent allocation method for authorized and license-free D2D communication resources

Publications (2)

Publication Number Publication Date
CN113316154A true CN113316154A (en) 2021-08-27
CN113316154B CN113316154B (en) 2022-06-21

Family

ID=77375597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110581716.5A Active CN113316154B (en) 2021-05-26 2021-05-26 A joint intelligent allocation method for authorized and license-free D2D communication resources

Country Status (1)

Country Link
CN (1) CN113316154B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363938A (en) * 2021-12-21 2022-04-15 重庆邮电大学 A cellular network traffic offloading method
CN114390588A (en) * 2022-01-13 2022-04-22 重庆邮电大学 A hybrid access method for D2D-U communication
CN114466386A (en) * 2022-01-13 2022-05-10 重庆邮电大学 A direct access method for D2D communication
CN114928549A (en) * 2022-04-20 2022-08-19 清华大学 Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning
CN115278593A (en) * 2022-06-20 2022-11-01 重庆邮电大学 Transmission method of unmanned aerial vehicle-non-orthogonal multiple access communication system based on semi-authorization-free protocol
CN115515101A (en) * 2022-09-23 2022-12-23 西北工业大学 Decoupling Q learning intelligent codebook selection method for SCMA-V2X system
CN116367332A (en) * 2023-05-31 2023-06-30 华信咨询设计研究院有限公司 Hierarchical control-based D2D resource allocation method under 5G system
WO2024032228A1 (en) * 2022-08-12 2024-02-15 华为技术有限公司 Reinforcement learning training method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160066337A1 (en) * 2014-09-03 2016-03-03 Futurewei Technologies, Inc. System and Method for D2D Resource Allocation
US20190124667A1 (en) * 2017-10-23 2019-04-25 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for allocating transmission resources using reinforcement learning
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 A joint resource allocation and power control method in D2D communication
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
WO2019231289A1 (en) * 2018-06-01 2019-12-05 Samsung Electronics Co., Ltd. Method and apparatus for machine learning based wide beam optimization in cellular network
CN110769514A (en) * 2019-11-08 2020-02-07 山东师范大学 A kind of heterogeneous cellular network D2D communication resource allocation method and system
CN111556572A (en) * 2020-04-21 2020-08-18 北京邮电大学 A joint allocation method of spectrum resources and computing resources based on reinforcement learning
US20210136785A1 (en) * 2019-10-30 2021-05-06 University Of Ottawa System and method for joint power and resource allocation using reinforcement learning
CN112822781A (en) * 2021-01-20 2021-05-18 重庆邮电大学 Resource allocation method based on Q learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160066337A1 (en) * 2014-09-03 2016-03-03 Futurewei Technologies, Inc. System and Method for D2D Resource Allocation
US20190124667A1 (en) * 2017-10-23 2019-04-25 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for allocating transmission resources using reinforcement learning
WO2019231289A1 (en) * 2018-06-01 2019-12-05 Samsung Electronics Co., Ltd. Method and apparatus for machine learning based wide beam optimization in cellular network
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 A joint resource allocation and power control method in D2D communication
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
US20210136785A1 (en) * 2019-10-30 2021-05-06 University Of Ottawa System and method for joint power and resource allocation using reinforcement learning
CN110769514A (en) * 2019-11-08 2020-02-07 山东师范大学 A kind of heterogeneous cellular network D2D communication resource allocation method and system
CN111556572A (en) * 2020-04-21 2020-08-18 北京邮电大学 A joint allocation method of spectrum resources and computing resources based on reinforcement learning
CN112822781A (en) * 2021-01-20 2021-05-18 重庆邮电大学 Resource allocation method based on Q learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y.LUO: "Dynamic resource allocations based on Q-learning for D2D communication in cellular networks", 《014 11TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIEV MEDIA TECHNOLOGY AND INFORMATION PROCESSING》 *
滑思忠: "D2D 通信资源复用分配奖惩加权算法研究", 《计算机应用研究》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363938A (en) * 2021-12-21 2022-04-15 重庆邮电大学 A cellular network traffic offloading method
CN114363938B (en) * 2021-12-21 2024-01-26 深圳千通科技有限公司 Cellular network flow unloading method
CN114390588A (en) * 2022-01-13 2022-04-22 重庆邮电大学 A hybrid access method for D2D-U communication
CN114466386A (en) * 2022-01-13 2022-05-10 重庆邮电大学 A direct access method for D2D communication
CN114466386B (en) * 2022-01-13 2023-09-29 深圳市晨讯达科技有限公司 Direct access method for D2D communication
CN114928549A (en) * 2022-04-20 2022-08-19 清华大学 Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning
CN115278593A (en) * 2022-06-20 2022-11-01 重庆邮电大学 Transmission method of unmanned aerial vehicle-non-orthogonal multiple access communication system based on semi-authorization-free protocol
WO2024032228A1 (en) * 2022-08-12 2024-02-15 华为技术有限公司 Reinforcement learning training method and related device
CN115515101A (en) * 2022-09-23 2022-12-23 西北工业大学 Decoupling Q learning intelligent codebook selection method for SCMA-V2X system
CN116367332A (en) * 2023-05-31 2023-06-30 华信咨询设计研究院有限公司 Hierarchical control-based D2D resource allocation method under 5G system
CN116367332B (en) * 2023-05-31 2023-09-15 华信咨询设计研究院有限公司 Hierarchical control-based D2D resource allocation method under 5G system

Also Published As

Publication number Publication date
CN113316154B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN113316154A (en) Authorized and unauthorized D2D communication resource joint intelligent distribution method
US12067487B2 (en) Method and apparatus employing distributed sensing and deep learning for dynamic spectrum access and spectrum sharing
Yuan et al. Meta-reinforcement learning based resource allocation for dynamic V2X communications
Sun et al. Application of machine learning in wireless networks: Key techniques and open issues
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN106358308A (en) Resource allocation method for reinforcement learning in ultra-dense network
CN105230070B (en) A radio resource allocation method and radio resource allocation device
CN112351433A (en) Heterogeneous network resource allocation method based on reinforcement learning
Chuang et al. Dynamic multiobjective approach for power and spectrum allocation in cognitive radio networks
CN109219025A (en) A kind of direct-connected communication resource allocation method of wireless terminal and device
Sande et al. Access and radio resource management for IAB networks using deep reinforcement learning
Banitalebi et al. Distributed learning-based resource allocation for self-organizing C-V2X communication in cellular networks
CN114126021A (en) Green cognitive radio power distribution method based on deep reinforcement learning
Lyu et al. NOMA-assisted on-demand transmissions for monitoring applications in industrial IoT networks
CN118828603A (en) A method for user association and resource allocation in cellular networks based on deep reinforcement learning
Kai et al. Multi-agent reinforcement learning based joint uplink–downlink subcarrier assignment and power allocation for D2D underlay networks
Jiang et al. Dueling double deep q-network based computation offloading and resource allocation scheme for internet of vehicles
Das et al. Reinforcement learning-based resource allocation for M2M communications over cellular networks
Lall et al. Multi-agent reinfocement learning for stochastic power management in cognitive radio network
CN113613337B (en) A user cooperative anti-jamming method for beamforming communication
Chen et al. Power allocation based on deep reinforcement learning in HetNets with varying user activity
CN115811788B (en) A D2D network distributed resource allocation method based on deep reinforcement learning and unsupervised learning
CN116133142B (en) Intelligent framework-based high-energy-efficiency uplink resource allocation method under QoS constraint
Saied Resource Allocation Management of D2D Communications in Cellular Networks
Jiang Reinforcement learning-based spectrum sharing for cognitive radio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240707

Address after: 230000 B-1015, wo Yuan Garden, 81 Ganquan Road, Shushan District, Hefei, Anhui.

Patentee after: HEFEI MINGLONG ELECTRONIC TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 400065 No. 2, Chongwen Road, Nan'an District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China