CN113316154B - Authorized and unauthorized D2D communication resource joint intelligent distribution method - Google Patents

Authorized and unauthorized D2D communication resource joint intelligent distribution method Download PDF

Info

Publication number
CN113316154B
CN113316154B CN202110581716.5A CN202110581716A CN113316154B CN 113316154 B CN113316154 B CN 113316154B CN 202110581716 A CN202110581716 A CN 202110581716A CN 113316154 B CN113316154 B CN 113316154B
Authority
CN
China
Prior art keywords
agent
action
state
user
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110581716.5A
Other languages
Chinese (zh)
Other versions
CN113316154A (en
Inventor
裴二荣
徐成义
陶凯
宋珈锐
黄一格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110581716.5A priority Critical patent/CN113316154B/en
Publication of CN113316154A publication Critical patent/CN113316154A/en
Application granted granted Critical
Publication of CN113316154B publication Critical patent/CN113316154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/06Testing, supervising or monitoring using simulated traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/0215Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices
    • H04W28/0221Traffic management, e.g. flow control or congestion control based on user or device properties, e.g. MTC-capable devices power availability or consumption

Abstract

The invention relates to a joint intelligent distribution method of authorized and unauthorized D2D communication resources, belonging to the field of D2D communication. The invention comprises the following steps: s1: establishing a D2D user communication model; s2: establishing an objective function to be optimized; s3: establishing a multi-agent deep reinforcement learning D2D communication model; s4: setting an action set, a state set and a reward function of the multi-agent; s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state; s6: calculating a TDerror of the Critic network, updating parameters of the Critic network, calculating a counterfactual baseline of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baseline, and updating the state; s7: steps S5-S6 are repeated until the target state is reached. According to the multi-agent deep reinforcement learning framework, the agents continuously interact with the environment, continuously train and learn, find the optimal strategy, and the action corresponding to the adopted strategy obtains the optimal throughput, and meanwhile, the communication quality of WiFi and cellular users is guaranteed.

Description

Authorized and unauthorized D2D communication resource joint intelligent distribution method
Technical Field
The invention belongs to the field of D2D communication, and relates to a joint intelligent allocation method for authorized and unauthorized D2D communication resources.
Background
With the rapid popularization of intelligent terminals, the number of connection devices in this year, including smart phones, tablets, car networking devices, internet of things devices and the like, is expected to reach 500 hundred million, and data traffic is increased by 1000 times, so that the requirement for the evolution of a wireless communication technology is more urgent. On one hand, wireless communication performance indexes such as network capacity and throughput need to be obviously improved to deal with the explosively increased data traffic; on the other hand, the expansion of cellular communication applications needs to be realized, and the end user experience is improved.
In the face of the situation that the data traffic and the authorized spectrum which are explosively increased are almost distributed, the new spectrum is expanded to improve the system capacity so as to adapt to the rapid increase of the mobile traffic and the diversification of the services, and the method becomes the current primary target. The frequency spectrum working frequency band of the unlicensed frequency band mainly comprises a 2.4GHz frequency band, a 5GHz frequency band and a 60GHz millimeter wave frequency band. Since the free space loss increases as the frequency band increases, only the frequency band of 6GHz or less is considered in order to better resist path fading. In the frequency bands below 6GHz, the 2.4GHz frequency band is densely occupied by wireless technologies such as WiFi and Bluetooth, and the interference is complex, while the 5GHz frequency band has an available space of nearly 500MHz, only one part of which is occupied by WiFi, and the utilization rate is low. Therefore, as long as the interference between the D2D technology and WiFi is controlled within an acceptable range, the D2D technology can be implemented in the 5GHz band. The 5GHz unlicensed frequency band channel has large fading and is suitable for short-distance communication, D2D is used as a proximity service, the property of the D2D is very consistent when the D2D is placed in the unlicensed frequency band, and compared with other short-distance communication technologies (WiFi direct connection, Bluetooth, Zigbee protocol and the like) which work in the unlicensed frequency band, the D2D communication has great advantages, pairing of users and resource allocation of channels and power are controlled by a base station, and access is more efficient and safer. D2D communication is deployed in the unlicensed frequency band, so that the use rate of the unlicensed spectrum can be improved, and seamless fusion with the existing cellular system can be realized. Similar to the LTE-U technology, the D2D communication is initially operated only in the licensed spectrum, and there is no coexistence mechanism with the WiFi system, and if the unlicensed frequency band is directly accessed, the performance of the WiFi system is seriously affected, so that it is also ensured that the D2D user and the original WiFi user coexist harmoniously when the D2D communication is implemented in the unlicensed frequency band.
The existing D2D communication is mainly deployed in an authorized frequency band, joint deployment of the D2D in an unlicensed frequency band and an authorized frequency band is less considered, and authorization and unlicensed selection of a D2D user is considered on the premise of ensuring the minimum communication quality of a WiFi user, and the problem of D2D communication resource allocation is solved, so that the problem of NP-hard is solved, and the traditional algorithm is difficult to solve. Therefore, the current very popular machine learning method is utilized to solve the problem which is difficult to solve in the traditional algorithm, and has very important research significance.
Disclosure of Invention
In view of this, the invention provides a joint intelligent allocation method for authorized and unlicensed D2D communication resources, which solves the problem of joint authorized and unlicensed spectrum selection D2D communication resource allocation that is difficult to solve by the conventional algorithm.
In order to achieve the purpose, the invention provides the following technical scheme:
a joint intelligent allocation method for authorized and unauthorized D2D communication resources comprises the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a D2D communication model of multi-agent deep reinforcement learning;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
Further, in step S1, a D2D communication process of the authorized and unlicensed frequency band combination is established, and the number of D2D accessing the WiFi frequency band is calculated, and in the multiplexing licensed frequency band mode, two D2D may multiplex the uplink of the same existing cellular user to directly perform communication. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
Figure GDA0003608705510000021
in the formula, pk,mIs the transmit power of the kth D2D pair,
Figure GDA0003608705510000022
is the transmit power of cellular user m, BCIs the granted sub-channel bandwidth and,
Figure GDA0003608705510000023
is the noise power density, hk,mIs the interference power between cellular user m and the receiver of D2D pair kGain; x is the number ofiWhen 1, D2D multiplexes i with the channel of the uplink cellular user, xiIf 0, the ith D2D pair will access the WiFi unlicensed frequency band; theta.theta.i,m1, D2D multiplexes i with the channel of the uplink cellular user m; thetai,m0, indicating that D2D multiplexes the channel of the uplink cellular user m to i; the spectral efficiency of cellular user m multiplexed by D2D for k is:
Figure GDA0003608705510000031
wherein the content of the first and second substances,
Figure GDA0003608705510000032
is the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between the D2D transmitting terminal k and the base station; the existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on the total D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the WiFi user minimum throughput, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
Further, in step S2, in order to obtain the system throughput of the maximum cellular user and the licensed band D2D user, there are:
Figure GDA0003608705510000033
Figure GDA0003608705510000034
wherein M is equal to M e {1,2, …, M }, and c1In xkIndicating the choice of D2D user to authorize access and to be free of authorization, thetak,mIndicating the selection of the D2D multiplexed sub-channel, c2Representing the power limit, p, of a D2D usermaxRepresents the maximum transmit power, c, of the D2D transmitting end3Indicates minimum WiFi throughput requirement is met, SWAnd
Figure GDA0003608705510000035
minimum threshold values, c, expressed as WiFi throughput and WiFi throughput, respectively4And c5Respectively, to ensure that D2D users and cellular users meet minimum signal-to-noise ratio requirements, SINRD2DAnd SINRCURepresenting the signal-to-noise ratio of D2D and CU respectively,
Figure GDA0003608705510000041
and
Figure GDA0003608705510000042
expressed as the minimum signal-to-noise ratio requirement that D2D and the CU need to meet, respectively.
Further, in step S3, in order to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (common Multi-Agent) algorithm, is adopted, which first takes the D2D transmitting end as an Agent, and models the Multi-Agent environment as markov game to optimize the strategy, and considers the behavior strategy of other agents, the method is to marginalize the influence of the single Agent on the reward, compare the behavior taken by the Agent at a certain time point t with all other possible behaviors taken by the Agent at t, and implement it by a centralized Critic, all agents have the same cost function, and each Agent accepts a customization error term according to its counter-fact action. The COMA solves the edge distribution computation baseline for the strategy of the current agent by using the current strategy and by using the current behavior value function, and the COMA avoids designing additional default actions and additional simulation computation. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges to an action with the maximum reward valueThereby obtaining an optimal strategy.
Further, in step S4, agent D2D takes corresponding action for interacting with the environment and according to the policy; at each time t, agent D2D observes a state S from state space StAnd adopting corresponding selection modes, RB and power levels from the action space A according to the strategy pi; after performing this action, the environment enters a new state st+1Agent earns rewards; thus, the state space, action space and reward function are set as follows:
state space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t as:
Figure GDA0003608705510000043
wherein
Figure GDA0003608705510000044
The local observation state is expressed as the nth D2D transmitting terminal t moment;
an action space A:
Figure GDA0003608705510000045
selection of licensed and unlicensed frequency bands, power level selection, and RB selection, respectively, wherein mode selection: 2, power level: 10, RB selection: 20, action space digit of each agent: α × β × η is 400;
the reward function R: the reward function is designed into three parts: the selection mode of the D2D, the rates of the D2D and the cellular users and the signal-to-noise ratios of the D2D and the cellular users are selected, if the selected mode is the access unlicensed frequency band, the intelligent agent sets the obtained reward to a positive value, but after the number of the D2D exceeds the satisfied maximum access number, a larger negative value is obtained, if the behaviors adopted by the intelligent agent enable the signal-to-noise ratios of the cellular users and the D2D users to be larger than a set threshold value, the sum of the corresponding rates and the selected cellular user reward of the same multiplex frequency spectrum is used as the reward, and if the behaviors adopted by the intelligent agent cause the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, the reward function is a negative value; and (3) limiting the number of users accessing the unlicensed frequency band, designing a function:
Figure GDA0003608705510000051
limiting the SINR of D2D and CU, and obtaining the reward function of j-th D2D user as:
Figure GDA0003608705510000052
in the formula
Figure GDA0003608705510000053
Figure GDA0003608705510000054
Wherein r isiIndicating the instant prize earned by the jth D2D; n is a radical ofmaxA maximum value representing an allowed access to the unlicensed frequency band;
Figure GDA0003608705510000055
the awards obtained by the ith intelligent agent under the condition that the number of the intelligent agent accessing to the unlicensed frequency band is met, the awards obtained by the ith intelligent agent under the condition that the intelligent agent does not access to the unlicensed frequency band, the instant awards obtained by the ith intelligent agent selecting to access to the unlicensed frequency band and meeting the minimum signal-to-noise ratio threshold of the ith intelligent agent and the multiplexed CU, and the instant awards obtained by the ith intelligent agent under the condition that the intelligent agent accesses to the unlicensed frequency band and not meeting the signal-to-noise ratio of the ith intelligent agent or the multiplexed CU are respectively expressed;
Figure GDA0003608705510000056
and
Figure GDA0003608705510000057
respectively representing the spectral efficiency of the ith D2D user pair and the CU multiplexed by the ith D2D pair.
Further, in step S5,first, the hyper-parameters gamma, alpha in the network are initializedθλ,
Figure GDA0003608705510000058
Beta, state s0And Actor, the parameter lambda in the critical network,
Figure GDA0003608705510000059
θ01,...,θM(ii) a Each agent takes the action with the maximum probability according to the policy network of the agent, and the action taken by all agents is taken as the action taken in the current state, and the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D user receiving rewards
Figure GDA00036087055100000510
And the next state st+1
Further, in step S6, the training process is completed by the BS, the history information collected during the execution process of the D2D user is uploaded to the BS, centralized training is completed at the BS, and Critic obtains the policy of the agent at the base station to evaluate how well to take the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Further, in step S6, TD error is calculated from Critic network:
Figure GDA0003608705510000061
wherein
Figure GDA0003608705510000062
Represents a state st+1The value of the action to be the maximum is set,
Figure GDA0003608705510000063
represents a state stAccording to policy function
Figure GDA0003608705510000064
And (3) updating the Critic network parameters by adopting a gradient ascending method according to the action value of the selected action:
Figure GDA0003608705510000065
wherein alpha isλDenotes the learning rate, u, of the Critic networktA joint action is represented that is a combination of actions,
Figure GDA0003608705510000066
represents a state stLower joint action utThe gradient of the state cost function of (1) solves the confidence assignment problem using the counter-fact baseline of the COMA algorithm by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, which is achieved by centralized Critic, so that the value functions of all agents are the same, but each agent gets a specific error term based on its counter-fact behavior, the jth agent counter-fact baseline is defined as:
Figure GDA0003608705510000067
wherein the content of the first and second substances,
Figure GDA0003608705510000068
indicating that agent j is not currently taking other action, u-jA joint action representing the action taken by processing agent j,
Figure GDA0003608705510000069
is shown in state stMovement of
Figure GDA00036087055100000610
Probability of (Q)λ(st,(u-j,a′j) Means at state s when other agent than agent j is not actingtLower motion
Figure GDA00036087055100000611
The jth agent passes the counterfactual baseline Aj(s, u), updating the network parameters of the Actor according to a formula:
Figure GDA00036087055100000612
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA00036087055100000613
representing the policy parameter, α, of the jth agent at time tθWhich represents the learning rate of the agent,
Figure GDA00036087055100000614
indicating that the jth agent is in state stTime-of-flight policy gradient, dominance function A obtained by agent according to Critic networkj(s, u), updating parameters of the Actor network.
Further, in step S7, steps S5-S6 are repeated until the target state is reached.
The invention has the following effective effects: in the D2D resource allocation problem, the D2D user is allocated with the unlicensed frequency band, the frequency spectrum and the power jointly, the number of the D2D accessible to the unlicensed frequency band can be determined on the premise of ensuring the minimum communication quality of the WiFi user, then the D2D entering the unlicensed frequency band is determined, and then the power and the frequency spectrum allocation are carried out on the D2D still remaining in the licensed frequency band, so that the throughput of the cellular user and the D2D user in the licensed frequency band is maximized, an effective multi-agent deep reinforcement learning algorithm is provided at the same time, and the problem of NP-hard which is difficult to solve in the traditional algorithm is solved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention better clear, the invention provides the following drawings for illustration:
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a diagram of a network model for D2D communication;
FIG. 3 is a diagram of an AC network framework model according to an embodiment of the present invention;
FIG. 4 is a COMA model diagram of a multi-agent deep reinforcement learning algorithm according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a joint intelligent allocation method of authorized and unauthorized D2D communication resources, aiming at the problem of uplink data transmission in a cellular network and a D2D network. In order to obtain the number of accessible unlicensed bands D2D, the number of accessible unlicensed bands is determined by utilizing an established two-dimensional time Markov model on the premise of ensuring the WiFi communication quality, after the number of accessible unlicensed bands is obtained, for selecting a part of D2D users to exempt the unlicensed bands and performing power and spectrum allocation on the rest D2D in the licensed bands, in order to maximize the throughput of D2D users and cellular users in the licensed bands, a multi-agent deep reinforcement learning method COMA algorithm is provided, a multi-agent environment is modeled as a Markov game to optimize strategies, and meanwhile, the action strategies of other agents are considered, the method is to marginalize the influence of a single agent on rewards, the action taken by the agent at a certain time point t is compared with all other actions which the agent may take at t, and the action can be realized by a centralized criticc, so the cost function of all agents is the same, but each agent will receive a customized error term based on its counter-fact actions, in this way the COMA can avoid designing additional default actions and additional simulation calculations. Therefore, strategies requiring multi-agent coordination are better learned for each agent. COMA provides an efficient algorithm to perform credit allocation for the reward function, and the deep learning training process will result in a large amount of computational overhead. Therefore, the training process is completed by the BS, the historical information collected by the D2D user during the execution process is uploaded to the BS, centralized training is completed at the BS, and Critic obtains the strategy of the agent at the base station to evaluate the quality of the action taken. In the distributed execution process, the D2D obtains counterfactual baselines from the BS so as to update own Actor network, the Actor selects behaviors based on states observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, so that the optimal strategy is obtained. A flow chart of a multi-intelligent deep reinforcement learning method for joint authorized and unauthorized D2D communication resource allocation in D2D communication is shown in fig. 1.
A diagram of a network model based on D2D communication is shown in fig. 2. The D2D user transmits in the authorized frequency band, it will multiplex the channel with the existing cellular user, thus cause the communication between D2D user and cellular user to interfere, the D2D user chooses to work in the unlicensed frequency band, then will influence the user communication quality of WiFi frequency band, consider that the D2D user deploys the access mode of the unlicensed frequency band as LBT, therefore can model D2D user and WiFi user as two-dimensional Markov model, and then can confirm the quantity that can access to the unlicensed frequency band on the premise of guaranteeing the communication quality of WiFi user. Considering the communication link between devices as uplink resource sharing, since it is much easier to handle channel interference in the uplink than in the downlink, in order to maximize the system capacity deployed in the licensed band, the same channel can be used by multiple pairs of D2D, but only one channel multiplex can be selected per pair of D2D. Therefore, authorization and authorization-free selection and power and spectrum allocation need to be performed on D2D users, and the problem is an NP-hard problem, so a machine learning method is used to solve the problem, in which D2D users are regarded as agents, actions are authorization and authorization-free selection, power and spectrum selection, a joint state is SINR of all D2D users, a reasonable reward function is set for multiple agents, the agents continuously interact with the environment to select actions, update states and update network parameters, and the agents continuously learn in the environment and select corresponding actions with maximum reward.
As shown in fig. 1, a joint intelligent allocation method for authorized and unauthorized D2D communication resources includes the following steps:
s1: establishing a D2D user communication model;
s2: establishing an objective function to be optimized;
s3: establishing a D2D communication model of multi-agent deep reinforcement learning;
s4: setting an action set, a state set and a reward function of the multi-agent;
s5: the intelligent agent takes action according to the Actor network of the intelligent agent to obtain the state, the reward and the next state;
s6: and calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of each agent by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state.
S7: steps S5-S6 are repeated until the target state is reached.
Firstly, the number of D2D accessing the WiFi frequency band is calculated, and in the reuse license frequency band mode, two D2D can reuse the uplink of the same existing cellular user to directly communicate. The spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
Figure GDA0003608705510000091
in the formula, pk,mIs the transmit power of the kth D2D pair,
Figure GDA0003608705510000092
is the transmit power of cellular user m, BCIs the granted sub-channel bandwidth and,
Figure GDA0003608705510000093
is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k; x is a radical of a fluorine atomiWhen 1, D2D multiplexes i with the channel of the uplink cellular user, xiIf 0, the ith D2D pair will access the WiFi unlicensed frequency band; theta i,m1, D2D multiplexes i with the channel of the uplink cellular user m; theta.theta.i,m0, indicating that D2D multiplexes the channel of the uplink cellular user m to i; the spectral efficiency of cellular user m multiplexed by D2D for k is:
Figure GDA0003608705510000094
wherein the content of the first and second substances,
Figure GDA0003608705510000095
is the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between the D2D transmitting terminal k and the base station; the existence of D2D communication has a great influence on cellular and WiFi users, so it is proposed to perform mode selection and resource allocation on the total D2D users after determining the maximum number of D2D capable of accessing the WiFi licensed band under the condition of meeting the WiFi user minimum throughput, so as to reduce the degradation of cellular and WiFi users caused by D2D communication to the maximum extent.
In order to obtain the maximum cellular user and authorized frequency band D2D user system throughput, there are:
Figure GDA0003608705510000096
Figure GDA0003608705510000101
Figure GDA0003608705510000102
wherein M is equal to M e {1,2, …, M }, and c1In xkIndicating the choice of D2D user to authorize access and to be free of authorization, thetak,mIndicating the selection of the D2D multiplexed sub-channel, c2Representing the power limit, p, of a D2D usermaxRepresents the maximum transmit power, c, of the transmitting end of D2D3Indicates minimum WiFi throughput requirement is met, SWAnd
Figure GDA0003608705510000103
minimum threshold values, c, expressed as WiFi throughput and WiFi throughput, respectively4And c5Respectively, to ensure that D2D users and cellular users meet minimum signal-to-noise ratio requirements, SINRD2DAnd SINRCURepresenting the signal-to-noise ratio of D2D and CU respectively,
Figure GDA0003608705510000104
and
Figure GDA0003608705510000105
expressed as the minimum signal-to-noise ratio requirement that D2D and the CU need to meet, respectively.
In order to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (computational Artificial Multi-Agent) algorithm, is adopted, firstly, a D2D transmitting terminal is used as an Agent, a Multi-Agent environment is modeled as a Markov game to optimize a strategy, and meanwhile, the action strategies of other agents are considered. The COMA solves the edge distribution computation baseline for the strategy of the current agent by using the current strategy and by using the current behavior value function, and the COMA avoids designing additional default actions and additional simulation computation. The training process is completed at the BS, the historical information collected by the D2D user in the execution process is uploaded to the BS, the centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of the action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the action with the maximum reward valueTo an optimal strategy. The AC network model is shown in fig. 3.
Further, in step S4, the agent D2D takes corresponding actions for interacting with the environment and according to the policy; at each time t, agent D2D observes a state S from state space StAnd adopting corresponding selection mode, RB selection and power level selection from the action space A according to the strategy pi; after performing this action, the environment enters a new state st+1Agent receives a reward; thus, the state space, action space and reward function are set as follows:
state space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t as:
Figure GDA0003608705510000111
wherein
Figure GDA0003608705510000112
The local observation state is expressed as the nth D2D transmitting terminal t moment;
an action space A:
Figure GDA0003608705510000113
selection of licensed and unlicensed frequency bands, power level selection, and RB selection, respectively, wherein mode selection: 2, power level: 10, RB selection: 20, action space digit of each agent: α × β × η is 400;
the reward function R: the reward function is designed into three parts: the selection mode of D2D, the rates of D2D and cellular users and the signal-to-noise ratios of the two, if the selected mode is the access unlicensed frequency band, the intelligent agent sets the reward obtained by the intelligent agent to be a positive value, but after the number of D2D exceeds the maximum access number met, a larger negative value is obtained, if the behavior adopted by the intelligent agent enables the signal-to-noise ratios of the cellular users and the D2D users to be larger than the set threshold value, the reward is taken as the sum of the corresponding rate and the reward of the selected cellular users of the same multiplex spectrum, otherwise, if the behavior adopted by the intelligent agent causes the signal-to-noise ratio of D2D or the cellular users to be smaller than the set threshold value, the reward function is a negative value; and (3) limiting the number of users accessing the unlicensed frequency band, designing a function:
Figure GDA0003608705510000114
limiting the SINR of D2D and CU, and obtaining the reward function of j-th D2D user as:
Figure GDA0003608705510000115
in the formula
Figure GDA0003608705510000116
Figure GDA0003608705510000121
Wherein r isiIndicating the instant prize earned by the jth D2D; n is a radical ofmaxA maximum value representing an allowed access to the unlicensed frequency band;
Figure GDA0003608705510000122
the awards obtained by the ith intelligent agent under the condition that the number of the intelligent agent accessing to the unlicensed frequency band is met, the awards obtained by the ith intelligent agent under the condition that the intelligent agent does not access to the unlicensed frequency band, the instant awards obtained by the ith intelligent agent selecting to access to the unlicensed frequency band and meeting the minimum signal-to-noise ratio threshold of the ith intelligent agent and the multiplexed CU, and the instant awards obtained by the ith intelligent agent under the condition that the intelligent agent accesses to the unlicensed frequency band and not meeting the signal-to-noise ratio of the ith intelligent agent or the multiplexed CU are respectively expressed;
Figure GDA0003608705510000123
and
Figure GDA0003608705510000124
respectively representing the spectral efficiency of the ith D2D user pair and the CU multiplexed by the ith D2D pair.
First, the method is initiatedHyper-parameters gamma, alpha in a chemical networkθλ,
Figure GDA0003608705510000125
Beta, state s0And Actor, the parameter lambda in the critical network,
Figure GDA0003608705510000126
θ01,...,θM(ii) a Each agent takes the action with the maximum probability according to the policy network of the agent, and the action taken by all agents is taken as the action taken in the current state, and the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D user receiving rewards
Figure GDA0003608705510000127
And the next state st+1. The AC network model is shown in fig. 3.
The training process is completed by the BS, historical information collected by the D2D user in the execution process is uploaded to the BS, centralized training is completed at the BS, and the Critic obtains the strategy of the intelligent agent at the BS to evaluate the quality of action. In the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, the agent performs enough training times, and finally converges on an action with the maximum reward value, thereby obtaining the optimal strategy.
Further, in step S6, a TDerror is calculated from the Critic network:
Figure GDA0003608705510000128
wherein
Figure GDA0003608705510000129
Represents a state st+1The value of the action to be the maximum is set,
Figure GDA00036087055100001210
represents a state stAccording to policy function
Figure GDA00036087055100001211
And (3) updating the Critic network parameters by adopting a gradient ascending method according to the action value of the selected action:
Figure GDA00036087055100001212
wherein alpha isλIndicates the learning rate of Critic network, utA joint action is represented that is a combination of actions,
Figure GDA00036087055100001213
represents a state stLower joint action utThe gradient of the state cost function of (1) solves the confidence assignment problem using the counter-fact baseline of the COMA algorithm by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, which is achieved by centralized Critic, so that the value functions of all agents are the same, but each agent gets a specific error term based on its counter-fact behavior, the jth agent counter-fact baseline is defined as:
Figure GDA0003608705510000131
wherein the content of the first and second substances,
Figure GDA0003608705510000132
indicates that agent j is not currently taking any other action, u-jA joint action representing the action taken by processing agent j,
Figure GDA0003608705510000133
is shown in state stMovement of
Figure GDA0003608705510000134
Probability of (Q)λ(st,(u-j,a′j) Means at state s when other agent than agent j is not actingtLower motion
Figure GDA0003608705510000135
The jth agent passes the counterfactual baseline Aj(s, u), updating the Actor network parameters of the Actor, and according to a formula:
Figure GDA0003608705510000136
wherein the content of the first and second substances,
Figure GDA0003608705510000137
representing the policy parameter, alpha, of the jth agent at time tθWhich represents the learning rate of the agent,
Figure GDA0003608705510000138
indicating that the jth agent is in state stTime-of-flight policy gradient, dominance function A obtained by agent according to Critic networkj(s, u), updating parameters of the Actor network. A schematic diagram of the COMA algorithm for multi-agent deep reinforcement learning is shown in FIG. 4.
Further, in step S7, steps S5-S6 are repeated until the target state is reached.
Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims (1)

1. A joint intelligent allocation method for authorized and unauthorized D2D communication resources comprises the following steps:
s1: establishing a D2D user communication model: the number of D2D accessed to the WiFi frequency band is obtained through calculation, and in the multiplexing permission frequency band mode, two D2D can multiplex the uplink of the same existing cellular user to directly carry out communication; the spectral efficiency of D2D versus k for the channel multiplexing cellular user m is:
Figure FDA0003608705500000011
in the formula, pk,mIs the transmit power of the kth D2D pair,
Figure FDA0003608705500000012
is the transmit power of cellular user m, BCIs the granted sub-channel bandwidth and,
Figure FDA0003608705500000013
is the noise power density, hk,mIs the interference power gain between cellular user m and the receiver of D2D pair k; x is the number ofiWhen 1, D2D multiplexes i with the channel of the uplink cellular user, xiIf 0, the ith D2D pair will access the WiFi unlicensed frequency band; thetai,m1, D2D multiplexes i with the channel of the uplink cellular user m; thetai,m0, indicating that D2D multiplexes the channel of the uplink cellular user m to i; the spectral efficiency of cellular user m multiplexed by D2D for k is:
Figure FDA0003608705500000014
wherein the content of the first and second substances,
Figure FDA0003608705500000015
is the channel power gain, h, of the cellular user m and the base stationk,BIs the channel gain between the D2D transmitting terminal k and the base station; the existence of D2D communication has great influence on cellular and WiFi users, so the method proposes the mode selection for the total D2D users after determining the maximum number of D2D capable of accessing the WiFi authorized frequency band under the condition of meeting the minimum throughput of the WiFi usersAnd allocation of resources to minimize the degradation of cellular and WiFi users caused by D2D communications;
s2: establishing an objective function to be optimized: in order to obtain the maximum cellular user and licensed band D2D user system throughput, there are:
Figure FDA0003608705500000016
Figure FDA0003608705500000021
wherein M is equal to M e {1,2, …, M }, and c1In xkIndicating the choice of D2D user to authorize access and to be free of authorization, thetak,mIndicating the selection of the D2D multiplexed sub-channel, c2Representing the power limit, p, of a D2D usermaxRepresents the maximum transmit power, c, of the transmitting end of D2D3Indicates minimum WiFi throughput requirement is met, SWAnd
Figure FDA0003608705500000022
minimum threshold values, c, expressed as WiFi throughput and WiFi throughput, respectively4And c5Respectively, to ensure that D2D users and cellular users meet minimum signal-to-noise ratio requirements, SINRD2DAnd SINRCURepresenting the signal-to-noise ratio of D2D and CU respectively,
Figure FDA0003608705500000023
and
Figure FDA0003608705500000024
expressed as the minimum signal-to-noise ratio requirement that D2D and the CU need to meet, respectively;
s3: establishing a D2D communication model of multi-agent deep reinforcement learning: in order to solve the NP-hard problem in D2D communication resource allocation, a Multi-Agent reinforcement learning method, COMA (common Multi-Agent) algorithm, is adopted, firstly, the D2D transmitting terminal is taken as an AgentThe method is characterized in that the influence of a single intelligent agent on the reward is marginalized, the action taken by the intelligent agent at a certain time point t is compared with all other actions possibly taken by the intelligent agent at t, the action is realized through a centralized Critic, the value functions of all intelligent agents are the same, and each intelligent agent receives a customized error item according to the counterfactual action of the intelligent agent; the COMA utilizes the current strategy and utilizes the current behavior value function to solve the edge distribution calculation baseline of the strategy of the current agent, and the COMA avoids designing additional default actions and additional simulation calculation; the training process is finished at the BS, historical information collected by the D2D user in the execution process is uploaded to the BS, centralized training is finished at the BS, and Critic obtains the strategy of the intelligent agent on the BS to evaluate the quality of action; in the distributed execution process, the D2D user obtains A from the base stationj(s, u) updating own Actor network, wherein the Actor selects behaviors based on the state observed by the agent from the environment, the agent continuously interacts with the environment, and the agent performs enough training times and finally converges on an action with the maximum reward value, thereby obtaining an optimal strategy;
s4: setting action set, state set and reward function of the multi-agent: the agent D2D interacts with the environment and takes corresponding actions according to the strategy; at each time t, agent D2D observes a state S from state space StAnd adopting corresponding selection mode, RB selection and power level selection from the action space A according to the strategy pi; after performing this action, the environment enters a new state st+1Agent receives a reward; thus, the state space, action space, and reward function are set as follows:
state space S: at any time t, the system state is represented by the joint SINR value of all D2D at that time t as:
Figure FDA0003608705500000031
wherein
Figure FDA0003608705500000032
The local observation state is expressed as the nth D2D transmitting terminal t moment;
an action space A:
Figure FDA0003608705500000033
selection of licensed and unlicensed frequency bands, power level selection, and RB selection, respectively, wherein mode selection: 2, power level: 10, RB selection: action space number of each agent 20: α × β × η is 400;
the reward function R: the reward function is designed into three parts: the selection mode of the D2D, the rates of the D2D and the cellular users and the signal-to-noise ratios of the D2D and the cellular users are selected, if the selected mode is the access unlicensed frequency band, the intelligent agent sets the obtained reward to a positive value, but after the number of the D2D exceeds the satisfied maximum access number, a larger negative value is obtained, if the behaviors adopted by the intelligent agent enable the signal-to-noise ratios of the cellular users and the D2D users to be larger than a set threshold value, the sum of the corresponding rates and the selected cellular user reward of the same multiplex frequency spectrum is used as the reward, and if the behaviors adopted by the intelligent agent cause the signal-to-noise ratio of the D2D or the cellular users to be smaller than the set threshold value, the reward function is a negative value; and (3) limiting the number of users accessing the unlicensed frequency band, designing a function:
Figure FDA0003608705500000034
limiting the SINR of D2D and CU, and obtaining the reward function of j-th D2D user as:
Figure FDA0003608705500000035
in the formula
Figure FDA0003608705500000036
Figure FDA0003608705500000037
Wherein r isiIndicating the instant prize earned by the jth D2D; n is a radical ofmaxA maximum value representing an allowed access to the unlicensed frequency band;
Figure FDA0003608705500000038
the awards obtained by the ith intelligent agent under the condition that the number of the intelligent agent accessing to the unlicensed frequency band is met, the awards obtained by the ith intelligent agent under the condition that the intelligent agent does not access to the unlicensed frequency band, the instant awards obtained by the ith intelligent agent selecting to access to the unlicensed frequency band and meeting the minimum signal-to-noise ratio threshold of the ith intelligent agent and the multiplexed CU, and the instant awards obtained by the ith intelligent agent under the condition that the intelligent agent accesses to the unlicensed frequency band and not meeting the signal-to-noise ratio of the ith intelligent agent or the multiplexed CU are respectively expressed;
Figure FDA0003608705500000041
and
Figure FDA0003608705500000042
respectively representing the spectral efficiencies of the ith D2D user pair and the CU multiplexed by the ith D2D pair;
s5: the agent takes action according to the Actor network of itself, obtains the status and reward and the next status: each agent takes the action with the maximum probability according to the policy network of the agent, and the action taken by all agents is taken as the action taken in the current state, and the action states taken by all agents are combined to obtain the action state from the environment state stCombined action oftD2D user receiving rewards
Figure FDA0003608705500000043
And the next state st+1
S6: calculating TD error of the Critic network, updating parameters of the Critic network, calculating counterfactual baselines of all agents by the Critic network, updating parameters of the Actor network through the counterfactual baselines, and updating the state: calculating TD error from Critic network:
Figure FDA0003608705500000044
wherein
Figure FDA0003608705500000045
Represents a state st+1The value of the action to be the maximum is set,
Figure FDA0003608705500000046
represents a state stAccording to policy function
Figure FDA0003608705500000047
And (3) updating the Critic network parameters by adopting a gradient ascending method according to the action value of the selected action:
Figure FDA0003608705500000048
wherein alpha isλIndicates the learning rate of Critic network, utA joint action is represented that is a combination of actions,
Figure FDA00036087055000000415
represents a state stLower joint action utThe gradient of the state cost function of (1) solves the confidence assignment problem using the counter-fact baseline of the COMA algorithm by marginalizing the effect of a single agent on the reward and comparing the action taken by the agent at a certain time t with all other actions that may be taken at t, which is achieved by centralized Critic, so that the value functions of all agents are the same, but each agent gets a specific error term based on its counter-fact behavior, the jth agent counter-fact baseline is defined as:
Figure FDA0003608705500000049
wherein the content of the first and second substances,
Figure FDA00036087055000000410
indicating that agent j is not currently taking other action, u-jA joint action representing the action taken by processing agent j,
Figure FDA00036087055000000411
is shown in state stMovement of
Figure FDA00036087055000000412
Probability of (Q)λ(st,(u-j,a′j) Means at state s when other agent than agent j is not actingtLower motion
Figure FDA00036087055000000413
The jth agent passes the counterfactual baseline Aj(s, u), updating the Actor network parameters of the Actor, and according to a formula:
Figure FDA00036087055000000414
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003608705500000051
representing the policy parameter, alpha, of the jth agent at time tθWhich represents the learning rate of the agent,
Figure FDA0003608705500000052
indicating that the jth agent is in state stTime-of-flight policy gradient, dominance function A obtained by agent according to Critic networkj(s, u) updating parameters of the Actor network;
s7: steps S5-S6 are repeated until the target state is reached.
CN202110581716.5A 2021-05-26 2021-05-26 Authorized and unauthorized D2D communication resource joint intelligent distribution method Active CN113316154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110581716.5A CN113316154B (en) 2021-05-26 2021-05-26 Authorized and unauthorized D2D communication resource joint intelligent distribution method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110581716.5A CN113316154B (en) 2021-05-26 2021-05-26 Authorized and unauthorized D2D communication resource joint intelligent distribution method

Publications (2)

Publication Number Publication Date
CN113316154A CN113316154A (en) 2021-08-27
CN113316154B true CN113316154B (en) 2022-06-21

Family

ID=77375597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110581716.5A Active CN113316154B (en) 2021-05-26 2021-05-26 Authorized and unauthorized D2D communication resource joint intelligent distribution method

Country Status (1)

Country Link
CN (1) CN113316154B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363938B (en) * 2021-12-21 2024-01-26 深圳千通科技有限公司 Cellular network flow unloading method
CN114466386B (en) * 2022-01-13 2023-09-29 深圳市晨讯达科技有限公司 Direct access method for D2D communication
CN114928549A (en) * 2022-04-20 2022-08-19 清华大学 Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning
CN117651346A (en) * 2022-08-12 2024-03-05 华为技术有限公司 Training method for reinforcement learning and related device
CN116367332B (en) * 2023-05-31 2023-09-15 华信咨询设计研究院有限公司 Hierarchical control-based D2D resource allocation method under 5G system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN110769514A (en) * 2019-11-08 2020-02-07 山东师范大学 Heterogeneous cellular network D2D communication resource allocation method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3187015A4 (en) * 2014-09-03 2017-09-20 Huawei Technologies Co., Ltd. System and method for d2d resource allocation
FR3072851B1 (en) * 2017-10-23 2019-11-15 Commissariat A L'energie Atomique Et Aux Energies Alternatives REALIZING LEARNING TRANSMISSION RESOURCE ALLOCATION METHOD
US10505616B1 (en) * 2018-06-01 2019-12-10 Samsung Electronics Co., Ltd. Method and apparatus for machine learning based wide beam optimization in cellular network
CN110493826B (en) * 2019-08-28 2022-04-12 重庆邮电大学 Heterogeneous cloud wireless access network resource allocation method based on deep reinforcement learning
US11678272B2 (en) * 2019-10-30 2023-06-13 University Of Ottawa System and method for joint power and resource allocation using reinforcement learning
CN111556572B (en) * 2020-04-21 2022-06-07 北京邮电大学 Spectrum resource and computing resource joint allocation method based on reinforcement learning
CN112822781B (en) * 2021-01-20 2022-04-12 重庆邮电大学 Resource allocation method based on Q learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN110769514A (en) * 2019-11-08 2020-02-07 山东师范大学 Heterogeneous cellular network D2D communication resource allocation method and system

Also Published As

Publication number Publication date
CN113316154A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN113316154B (en) Authorized and unauthorized D2D communication resource joint intelligent distribution method
CN109474980B (en) Wireless network resource allocation method based on deep reinforcement learning
Luo et al. Dynamic resource allocations based on Q-learning for D2D communication in cellular networks
CN113543074B (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN106358308A (en) Resource allocation method for reinforcement learning in ultra-dense network
US20210326695A1 (en) Method and apparatus employing distributed sensing and deep learning for dynamic spectrum access and spectrum sharing
Gu et al. Dynamic path to stability in LTE-unlicensed with user mobility: A matching framework
CN111586646B (en) Resource allocation method for D2D communication combining uplink and downlink channels in cellular network
Amichi et al. Spreading factor allocation strategy for LoRa networks under imperfect orthogonality
Elsayed et al. Deep reinforcement learning for reducing latency in mission critical services
CN114363908A (en) A2C-based unlicensed spectrum resource sharing method
CN112153744A (en) Physical layer security resource allocation method in ICV network
CN113453358B (en) Joint resource allocation method of wireless energy-carrying D2D network
Jiang Reinforcement learning-based spectrum sharing for cognitive radio
CN114126021A (en) Green cognitive radio power distribution method based on deep reinforcement learning
Lall et al. Multi-agent reinfocement learning for stochastic power management in cognitive radio network
CN103974266B (en) A kind of method of relay transmission, equipment and system
CN115811788B (en) D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning
CN113330767A (en) Spectrum management device, electronic device, wireless communication method, and storage medium
CN107172574B (en) Power distribution method for D2D user to sharing frequency spectrum with cellular user
CN115866787A (en) Network resource allocation method integrating terminal direct transmission communication and multi-access edge calculation
CN115915454A (en) SWIPT-assisted downlink resource allocation method and device
CN114928857A (en) Direct connection anti-interference configuration method for mobile equipment of cellular communication network
CN114423070A (en) D2D-based heterogeneous wireless network power distribution method and system
Zhang et al. A convolutional neural network based resource management algorithm for NOMA enhanced D2D and cellular hybrid networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant