CN109474980B - Wireless network resource allocation method based on deep reinforcement learning - Google Patents

Wireless network resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN109474980B
CN109474980B CN201811535056.1A CN201811535056A CN109474980B CN 109474980 B CN109474980 B CN 109474980B CN 201811535056 A CN201811535056 A CN 201811535056A CN 109474980 B CN109474980 B CN 109474980B
Authority
CN
China
Prior art keywords
reinforcement learning
eval
deep reinforcement
subcarrier
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811535056.1A
Other languages
Chinese (zh)
Other versions
CN109474980A (en
Inventor
张海君
刘启瑞
皇甫伟
董江波
隆克平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201811535056.1A priority Critical patent/CN109474980B/en
Publication of CN109474980A publication Critical patent/CN109474980A/en
Application granted granted Critical
Publication of CN109474980B publication Critical patent/CN109474980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/06TPC algorithms
    • H04W52/14Separate analysis of uplink or downlink
    • H04W52/143Downlink power control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/241TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account channel quality metrics, e.g. SIR, SNR, CIR, Eb/lo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/26TPC being performed according to specific parameters using transmission rate or quality of service QoS [Quality of Service]
    • H04W52/265TPC being performed according to specific parameters using transmission rate or quality of service QoS [Quality of Service] taking into account the quality of service QoS
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power
    • H04W52/34TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
    • H04W52/346TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading distributing total power among users or channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/542Allocation or scheduling criteria for wireless resources based on quality criteria using measured or perceived quality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/543Allocation or scheduling criteria for wireless resources based on quality criteria based on requested quality, e.g. QoS
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource

Abstract

The invention provides a wireless network resource allocation method based on deep reinforcement learning, which can improve the energy efficiency in a time-varying channel environment to the maximum extent with lower complexity. The method comprises the following steps: establishing a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal as a finite-state time-varying Markov channel, determining a normalized channel coefficient, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to the subcarrier distribution result, downlink power is distributed to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient, a return function is determined based on the distributed downlink power, and the return function is fed back to the deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return functioneval、qtargetAnd determining the local optimal power distribution in the time-varying channel environment. The invention relates to the field of wireless communication and artificial intelligence decision making.

Description

Wireless network resource allocation method based on deep reinforcement learning
Technical Field
The invention relates to the field of wireless communication and artificial intelligence decision making, in particular to a wireless network resource allocation method based on deep reinforcement learning.
Background
Starting from the Long Term Evolution (LTE) era, the networking architecture is shifted from Macro network to Macro-micro cooperation, and Macro Cell (Macro Cell) sustainable development faces many challenges, such as unpredictable traffic growth demand, ubiquitous access demand, random hotspot deployment, and great cost pressure of Macro Cell itself. Therefore, the advantages of accurate coverage and blind area supplement of Small base stations (Small cells) such as micro-cells and home base stations are embodied, and the method gradually becomes an important link for cooperative work with macro base stations in network deployment and allocation of macro base station service pressure. The fifth generation mobile communication is an extension of 4G, and 5G is not a single radio access technology but a general term for a solution after evolution and integration of a plurality of new radio access technologies and existing radio access technologies. Nowadays, 5G networks start to get into the sight of people, and the user experience rate is generally considered as the most important performance index of 5G. The technical features of 5G can be summarized by several numbers: capacity boost of 1000x, connection support of 1000 hundred million +, maximum speed of 10GB/s, delay of 1ms or less. The 5G main technologies comprise a super-large-scale multi-antenna, a novel multiple access technology and an ultra-dense network, wherein the deployment of the small base station and the macro base station form the ultra-dense heterogeneous network, and ubiquitous services are provided for users.
With the rapid increase of the number of mobile users, the arrangement of small base stations also tends to be ultra-intensive, the energy consumption brought by the field of wireless communication is very large, and aiming at the national conditions of serious environmental pollution and increasingly scarce energy in China, green communication is inevitably a direction worthy of research and exploration, so on the basis of ensuring the satisfaction of user data requirements and service quality, the realization of higher energy efficiency through a reasonable resource distribution mode is an important research direction.
Disclosure of Invention
The invention aims to provide a wireless network resource allocation method based on deep reinforcement learning, so as to solve the problem that wireless resource allocation in a time-varying channel environment cannot be effectively realized in the prior art.
In order to solve the above technical problem, an embodiment of the present invention provides a wireless network resource allocation method based on deep reinforcement learning, including:
s101, establishing a convolutional neural network q with two same parameterseval、qtargetForming a deep reinforcement learning model;
s102, modeling the time-varying channel environment between the base station and the user terminal as a time-varying Markov channel in a finite state, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
s103, distributing downlink power to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient according to the subcarrier distribution result, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model;
s104, training a convolutional neural network q in the deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment.
Further, the normalized channel coefficient is represented as:
Figure BDA0001906660680000021
wherein Hn,kThe normalized channel coefficient is expressed as the normalized channel gain of the base station and the user terminal n on the subcarrier k; h isn,kRepresenting the channel gain of the base station and the user terminal n on the subcarrier k;
Figure BDA0001906660680000022
representing the noise power on subcarrier k.
Further, the input convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user comprises the following steps:
inputting the normalized channel coefficients into a convolutional neural network qevalConvolutional neural network qevalBy means of decision formulas
Figure BDA0001906660680000031
Selecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
wherein, thetaevalRepresenting a convolutional neural network qevalThe weight parameter of (2), Q function Q (s, a'; theta)eval) The weight is represented as thetaevalOf the convolutional neural network qevalPerforming the reported value obtained by action a' at state s, which is the input normalized channel coefficient; and a represents a decision action of the deep reinforcement learning model, namely an optimal subcarrier distribution result, wherein the optimal subcarrier distribution result is obtained according to the index of the action with the maximum return value.
Further, the downlink power allocated to the user is represented as:
Figure BDA0001906660680000032
wherein p isn,kIndicating the downlink transmitting power distributed by the base station for the user terminal n on the subcarrier k; p'kIndicating the downlink transmitting power distributed by the base station on the subcarrier k; a represents an attenuation factor; kmaxRepresents the most multiplexed on each sub-carrier in the complexity that the current serial interference canceller can bear in the non-orthogonal multiple access networkThe number of large users.
Further, the determining the system energy efficiency based on the allocated downlink power comprises:
determining the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,k
Determining the power consumption U of the system according to the determined normalized channel coefficient between the base station and the user, the subcarrier distribution result and the distributed downlink powerP(X);
According to determined rn,kAnd UP(X), determining system energy efficiency.
Further, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,kExpressed as:
rn,k=log2(1+γn,k)
Figure BDA0001906660680000033
wherein, γn,kRepresenting the signal-to-noise ratio, gamma, of the signal obtained by the user terminal n from the subcarrier kn,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from the subcarrier k;
system power consumption UP(X) is represented by:
Figure BDA0001906660680000041
wherein p iskIndicating circuit power consumption, # denotes base station energy recovery factor, xn,kIndicating whether user terminal n uses subcarrier k.
Further, the system energy efficiency is expressed as:
Figure BDA0001906660680000042
wherein, een,kRepresenting the energy efficiency of the sub-carrier k to the user terminal n,
Figure BDA0001906660680000043
denotes the channel bandwidth of subcarrier K, N denotes the set of user terminals, and K denotes the set of subcarriers available under the current base station.
Further, the determining a reward function based on the system energy efficiency and feeding the reward function back to the deep reinforcement learning model comprises:
punishment is carried out on the system energy efficiency which does not accord with the preset modeling constraint condition according to the type which does not accord with the modeling constraint condition by a weak supervision algorithm based on value return to obtain a return function after a deep reinforcement learning model makes a decision action, and the return function is fed back to the deep reinforcement learning model; wherein the reward function is represented as:
Figure BDA0001906660680000044
wherein, rewardtRepresenting a return function calculated during the t training; rminThe minimum standard of the user service quality, namely the minimum downlink transmission rate is represented; hinnterThe normalized channel coefficient corresponding to the shortest distance between the nearest base station working at the same subcarrier frequency and the current optimized base station is represented; i iskRepresenting the upper limit of cross-layer interference that the k-th sub-carrier frequency band can bear ξcase1~ξcase3And (3) representing the penalty coefficients of the three cases which do not accord with the modeling constraint on the energy efficiency of the system.
Further, training the convolutional neural network q in the deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference between the system energy efficiency value obtained for a plurality of consecutive times and the preset threshold is within the preset range or higher than the preset threshold, the power local optimal allocation of the currently allocated downlink power in the time-varying channel environment includes:
and storing the return function, the channel environment, the decision action and the transferred inferior state as a quadruple into a memory playback unit memory of the deep reinforcement learning model, wherein the memory is represented as:
memory:D(t)={e(1),...,e(t)}
e(t)=(s(t),a(t),r(t),s(t+1))
wherein, s (t) represents the input state when the deep reinforcement learning model is trained for the t time; a (t) represents the decision-making action made by the deep reinforcement learning model when the deep reinforcement learning model is trained for the t time; r (t) represents the reward function obtained after the action a (t) of the deep reinforcement learning model is performed when the deep reinforcement learning model is trained for the t timet(ii) a s (t +1) represents a secondary state after updating according to a time-varying Markov channel in a finite state when the deep reinforcement learning model is trained for t +1 times;
randomly selecting memory data from a memory playback unit of the deep reinforcement learning model for learning two convolutional neural networks and updating gradient descent, wherein the gradient descent only updates the convolutional neural network qevalQ is updated every fixed times in the deep reinforcement learning model training processtargetParameter thetatargetIs qevalParameter thetaeval
And if the difference value between the system energy efficiency value obtained for a plurality of times continuously and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment.
Further, the gradient descent update formula is expressed as:
Figure BDA0001906660680000051
wherein the content of the first and second substances,
Figure BDA0001906660680000052
represents a training learning rate; λ represents a discount factor for evaluation of the attitude of the decision body;
Figure BDA0001906660680000053
represents that when the input is the sub-state s (t +1) of the current memory e (t), the weight is thetatargetOf the convolutional neural network qtargetAn action a' which is decided to be capable of harvesting the maximum return; q(s), (t), a (t); thetaeval) Indicating that when the input is the state s (t) of the current memory e (t), the weight is θevalOf the convolutional neural network qevalPerforming the reward value obtained in act a (t);
Figure BDA0001906660680000054
represents a parameter of thetaevalThe convolutional neural network performs gradient descent operation.
The technical scheme of the invention has the following beneficial effects:
in the scheme, two convolutional neural networks q are establishedeval、qtargetForming a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal into a finite-state time-varying Markov channel, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to a subcarrier distribution result, distributing downlink power to users multiplexed on each subcarrier based on the inverse ratio of a channel coefficient, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to a deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment; therefore, the time-varying channel environment between the base station and the user terminal is modeled into a time-varying Markov channel in a finite state, so that on the basis of considering the time-varying channel with high complexity, a deep reinforcement learning model is used, the calculation complexity is converted into the process of training the deep reinforcement learning model, and therefore a decision-making action is selected with lower complexity, the local optimal distribution of the sub-carriers from the base station to the user terminal in the time-varying channel environment is determined, and the energy efficiency in the time-varying channel environment is improved to the maximum extent.
Drawings
Fig. 1 is a schematic flowchart of a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention;
fig. 2 is a detailed flowchart of a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a wireless network resource allocation method based on deep reinforcement learning, aiming at the problem that the wireless resource allocation in a time-varying channel environment cannot be effectively realized in the prior art.
As shown in fig. 1, a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention includes:
s101, establishing a convolutional neural network q with two same parameterseval、qtargetConstructing a Deep enhanced learning model (Deep Q Network, DQN);
s102, modeling the time-varying channel environment between the base station and the user terminal as a time-varying Markov channel in a finite state, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
s103, distributing downlink power to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient according to the subcarrier distribution result, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model;
s104, training a convolutional neural network q in the deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment.
The wireless network resource allocation method based on deep reinforcement learning of the embodiment of the invention establishes two convolutional neural networks qeval、qtargetForming a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal into a finite-state time-varying Markov channel, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to a subcarrier distribution result, distributing downlink power to users multiplexed on each subcarrier based on the inverse ratio of a channel coefficient, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to a deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment; therefore, the time-varying channel environment between the base station and the user terminal is modeled into a time-varying Markov channel in a finite state, so that on the basis of considering the time-varying channel with high complexity, a deep reinforcement learning model is used, the calculation complexity is converted into the process of training the deep reinforcement learning model, and therefore a decision-making action is selected with lower complexity, the local optimal distribution of the sub-carriers from the base station to the user terminal in the time-varying channel environment is determined, and the energy efficiency in the time-varying channel environment is improved to the maximum extent.
The deep reinforcement learning in the embodiment is a decision method based on artificial intelligence, and is characterized in that sequential decisions are made by a decision body in a dynamically changing environment, states, actions and rewards required by the deep reinforcement learning can be constructed, and the decision body can be automated and optimizes the decision actions when a deep heightening learning model is trained. The wireless network resource allocation method based on deep reinforcement learning can simulate a time-varying channel environment, optimize allocation of wireless network resources in a time-varying network scene to the maximum extent with low calculation complexity, and achieve the effect of jointly improving quick decision and energy efficiency. The trained deep reinforcement learning model can be continuously used for managing wireless resources in a time-varying channel environment and making a quick decision with high return. In a wide range of wireless network optimization, the deep reinforcement learning model can be subjected to distributed calculation, so that the complexity is reduced.
In order to better understand the method for allocating wireless network resources based on deep reinforcement learning in this embodiment, the method is described in detail, and the specific steps may include:
a11, constructing a depth enhanced learning model DQN
In this embodiment, a convolutional neural network q with two identical parameters is initially establishedeval、qtargetForming a deep reinforcement learning model; the decision process of the deep reinforcement learning model is determined by a Q function Q (s, a; theta), wherein theta represents a weight parameter of the convolutional neural network, and the convolutional neural network QevalAnd q istargetRespectively is thetaevalAnd thetatargetThe two are the same when initialized; and the Q function Q (s, a; theta) represents a return value obtained by the convolutional neural network with the weight value of theta when the convolutional neural network executes the action a in the state of s.
In this embodiment, each convolutional neural network is composed of two convolutional layers, two pooling layers, and two fully-connected layers; each training input is [ n ]samples,N,K]First dimension nsamplesRepresenting the number of input samples, second, third dimension ([ N, K)]) Representing an input sample, i.e. with dimensions [ N, K ]]The normalized channel coefficient matrix of (a); the input number in each training is nsamplesThe normalized channel coefficient matrix of each input convolutional neural network is [ N, K ]]Data, output is all possible actions under the current channel state, and the return value Q obtained by each actionaction_val,Qaction_valThe data structure of (a) is a one-dimensional vector [ Action ]num]Therein, ActionNumRepresenting all possible actions, the number of input channel states being nsamplesEach stateGet the return value of all actionsnum]Thus the output is nsamplesOne-dimensional vector [ Actionnum]A two-dimensional matrix is formed.
A12, modeling the time-varying channel environment between base station and user terminal as finite-state time-varying Markov channel, determining the normalized channel coefficient between base station and user, and inputting it to convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user
In this embodiment, a plurality of common-frequency Small Base Stations (SBS) are deployed within a certain range, and the small base stations include an outdoor micro base station, a pico base station, and an indoor home base station. Within the coverage area of each small base station, 6 user terminals (UE) and 3 Subcarriers (SC) available in the non-orthogonal multiple access network are distributed in a certain area by taking the small base station as the center. In the embodiment, an independent deep reinforcement learning model is operated on each small base station, so that the effect of distributed processing is achieved. Initializing parameters of the small cell and the user terminal, including but not limited to: SBS and UEnNormalized channel coefficient H on subcarrier kn,kA channel bandwidth B and a sub-carrier channel bandwidth B allocated to the base stationSCThe circuit consumes power pkEtc., wherein the UEnIndicating user terminal n, SCkRepresenting a sub-carrier k while initializing a user-sub-carrier correlation matrix XN,KAnd Finite State time varying Markov Channel (FSMC) transition probability matrix
Figure BDA0001906660680000091
N represents a set of user terminals, and K represents a set of usable subcarriers under the current base station; user-subcarrier incidence matrix X obtained by initializationN,KAnd finite state time varying Markov channel transition probability matrix
Figure BDA0001906660680000092
Used for subsequent user association matrix optimization and calculation of updated channel state.
In this embodiment, the optimization letterObtaining initial coordinates through space random scattering of time-varying Markov channel with finite state channel environment, calculating initial normalized channel coefficient matrix, and quantizing the obtained value in ten steps with quantization boundary being bound0,...,bound9The optimized scene is based on a time-varying Markov channel transition probability matrix
Figure BDA0001906660680000093
And (4) changing. Transition probability matrix
Figure BDA0001906660680000094
Element available probability transition indicator p in (1)i,jWhere i represents the current state, j represents the next state (the state after the action was performed in the current state), pi,jRepresenting the probability of transition from the current state i to the secondary state j; stipulate when i equals ji,jTaking the maximum value, namely keeping the probability of the original channel state to be maximum, wherein the probability of transferring to the adjacent second state is half of the probability of transferring to the adjacent first state, and each iteration is according to
Figure BDA0001906660680000095
And updating the environment.
In this embodiment, the user-subcarrier correlation matrix XN,KMay use a user-subcarrier allocation indicator xn,kDenotes xn,kIndicating whether the user terminal n uses the subcarrier k, in a specific application, for example, binary 1 (x)n,k1) indicates that the user terminal n uses subcarrier k and uses binary 0 (x)n,k0) means that the user terminal n does not use the subcarrier k, i.e. does not apply for resources using the subcarrier k. All possible subcarrier allocation calculation methods are as follows:
the number of combinations C is introduced, and if the upper limit of the number of the non-orthogonal multiple access network subcarrier multiplexing users is 2 and each user can only use one subcarrier (the number can be adjusted according to practical application), the types are shared
Figure BDA0001906660680000096
For convenience of explanation, this embodimentCalculating by using small-capacity small base station network model
Figure BDA0001906660680000097
A simplified case of (1). Will ActionnumThe possible subcarrier allocation methods are stored in a list structure, denoted ActionlistThe list index corresponds to a possible subcarrier distribution method, and the subcarrier distribution method can be matched according to the index value, so that the complexity of DQN processing is reduced, and the DQN decision Action is designed to be an integer [0, Action ]num-1](ii) a Wherein each subcarrier allocation method corresponds to a user-subcarrier correlation matrix XN,K
In this embodiment, a ratio of gain to noise between the base station and the user terminal is used as a normalized channel coefficient, and the normalized channel coefficient is determined by the following formula:
Figure BDA0001906660680000101
wherein Hn,kThe normalized channel coefficient is expressed as the normalized channel gain of the base station and the user terminal n on the subcarrier k; h isn,kRepresenting the channel gain of a base station and a user terminal n on a subcarrier k, calculating according to Rayleigh fast fading and large-scale fading caused by distance, wherein the common service range based on a small base station is an indoor environment, and adding two layers of wall loss;
Figure BDA0001906660680000102
represents the noise power on subcarrier k, where E [ ·]The mathematical expectation is represented by the mathematical expectation,
Figure BDA0001906660680000103
represents a mean of 0 and a variance of
Figure BDA0001906660680000104
White additive gaussian noise.
In this embodiment, the normalized channel coefficient is input to a convolutional neural network qevalConvolutional neural network qevalBy means of decision formulas
Figure BDA0001906660680000105
Selecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
wherein the Q function Q (s, a'; theta)eval) Representing a convolutional neural network qevalThe decision body executes the return value obtained by the action a' in a state s, wherein the state s is an input normalized channel coefficient; a represents the decision action of the deep reinforcement learning model, namely the optimal subcarrier allocation result, and is a possible XN,KAnd represents the correlation matrix of the user terminal n and the subcarrier k.
In this embodiment, the input of the DQN of the deep enhanced learning model is the state s of the DQN decision-making body, i.e. the normalized channel coefficient (specifically: two-dimensional normalized channel coefficient matrix H)N,K) (ii) a The output is a one-dimensional vector Qaction_valAt Qaction_valThe action a' with the largest value is selected as the decision action for subcarrier allocation (optimal subcarrier allocation result), and therefore, in Qaction_valIndex into Action of Action with the largest valuelistMatching to obtain the current decision action XN,KThereby obtaining a user-subcarrier correlation matrix X when the subcarriers from the base station to the user terminal obtain the local optimal distribution valueN,KThus, the complexity of DQN processing can be reduced by matching the subcarrier allocation method according to the index value.
A13, according to the optimal subcarrier allocation result, based on the fractional order algorithm allocated by the fixed subcarriers, that is, allocating downlink power to the users multiplexed on each subcarrier under the same subcarrier according to the channel gain coefficient inverse proportion rule (wherein, the user with larger channel gain allocates smaller power, and the user with smaller channel gain allocates larger power).
In this embodiment, the downlink power allocated to the user is represented as:
Figure BDA0001906660680000111
wherein p isn,kIndicating that the base station is on a subcarrierk is the downlink transmitting power distributed to the user terminal n; p'kIndicating the downlink transmitting power distributed by the base station on the subcarrier k; a represents an attenuation factor with a constraint of 0<a<1, in the same sub-optimization process, the value of a is a fixed value and can not be changed according to different users or different subcarriers; kmaxRepresents the maximum number of users multiplexed on each subcarrier in a non-orthogonal multiple access network under the complexity that the current Successive Interference Cancellation (SIC) can bear.
A14, determining the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,k
In this embodiment, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,kExpressed as:
rn,k=log2(1+γn,k)
Figure BDA0001906660680000112
wherein, γn,kRepresenting the signal-to-noise ratio, gamma, of the signal obtained by the user terminal n from the subcarrier kn,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from subcarrier k.
In this embodiment, in a non-orthogonal multiple access network, the normalized channel coefficients of users multiplexed on the same subcarrier are arranged in a descending order, and are represented as:
|H1,k|≥|H2,k|≥…≥|Hn,k|≥|Hn+1,k|≥…≥|HKmax,k|
based on the optimal decoding order of the successive interference canceller, when the ue i is located before j in the sequence, the interference from the ue j can be successfully decoded and removed, and the ue j receives the signal of the ue i and accepts the signal as interference. In the non-orthogonal multiple access network, considering the fairness among users and the principle of reducing co-channel interference, when allocating power, the user with good channel condition allocates less power, i.e. in the above example, if H isi,k>Hj,kThen p is allocatedi,k<pj,kIn accordance with the assignment rule of the fractional order algorithm in a 13.
The co-frequency interference and the calculation complexity are reduced as much as possible under the small base station scene, and the number of the multiplexed sub-carriers is predefined to be KmaxThe maximum information transmission rate for ue i and ue j is a logarithmic function of the Signal to Interference plus Noise Ratio (SINR). Chi shapeINNER=pi,kHj,kIndicating the intra-layer co-channel interference experienced by the user terminal j under the service of the current base station.
In this embodiment, the maximum transmission rate of the user terminal i and the user terminal j is represented as:
ri,k=log2(1+γi,k),rj,k=log2(1+γj,k),γi,k=pi,kHi,k
Figure BDA0001906660680000121
namely:
ri,k=log2(1+pi,kHi,k),
Figure BDA0001906660680000122
a16, determining the system power consumption UP(X)
In this embodiment, it is considered that the small cell has an energy recovery unit, and the system power consumption UP(X) is represented by:
Figure BDA0001906660680000123
in this example, pkRepresents the power consumed by the circuit; psi denotes the base station energy recovery coefficient, which can be modified according to the actual hardware properties.
A17, according to determined gamman,kAnd UP(X), determining the energy efficiency of the system
In the present example, based on the obtainedMaximum undistorted information transmission rate r from base station subcarrier k to user terminal nn,kAnd system power consumption UP(X) calculating the energy efficiency ee of the subcarriers k to the user terminal nn,k
Figure BDA0001906660680000124
Wherein the content of the first and second substances,
Figure BDA0001906660680000125
representing the subcarrier k channel bandwidth.
In this embodiment, the system energy efficiency is expressed as:
Figure BDA0001906660680000126
wherein, een,kRepresenting the energy efficiency of the sub-carrier k to the user terminal n,
Figure BDA0001906660680000127
denotes the channel bandwidth of subcarrier K, N denotes the set of user terminals, and K denotes the set of subcarriers available under the current base station.
A17, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model
In this embodiment, for system energy efficiency that does not meet preset modeling constraint conditions (the modeling constraint conditions are determined by factors such as an inter-user fairness principle, a minimum quality of service standard, and an upper limit of cross-layer interference), a weak supervision algorithm based on value return punishment is performed on the system energy efficiency according to types that do not meet the modeling constraint conditions, a return function after a deep reinforcement learning model makes a decision action is obtained, and the return function is fed back to the deep reinforcement learning model; wherein the reward function is represented as:
Figure BDA0001906660680000131
wherein, rewardtRepresenting a return function calculated during the t training; rminRepresents the minimum standard of user quality of Service (QoS), i.e. the minimum downlink transmission rate; hinnterThe normalized channel coefficient representing the shortest distance between the nearest base station operating at the same subcarrier frequency and the currently optimized base station may be calculated according to the method in step a 12; i iskRepresenting the cross-layer (cross-station) interference upper limit which the k sub-carrier frequency band can bear, setting and adjusting the interference upper limit according to specific application ξcase1~ξcase3Penalty coefficients for energy efficiency are represented for three cases that do not meet the modeling constraints.
In addition, the following should be noted: when the system energy efficiency is directly taken as the return function, xn,kα other constraints are also required, in combination with the above constraint, where xn,kAnd the constraint conditions to be met by the a are as follows:
Figure BDA0001906660680000132
wherein, BSpeakRepresenting the peak power of the small base station; condition 1
Figure BDA0001906660680000133
The user terminal is forced to be associated with 1 subcarrier at the same time; condition 2
Figure BDA0001906660680000134
Limiting the maximum number of users multiplexed on the same subcarrier in a non-orthogonal multiple access network, wherein the number is KmaxThe purpose is to reduce intra-station interference and to reduce the complexity of the successive interference canceller; condition 3
Figure BDA0001906660680000135
For QoS constraints, the information transmission rate of all user terminals served by the base station should exceed the user quality of service minimum limit. Condition 4
Figure BDA0001906660680000141
Is to the slave base station in the sub-carrierThe limit of the maximum transmit power of wave k. Condition 5
Figure BDA0001906660680000142
The method is an effective interference coordination mechanism, and limits the interference of the currently optimized base station to other base stations. Condition 6
Figure BDA0001906660680000143
Is the limit on the attenuation factor when distributing power.
A18, storing the report function, channel environment, decision action and transition order state into DQN memory playback unit
In this embodiment, the reporting function, the channel environment, the decision action, and the transition order (transition state) are stored as a quadruple in the DQN memory playback unit memory, where the memory is represented as:
memory:D(t)={e(1),...,e(t)}
e(t)=(s(t),a(t),r(t),s(t+1))
wherein, s (t) represents the normalized channel coefficient (state) input during the t-th training of the model; a (t) represents the decision-making action made by the DQN when the deep reinforcement learning model is trained for the t time, namely a user-subcarrier correlation matrix; r (t) represents a reward function obtained after the action a (t) of the DQN is finished when the t training deep reinforcement learning model is trainedt(ii) a s (t +1) represents the normalized channel coefficient (secondary state) after updating according to the time-varying Markov channel in the finite state when the deep reinforcement learning model is trained for t +1 times.
In this embodiment, each group e (t) is stored by defining a memory playback class and setting the memory as a data structure of an object array or a dictionary.
A19, training a deep reinforcement learning model by using a batch processing mode, and randomly selecting batch memory data with a fixed size from the DQN memory playback unit for learning and gradient descent updating of two convolutional neural networks.
In this embodiment, the memory data is processed by using a Loss function Loss (θ), which is expressed as:
Figure BDA0001906660680000144
the gradient descent update formula is expressed as:
Figure BDA0001906660680000145
wherein the content of the first and second substances,
Figure BDA0001906660680000146
represents a training learning rate; λ represents a discount factor for evaluation of the attitude of the decision body;
Figure BDA0001906660680000151
represents that when the input is the sub-state s (t +1) of the current memory e (t), the weight is thetatargetOf the convolutional neural network qtargetAn action a' which is decided to be capable of harvesting the maximum return; q(s), (t), a (t); thetaeval) Indicating that when the input is the state s (t) of the current memory e (t), the weight is θevalOf the convolutional neural network qevalPerforming the reward value obtained in act a (t);
Figure BDA0001906660680000152
represents a parameter of thetaevalThe convolutional neural network performs gradient descent operation, i.e. modifies the convolutional neural network qevalParameter theta ofevalMake the convolutional neural network qtargetAnd q isevalThe output of (c) is subtracted to a minimum.
In the present embodiment, the subtraction Q(s) (t), a (t); θeval) If the memory unit e (1) selects action 2, only updating [1,2 ] of two convolutional neural networks by gradient descent updating formula]The values of the positions are unchanged, the values corresponding to the rest of actions in the first dimension are unchanged, and in order to ensure the stability of training, the gradient descent only updates the convolutional neural network qevalThe parameter (c) of (c).
A20, updating q every fixed times in the deep reinforcement learning model training processtargetParameter qevalThe parameters, expressed as:
Figure BDA0001906660680000153
wherein, CiterA counter for representing training is used for recording the training times; cmaxDenotes qtargetParameter and qevalUpdate interval of parameter, also CiterOf (2) and thus CiterIs equal to CmaxAnd then, the zero is reset.
A21, q updated by steps A19 and A20targetNetwork parameters and qevalIf the difference value between the system energy efficiency value which is continuously optimized for multiple times and a preset threshold (specified value) is within a preset range or is higher than the preset threshold, the deep reinforcement learning model can be considered to be suitable for wireless resource allocation in the time-varying channel environment, the currently allocated downlink power is locally optimal power allocation in the time-varying channel environment, the current deep reinforcement learning model achieves locally optimal allocation of network resources in the time-varying environment, and the obtained deep reinforcement learning model can be continuously used in the actual time-varying channel environment;
a22, otherwise, press
Figure BDA0001906660680000154
Update environment, judge Citer=CmaxIf true, let Citer=0、θtarget=θevalThen, step A12 is executed; otherwise, step a12 is directly executed until the difference between the recalculated system energy efficiency value and the preset threshold is within the preset range or higher than the preset threshold, at which time the best optimization in the time-varying channel environment is achieved.
In this embodiment, as the number of times of optimization t increases, the return value of the DQN model in the time-varying channel environment gradually tends from low to higher, and this process is a wireless network resource allocation method based on deep reinforcement learning, thereby implementing optimization of subcarrier and power allocation in the time-varying channel environment.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (1)

1. A wireless network resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
s101, establishing a convolutional neural network q with two same parameterseval、qtargetForming a deep reinforcement learning model;
s102, modeling the time-varying channel environment between the base station and the user terminal as a time-varying Markov channel in a finite state, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
s103, distributing downlink power to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient according to the subcarrier distribution result, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model;
s104, training a convolutional neural network q in the deep reinforcement learning model according to the determined return functioneval、qtargetIf the difference value between the system energy efficiency value obtained for a plurality of continuous times and the preset threshold value is within the preset range, or the system energy efficiency value obtained for a plurality of continuous times is higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment;
wherein the normalized channel coefficient is represented as:
Figure FDA0002364532410000011
wherein Hn,kThe normalized channel coefficient is expressed as the normalized channel gain of the base station and the user terminal n on the subcarrier k; h isn,kRepresenting the channel gain of the base station and the user terminal n on the subcarrier k;
Figure FDA0002364532410000012
represents the noise power on subcarrier k;
wherein the input convolutional neural network qevalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user comprises the following steps:
inputting the normalized channel coefficients into a convolutional neural network qevalConvolutional neural network qevalBy means of decision formulas
Figure FDA0002364532410000013
Selecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;
wherein, thetaevalRepresenting a convolutional neural network qevalThe weight parameter of (2), Q function Q (s, a'; theta)eval) The weight is represented as thetaevalOf the convolutional neural network qevalPerforming the reported value obtained by action a' at state s, which is the input normalized channel coefficient; a represents a decision action of the deep reinforcement learning model, namely an optimal subcarrier distribution result, wherein the optimal subcarrier distribution result is obtained according to an index of the action with the maximum return value;
wherein, the downlink power allocated to the user is represented as:
Figure FDA0002364532410000021
wherein p isn,kIndicating the downlink transmitting power distributed by the base station for the user terminal n on the subcarrier k; p'kTo representThe downlink transmitting power distributed by the base station on the sub-carrier K, α represents the attenuation factor, KmaxThe maximum number of users multiplexed on each subcarrier in a non-orthogonal multiple access network under the complexity borne by the current serial interference eliminator is represented;
wherein determining system energy efficiency based on the allocated downlink power comprises:
determining the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,k
Determining the power consumption U of the system according to the determined normalized channel coefficient between the base station and the user, the subcarrier distribution result and the distributed downlink powerP(X);
According to determined rn,kAnd UP(X) determining a system energy efficiency;
wherein, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal nn,kExpressed as:
rn,k=log2(1+γn,k)
Figure FDA0002364532410000022
wherein, γn,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from the subcarrier k;
system power consumption UP(X) is represented by:
Figure FDA0002364532410000023
wherein p iskIndicating circuit power consumption, # denotes base station energy recovery factor, xn,kIndicating whether user terminal n uses subcarrier k;
wherein the system energy efficiency is expressed as:
Figure FDA0002364532410000024
wherein, een,kRepresenting sub-carriersk to the energy efficiency of the user terminal n,
Figure FDA0002364532410000025
representing a channel bandwidth of a subcarrier K, N representing a set of user terminals, and K representing a set of subcarriers available under a current base station;
wherein the determining a reward function based on the system energy efficiency and feeding back the reward function to the deep reinforcement learning model comprises:
punishment is carried out on the system energy efficiency which does not accord with the preset modeling constraint condition according to the type which does not accord with the modeling constraint condition by a weak supervision algorithm based on value return to obtain a return function after a deep reinforcement learning model makes a decision action, and the return function is fed back to the deep reinforcement learning model; wherein the reward function is represented as:
Figure FDA0002364532410000031
wherein, rewardtRepresenting a return function calculated during the t training; rminThe minimum standard of the user service quality, namely the minimum downlink transmission rate is represented; hinnterThe normalized channel coefficient corresponding to the shortest distance between the nearest base station working at the same subcarrier frequency and the current optimized base station is represented; i iskRepresenting the upper limit of cross-layer interference that the k-th sub-carrier frequency band can bear ξcase1~ξcase3Penalty coefficients representing the system energy efficiency for three cases not conforming to the modeling constraints;
wherein the convolutional neural network q in the deep reinforcement learning model is trained according to the determined return functioneval、qtargetIf the difference between the system energy efficiency value obtained for a plurality of consecutive times and the preset threshold is within the preset range, or the system energy efficiency value obtained for a plurality of consecutive times is higher than the preset threshold, the power local optimal allocation of the currently allocated downlink power in the time-varying channel environment includes:
and storing the return function, the channel environment, the decision action and the transferred inferior state as a quadruple into a memory playback unit memory of the deep reinforcement learning model, wherein the memory is represented as:
memory:D(t)={e(1),...,e(t)}
e(t)=(s(t),a(t),r(t),s(t+1))
wherein, s (t) represents the input state when the deep reinforcement learning model is trained for the t time; a (t) represents the decision-making action made by the deep reinforcement learning model when the deep reinforcement learning model is trained for the t time; r (t) represents the reward function obtained after the action a (t) of the deep reinforcement learning model is performed when the deep reinforcement learning model is trained for the t timet(ii) a s (t +1) represents a secondary state after updating according to a time-varying Markov channel in a finite state when the deep reinforcement learning model is trained for t +1 times;
randomly selecting memory data from a memory playback unit of the deep reinforcement learning model for learning two convolutional neural networks and updating gradient descent, wherein the gradient descent only updates the convolutional neural network qevalQ is updated every fixed times in the deep reinforcement learning model training processtargetParameter thetatargetIs qevalParameter thetaeval
If the difference value between the system energy efficiency value obtained for a plurality of continuous times and the preset threshold value is within the preset range, or the system energy efficiency value obtained for a plurality of continuous times is higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment;
wherein the gradient descent update formula is represented as:
Figure FDA0002364532410000041
wherein the content of the first and second substances,
Figure FDA0002364532410000042
represents a training learning rate; λ represents a discount factor for evaluation of the attitude of the decision body;
Figure FDA0002364532410000043
represents that when the input is the sub-state s (t +1) of the current memory e (t), the weight is thetatargetOf the convolutional neural network qtargetAn action a' which is decided to be capable of harvesting the maximum return; q(s), (t), a (t); thetaeval) Indicating that when the input is the state s (t) of the current memory e (t), the weight is θevalOf the convolutional neural network qevalPerforming the reward value obtained in act a (t);
Figure FDA0002364532410000044
represents a parameter of thetaevalThe convolutional neural network performs gradient descent operation.
CN201811535056.1A 2018-12-14 2018-12-14 Wireless network resource allocation method based on deep reinforcement learning Active CN109474980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811535056.1A CN109474980B (en) 2018-12-14 2018-12-14 Wireless network resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811535056.1A CN109474980B (en) 2018-12-14 2018-12-14 Wireless network resource allocation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN109474980A CN109474980A (en) 2019-03-15
CN109474980B true CN109474980B (en) 2020-04-28

Family

ID=65675169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811535056.1A Active CN109474980B (en) 2018-12-14 2018-12-14 Wireless network resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN109474980B (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113615277B (en) * 2019-03-27 2023-03-24 华为技术有限公司 Power distribution method and device based on neural network
CN109962728B (en) * 2019-03-28 2021-01-26 北京邮电大学 Multi-node joint power control method based on deep reinforcement learning
CN110084245B (en) * 2019-04-04 2020-12-25 中国科学院自动化研究所 Weak supervision image detection method and system based on visual attention mechanism reinforcement learning
CN110430613B (en) * 2019-04-11 2022-07-01 重庆邮电大学 Energy-efficiency-based resource allocation method for multi-carrier non-orthogonal multiple access system
CN110035478A (en) * 2019-04-18 2019-07-19 北京邮电大学 A kind of dynamic multi-channel cut-in method under high-speed mobile scene
CN110167176B (en) * 2019-04-25 2021-06-01 北京科技大学 Wireless network resource allocation method based on distributed machine learning
CN110401975A (en) * 2019-07-05 2019-11-01 深圳市中电数通智慧安全科技股份有限公司 A kind of method, apparatus and electronic equipment of the transmission power adjusting internet of things equipment
CN110380776B (en) * 2019-08-22 2021-05-14 电子科技大学 Internet of things system data collection method based on unmanned aerial vehicle
CN110635833B (en) * 2019-09-25 2020-12-15 北京邮电大学 Power distribution method and device based on deep learning
CN111428903A (en) * 2019-10-31 2020-07-17 国家电网有限公司 Interruptible load optimization method based on deep reinforcement learning
CN110809306B (en) * 2019-11-04 2021-03-16 电子科技大学 Terminal access selection method based on deep reinforcement learning
CN110972309B (en) * 2019-11-08 2022-07-19 厦门大学 Ultra-dense wireless network power distribution method combining graph signals and reinforcement learning
US11246173B2 (en) 2019-11-08 2022-02-08 Huawei Technologies Co. Ltd. Systems and methods for multi-user pairing in wireless communication networks
CN112988229B (en) * 2019-12-12 2022-08-05 上海大学 Convolutional neural network resource optimization configuration method based on heterogeneous computation
CN111211831A (en) * 2020-01-13 2020-05-29 东方红卫星移动通信有限公司 Multi-beam low-orbit satellite intelligent dynamic channel resource allocation method
CN111431646B (en) * 2020-03-31 2021-06-15 北京邮电大学 Dynamic resource allocation method in millimeter wave system
CN111526592B (en) * 2020-04-14 2022-04-08 电子科技大学 Non-cooperative multi-agent power control method used in wireless interference channel
CN112104400B (en) * 2020-04-24 2023-04-07 广西华南通信股份有限公司 Combined relay selection method and system based on supervised machine learning
CN111542107A (en) * 2020-05-14 2020-08-14 南昌工程学院 Mobile edge network resource allocation method based on reinforcement learning
CN111885720B (en) * 2020-06-08 2021-05-28 中山大学 Multi-user subcarrier power distribution method based on deep reinforcement learning
CN111867110B (en) * 2020-06-17 2023-10-03 三明学院 Wireless network channel separation energy-saving method based on switch switching strategy
CN111930501B (en) * 2020-07-23 2022-08-26 齐齐哈尔大学 Wireless resource allocation method based on unsupervised learning and oriented to multi-cell network
CN112770398A (en) * 2020-12-18 2021-05-07 北京科技大学 Far-end radio frequency end power control method based on convolutional neural network
CN113115355B (en) * 2021-04-29 2022-04-22 电子科技大学 Power distribution method based on deep reinforcement learning in D2D system
CN113490184B (en) * 2021-05-10 2023-05-26 北京科技大学 Random access resource optimization method and device for intelligent factory
CN113395757B (en) * 2021-06-10 2023-06-30 中国人民解放军空军通信士官学校 Deep reinforcement learning cognitive network power control method based on improved return function
CN114126025B (en) * 2021-11-02 2023-04-28 中国联合网络通信集团有限公司 Power adjustment method for vehicle-mounted terminal, vehicle-mounted terminal and server
CN114142912B (en) * 2021-11-26 2023-01-06 西安电子科技大学 Resource control method for guaranteeing time coverage continuity of high-dynamic air network
CN114360305A (en) * 2021-12-15 2022-04-15 广州创显科教股份有限公司 Classroom interactive teaching method and system based on 5G network
CN114928549A (en) * 2022-04-20 2022-08-19 清华大学 Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105407535A (en) * 2015-10-22 2016-03-16 东南大学 High energy efficiency resource optimization method based on constrained Markov decision process
CN106358308A (en) * 2015-07-14 2017-01-25 北京化工大学 Resource allocation method for reinforcement learning in ultra-dense network
CN106909728A (en) * 2017-02-21 2017-06-30 电子科技大学 A kind of FPGA interconnection resources configuration generating methods based on enhancing study
CN108307510A (en) * 2018-02-28 2018-07-20 北京科技大学 A kind of power distribution method in isomery subzone network
CN108712748A (en) * 2018-04-12 2018-10-26 天津大学 A method of the anti-interference intelligent decision of cognitive radio based on intensified learning
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods
CN108989099A (en) * 2018-07-02 2018-12-11 北京邮电大学 Federated resource distribution method and system based on software definition Incorporate network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121766A1 (en) * 2016-09-18 2018-05-03 Newvoicemedia, Ltd. Enhanced human/machine workforce management using reinforcement learning
US20180091981A1 (en) * 2016-09-23 2018-03-29 Board Of Trustees Of The University Of Arkansas Smart vehicular hybrid network systems and applications of same

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106358308A (en) * 2015-07-14 2017-01-25 北京化工大学 Resource allocation method for reinforcement learning in ultra-dense network
CN105407535A (en) * 2015-10-22 2016-03-16 东南大学 High energy efficiency resource optimization method based on constrained Markov decision process
CN106909728A (en) * 2017-02-21 2017-06-30 电子科技大学 A kind of FPGA interconnection resources configuration generating methods based on enhancing study
CN108307510A (en) * 2018-02-28 2018-07-20 北京科技大学 A kind of power distribution method in isomery subzone network
CN108712748A (en) * 2018-04-12 2018-10-26 天津大学 A method of the anti-interference intelligent decision of cognitive radio based on intensified learning
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods
CN108989099A (en) * 2018-07-02 2018-12-11 北京邮电大学 Federated resource distribution method and system based on software definition Incorporate network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Integrated Networking, Caching, and Computing for Connected Vehicles: A Deep Reinforcement Learning Approach;Ying He等;《IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY》;20171006;全文 *
Power Allocation in Multi-cell Networks Using Deep Reinforcement Learning;Yong Zhang等;《2018 IEEE 88th Vehicular Technology Conference (VTC-Fall)》;20180830;全文 *

Also Published As

Publication number Publication date
CN109474980A (en) 2019-03-15

Similar Documents

Publication Publication Date Title
CN109474980B (en) Wireless network resource allocation method based on deep reinforcement learning
CN110493826B (en) Heterogeneous cloud wireless access network resource allocation method based on deep reinforcement learning
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN106358308A (en) Resource allocation method for reinforcement learning in ultra-dense network
CN107426820B (en) Resource allocation method for improving energy efficiency of multi-user game in cognitive D2D communication system
CN110708711A (en) Heterogeneous energy-carrying communication network resource allocation method based on non-orthogonal multiple access
Wang et al. Joint interference alignment and power control for dense networks via deep reinforcement learning
CN105451322B (en) A kind of channel distribution and Poewr control method based on QoS in D2D network
AlQerm et al. Enhanced machine learning scheme for energy efficient resource allocation in 5G heterogeneous cloud radio access networks
CN107708157A (en) Intensive small cell network resource allocation methods based on efficiency
CN106792451B (en) D2D communication resource optimization method based on multi-population genetic algorithm
CN109982437B (en) D2D communication spectrum allocation method based on location-aware weighted graph
CN113316154B (en) Authorized and unauthorized D2D communication resource joint intelligent distribution method
Coskun et al. Three-stage resource allocation algorithm for energy-efficient heterogeneous networks
Shahid et al. Self-organized energy-efficient cross-layer optimization for device to device communication in heterogeneous cellular networks
Zhang et al. Resource optimization-based interference management for hybrid self-organized small-cell network
CN105490794B (en) The packet-based resource allocation methods of the Femto cell OFDMA double-layer network
Yu et al. Interference coordination strategy based on Nash bargaining for small‐cell networks
CN110139282B (en) Energy acquisition D2D communication resource allocation method based on neural network
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
CN110677175A (en) Sub-channel scheduling and power distribution joint optimization method based on non-orthogonal multiple access system
CN114423028A (en) CoMP-NOMA (coordinated multi-point-non-orthogonal multiple Access) cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
Khodmi et al. Joint user-channel assignment and power allocation for non-orthogonal multiple access in a 5G heterogeneous ultra-dense networks
CN110677176A (en) Combined compromise optimization method based on energy efficiency and spectrum efficiency
CN109275163B (en) Non-orthogonal multiple access joint bandwidth and rate allocation method based on structured ordering characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant