CN109474980B

CN109474980B - Wireless network resource allocation method based on deep reinforcement learning

Info

Publication number: CN109474980B
Application number: CN201811535056.1A
Authority: CN
Inventors: 张海君; 刘启瑞; 皇甫伟; 董江波; 隆克平
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2020-04-28
Anticipated expiration: 2038-12-14
Also published as: CN109474980A

Abstract

The invention provides a wireless network resource allocation method based on deep reinforcement learning, which can improve the energy efficiency in a time-varying channel environment to the maximum extent with lower complexity. The method comprises the following steps: establishing a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal as a finite-state time-varying Markov channel, determining a normalized channel coefficient, and inputting the normalized channel coefficient into a convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to the subcarrier distribution result, downlink power is distributed to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient, a return function is determined based on the distributed downlink power, and the return function is fed back to the deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return function_eval、q_targetAnd determining the local optimal power distribution in the time-varying channel environment. The invention relates to the field of wireless communication and artificial intelligence decision making.

Description

Wireless network resource allocation method based on deep reinforcement learning

Technical Field

The invention relates to the field of wireless communication and artificial intelligence decision making, in particular to a wireless network resource allocation method based on deep reinforcement learning.

Background

Starting from the Long Term Evolution (LTE) era, the networking architecture is shifted from Macro network to Macro-micro cooperation, and Macro Cell (Macro Cell) sustainable development faces many challenges, such as unpredictable traffic growth demand, ubiquitous access demand, random hotspot deployment, and great cost pressure of Macro Cell itself. Therefore, the advantages of accurate coverage and blind area supplement of Small base stations (Small cells) such as micro-cells and home base stations are embodied, and the method gradually becomes an important link for cooperative work with macro base stations in network deployment and allocation of macro base station service pressure. The fifth generation mobile communication is an extension of 4G, and 5G is not a single radio access technology but a general term for a solution after evolution and integration of a plurality of new radio access technologies and existing radio access technologies. Nowadays, 5G networks start to get into the sight of people, and the user experience rate is generally considered as the most important performance index of 5G. The technical features of 5G can be summarized by several numbers: capacity boost of 1000x, connection support of 1000 hundred million +, maximum speed of 10GB/s, delay of 1ms or less. The 5G main technologies comprise a super-large-scale multi-antenna, a novel multiple access technology and an ultra-dense network, wherein the deployment of the small base station and the macro base station form the ultra-dense heterogeneous network, and ubiquitous services are provided for users.

With the rapid increase of the number of mobile users, the arrangement of small base stations also tends to be ultra-intensive, the energy consumption brought by the field of wireless communication is very large, and aiming at the national conditions of serious environmental pollution and increasingly scarce energy in China, green communication is inevitably a direction worthy of research and exploration, so on the basis of ensuring the satisfaction of user data requirements and service quality, the realization of higher energy efficiency through a reasonable resource distribution mode is an important research direction.

Disclosure of Invention

The invention aims to provide a wireless network resource allocation method based on deep reinforcement learning, so as to solve the problem that wireless resource allocation in a time-varying channel environment cannot be effectively realized in the prior art.

In order to solve the above technical problem, an embodiment of the present invention provides a wireless network resource allocation method based on deep reinforcement learning, including:

s101, establishing a convolutional neural network q with two same parameters_eval、q_targetForming a deep reinforcement learning model;

s102, modeling the time-varying channel environment between the base station and the user terminal as a time-varying Markov channel in a finite state, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;

s103, distributing downlink power to the users multiplexed on each subcarrier based on the inverse ratio of the channel coefficient according to the subcarrier distribution result, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model;

s104, training a convolutional neural network q in the deep reinforcement learning model according to the determined return function_eval、q_targetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment.

Further, the normalized channel coefficient is represented as:

wherein H_n,kThe normalized channel coefficient is expressed as the normalized channel gain of the base station and the user terminal n on the subcarrier k; h is_n,kRepresenting the channel gain of the base station and the user terminal n on the subcarrier k;

representing the noise power on subcarrier k.

Further, the input convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user comprises the following steps:

inputting the normalized channel coefficients into a convolutional neural network q_evalConvolutional neural network q_evalBy means of decision formulas

Selecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user;

wherein, theta_evalRepresenting a convolutional neural network q_evalThe weight parameter of (2), Q function Q (s, a'; theta)_eval) The weight is represented as theta_evalOf the convolutional neural network q_evalPerforming the reported value obtained by action a' at state s, which is the input normalized channel coefficient; and a represents a decision action of the deep reinforcement learning model, namely an optimal subcarrier distribution result, wherein the optimal subcarrier distribution result is obtained according to the index of the action with the maximum return value.

Further, the downlink power allocated to the user is represented as:

wherein p is_n,kIndicating the downlink transmitting power distributed by the base station for the user terminal n on the subcarrier k; p'_kIndicating the downlink transmitting power distributed by the base station on the subcarrier k; a represents an attenuation factor; k_maxRepresents the most multiplexed on each sub-carrier in the complexity that the current serial interference canceller can bear in the non-orthogonal multiple access networkThe number of large users.

Further, the determining the system energy efficiency based on the allocated downlink power comprises:

determining the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal n_n,k；

Determining the power consumption U of the system according to the determined normalized channel coefficient between the base station and the user, the subcarrier distribution result and the distributed downlink power_P(X)；

According to determined r_n,kAnd U_P(X), determining system energy efficiency.

Further, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal n_n,kExpressed as:

r_n,k＝log₂(1+γ_n,k)

wherein, γ_n,kRepresenting the signal-to-noise ratio, gamma, of the signal obtained by the user terminal n from the subcarrier k_n,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from the subcarrier k;

system power consumption U_P(X) is represented by:

wherein p is_kIndicating circuit power consumption, # denotes base station energy recovery factor, x_n,kIndicating whether user terminal n uses subcarrier k.

Further, the system energy efficiency is expressed as:

wherein, ee_n,kRepresenting the energy efficiency of the sub-carrier k to the user terminal n,

denotes the channel bandwidth of subcarrier K, N denotes the set of user terminals, and K denotes the set of subcarriers available under the current base station.

Further, the determining a reward function based on the system energy efficiency and feeding the reward function back to the deep reinforcement learning model comprises:

punishment is carried out on the system energy efficiency which does not accord with the preset modeling constraint condition according to the type which does not accord with the modeling constraint condition by a weak supervision algorithm based on value return to obtain a return function after a deep reinforcement learning model makes a decision action, and the return function is fed back to the deep reinforcement learning model; wherein the reward function is represented as:

wherein, reward_tRepresenting a return function calculated during the t training; r_minThe minimum standard of the user service quality, namely the minimum downlink transmission rate is represented; h_innterThe normalized channel coefficient corresponding to the shortest distance between the nearest base station working at the same subcarrier frequency and the current optimized base station is represented; i is_kRepresenting the upper limit of cross-layer interference that the k-th sub-carrier frequency band can bear ξ_case1～ξ_case3And (3) representing the penalty coefficients of the three cases which do not accord with the modeling constraint on the energy efficiency of the system.

Further, training the convolutional neural network q in the deep reinforcement learning model according to the determined return function_eval、q_targetIf the difference between the system energy efficiency value obtained for a plurality of consecutive times and the preset threshold is within the preset range or higher than the preset threshold, the power local optimal allocation of the currently allocated downlink power in the time-varying channel environment includes:

and storing the return function, the channel environment, the decision action and the transferred inferior state as a quadruple into a memory playback unit memory of the deep reinforcement learning model, wherein the memory is represented as:

memory:D(t)＝{e(1),...,e(t)}

e(t)＝(s(t),a(t),r(t),s(t+1))

wherein, s (t) represents the input state when the deep reinforcement learning model is trained for the t time; a (t) represents the decision-making action made by the deep reinforcement learning model when the deep reinforcement learning model is trained for the t time; r (t) represents the reward function obtained after the action a (t) of the deep reinforcement learning model is performed when the deep reinforcement learning model is trained for the t time_t(ii) a s (t +1) represents a secondary state after updating according to a time-varying Markov channel in a finite state when the deep reinforcement learning model is trained for t +1 times;

randomly selecting memory data from a memory playback unit of the deep reinforcement learning model for learning two convolutional neural networks and updating gradient descent, wherein the gradient descent only updates the convolutional neural network q_evalQ is updated every fixed times in the deep reinforcement learning model training process_targetParameter theta_targetIs q_evalParameter theta_eval；

And if the difference value between the system energy efficiency value obtained for a plurality of times continuously and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment.

Further, the gradient descent update formula is expressed as:

wherein the content of the first and second substances,

represents a training learning rate; λ represents a discount factor for evaluation of the attitude of the decision body;

represents that when the input is the sub-state s (t +1) of the current memory e (t), the weight is theta_targetOf the convolutional neural network q_targetAn action a' which is decided to be capable of harvesting the maximum return; q(s), (t), a (t); theta_eval) Indicating that when the input is the state s (t) of the current memory e (t), the weight is θ_evalOf the convolutional neural network q_evalPerforming the reward value obtained in act a (t);

represents a parameter of theta_evalThe convolutional neural network performs gradient descent operation.

The technical scheme of the invention has the following beneficial effects:

in the scheme, two convolutional neural networks q are established_eval、q_targetForming a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal into a finite-state time-varying Markov channel, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to a subcarrier distribution result, distributing downlink power to users multiplexed on each subcarrier based on the inverse ratio of a channel coefficient, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to a deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return function_eval、q_targetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment; therefore, the time-varying channel environment between the base station and the user terminal is modeled into a time-varying Markov channel in a finite state, so that on the basis of considering the time-varying channel with high complexity, a deep reinforcement learning model is used, the calculation complexity is converted into the process of training the deep reinforcement learning model, and therefore a decision-making action is selected with lower complexity, the local optimal distribution of the sub-carriers from the base station to the user terminal in the time-varying channel environment is determined, and the energy efficiency in the time-varying channel environment is improved to the maximum extent.

Drawings

Fig. 1 is a schematic flowchart of a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention;

fig. 2 is a detailed flowchart of a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a wireless network resource allocation method based on deep reinforcement learning, aiming at the problem that the wireless resource allocation in a time-varying channel environment cannot be effectively realized in the prior art.

As shown in fig. 1, a method for allocating wireless network resources based on deep reinforcement learning according to an embodiment of the present invention includes:

s101, establishing a convolutional neural network q with two same parameters_eval、q_targetConstructing a Deep enhanced learning model (Deep Q Network, DQN);

The wireless network resource allocation method based on deep reinforcement learning of the embodiment of the invention establishes two convolutional neural networks q_eval、q_targetForming a deep reinforcement learning model; modeling a time-varying channel environment between a base station and a user terminal into a finite-state time-varying Markov channel, determining a normalized channel coefficient between the base station and the user, and inputting the normalized channel coefficient into a convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers to the user; according to a subcarrier distribution result, distributing downlink power to users multiplexed on each subcarrier based on the inverse ratio of a channel coefficient, determining system energy efficiency based on the distributed downlink power, determining a return function based on the system energy efficiency, and feeding the return function back to a deep reinforcement learning model; training a convolutional neural network q in a deep reinforcement learning model according to the determined return function_eval、q_targetIf the difference value between the system energy efficiency value obtained continuously for multiple times and the preset threshold value is within the preset range or higher than the preset threshold value, the currently allocated downlink power is locally and optimally allocated under the time-varying channel environment; therefore, the time-varying channel environment between the base station and the user terminal is modeled into a time-varying Markov channel in a finite state, so that on the basis of considering the time-varying channel with high complexity, a deep reinforcement learning model is used, the calculation complexity is converted into the process of training the deep reinforcement learning model, and therefore a decision-making action is selected with lower complexity, the local optimal distribution of the sub-carriers from the base station to the user terminal in the time-varying channel environment is determined, and the energy efficiency in the time-varying channel environment is improved to the maximum extent.

The deep reinforcement learning in the embodiment is a decision method based on artificial intelligence, and is characterized in that sequential decisions are made by a decision body in a dynamically changing environment, states, actions and rewards required by the deep reinforcement learning can be constructed, and the decision body can be automated and optimizes the decision actions when a deep heightening learning model is trained. The wireless network resource allocation method based on deep reinforcement learning can simulate a time-varying channel environment, optimize allocation of wireless network resources in a time-varying network scene to the maximum extent with low calculation complexity, and achieve the effect of jointly improving quick decision and energy efficiency. The trained deep reinforcement learning model can be continuously used for managing wireless resources in a time-varying channel environment and making a quick decision with high return. In a wide range of wireless network optimization, the deep reinforcement learning model can be subjected to distributed calculation, so that the complexity is reduced.

In order to better understand the method for allocating wireless network resources based on deep reinforcement learning in this embodiment, the method is described in detail, and the specific steps may include:

a11, constructing a depth enhanced learning model DQN

In this embodiment, a convolutional neural network q with two identical parameters is initially established_eval、q_targetForming a deep reinforcement learning model; the decision process of the deep reinforcement learning model is determined by a Q function Q (s, a; theta), wherein theta represents a weight parameter of the convolutional neural network, and the convolutional neural network Q_evalAnd q is_targetRespectively is theta_evalAnd theta_targetThe two are the same when initialized; and the Q function Q (s, a; theta) represents a return value obtained by the convolutional neural network with the weight value of theta when the convolutional neural network executes the action a in the state of s.

In this embodiment, each convolutional neural network is composed of two convolutional layers, two pooling layers, and two fully-connected layers; each training input is [ n ]_samples,N,K]First dimension n_samplesRepresenting the number of input samples, second, third dimension ([ N, K)]) Representing an input sample, i.e. with dimensions [ N, K ]]The normalized channel coefficient matrix of (a); the input number in each training is n_samplesThe normalized channel coefficient matrix of each input convolutional neural network is [ N, K ]]Data, output is all possible actions under the current channel state, and the return value Q obtained by each action_{action_val}，Q_{action_val}The data structure of (a) is a one-dimensional vector [ Action ]_num]Therein, Action_NumRepresenting all possible actions, the number of input channel states being n_samplesEach stateGet the return value of all actions_num]Thus the output is n_samplesOne-dimensional vector [ Action_num]A two-dimensional matrix is formed.

A12, modeling the time-varying channel environment between base station and user terminal as finite-state time-varying Markov channel, determining the normalized channel coefficient between base station and user, and inputting it to convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user

In this embodiment, a plurality of common-frequency Small Base Stations (SBS) are deployed within a certain range, and the small base stations include an outdoor micro base station, a pico base station, and an indoor home base station. Within the coverage area of each small base station, 6 user terminals (UE) and 3 Subcarriers (SC) available in the non-orthogonal multiple access network are distributed in a certain area by taking the small base station as the center. In the embodiment, an independent deep reinforcement learning model is operated on each small base station, so that the effect of distributed processing is achieved. Initializing parameters of the small cell and the user terminal, including but not limited to: SBS and UE_nNormalized channel coefficient H on subcarrier k_n,kA channel bandwidth B and a sub-carrier channel bandwidth B allocated to the base station_SCThe circuit consumes power p_kEtc., wherein the UE_nIndicating user terminal n, SC_kRepresenting a sub-carrier k while initializing a user-sub-carrier correlation matrix X_N,KAnd Finite State time varying Markov Channel (FSMC) transition probability matrix

N represents a set of user terminals, and K represents a set of usable subcarriers under the current base station; user-subcarrier incidence matrix X obtained by initialization_N,KAnd finite state time varying Markov channel transition probability matrix

Used for subsequent user association matrix optimization and calculation of updated channel state.

In this embodiment, the optimization letterObtaining initial coordinates through space random scattering of time-varying Markov channel with finite state channel environment, calculating initial normalized channel coefficient matrix, and quantizing the obtained value in ten steps with quantization boundary being bound₀,...,bound₉The optimized scene is based on a time-varying Markov channel transition probability matrix

And (4) changing. Transition probability matrix

Element available probability transition indicator p in (1)_i,jWhere i represents the current state, j represents the next state (the state after the action was performed in the current state), p_i,jRepresenting the probability of transition from the current state i to the secondary state j; stipulate when i equals j_i,jTaking the maximum value, namely keeping the probability of the original channel state to be maximum, wherein the probability of transferring to the adjacent second state is half of the probability of transferring to the adjacent first state, and each iteration is according to

And updating the environment.

In this embodiment, the user-subcarrier correlation matrix X_N,KMay use a user-subcarrier allocation indicator x_n,kDenotes x_n,kIndicating whether the user terminal n uses the subcarrier k, in a specific application, for example, binary 1 (x)_n,k1) indicates that the user terminal n uses subcarrier k and uses binary 0 (x)_n,k0) means that the user terminal n does not use the subcarrier k, i.e. does not apply for resources using the subcarrier k. All possible subcarrier allocation calculation methods are as follows:

the number of combinations C is introduced, and if the upper limit of the number of the non-orthogonal multiple access network subcarrier multiplexing users is 2 and each user can only use one subcarrier (the number can be adjusted according to practical application), the types are shared

For convenience of explanation, this embodimentCalculating by using small-capacity small base station network model

A simplified case of (1). Will Action_numThe possible subcarrier allocation methods are stored in a list structure, denoted Action_listThe list index corresponds to a possible subcarrier distribution method, and the subcarrier distribution method can be matched according to the index value, so that the complexity of DQN processing is reduced, and the DQN decision Action is designed to be an integer [0, Action ]_num-1](ii) a Wherein each subcarrier allocation method corresponds to a user-subcarrier correlation matrix X_N,K。

In this embodiment, a ratio of gain to noise between the base station and the user terminal is used as a normalized channel coefficient, and the normalized channel coefficient is determined by the following formula:

wherein H_n,kThe normalized channel coefficient is expressed as the normalized channel gain of the base station and the user terminal n on the subcarrier k; h is_n,kRepresenting the channel gain of a base station and a user terminal n on a subcarrier k, calculating according to Rayleigh fast fading and large-scale fading caused by distance, wherein the common service range based on a small base station is an indoor environment, and adding two layers of wall loss;

represents the noise power on subcarrier k, where E [ ·]The mathematical expectation is represented by the mathematical expectation,

represents a mean of 0 and a variance of

White additive gaussian noise.

In this embodiment, the normalized channel coefficient is input to a convolutional neural network q_evalConvolutional neural network q_evalBy means of decision formulas

wherein the Q function Q (s, a'; theta)_eval) Representing a convolutional neural network q_evalThe decision body executes the return value obtained by the action a' in a state s, wherein the state s is an input normalized channel coefficient; a represents the decision action of the deep reinforcement learning model, namely the optimal subcarrier allocation result, and is a possible X_N,KAnd represents the correlation matrix of the user terminal n and the subcarrier k.

In this embodiment, the input of the DQN of the deep enhanced learning model is the state s of the DQN decision-making body, i.e. the normalized channel coefficient (specifically: two-dimensional normalized channel coefficient matrix H)_N,K) (ii) a The output is a one-dimensional vector Q_{action_val}At Q_{action_val}The action a' with the largest value is selected as the decision action for subcarrier allocation (optimal subcarrier allocation result), and therefore, in Q_{action_val}Index into Action of Action with the largest value_listMatching to obtain the current decision action X_N,KThereby obtaining a user-subcarrier correlation matrix X when the subcarriers from the base station to the user terminal obtain the local optimal distribution value_N,KThus, the complexity of DQN processing can be reduced by matching the subcarrier allocation method according to the index value.

A13, according to the optimal subcarrier allocation result, based on the fractional order algorithm allocated by the fixed subcarriers, that is, allocating downlink power to the users multiplexed on each subcarrier under the same subcarrier according to the channel gain coefficient inverse proportion rule (wherein, the user with larger channel gain allocates smaller power, and the user with smaller channel gain allocates larger power).

In this embodiment, the downlink power allocated to the user is represented as:

wherein p is_n,kIndicating that the base station is on a subcarrierk is the downlink transmitting power distributed to the user terminal n; p'_kIndicating the downlink transmitting power distributed by the base station on the subcarrier k; a represents an attenuation factor with a constraint of 0<a<1, in the same sub-optimization process, the value of a is a fixed value and can not be changed according to different users or different subcarriers; k_maxRepresents the maximum number of users multiplexed on each subcarrier in a non-orthogonal multiple access network under the complexity that the current Successive Interference Cancellation (SIC) can bear.

A14, determining the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal n_n,k

In this embodiment, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal n_n,kExpressed as:

r_n,k＝log₂(1+γ_n,k)

wherein, γ_n,kRepresenting the signal-to-noise ratio, gamma, of the signal obtained by the user terminal n from the subcarrier k_n,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from subcarrier k.

In this embodiment, in a non-orthogonal multiple access network, the normalized channel coefficients of users multiplexed on the same subcarrier are arranged in a descending order, and are represented as:

|H_1,k|≥|H_2,k|≥…≥|H_n,k|≥|H_n+1,k|≥…≥|H_Kmax,k|

based on the optimal decoding order of the successive interference canceller, when the ue i is located before j in the sequence, the interference from the ue j can be successfully decoded and removed, and the ue j receives the signal of the ue i and accepts the signal as interference. In the non-orthogonal multiple access network, considering the fairness among users and the principle of reducing co-channel interference, when allocating power, the user with good channel condition allocates less power, i.e. in the above example, if H is_i,k>H_j,kThen p is allocated_i,k<p_j,kIn accordance with the assignment rule of the fractional order algorithm in a 13.

The co-frequency interference and the calculation complexity are reduced as much as possible under the small base station scene, and the number of the multiplexed sub-carriers is predefined to be K_maxThe maximum information transmission rate for ue i and ue j is a logarithmic function of the Signal to Interference plus Noise Ratio (SINR). Chi shape_INNER＝p_i,kH_j,_kIndicating the intra-layer co-channel interference experienced by the user terminal j under the service of the current base station.

In this embodiment, the maximum transmission rate of the user terminal i and the user terminal j is represented as:

r_i,k＝log₂(1+γ_i,k)，r_j,k＝log₂(1+γ_j,k)，γ_i,k＝p_i,kH_i,k，

namely:

r_i,k＝log₂(1+p_i,kH_i,k),

a16, determining the system power consumption U_P(X)

In this embodiment, it is considered that the small cell has an energy recovery unit, and the system power consumption U_P(X) is represented by:

in this example, p_kRepresents the power consumed by the circuit; psi denotes the base station energy recovery coefficient, which can be modified according to the actual hardware properties.

A17, according to determined gamma_n,kAnd U_P(X), determining the energy efficiency of the system

In the present example, based on the obtainedMaximum undistorted information transmission rate r from base station subcarrier k to user terminal n_n,kAnd system power consumption U_P(X) calculating the energy efficiency ee of the subcarriers k to the user terminal n_n,k：

Wherein the content of the first and second substances,

representing the subcarrier k channel bandwidth.

In this embodiment, the system energy efficiency is expressed as:

A17, determining a return function based on the system energy efficiency, and feeding the return function back to the deep reinforcement learning model

In this embodiment, for system energy efficiency that does not meet preset modeling constraint conditions (the modeling constraint conditions are determined by factors such as an inter-user fairness principle, a minimum quality of service standard, and an upper limit of cross-layer interference), a weak supervision algorithm based on value return punishment is performed on the system energy efficiency according to types that do not meet the modeling constraint conditions, a return function after a deep reinforcement learning model makes a decision action is obtained, and the return function is fed back to the deep reinforcement learning model; wherein the reward function is represented as:

wherein, reward_tRepresenting a return function calculated during the t training; r_minRepresents the minimum standard of user quality of Service (QoS), i.e. the minimum downlink transmission rate; h_innterThe normalized channel coefficient representing the shortest distance between the nearest base station operating at the same subcarrier frequency and the currently optimized base station may be calculated according to the method in step a 12; i is_kRepresenting the cross-layer (cross-station) interference upper limit which the k sub-carrier frequency band can bear, setting and adjusting the interference upper limit according to specific application ξ_case1～ξ_case3Penalty coefficients for energy efficiency are represented for three cases that do not meet the modeling constraints.

In addition, the following should be noted: when the system energy efficiency is directly taken as the return function, x_n,kα other constraints are also required, in combination with the above constraint, where x_n,kAnd the constraint conditions to be met by the a are as follows:

wherein, BS_peakRepresenting the peak power of the small base station; condition 1

The user terminal is forced to be associated with 1 subcarrier at the same time; condition 2

Limiting the maximum number of users multiplexed on the same subcarrier in a non-orthogonal multiple access network, wherein the number is K_maxThe purpose is to reduce intra-station interference and to reduce the complexity of the successive interference canceller; condition 3

For QoS constraints, the information transmission rate of all user terminals served by the base station should exceed the user quality of service minimum limit. Condition 4

Is to the slave base station in the sub-carrierThe limit of the maximum transmit power of wave k. Condition 5

The method is an effective interference coordination mechanism, and limits the interference of the currently optimized base station to other base stations. Condition 6

Is the limit on the attenuation factor when distributing power.

A18, storing the report function, channel environment, decision action and transition order state into DQN memory playback unit

In this embodiment, the reporting function, the channel environment, the decision action, and the transition order (transition state) are stored as a quadruple in the DQN memory playback unit memory, where the memory is represented as:

memory:D(t)＝{e(1),...,e(t)}

e(t)＝(s(t),a(t),r(t),s(t+1))

wherein, s (t) represents the normalized channel coefficient (state) input during the t-th training of the model; a (t) represents the decision-making action made by the DQN when the deep reinforcement learning model is trained for the t time, namely a user-subcarrier correlation matrix; r (t) represents a reward function obtained after the action a (t) of the DQN is finished when the t training deep reinforcement learning model is trained_t(ii) a s (t +1) represents the normalized channel coefficient (secondary state) after updating according to the time-varying Markov channel in the finite state when the deep reinforcement learning model is trained for t +1 times.

In this embodiment, each group e (t) is stored by defining a memory playback class and setting the memory as a data structure of an object array or a dictionary.

A19, training a deep reinforcement learning model by using a batch processing mode, and randomly selecting batch memory data with a fixed size from the DQN memory playback unit for learning and gradient descent updating of two convolutional neural networks.

In this embodiment, the memory data is processed by using a Loss function Loss (θ), which is expressed as:

the gradient descent update formula is expressed as:

wherein the content of the first and second substances,

represents a parameter of theta_evalThe convolutional neural network performs gradient descent operation, i.e. modifies the convolutional neural network q_evalParameter theta of_evalMake the convolutional neural network q_targetAnd q is_evalThe output of (c) is subtracted to a minimum.

In the present embodiment, the subtraction Q(s) (t), a (t); θ_eval) If the memory unit e (1) selects action 2, only updating [1,2 ] of two convolutional neural networks by gradient descent updating formula]The values of the positions are unchanged, the values corresponding to the rest of actions in the first dimension are unchanged, and in order to ensure the stability of training, the gradient descent only updates the convolutional neural network q_evalThe parameter (c) of (c).

A20, updating q every fixed times in the deep reinforcement learning model training process_targetParameter q_evalThe parameters, expressed as:

wherein, C_iterA counter for representing training is used for recording the training times; c_maxDenotes q_targetParameter and q_evalUpdate interval of parameter, also C_iterOf (2) and thus C_iterIs equal to C_maxAnd then, the zero is reset.

A21, q updated by steps A19 and A20_targetNetwork parameters and q_evalIf the difference value between the system energy efficiency value which is continuously optimized for multiple times and a preset threshold (specified value) is within a preset range or is higher than the preset threshold, the deep reinforcement learning model can be considered to be suitable for wireless resource allocation in the time-varying channel environment, the currently allocated downlink power is locally optimal power allocation in the time-varying channel environment, the current deep reinforcement learning model achieves locally optimal allocation of network resources in the time-varying environment, and the obtained deep reinforcement learning model can be continuously used in the actual time-varying channel environment;

a22, otherwise, press

Update environment, judge C_iter＝C_maxIf true, let C_iter＝0、θ_target＝θ_evalThen, step A12 is executed; otherwise, step a12 is directly executed until the difference between the recalculated system energy efficiency value and the preset threshold is within the preset range or higher than the preset threshold, at which time the best optimization in the time-varying channel environment is achieved.

In this embodiment, as the number of times of optimization t increases, the return value of the DQN model in the time-varying channel environment gradually tends from low to higher, and this process is a wireless network resource allocation method based on deep reinforcement learning, thereby implementing optimization of subcarrier and power allocation in the time-varying channel environment.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A wireless network resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

s104, training a convolutional neural network q in the deep reinforcement learning model according to the determined return function_eval、q_targetIf the difference value between the system energy efficiency value obtained for a plurality of continuous times and the preset threshold value is within the preset range, or the system energy efficiency value obtained for a plurality of continuous times is higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment;

wherein the normalized channel coefficient is represented as:

represents the noise power on subcarrier k;

wherein the input convolutional neural network q_evalSelecting the action with the maximum output return value as a decision action, and allocating subcarriers for the user comprises the following steps:

wherein, theta_evalRepresenting a convolutional neural network q_evalThe weight parameter of (2), Q function Q (s, a'; theta)_eval) The weight is represented as theta_evalOf the convolutional neural network q_evalPerforming the reported value obtained by action a' at state s, which is the input normalized channel coefficient; a represents a decision action of the deep reinforcement learning model, namely an optimal subcarrier distribution result, wherein the optimal subcarrier distribution result is obtained according to an index of the action with the maximum return value;

wherein, the downlink power allocated to the user is represented as:

wherein p is_n,kIndicating the downlink transmitting power distributed by the base station for the user terminal n on the subcarrier k; p'_kTo representThe downlink transmitting power distributed by the base station on the sub-carrier K, α represents the attenuation factor, K_maxThe maximum number of users multiplexed on each subcarrier in a non-orthogonal multiple access network under the complexity borne by the current serial interference eliminator is represented;

wherein determining system energy efficiency based on the allocated downlink power comprises:

According to determined r_n,kAnd U_P(X) determining a system energy efficiency;

wherein, the maximum undistorted information transmission rate r from the base station subcarrier k to the user terminal n_n,kExpressed as:

r_n,k＝log₂(1+γ_n,k)

wherein, γ_n,kRepresenting the signal-to-noise ratio of the signal obtained by the user terminal n from the subcarrier k;

system power consumption U_P(X) is represented by:

wherein p is_kIndicating circuit power consumption, # denotes base station energy recovery factor, x_n,kIndicating whether user terminal n uses subcarrier k;

wherein the system energy efficiency is expressed as:

wherein, ee_n,kRepresenting sub-carriersk to the energy efficiency of the user terminal n,

representing a channel bandwidth of a subcarrier K, N representing a set of user terminals, and K representing a set of subcarriers available under a current base station;

wherein the determining a reward function based on the system energy efficiency and feeding back the reward function to the deep reinforcement learning model comprises:

wherein, reward_tRepresenting a return function calculated during the t training; r_minThe minimum standard of the user service quality, namely the minimum downlink transmission rate is represented; h_innterThe normalized channel coefficient corresponding to the shortest distance between the nearest base station working at the same subcarrier frequency and the current optimized base station is represented; i is_kRepresenting the upper limit of cross-layer interference that the k-th sub-carrier frequency band can bear ξ_case1～ξ_case3Penalty coefficients representing the system energy efficiency for three cases not conforming to the modeling constraints;

wherein the convolutional neural network q in the deep reinforcement learning model is trained according to the determined return function_eval、q_targetIf the difference between the system energy efficiency value obtained for a plurality of consecutive times and the preset threshold is within the preset range, or the system energy efficiency value obtained for a plurality of consecutive times is higher than the preset threshold, the power local optimal allocation of the currently allocated downlink power in the time-varying channel environment includes:

memory:D(t)＝{e(1),...,e(t)}

e(t)＝(s(t),a(t),r(t),s(t+1))

If the difference value between the system energy efficiency value obtained for a plurality of continuous times and the preset threshold value is within the preset range, or the system energy efficiency value obtained for a plurality of continuous times is higher than the preset threshold value, the currently allocated downlink power is the local optimal power allocation under the time-varying channel environment;

wherein the gradient descent update formula is represented as:

wherein the content of the first and second substances,