CN109474980A

CN109474980A - A kind of wireless network resource distribution method based on depth enhancing study

Info

Publication number: CN109474980A
Application number: CN201811535056.1A
Authority: CN
Inventors: 张海君; 刘启瑞; 皇甫伟; 董江波; 隆克平
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-03-15
Anticipated expiration: 2038-12-14
Also published as: CN109474980B

Abstract

The present invention provides a kind of wireless network resource distribution method based on depth enhancing study, and the energy efficiency in time-variant channel environment can be improved to the maximum extent with lower complexity.The described method includes: establishing depth enhancing learning model；Time-variant channel environment between base station and user terminal is modeled as to the time-varying Markov channel of finite state, determines normalization channel coefficients, and input convolutional neural networks q_eval, select the maximum movement of output return value to act as decision, distribute subcarrier for user；According to sub-carrier allocation results, the inverse ratio based on channel coefficients is the user's allocation of downlink power being multiplexed on each subcarrier, determines Reward Program based on the descending power of distribution, and Reward Program is fed back to depth enhancing learning model；According to determining Reward Program, training depth enhances the convolutional neural networks q in learning model_eval、q_target, determine that power local optimum is distributed under time-variant channel environment.The present invention relates to wireless communication and artificial intelligence decision domains.

Description

A kind of wireless network resource distribution method based on depth enhancing study

Technical field

The present invention relates to wireless communication and artificial intelligence decision domain, particularly relate to it is a kind of based on depth enhancing study Wireless network resource distribution method.

Background technique

In long term evolution (Long Term Evolution, the LTE) epoch, networking framework is from macro network to macro micro- collaboration Transformation, macrocellular (Macro Cell) Faced In Sustainable Development lot of challenges, for example, not expected business increased requirement, Ubiquitous access demand, random hot spot deployment and the biggish cost pressure of macrocellular itself.Therefore, microcellulor, Home eNodeB Etc. small base station (Small Cell) precisely cover, supplement blind area the advantages of emerge from, and be increasingly becoming network deployment in it is macro Base station cooperates, and shares the important link of macro base station service pressure.5th third-generation mobile communication is the extension after 4G, 5G It is not a single wireless access technology, but various new wireless access technology and existing wireless access technology evolution collection The general name of solution after.Nowadays 5G network initially enters the sight of people again, and industry generally believes user experience rate It is the most important performance indicator of 5G.The technical characterstic of 5G can be summarized with several numbers: the capacity boost of 1000x, 100,000,000,000+ Connection support, the maximum speed of 10GB/s, 1ms delay below.Major technique includes ultra-large multiple antennas in 5G, novel Multiple access technique and super-intensive network, wherein the deployment of small base station and macro base station constitute super-intensive heterogeneous network, for Family provides ubiquitous business.

With the sharp increase of mobile subscriber's quantity, the laying of small base station also tends to super-intensive, wireless communication field bring Energy consumption be it is very huge, for China's national conditions that environmental pollution is serious and the energy is increasingly in short supply, green communications are inevitable It is the direction for being worth research and discovery, therefore, on the basis of guaranteeing to meet user data demand and service quality, passes through conjunction The resource distribution mode of reason realizes that higher energy efficiency is an important research direction, still, in the prior art, not yet Effective optimization method it can be considered that time varying channel influence, simulate practical time-variant channel environment, divided with lower computation complexity Distribution network resource and the optimization method for obtaining higher-energy efficiency.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of wireless network resource distribution sides based on depth enhancing study Method, to solve the problems, such as that radio resource allocation in time-variant channel environment can not be effectively realized present in the prior art.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of wireless network resource based on depth enhancing study Distribution method, comprising:

S101 establishes the convolutional neural networks q by two identical parameters_eval、q_targetConstituting depth enhances learning model；

S102 believes the time-varying Markov that the time-variant channel environment between base station and user terminal is modeled as finite state Road determines the normalization channel coefficients between base station and user, and inputs convolutional neural networks q_eval, selection output return value is most Big movement is acted as decision, distributes subcarrier for user；

S103, according to sub-carrier allocation results, the inverse ratio based on channel coefficients is the user point being multiplexed on each subcarrier With descending power, system energy efficiency is determined based on the descending power of distribution, return letter is determined based on the system energy efficiency Number, and Reward Program is fed back into depth enhancing learning model；

S104, according to determining Reward Program, training depth enhances the convolutional neural networks q in learning model_eval、 q_targetIf the difference between the resulting system energy efficiency value of continuous several times and preset threshold in default range or is higher than Preset threshold, the then descending power currently distributed are that power local optimum is distributed under time-variant channel environment.

Further, the normalization channel coefficients indicate are as follows:

Wherein, H_n,kTo normalize channel coefficients, indicate that the normalization channel of base station and user terminal n on sub-carrierk increases Benefit；h_n,kIndicate base station and the channel gain of user terminal n on sub-carrierk；Indicate noise power on sub-carrierk.

Further, the input convolutional neural networks q_eval, select the maximum movement of output return value dynamic as decision Make, distributing subcarrier for user includes:

The normalization channel coefficients are inputted into convolutional neural networks q_eval, convolutional neural networks q_evalPass through decision formulaIt selects the maximum movement of output return value to act as decision, distributes subcarrier for user；

Wherein, θ_evalIndicate convolutional neural networks q_evalWeighting parameter, Q function Q (s, a '；θ_eval) indicate that weight is θ_evalConvolutional neural networks q_evalIn state s, execution acts a ' return value obtained, and the state s is the normalizing of input Change channel coefficients；A indicates the decision movement of depth enhancing learning model, i.e., optimal sub-carrier allocation results, wherein according to return The index for being worth maximum movement obtains optimal sub-carrier allocation results.

Further, the descending power for user's distribution indicates are as follows:

Wherein, p_n,kIndicate that base station is the down transmitting power of user terminal n distribution on sub-carrierk；p'_kIndicate base station The down transmitting power distributed on sub-carrierk；A indicates decay factor；K_maxIt indicates in non-orthogonal multiple access network, when Under the complexity that preceding serial interference canceller can be born, the maximum number of user that is multiplexed on each subcarrier.

Further, the descending power based on distribution determines that system energy efficiency includes:

Determine the undistorted rate of information throughput r of maximum of base station subcarrier k to user terminal n_n,k；

According to the normalization channel coefficients between determining base station and user, the downlink function of sub-carrier allocation results and distribution Rate determines system power consumption U_P(X)；

According to determining r_n,kAnd U_P(X), system energy efficiency is determined.

Further, the undistorted rate of information throughput r of maximum of base station subcarrier k to user terminal n_n,kIt indicates are as follows:

r_n,k=log₂(1+γ_n,k)

Wherein, γ_n,kIndicate the Signal-to-Noise that user terminal n is obtained from subcarrier k, γ_n,kIndicate user terminal n from son The Signal-to-Noise that carrier wave k is obtained；

System power consumption U_P(X) it indicates are as follows:

Wherein, p_kIndication circuit consumes power, and ψ indicates base station energy regenerating coefficient, x_n,kIndicate whether user terminal n makes With subcarrier k.

Further, system energy efficiency indicates are as follows:

Wherein, ee_n,kIndicate the energy efficiency of subcarrier k to user terminal n,Indicate subcarrier k channel width, N table Show the set of user terminal, K indicates the set of workable subcarrier under current base station.

Further, described that Reward Program is determined based on the system energy efficiency, and Reward Program is fed back into depth Enhancing learning model includes:

The system energy efficiency for not meeting preset modeling constraint condition is pressed with the Weakly supervised algorithm based on value return The system energy efficiency is punished according to the type for not meeting modeling constraint condition, depth enhancing learning model is obtained and makes a policy Reward Program after movement, and the Reward Program is fed back into depth enhancing learning model；Wherein, the Reward Program indicates Are as follows:

Wherein, reward_tIndicate the t times it is trained when the Reward Program that calculates；R_minIndicate QoS of customer lowest bid Standard, i.e., minimum downlink transmission rate；H_innterIt indicates to work in the nearest base station of same subcarrier frequency and the base station currently optimized Between the corresponding normalization channel coefficients of the shortest distance；I_kIndicate the cross-layer interference upper limit that k-th of subcarrier frequency range can be born； ξ_case1~ξ_case3Indicate that three kinds do not meet the case where modeling constrains to the penalty coefficient of system energy efficiency.

Further, described according to determining Reward Program, training depth enhances the convolutional neural networks in learning model q_eval、q_targetIf difference between the resulting system energy efficiency value of continuous several times and preset threshold in default range or Higher than preset threshold, then the descending power currently distributed is that the distribution of power local optimum includes: under time-variant channel environment

Using Reward Program, channel circumstance, decision movement and the next state being transferred to as four-tuple deposit depth enhancing study The memory playback unit memory of model, wherein the memory is indicated are as follows:

Memory:D (t)=e (1) ..., e (t) }

E (t)=(s (t), a (t), r (t), s (t+1))

Wherein, s (t) indicates the state inputted when the t times trained depth enhancing learning model；A (t) indicates the t times training When depth enhances learning model, the decision that depth enhancing learning model is made is acted；R (t) indicates that the t times trained depth enhancing is learned When practising model, depth enhances learning model after movement a (t) is made, obtained Reward Program reward_t；S (t+1) indicates t+1 When secondary trained depth enhancing learning model, according to the updated next state of time-varying Markov channel of finite state；

Data memory is randomly selected for two convolutional neural networks from the memory playback unit of depth enhancing learning model Study and gradient decline update, wherein gradient decline only update a convolutional neural networks q_evalParameter, depth enhance learn Model training is practised in the process every fixed number of times, updates q_targetParameter θ_targetFor q_evalParameter θ_eval；

If the difference between the resulting system energy efficiency value of continuous several times and preset threshold is in default range or high In preset threshold, then the descending power currently distributed is that power local optimum is distributed under time-variant channel environment.

Further, gradient decline more new formula is expressed as:

Wherein,Indicate training learning rate；λ indicates the discount factor assessed decision body next state；It indicates in next state s (t+1) of the input for current memory e (t), weight θ_targetConvolution Neural network q_targetThe movement a ' that can harvest maximal rewards that decision goes out；Q(s(t),a(t)；θ_eval) indicate in input to be to work as When the state s (t) of preceding memory e (t), weight θ_evalConvolutional neural networks q_evalExecution acts a (t) return value obtained；Indicate to be θ to parameter_evalConvolutional neural networks do gradient step-down operation.

The advantageous effects of the above technical solutions of the present invention are as follows:

In above scheme, establish by two convolutional neural networks q_eval、q_targetConstituting depth enhances learning model；By base The time-variant channel environment between user terminal of standing is modeled as the time-varying Markov channel of finite state, determines base station and user Between normalization channel coefficients, and input convolutional neural networks q_eval, select the maximum movement of output return value as decision Movement distributes subcarrier for user；According to sub-carrier allocation results, the inverse ratio based on channel coefficients is to be multiplexed on each subcarrier User's allocation of downlink power, system energy efficiency is determined based on the descending power of distribution, based on the system energy efficiency it is true Determine Reward Program, and Reward Program is fed back into depth enhancing learning model；According to determining Reward Program, training depth enhancing Convolutional neural networks q in learning model_eval、q_targetIf the resulting system energy efficiency value of continuous several times and preset threshold it Between difference in default range or be higher than preset threshold, then the descending power currently distributed is power under time-variant channel environment Local optimum distribution；In this way, the time-variant channel environment between base station and user terminal to be modeled as to the time-varying Ma Er of finite state Section's husband's channel, it is complicated by calculating to enhance learning model using depth on the basis of considering the time varying channel of high complexity During degree is transformed into trained depth enhancing learning model, to choose decision movement with lower complexity, determine that time-varying is believed Under road environment, the subcarrier local optimum of base station to user terminal is distributed, and improves the energy in time-variant channel environment to the maximum extent Amount efficiency.

Detailed description of the invention

Fig. 1 is that the process of the wireless network resource distribution method provided in an embodiment of the present invention based on depth enhancing study is shown It is intended to；

Fig. 2 is the detailed stream of the wireless network resource distribution method provided in an embodiment of the present invention based on depth enhancing study Journey schematic diagram.

Specific embodiment

To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.

The present invention it is existing can not effectively realize radio resource allocation in time-variant channel environment aiming at the problem that, provide one Wireless network resource distribution method of the kind based on depth enhancing study.

As shown in Figure 1, the wireless network resource distribution method provided in an embodiment of the present invention based on depth enhancing study, packet It includes:

S101 establishes the convolutional neural networks q by two identical parameters_eval、q_targetConstituting depth enhances learning model (Deep Q Network, DQN)；

Wireless network resource distribution method based on depth enhancing study described in the embodiment of the present invention, establishes and is rolled up by two Product neural network q_eval、q_targetConstituting depth enhances learning model；Time-variant channel environment between base station and user terminal is built Mould is the time-varying Markov channel of finite state, determines the normalization channel coefficients between base station and user, and input convolution Neural network q_eval, select the maximum movement of output return value to act as decision, distribute subcarrier for user；According to subcarrier Allocation result, the inverse ratio based on channel coefficients is the user's allocation of downlink power being multiplexed on each subcarrier, based under distribution Row power determines system energy efficiency, determines Reward Program based on the system energy efficiency, and Reward Program is fed back to depth Degree enhancing learning model；According to determining Reward Program, training depth enhances the convolutional neural networks q in learning model_eval、 q_targetIf the difference between the resulting system energy efficiency value of continuous several times and preset threshold in default range or is higher than Preset threshold, the then descending power currently distributed are that power local optimum is distributed under time-variant channel environment；In this way, by base station and using Time-variant channel environment between the terminal of family is modeled as the time-varying Markov channel of finite state, to consider high complexity Time varying channel on the basis of, enhance learning model using depth, computation complexity, which is transformed into trained depth, enhances learning model During, to choose decision movement with lower complexity, determine under time-variant channel environment, the son of base station to user terminal carries The distribution of wave local optimum improves the energy efficiency in time-variant channel environment to the maximum extent.

Depth enhancing study in the present embodiment is a kind of decision-making technique based on artificial intelligence, it is characterized in that becoming in dynamic The sequential decision that decision body is made in the environment of change can enhance state, movement, award required for learning, certainly with construction depth Plan body can be automated when training depth increases learning model and Optimal Decision-making acts.Increased described in the present embodiment based on depth The wireless network resource distribution method learnt by force, can simulate time-variant channel environment, to the maximum extent with lower computation complexity The distribution for optimizing wireless network resource in time-varying network scene, achievees the effect that high-speed decision is promoted jointly with energy efficiency.Instruction The depth enhancing learning model perfected can continue to the management for time-variant channel environment radio resource, and make the fast of high repayment Quick decision plan.In large-scale radio network optimization, this depth can be enhanced learning model and carry out distributed computing, to reduce Complexity.

Wireless network resource distribution method based on depth enhancing study described in the present embodiment in order to better understand is right The method is described in detail, and specific steps may include:

A11, building depth enhance learning model DQN

In the present embodiment, initially set up by the convolutional neural networks q of two identical parameters_eval、q_targetConstitute depth enhancing Learning model；The decision process of the depth enhancing learning model is by Q function Q (s, a；It θ) determines, wherein θ indicates convolutional Neural The weighting parameter of network, convolutional neural networks q_evalAnd q_targetParameter be respectively θ_evalAnd θ_target, the two initialization phase Together；Q function Q (s, a；θ) the convolutional neural networks that expression weight is θ execution in state s acts a, return value obtained.

In the present embodiment, each convolutional neural networks are by two convolutional layers, two pond layers and two full articulamentum structures At；Training input is [n every time_samples, N, K], first dimension n_samplesIndicate input sample quantity, second and third dimension The normalization channel coefficient matrix that one input sample of ([N, K]) expression, i.e. dimension are [N, K]；Quantity is inputted in training every time For n_samplesNormalization channel coefficient matrix, every time input convolutional neural networks normalization channel coefficient matrix be [N, K] Data export the return value Q obtained for movement all possible under current channel condition, each movement_{action_val}, Q_{action_val} Data structure be one-dimensional vector [Action_num], wherein Action_NumIndicate all possible amount of action, input channel shape State quantity is n_samples, the return value [Action of everything acquisition is made under each state_num], therefore exporting is n_samplesIt is a One-dimensional vector [Action_num] constitute two-dimensional matrix.

A12 believes the time-varying Markov that the time-variant channel environment between base station and user terminal is modeled as finite state Road determines the normalization channel coefficients between base station and user, and inputs convolutional neural networks q_eval, selection output return value is most Big movement is acted as decision, distributes subcarrier for user

In the present embodiment, in a certain range, dispose it is multiple with the small base stations (SBS) of frequency, small base station include outdoor micro-base station, Scytoblastema station and indoor Home eNodeB.6 user terminals (UE) are set in each small base station range and 3 nonopiate more Available subcarrier (SC) spreads a distribution in location access network in the certain area centered on small base station.The present embodiment is each An independent depth is run on small base station enhances learning model, achievees the effect that distributed treatment.Initialize small base station and The parameter of user terminal, the parameter includes but is not limited to: SBS and UE_nNormalization channel coefficients H on sub-carrierk_n,k, be The channel width B of this base station distribution, sub-carrier channels bandwidth B_SC, circuit consume power p_kDeng, wherein UE_nIndicate user terminal N, SC_kIndicate subcarrier k, while the subcarrier associated matrix X of initialising subscriber-_N,KWith the time-varying Markov channel of finite state (Finite State Markov Channel, FSMC) transition probability matrixN indicates the set of user terminal, and K is indicated The set of workable subcarrier under current base station；Initialize the obtained subcarrier associated matrix X of user-_N,KWith finite state Time-varying Markov matrix of the channel transfer probabilityOptimize as subsequent user incidence matrix and calculate and updates channel status.

In the present embodiment, optimization channel circumstance is the time-varying Markov channel of finite state, is spread at random a little by space Initial coordinate is obtained, and calculates initial normalization channel coefficient matrix, ten rank of institute's value is quantified, quantization boundary is bound₀,...,bound₉, optimization scene is based on time-varying Markov matrix of the channel transfer probabilityVariation.Transition probability square Battle arrayIn element usable probability shift indicator p_i,jIt indicates, wherein i indicates current state, and j indicates next state (current State in state after execution movement), p_i,jIndicate the probability that next state j is transferred to from current state i；Provide p when i=j_i,jIt takes most Big value keeps the maximum probability of former channel status, is transferred to adjacent second shape probability of state and is transferred into adjacent first The half of state probability, each iteration are pressedMore new environment.

In the present embodiment, the subcarrier associated matrix X of user-_N,KElement can distribute indicator x with user-subcarrier_n,k It indicates, x_n,kIndicate whether user terminal n uses subcarrier k, in a particular application, for example, binary one (x can be used_n,k=1) Indicate that user terminal n uses subcarrier k, with Binary Zero (x_n,k=0) it indicates that user terminal n does not use subcarrier k, that is, does not have Apply to the resource for using subcarrier k.All possible subcarrier distribution calculation method is as follows:

Introduce number of combinations C, it is assumed that defining the non-orthogonal multiple access network subcarrier multiplex number upper limit is 2, and each User (can be adjusted) in the case where can only using a subcarrier according to practical application, and type is sharedFor ease of description, the present embodiment is calculated using the small base station network model of low capacitySimplification situation.By Action_numThe possible sub-carrier wave distribution method of kind is deposited with the structure of list Storage, is expressed as Action_list, list index corresponds to possible sub-carrier wave distribution method, can match subcarrier according to index value Distribution method, to reduce the complexity of DQN processing, DQN decision movements design is integer [0, Action_num-1]；Wherein, The corresponding subcarrier associated matrix X of a user-of every sub-carrier distribution method_N,K。

In the present embodiment, using the ratio of gain and noise between base station and user terminal as normalization channel coefficients, institute Normalization channel coefficients are stated to be determined by following formula:

Wherein, H_n,kTo normalize channel coefficients, indicate that the normalization channel of base station and user terminal n on sub-carrierk increases Benefit；h_n,kIndicate base station and the channel gain of user terminal n on sub-carrierk, it is big according to caused by Ruili rapid fading and distance Scale decline calculates, and the generic service range based on small base station is indoor environment, and two layers of wall damage is added；It indicates Noise power on sub-carrierk, wherein E [] indicates mathematic expectaion,Indicate that mean value is 0, variance ForAdditive white Gaussian noise.

In the present embodiment, the normalization channel coefficients are inputted into convolutional neural networks q_eval, convolutional neural networks q_evalIt is logical Cross decision formulaIt selects the maximum movement of output return value to act as decision, is distributed for user Subcarrier；

Wherein, Q function Q (s, a '；θ_eval) indicate convolutional neural networks q_evalThe execution in state s of decision body acts a ' institute The return value of acquisition, the state s are the normalization channel coefficients of input；A indicates the decision movement of depth enhancing learning model, I.e. optimal sub-carrier allocation results are a kind of possible X_N,K, indicate the incidence matrix of user terminal n and subcarrier k.

In the present embodiment, the input that depth enhances learning model DQN is DQN decision body state in which s, i.e. normalization letter Road coefficient (specifically: two dimension normalization channel coefficient matrix H_N,K)；Output is one-dimensional vector Q_{action_val}, in Q_{action_val}Middle choosing The decision movement (optimal sub-carrier allocation results) that the maximum movement a ' of value is distributed as subcarrier is selected, therefore, in Q_{action_val} The index of the middle maximum movement of selective value enters Action_listMatching obtains current decision movement X_N,K, to obtain arriving base station The subcarrier of user terminal obtains the subcarrier associated matrix X of user-when local optimum apportioning cost_N,K, in this way, according to index value Sub-carrier wave distribution method is matched, can reduce the complexity of DQN processing.

A13, according to optimal sub-carrier allocation results, based on the score order algorithm of fixed subcarrier distribution, i.e., same sub- load User's allocation of downlink power (wherein, channel increasing under wave according to channel gain coefficient inverse ratio rule to be multiplexed on each subcarrier The biggish user of benefit distributes smaller power, and channel gain small user distribute relatively high power).

In the present embodiment, the descending power for user's distribution is indicated are as follows:

Wherein, p_n,kIndicate that base station is the down transmitting power of user terminal n distribution on sub-carrierk；p'_kIndicate base station The down transmitting power distributed on sub-carrierk；A indicates that decay factor, constraint condition are 0 < a < 1, and same suboptimization The value of Cheng Zhong, a are definite value, can not be changed according to different user or different sub-carrier；K_maxIt indicates to access net in non-orthogonal multiple In network, complexity that current serial interference eliminator (Successive Interference Cancellation, SIC) can be born Under degree, the maximum number of user that is multiplexed on each subcarrier.

A14 determines the undistorted rate of information throughput r of maximum of base station subcarrier k to user terminal n_n,k

In the present embodiment, the undistorted rate of information throughput r of maximum of base station subcarrier k to user terminal n_n,kIt indicates are as follows:

r_n,k=log₂(1+γ_n,k)

Wherein, γ_n,kIndicate the Signal-to-Noise that user terminal n is obtained from subcarrier k, γ_n,kIndicate user terminal n from son The Signal-to-Noise that carrier wave k is obtained.

In the present embodiment, in non-orthogonal multiple access network, user of the multiplexing on same subcarrier is normalized into letter Road coefficient arranges in descending order, indicates are as follows:

|H_1,k|≥|H_2,k|≥…≥|H_n,k|≥|H_n+1,k|≥…≥|H_Kmax,k|

It can be succeeded before user terminal i is located at j in the sequence based on the optimal decoding order of serial interference canceller The interference for carrying out user terminal j is decoded and removes, user terminal j can receive the signal of user terminal i, and together as interference Receive.Non-orthogonal multiple accesses in network, considers fairness between user and reduces the principle of co-channel interference, when distribution power, Channel condition good user distribute smaller power, i.e., in examples detailed above, if H_i,k>H_j,k, then p is distributed_i,k<p_j,k, with point in A13 The distribution principle of number order algorithm is consistent.

Consider to reduce co-channel interference and computation complexity to the greatest extent under small base station scene, predefines each subcarrier multiplex number For K_maxThe maximum information transmission rate of=2, user terminal i and user terminal j are a Signal to Interference plus Noise Ratios The logarithmic function of (Signal to Interference plus Noise Ratio, SINR).χ_INNER=p_i,kH_j,_kIt indicates to use Co-channel interference in the layer that family terminal j is subject under current base station service.

In the present embodiment, the peak transfer rate of user terminal i and user terminal j are indicated are as follows:

r_i,k=log₂(1+γ_i,k), r_j,k=log₂(1+γ_j,k), γ_i,k=p_i,kH_i,k,

That is:

r_i,k=log₂(1+p_i,kH_i,k),

A16 determines system power consumption U_P(X)

In the present embodiment, consider that small base station has energy recovery unit, the system power consumption U_P(X) it indicates are as follows:

In the present embodiment, p_kIndication circuit consumes power；ψ indicates base station energy regenerating coefficient, can be according to actual hardware category Property change.

A17, according to determining γ_n,kAnd U_P(X), system energy efficiency is determined

In the present embodiment, according to the undistorted rate of information throughput of maximum of obtained base station subcarrier k to user terminal n r_n,kWith system power consumption U_P(X), the energy efficiency ee of subcarrier k to user terminal n is calculated_n,k:

Wherein,Indicate subcarrier k channel width.

In the present embodiment, system energy efficiency is indicated are as follows:

A17 determines Reward Program based on the system energy efficiency, and Reward Program is fed back to depth enhancing study mould Type

In the present embodiment, between preset modeling constraint condition is not met, (the modeling constraint condition is by fairness user Principle, service quality minimum standard, the cross-layer interference upper limit etc. factors determine) system energy efficiency, based on the weak of value return Algorithm is supervised, the system energy efficiency is punished according to the type for not meeting modeling constraint condition, obtains depth enhancing study Model make a policy movement after Reward Program, and by Reward Program feed back to depth enhancing learning model；Wherein, the return Function representation are as follows:

Wherein, reward_tIndicate the t times it is trained when the Reward Program that calculates；R_minIndicate QoS of customer (Quality Of Service, QoS) minimum standard, i.e., minimum downlink transmission rate；H_innterExpression works in same subcarrier frequency most The corresponding normalization channel coefficients of the shortest distance between nearly base station and the base station currently optimized, can be according to the method in step A12 It calculates；I_kIt indicates cross-layer (across station) interference upper limit that k-th of subcarrier frequency range can be born, adjustment interference is set according to concrete application Upper limit value；ξ_case1~ξ_case3Indicate that three kinds do not meet the case where modeling constrains to the penalty coefficient of energy efficiency.

Separately it should be understood that when directly using system energy efficiency as Reward Program, x_n,k, α also need to meet other about Beam condition, in conjunction with above-mentioned constraint condition, at this point, x_n,k, the constraint condition that need to meet of a are as follows:

Wherein, BS_peakIndicate small base station peak power；Condition 1Force user terminal can only simultaneously with 1 subcarrier is associated；Condition 2It limits in non-orthogonal multiple access network, same height carries The maximum number of user amount being multiplexed on wave, the quantity are K_max, it is therefore an objective to it reduces and is interfered in station and reduce answering for serial interference canceller Miscellaneous degree；Condition 3It is constrained for QoS, by the rate of information throughput of all user terminals of base station service It should be more than the minimum limitation of QoS of customer.Condition 4Be to from base station subcarrier k most The limitation of big transmission power.Condition 5It is a kind of effective interference coordination mechanism, limits current optimization Interference of the base station for other base stations.Condition 6Limitation when being distribution power, to decay factor.

Reward Program, channel circumstance, decision are acted and transfer next state deposit DQN remember playback unit by A18

In the present embodiment, Reward Program, channel circumstance, decision are acted and transfer next state (state being transferred to) is used as four Tuple is stored in DQN and remembers playback unit memory, and memory is indicated are as follows:

Memory:D (t)=e (1) ..., e (t) }

E (t)=(s (t), a (t), r (t), s (t+1))

Wherein, s (t) indicates the normalization channel coefficients (state) inputted when the t times training pattern；A (t) is indicated the t times When training depth enhancing learning model, the decision movement that DQN makes, the i.e. subcarrier associated matrix of user-；R (t) is indicated the t times When training depth enhancing learning model, DQN is after movement a (t) is made, obtained Reward Program reward_t；S (t+1) indicates t+ When 1 trained depth enhancing learning model, according to the updated normalization channel system of time-varying Markov channel of finite state Number (next state).

In the present embodiment, by being defined to memory playback class, and memory is set as to the number of object array or dictionary According to structure, every group of e (t) is stored.

A19, enhances learning model using batch mode training depth, randomly selects fixation from DQN memory playback unit Study and gradient decline update of the batch data memory of size for two convolutional neural networks.

In the present embodiment, data memory is handled using loss function Loss (θ), loss function Loss (θ) is indicated are as follows:

Gradient decline more new formula is expressed as:

Wherein,Indicate training learning rate；λ indicates the discount factor assessed decision body next state；It indicates in next state s (t+1) of the input for current memory e (t), weight θ_targetConvolution Neural network q_targetThe movement a ' that can harvest maximal rewards that decision goes out；Q(s(t),a(t)；θ_eval) indicate in input to be to work as When the state s (t) of preceding memory e (t), weight θ_evalConvolutional neural networks q_evalExecution acts a (t) return value obtained；Indicate to be θ to parameter_evalConvolutional neural networks do gradient step-down operation, i.e., modification convolutional neural networks q_evalParameter θ_eval, so that convolutional neural networks q_targetWith q_evalOutput subtract each other after minimum.

In the present embodiment, subtraction Q (s (t), a (t)；θ_eval) operation belong to the operation of respective action index position, such as remember Recall unit e (1) and selected movement 2, then more new formula is declined by gradient, only updates in two convolutional neural networks at [1,2] Numerical value, remaining in first dimension acts that corresponding numerical value is constant, and the stability to guarantee training, and gradient decline is only more New convolutional neural networks q_evalParameter.

A20 updates q every fixed number of times in depth enhancing learning model training process_targetParameter is q_evalParameter, It indicates are as follows:

Wherein, C_iterThe counter in training is indicated, for recording frequency of training；C_maxIndicate q_targetParameter and q_evalGinseng Several update intervals, while being also C_iterPeriod of change, therefore C_iterEqual to C_maxWhen, it is zeroed.

A21, the q obtained after being updated by step A19 and A20_targetNetwork parameter and q_evalNetwork parameter, if continuous several times Difference between the system energy efficiency value and preset threshold (designated value) of optimization is within the scope of preset, or is higher than default threshold Value, then it is current to distribute it is believed that depth enhancing learning model can be adapted for the radio resource allocation of this time-variant channel environment Descending power be time-variant channel environment under power local optimum distribute, current depth enhancing learning model Internet resources are divided With the local optimum under changing environment when reaching this, gained depth enhancing learning model can persistently make under practical time-variant channel environment With；

Otherwise A22 is pressedMore new environment judges C_iter=C_maxIt is whether true, if so, then enable C_iter=0, θ_target=θ_eval, then execute step A12；Otherwise, step A12 is directly executed, until the system energy efficiency recalculated Difference between value and preset threshold is in default range, or is higher than preset threshold, at this point, reaching in time-variant channel environment Optimum optimization.

In the present embodiment, with the increase of optimization number t, return value of the DQN model in time-variant channel environment can be from low Gradually it is intended to higher level, which is the wireless network resource distribution method based on depth enhancing study, to realize The optimization of sub-carrier, power distribution in time-variant channel environment.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.

The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, several improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of wireless network resource distribution method based on depth enhancing study characterized by comprising

Time-variant channel environment between base station and user terminal is modeled as the time-varying Markov channel of finite state by S102, It determines the normalization channel coefficients between base station and user, and inputs convolutional neural networks q_eval, selection output return value maximum Movement as decision act, for user distribute subcarrier；

S103, according to sub-carrier allocation results, the inverse ratio based on channel coefficients is under the user's distribution being multiplexed on each subcarrier Row power, determines system energy efficiency based on the descending power of distribution, determines Reward Program based on the system energy efficiency, and Reward Program is fed back into depth enhancing learning model；

S104, according to determining Reward Program, training depth enhances the convolutional neural networks q in learning model_eval、q_targetIf Difference between the resulting system energy efficiency value of continuous several times and preset threshold is in default range or is higher than preset threshold, The descending power then currently distributed is that power local optimum is distributed under time-variant channel environment.

2. the wireless network resource distribution method according to claim 1 based on depth enhancing study, which is characterized in that institute Stating normalization channel coefficients indicates are as follows:

Wherein, H_n,kTo normalize channel coefficients, base station and the normalization channel gain of user terminal n on sub-carrierk are indicated； h_n,kIndicate base station and the channel gain of user terminal n on sub-carrierk；Indicate noise power on sub-carrierk.

3. the wireless network resource distribution method according to claim 2 based on depth enhancing study, which is characterized in that institute State input convolutional neural networks q_eval, select the maximum movement of output return value to act as decision, distribute subcarrier for user Include:

Wherein, θ_evalIndicate convolutional neural networks q_evalWeighting parameter, Q function Q (s, a '；θ_eval) expression weight be θ_evalVolume Product neural network q_evalIn state s, execution acts a ' return value obtained, and the state s is the normalization channel system of input Number；A indicates the decision movement of depth enhancing learning model, i.e., optimal sub-carrier allocation results, wherein maximum according to return value The index of movement obtains optimal sub-carrier allocation results.

4. the wireless network resource distribution method according to claim 3 based on depth enhancing study, which is characterized in that be The descending power of user's distribution indicates are as follows:

Wherein, p_n,kIndicate that base station is the down transmitting power of user terminal n distribution on sub-carrierk；p'_kIndicate base station in sub- load The down transmitting power distributed on wave k；A indicates decay factor；K_maxIt indicates in non-orthogonal multiple access network, current serial Under the complexity that interference eliminator can be born, the maximum number of user that is multiplexed on each subcarrier.

5. the wireless network resource distribution method according to claim 4 based on depth enhancing study, which is characterized in that institute The descending power stated based on distribution determines that system energy efficiency includes:

According to the normalization channel coefficients between determining base station and user, the descending power of sub-carrier allocation results and distribution, Determine system power consumption U_P(X)；

6. the wireless network resource distribution method according to claim 5 based on depth enhancing study, which is characterized in that base Stand the undistorted rate of information throughput r of maximum of subcarrier k to user terminal n_n,kIt indicates are as follows:

r_n,k=log₂(1+γ_n,k)

Wherein, γ_n,kIndicate the Signal-to-Noise that user terminal n is obtained from subcarrier k, γ_n,kIndicate user terminal n from subcarrier The Signal-to-Noise that k is obtained；

System power consumption U_P(X) it indicates are as follows:

Wherein, p_kIndication circuit consumes power, and ψ indicates base station energy regenerating coefficient, x_n,kIndicate whether user terminal n is carried using son Wave k.

7. the wireless network resource distribution method according to claim 6 based on depth enhancing study, which is characterized in that be Energy efficiency of uniting indicates are as follows:

Wherein, ee_n,kIndicate the energy efficiency of subcarrier k to user terminal n,Indicate subcarrier k channel width, N indicates to use The set of family terminal, K indicate the set of workable subcarrier under current base station.

8. the wireless network resource distribution method according to claim 7 based on depth enhancing study, which is characterized in that institute It states and Reward Program is determined based on the system energy efficiency, and Reward Program is fed back into depth enhancing learning model and includes:

To the system energy efficiency for not meeting preset modeling constraint condition, with the Weakly supervised algorithm based on value return, according to not The type for meeting modeling constraint condition punishes the system energy efficiency, obtains depth enhancing learning model and makes a policy movement Reward Program afterwards, and the Reward Program is fed back into depth enhancing learning model；Wherein, the Reward Program indicates are as follows:

Wherein, reward_tIndicate the t times it is trained when the Reward Program that calculates；R_minIndicate QoS of customer minimum standard, i.e., most Low downlink transmission rate；H_innterIt indicates to work in the nearest base station of same subcarrier frequency between the base station that currently optimizes most The corresponding normalization channel coefficients of short distance；I_kIndicate the cross-layer interference upper limit that k-th of subcarrier frequency range can be born；ξ_case1~ ξ_case3Indicate that three kinds do not meet the case where modeling constrains to the penalty coefficient of system energy efficiency.

9. the wireless network resource distribution method according to claim 8 based on depth enhancing study, which is characterized in that institute It states according to determining Reward Program, training depth enhances the convolutional neural networks q in learning model_eval、q_targetIf continuous more Difference between secondary resulting system energy efficiency value and preset threshold in default range or is higher than preset threshold, then currently The descending power of distribution is that the distribution of power local optimum includes: under time-variant channel environment

Using Reward Program, channel circumstance, decision movement and the next state being transferred to as four-tuple deposit depth enhancing learning model Memory playback unit memory, wherein the memory is indicated are as follows:

Memory:D (t)=e (1) ..., e (t) }

E (t)=(s (t), a (t), r (t), s (t+1))

Wherein, s (t) indicates the state inputted when the t times trained depth enhancing learning model；A (t) indicates the t times trained depth When enhancing learning model, the decision that depth enhancing learning model is made is acted；R (t) indicates the t times trained depth enhancing study mould When type, depth enhances learning model after movement a (t) is made, obtained Reward Program reward_t；S (t+1) indicates t+1 instruction When practicing depth enhancing learning model, according to the updated next state of time-varying Markov channel of finite state；

For two convolutional neural networks of data memory is randomly selected from the memory playback unit of depth enhancing learning model It practises and gradient decline updates, wherein gradient decline only updates convolutional neural networks q_evalParameter, depth enhance study mould Every fixed number of times in type training process, q is updated_targetParameter θ_targetFor q_evalParameter θ_eval；

If the difference between the resulting system energy efficiency value of continuous several times and preset threshold is in default range or is higher than pre- If threshold value, then the descending power currently distributed is that power local optimum is distributed under time-variant channel environment.

10. the wireless network resource distribution method according to claim 9 based on depth enhancing study, which is characterized in that Gradient decline more new formula is expressed as: