CN108924944A

CN108924944A - The dynamic optimization method of contention window value coexists in LTE and WiFi based on Q-learning algorithm

Info

Publication number: CN108924944A
Application number: CN201810797200.2A
Authority: CN
Inventors: 裴二荣; 江军杰; 李露; 程巍; 李海星; 马玉鹏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2018-11-30
Anticipated expiration: 2038-07-19
Also published as: CN108924944B

Abstract

The present invention relates to the dynamic optimization methods that contention window value coexists in the LTE and WiFi based on Q-learning algorithm, belong to field of communication technology, including step：1, the state set and set of actions of the small base station LAA are set；2, the state and behavior Q value of the small base station LAA are initialized；3, the initial state value of the small base station LAA is calculated；4, Logistic Chaotic map sequence is calculated according to formula, and is mapped in the small base station behavior value set of LAA and randomly chooses a behavior a_t(i)；5, process performing a_t(i) after, environment reward value r is obtained_t, into next state s_t+1；6, the behavior Q value function of the small base station LAA is updated；7, t ← t+1 is enabled, step 4~6 are repeated, until reaching target.The present invention can improve the availability of frequency spectrum of channel, while extending the power system capacity of next generation communication system under conditions of guaranteeing user fairness, provide preferable service quality for user, promote user experience.

Description

The dynamic that contention window value coexists in LTE and WiFi based on Q-learning algorithm is excellent Change method

Technical field

The invention belongs to field of communication technology, it is related to a kind of LTE based on Q-learning algorithm and WiFi and competition coexists The dynamic optimization method of window value.

Background technique

Wireless mobile communications become more next in Future Information communication system with the convenience of its popularity and access that use More important role.With the rapid development of mobile Internet and internet of things service, mobile data flow abruptly increase leads to authorization frequency Section shortage, therefore operator wishes to excavate unlicensed spectrum to supplement authorization frequency spectrum.LTE-U (the LTE- that 3GPP is proposed Unlicensed) also it is referred to as authorization auxiliary access module (Licensed Assisted Access, LAA), it is intended to tie up as far as possible Under conditions of holding original LTE protocol specification, LTE technology is applied to unauthorized frequency range (such as near 5GHz), by unauthorized Frequency range disposes small base station, and allows LTE to cooperate in unauthorized frequency range and authorized spectrum band by carrier aggregation technology, to reach To the purpose for promoting honeycomb system capacity and the unauthorized frequency range availability of frequency spectrum of raising.

At present in unauthorized frequency range there are mainly two types of the co-existence schemes of LTE and WiFi：Duty ratio silent mode (Duty Cycle Muting, DCM) and LAA.DCM is first version of LTE-U, initially by Ericsson and Qualcomm in 2013 It proposes.This scheme with WiFi shares unlicensed spectrum by way of for a period of time LTE periodic quiet, does not need " first to listen Say afterwards " (Listen Before talk, LBT), and be easy to dispose because modification LTE protocol is not needed, only exist at present China, India, South Korea and the U.S. use.In the Sophia Antipolis meeting of in June, 2014 France, it is put forward for the first time LTE LAA scheme.This scheme seeks to a long-range global solutions, its important feature is exactly that LTE is accessed not It needs to assess channel situation before authorization frequency spectrum, i.e. clear channel assessment (CCA) (the Clear Channel of LBT mechanism Assessment, CCA) process.Thus this mechanism needs to modify to LTE protocol stack and the support of equipment vendor.At present The telecommunications such as 3GPP, ESTI tissue is also actively formulating relevant criterion to LBT coexistence mechanism.We study LTE and WiFi network Between the coexistence mechanism based on LBT, that is, LAA mechanism.Due to the concern and worry to the LAA mechanism performance based on LBT, Yi Xieyan Study carefully personnel to assess the performance of this coexistence mechanism.By the study found that the contention window value of LBT mechanism to machine coexists The performance influence of system is very big, and a good avoidance mechanism can generate reasonable competition window according to actual loading situation in network Value makes user obtain preferable experience to improve the availability of frequency spectrum of channel.

Currently, existing avoidance mechanism lacks the process of dynamic learning, such as binary exponential backoff mechanism, fixed contention window Mouth avoidance mechanism etc., and symbiotic system channel cannot be objectively limited according to real-time scene flexibly adjustment system parameter The raising of the availability of frequency spectrum.

Therefore, a good avoidance mechanism is designed, for real-time network load condition, type of service etc. can be generated Reasonable contention window value will be helpful to improve the availability of frequency spectrum of channel, while the system for extending next generation communication system is held Amount, provides preferable service quality for user, to promote user experience.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of LTE based on Q-learning algorithm and WiFi coexist it is competing The dynamic optimization method of window value is striven, the small base station LAA, can be according to network real time traffic load, industry by Q-learning algorithm The factors such as service type carry out its LBT mechanism contention window value for coexisting with WiFi system of flexibly adjustment, are guaranteeing LTE and WiFi Under conditions of user fairness coexists, overall system throughput is maximized, the availability of frequency spectrum of symbiotic system is improved, to promote user Experience.This method has the characteristics that be concisely and efficiently, and at the same time, has certain portability.

In order to achieve the above objectives, the present invention provides the following technical solutions：

The dynamic optimization method of contention window value, including following step coexists in LTE and WiFi based on Q-learning algorithm Suddenly：

S1：The state set and set of actions of the small base station LAA are set；

S2：At the t=0 moment, the state and behavior Q value for initializing the small base station LAA are " 0 "；

S3：Calculate the original state s of the small base station LAA_tState value；

S4：Logistic Chaotic map sequence is calculated according to formula, the sequence is then mapped to the small base station behavior value of LAA In set and randomly choose a behavior a_t(i)；

S5：Process performing a_t(i) after, system will obtain environment reward value r according to formula_t, then into next state s_t+1；

S6：The behavior Q value function of the small base station LAA is updated according to formula；

S7：T ← t+1 is enabled, step S4~S6 is repeated, until reaching dbjective state.

Further, in step sl, the state set of the small base station LAA is expressed as the combination of throughput of system and fairness, That is s_t={ R_t,F_t, R_tIndicate t moment system total throughout obtained in unauthorized frequency range, i.e. LAA and WiFi user gulps down The sum of the amount of spitting, F_tIndicate the fairness function on average, defining fairness function is：

Wherein R_t(s, l) and R_t(s, w) indicates that LAA and WiFi user throughput, nl indicate the quantity of the small base station LAA, nw table The small base station LAA is divided into four kinds of states according to predefined handling capacity and fairness threshold value by the number of users for showing WiFi：It is low to handle up The high fairness of low fairness, poor throughput, the low fairness of high-throughput and the high fairness of high-throughput are measured, i.e.,

WhereinWithThe threshold value of handling capacity and fairness is respectively indicated, and

For behavior set, using contention window value as the small base station behavior of LAA, and according to the Ma Er of limited action set Section's husband's process defines 16≤a of any small base station behavior of t moment LAA_t(i)≤128。

Further, in step s 2, the state and behavior Q value that the small base station LAA is arranged are null matrix, base station small for LAA The solution target of markov decision process is to find an optimal policy π^*, so that the value V (s) of each state s is simultaneously Reach maximum, state value function representation is as follows：

Wherein r (s_t,a_t) indicate the reward value that the small base station LAA is obtained from environment, p (s_t+1|s_t,a_t) indicate the small base of LAA It stands when in state s_tWhen housing choice behavior a_tAfter be transferred to state s_t+1Probability.

Further, in step s 4, the target of the small base station LAA is to obtain higher reward value, and introducing has ergodic, rule The chaotic motion of rule property and random nature is as a kind of Optimization Mechanism；

There are three types of common mapped systems in chaos system：Logistic mapping, Chebyshev mapping and Henon mapping, Its equation is mapped for Logistic to be expressed as：

z_k+1=μ z_k(1-z_k)

Wherein 0≤μ≤4 are known as branch parameter, and when μ ∈ [3.5699456 ..., 4], logistic mappings work is in chaos State takes μ=4；K indicates the number of iterations, and z is known as Chaos Variable, and chaos domain is (0,1).

Further, in step s 5, the small base station LAA will obtain a reward value after executing the behavior of selection from environment, Reward value function is defined as：

Wherein ε indicates weight factor and 0 < ε < 1,Indicate symbiotic system handling capacity minimum requirements threshold value, F_t° indicate altogether The minimum of deposit system fairness function requires threshold value.

Further, in step s 6, the small base station LAA after obtaining reward value in environment, is needing to carry out more Q matrix Newly, more new formula is：

α indicates learning rate in formula and 0 < α < 1, Υ indicate discount factor and 0≤Υ < 1.

The beneficial effects of the present invention are：By Q-learning algorithm dynamic optimization LTE and WiFi in unauthorized frequency range On the contention window value based on LBT mechanism coexistence, compared with traditional back off algorithm, the present invention in be based on Q-learning algorithm Dynamic optimization can be carried out to the contention window value that LTE and WiFi coexist on unauthorized, the small base station LAA can be according to network reality When scene contention window value is adjusted flexibly.Its process as shown in Fig. 2, the small base station LAA first under some state, according to current Logistic mapping selection of the environment based on chaos system simultaneously executes some behavior；Then environment of observation obtains reward value, according to Formula is updated Q functional value and is determined the behavior of next state based on current Q functional value, repeats above-mentioned movement until convergence, originally Invention can improve the availability of frequency spectrum of channel, while extending next generation communication system under conditions of guaranteeing user fairness Power system capacity, provide preferable service quality for user, promote user experience.

Detailed description of the invention

In order to keep the purpose of the present invention, technical scheme and beneficial effects clearer, the present invention provides following attached drawing and carries out Explanation：

Fig. 1 is that competition window coexists in a kind of LTE based on Q-learning algorithm described in the embodiment of the present invention and WiFi The flow diagram of the dynamic optimization method of value；

Fig. 2 is Q-learning described in the embodiment of the present invention and environmental interaction process model；

Fig. 3 is the network illustraton of model that LTE described in the embodiment of the present invention and WiFi coexist.

Specific embodiment

Below in conjunction with attached drawing, a preferred embodiment of the present invention will be described in detail.

The present invention is based on LBT mechanism coexistence problem in WiFi for LTE on unauthorized frequency range (5GHz), proposes a kind of base The dynamic optimization method of contention window value coexists in the LTE and WiFi of Q-learning algorithm.Compared with traditional back off algorithm, It can be excellent to the contention window value progress dynamic that LTE and WiFi coexist on unauthorized based on Q-learning algorithm in the present invention Change, contention window value can be adjusted flexibly according to the real-time scene of network in the small base station LAA.Its process is as shown in Fig. 2, LAA is small first Under some state, Logistic mapping selection according to current environment based on chaos system simultaneously executes some behavior for base station； Then environment of observation obtains reward value, updates Q functional value according to formula and determines the row of next state based on current Q functional value To repeat above-mentioned movement until convergence.

Consider that there are multiple LAA small base stations and multiple WiFi access points (AP), network model such as Fig. 3 institutes in coexistence scenario Show.Since the small base station LAA can be run in multiple unauthorized frequency ranges, and it is primarily upon the performance that coexists of LAA, therefore, institute The scene of consideration can simplify as simpler coexistence scenario, and there are the small bases of multiple LAA on specific one unlicensed channel It stands and a WiFi AP.Assuming that there are the small base stations nl LAA and one to have nw user's in the coexistence scenario considered WiFi AP, wherein the network insertion of WiFi user follows 802.11 standard of IEEE.

As shown in Figure 1, the contention window based on dynamic optimization LTE and WiFi based on LBT mechanism coexistence in unauthorized frequency range The method of mouth value, this approach includes the following steps：

100：The state set and set of actions of the small base station LAA are set；

200：At the t=0 moment, the state and behavior Q value for initializing the small base station LAA are " 0 "；

300：Calculate the original state s of the small base station LAA_tState value；

400：Logistic Chaotic map sequence is calculated according to formula, the sequence is then mapped to the small base station behavior of LAA In value set and randomly choose a behavior a_t(i)；

500：Process performing a_t(i) after, system will obtain environment reward value r according to formula_t, then into next shape State s_t+1；

600：The behavior Q value function of the small base station LAA is updated according to formula；

700：T ← t+1 is enabled, step 400~600 are repeated, until reaching dbjective state.

Q-learning algorithm is a kind of enhancing study of determining optimizing decision strategy using algorithm, is considered different Walk a kind of method of Dynamic Programming.During Q-learning algorithm iteration, state set is defined as S, if the decision-making time For t, then s_t∈ S indicates that in the state of the small base station t moment LAA be s_t.Meanwhile the limited behavior that may execute the small base station LAA Set is defined as A, a_t∈ A indicates the behavior in the small base station t moment LAA.Reward function r (s_t,a_t) indicate that the small base station LAA is based on State in which s_tProcess performing a_tThe reward value obtained from environment afterwards, then from state s_tIt is transferred to s_t+1, determine next The plan time, t+1 was to Q_tFunction is updated.Q-learning algorithm is really markov decision process (Markov Decision Processes, MDP) a kind of version.

In co-existin networks, the small base station user of LAA in unauthorized frequency range with WiFi user's harmonious coexistence.Based on Q- Learning algorithm working principle, state set is expressed as follows：

s_t={ R_t,F_t}

Wherein R_tIndicate t moment system total throughout obtained, i.e. R in unauthorized frequency range_t=R_t(s,l)+R_t(s, w)。F_tIt indicates the fairness function on average, fair function is defined as follows：

Wherein R_t(s,l)(R_t(s, w)) indicate LAA (Wi-Fi) user throughput, F_tValue show that system is got over closer to 1 It is fair.According to predefinedWith(and) threshold value, the small base station LAA is divided into four kinds of states：The low public affairs of poor throughput The high fairness of levelling, poor throughput, the low fairness of high-throughput, the high fairness of high-throughput.Therefore the list of elements of state set S Show as follows：

Using contention window value as behavior set, then the behavior set A={ a (1), a (2) ..., a (k) } of the small base station LAA, Its unit is number of time slots.According to the markoff process of limited action set, the small base station behavior 16 of any t moment LAA is defined ≤a_t(i)≤128

The task that the small base station LAA faces is to determine an optimal policy, so that reward obtained is maximum.It is small for LAA Base station, can be according to current state, then environment of observation makes best decision to state/movement of next step.State s_t's Accoumulation of discount reward value function can be expressed as：

Wherein r (s_t,a_t) indicate the small base station LAA in state s_tSelection acts a_tWhen instant reward obtained.Υ indicates folding The factor and 0≤Υ < 1 are detained, discount factor tends to the 0 small base station expression LAA and mainly considers to reward immediately.p(s_t+1|s_t,a_t) indicate The small base station selected movement a of LAA_tWhen from state s_tIt is transferred to s_t+1Probability.The target that MDP is solved is to find an optimal policy π^*, so that the value V (s) of each state s reaches maximum simultaneously.According to bellman principle, when total discount period of the small base station LAA An optimal policy π can at least be obtained when reward is maximum by hoping^*So that：

Wherein V^*(s_t) indicate the small base station LAA from state s_tStart and follows optimal policy π^*Maximum-discount obtained is tired Count reward value.The tactful π given for one, is the function that state space is mapped to motion space, i.e.,：π:s_t→a_t.Therefore Optimal policy can be expressed as form：

π^*(s_t)=argV^*(s_t)

The target of the small base station LAA is to obtain higher reward value, therefore, in each state, it will selection has higher Q The movement of value.But in the initial stage of study, fewer for state-movement experience, Q value cannot accurately indicate correct Reinforcement value, in general, the movement of highest q value, which results in the small base station LAA, always can not explore it along identical path He is preferably worth, to be easily trapped into local optimum.Therefore, in order to overcome the disadvantage, the small base station LAA must be randomly chosen dynamic Make, therefore, introducing has the chaotic motion of ergodic, regularity and random nature as a kind of Optimization Mechanism, to reduce The small base station movement selection strategy of LAA falls into the possibility of locally optimal solution.

Chaos system mainly has Logistic mapping, Chebyshev mapping and Henon to map three kinds, for Logistic Mapped system, equation are expressed as：

z_k+1=μ z_k(1-z_k)

Wherein, 0≤μ≤4 are known as branch parameter in formula, and k indicates the number of iterations, and z is known as Chaos Variable, chaos domain be (0, 1).When μ ∈ [3.5699456 ..., 4], logistic mappings work is in chaos state, that is to say, that in logistic mapping The lower sequence generated of effect is aperiodic and not convergent.The chaotic motion state that chaos system shows seems random complexity, But it there are in fact inherent laws.

Based on reward value function, the small base station LAA will change using high-throughput and high fairness as target selection strategy Generation.The reward value function that the small base station LAA is obtained from environment is defined as：

Wherein ε indicates weight factor and 0 < ε < 1, ε are smaller shows that Q-learning process is more likely to fair sexual factor Reward obtained.Indicate symbiotic system handling capacity minimum requirements threshold value, F_t° indicate symbiotic system fairness function minimum It is required that threshold value.It can be seen that r from reward value function expression_tIt is bounded function, item is restrained according to watt golden this (Watkins) The Q-learning process has convergence known to part.In view of the throughput performance and network fairness factor of whole network, Reward value function throughput of system be higher than minimum throughput threshold under conditions of make fairness functional value as close as 1。

It is right in a recursive manner in each moment t as the following formula based on the small base station strategy π, LAA in Q-learning algorithm Q value function is calculated：

It will be apparent that Q value is indicated when the small base station LAA is in state s_tWhen follow tactful π execution movement a_tExpectation discount obtained Reward.Therefore, assessment optimal policy π is aimed at^*Under Q value.From above formula it can be concluded that state value function and behavior value function Relationship it is as follows：

However, being based on uncertainty environment, above-mentioned Q value function is only just set up under optimal policy, i.e. the value of Q value function Learn to be variation (or not restraining) by Q under non-optimal strategy.Therefore, the following institute of its calculation formula of Q value function is corrected Show：

Wherein α indicates learning rate and 0 < α < 1, and learning rate is bigger, and the effect of training is fewer before showing to retain. If each state-movement is to that can be repeated several times, learning rate can decline according to suitable scheme, then to arbitrary finite MDP, Q-learning algorithm can converge to optimal policy.Learning rate and discount factor synergistic effect adjust Q matrix more Newly, so influence Q-learning algorithm learning performance, α value 0.5, Υ value 0.8.

Finally, it is stated that preferred embodiment above is only used to illustrate the technical scheme of the present invention and not to limit it, although logical It crosses above preferred embodiment the present invention is described in detail, however, those skilled in the art should understand that, can be Various changes are made to it in form and in details, without departing from claims of the present invention limited range.

Claims

1. the dynamic optimization method that contention window value coexists in LTE and WiFi based on Q-learning algorithm, it is characterised in that：Packet Include following steps：

S1：The state set and set of actions of the small base station LAA are set；

S4：Logistic Chaotic map sequence is calculated according to formula, the sequence is then mapped to the small base station behavior value set of LAA In and randomly choose a behavior a_t(i)；

2. the LTE according to claim 1 based on Q-learning algorithm and the dynamic that contention window value coexists in WiFi are excellent Change method, it is characterised in that：In step sl, the state set of the small base station LAA is expressed as the group of throughput of system and fairness It closes, i.e. s_t={ R_t,F_t, R_tIndicate t moment system total throughout obtained in unauthorized frequency range, i.e. LAA and WiFi user The sum of handling capacity, F_tIndicate the fairness function on average, defining fairness function is：

Wherein R_t(s, l) and R_t(s, w) indicates that LAA and WiFi user throughput, nl indicate the quantity of the small base station LAA, and nw is indicated The small base station LAA is divided into four kinds of states according to predefined handling capacity and fairness threshold value by the number of users of WiFi：Poor throughput The high fairness of low fairness, poor throughput, the low fairness of high-throughput and the high fairness of high-throughput, i.e.,

For behavior set, using contention window value as the small base station behavior of LAA, and according to the Markov of limited action set Process defines 16≤a of any small base station behavior of t moment LAA_t(i)≤128。

3. the LTE according to claim 2 based on Q-learning algorithm and the dynamic that contention window value coexists in WiFi are excellent Change method, it is characterised in that：In step s 2, the state and behavior Q value that the small base station LAA is arranged are null matrix, base small for LAA Stand markov decision process solution target be find an optimal policy π^*, so that the value V (s) of each state s is same When reach maximum, state value function representation is as follows：

Wherein r (s_t,a_t) indicate the reward value that the small base station LAA is obtained from environment, p (s_t+1|s_t,a_t) indicate the small base station LAA when place In state s_tWhen housing choice behavior a_tAfter be transferred to state s_t+1Probability.

4. the LTE according to claim 3 based on Q-learning algorithm and the dynamic that contention window value coexists in WiFi are excellent Change method, it is characterised in that：In step s 4, a kind of Optimization Mechanism is used as by Logistic mapping in chaotic motion, with this Housing choice behavior a_t(i), the equation of Logistic mapped system is：

z_k+1=μ z_k(1-z_k)

Wherein 0≤μ≤4 are known as branch parameter, take μ=4, k to indicate the number of iterations herein, z is known as Chaos Variable, and chaos domain is (0,1)。

5. the LTE according to claim 4 based on Q-learning algorithm and the dynamic that contention window value coexists in WiFi are excellent Change method, it is characterised in that：In step s 5, the small base station LAA will obtain a reward after executing the behavior of selection from environment Value, reward value function are defined as：

Wherein ε indicates weight factor and 0 < ε < 1,Indicate symbiotic system handling capacity minimum requirements threshold value, F_t° indicate coexistence system The minimum of system fairness function requires threshold value.

6. the LTE according to claim 5 based on Q-learning algorithm and the dynamic that contention window value coexists in WiFi are excellent Change method, it is characterised in that：In step s 6, the small base station LAA after obtaining reward value in environment, is needing to carry out Q matrix It updates, more new formula is：

Wherein α indicates learning rate and 0 < α < 1, Υ indicate discount factor and 0≤Υ < 1.