CN111246502A

CN111246502A - Energy threshold dynamic optimization method based on Q learning

Info

Publication number: CN111246502A
Application number: CN202010021376.6A
Authority: CN
Inventors: 裴二荣; 鹿逊; 刘珊; 易鑫; 周礼能; 张茹; 王振民; 朱冰冰; 杨光财; 荆玉琪
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2020-06-05
Anticipated expiration: 2040-01-09
Also published as: CN111246502B

Abstract

The invention relates to a dynamic energy threshold optimization method based on Q learning, and belongs to the technical field of communication. By adopting a Q learning-based method, the intelligent agent is repeatedly trained, and then the optimal action is selected according to the converged Q table to reach the target state. The current algorithm can dynamically adjust the LAA energy detection threshold value in real time according to the external environment, and the coexistence system can be ensured to be in a high-throughput and high-fairness state by adopting the algorithm based on Q learning.

Description

Energy threshold dynamic optimization method based on Q learning

Technical Field

The invention belongs to the technical field of communication, and particularly relates to a dynamic energy threshold optimization method based on Q learning.

Background

With the rapid development of mobile communication technology, the number of various intelligent terminals is increased explosively, and spectrum resources are less and less with the increase of the number of intelligent terminals. Spectrum resources can be divided into licensed spectrum resources and unlicensed spectrum resources, where licensed spectrum resources suitable for communication transmission are becoming more scarce and also becoming more crowded, and simply increasing spectrum utilization is not enough to alleviate the shortage of spectrum resources. The 3GPP standardization body proposes that utilizing unlicensed spectrum is an effective solution to the growing traffic.

Currently available unlicensed frequency bands include the 2.4GHz industrial, scientific and medical (ISM band) and 5GHz U-nii (unlicensed National Information infrastructure) frequency bands, and there are already some mature access technologies such as Wi-Fi, bluetooth, radar, D2D (Device-to-Device), and the like in the existing frequency bands, the most prominent of which is Wi-Fi access technology. In addition to meeting the usage limit of unlicensed frequency bands in different countries and regions, the main problem of LTE is how to ensure harmonious coexistence with Wi-Fi under the condition of fair use of frequency band resources.

The LTE technology and the Wi-Fi technology are two distinct wireless communication technologies, and the difference between the two technical protocols causes negative effects in the direct convergence process of the network. For 3GPP standardization organization to deploy LTE system to unlicensed frequency band, a Licensed-Assisted Access (LAA) technology is formulated in LTE Release 13. In order to solve the problem of harmonious coexistence of LTE and Wi-Fi, the LAA scheme changes the access mechanism of LTE, and adopts a Listen Before Talk (Listen Before Talk, LBT) access mechanism. The mechanism requires that all LTE devices need to detect the current channel state before accessing the channel, the LTE devices need to compete with Wi-Fi devices for the channel, and this scheme needs to change the access protocol of LTE. The core of the LBT mechanism is Clear Channel Access (CCA), which uses Energy Detection (ED) technology to sense the channel. In the LAA scheme, the LTE equipment detects the current channel state before accessing a channel, if the channel state is detected to be busy, the equipment waits for the completion of the transmission of other equipment and searches for a chance to carry out data transmission, if the channel state is detected to be idle, the equipment immediately accesses the channel and carries out data transmission, and the energy detection technology is a simple and effective scheme for judging the channel state. Whether the LTE device accesses the channel for data transmission depends on the result of energy detection, and the size of the energy threshold of the LAA scheme directly affects the result of energy detection.

The reference threshold value specified by the 3GPP standardization organization is a fixed value, and does not take into consideration the real-time network environment and the like. Therefore, a dynamic energy threshold optimization scheme considering a real-time network environment is designed.

Disclosure of Invention

In view of this, the present invention provides a Q learning-based energy threshold dynamic optimization method, which is used to solve the fairness problem when two different networks coexist in an unlicensed frequency band.

In order to achieve the purpose, the invention provides the following technical scheme:

the energy threshold dynamic optimization method based on Q learning comprises the following steps:

s1: setting LAA SBSs action set A ═ { a ═ a₁,a₂...a_tAnd state set S ═ S₁,s₂...s_tInitializing a Q matrix to be a zero-order matrix, and randomly selecting an initial state by LAA SBSs;

s2: the LAA SBSs select an action a according to an epsilon-greedy selection strategy_t；

S3: according to action a_tCalculating the throughput and fairness coefficient of the coexisting system corresponding to the currently selected action, and obtaining the currently selected action a_tIs given a prize of r(s)_t,a_t)；

S4: updating the Q table according to a Q table updating formula of Q learning, and enabling LAA SBSs to enter the next state;

s5: repeatedly executing step S2 and the following steps until the next state reaches the target state;

further, in step S1, for action set a ═ a₁,a₂...a_tIn which each action a_tValues representing different energy thresholds, S ═ S for the set of states₁,s₂...s_tEach state s_tAre all composed of throughput and fairness coefficients, i.e. s_t＝{R_t,F_t}；

Further, in step S2, an action is chosen using an epsilon-greedy action selection policy. Different from a random selection strategy and a greedy selection strategy, the situation that iteration times are too many due to repeated action selection in the random selection strategy and the situation that a local optimal solution occurs in the greedy selection strategy are avoided. The epsilon-greedy action selection strategy can efficiently and accurately select actions by adopting a selection mode combining the epsilon-greedy action selection strategy and the greedy action selection strategy.

Further, in step S3, action a is selected using the ε -greedy selection strategy_tUsing action a_tCalculating a corresponding throughput R_tAnd a fairness coefficient F_tI.e. to confirm the state s corresponding to the current action_t＝{R_t,F_t}. For state s_tThroughput R in (1)_tThe sum of the throughput of the LAA system and the throughput of the Wi-Fi system is expressed, and the throughput of the coexisting system is obtained by referring to a Markov chain model. For state s_tFairness factor F in_tThe fairness coefficient, which represents a coexistence system, is defined as:

wherein R is_lAnd R_wDenotes the LAA and Wi-Fi throughputs, n_lAnd n_wRespectively representing the equipment quantity and the fairness factor F of LAA SBSs and Wi-Fiaps_tThe closer to 1, the more fair the coexistence system. Therefore, the states can be divided into four states, low throughput low fairness, low throughput high fairness, high throughput, respectively, according to throughput and fairnessLow throughput and high throughput with high fairness. The high throughput fairness is a target state of LAA SBSs, and four states are defined as follows:

wherein

And

indicating a threshold for fairness. Further, in step S3, action a is selected_tUpon completion, the reward r(s) is acquired according to the currently selected action_t,a_t). The reward function is defined as:

wherein F₁DEG and F₂Degree is the minimum fairness factor defined only when action a_tWhen the corresponding throughput and fairness coefficients meet certain conditions, the currently selected action is rewarded.

Further, in step S4, the Q table is updated according to the Q table update formula of Q learning. The update formula is:

where α represents the learning rate and 0< α <1, γ represents the discount factor and 0 ≦ γ < 1.

Further, in step S5, for Q learning herein, only if the current state reaches the target state, i.e., LAASBSs current state reaches high throughput fairness, one iteration is calculated to be completed.

The invention has the beneficial effects that: the energy threshold of the LTE-LAA on the unlicensed frequency band is dynamically optimized through the Q learning algorithm, the fairness of the coexistence system can be enabled to be the highest under the condition that the throughput of the coexistence system is ensured to be certain, and the method has a reference meaning for the harmonious coexistence of other heterogeneous networks.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a Q learning framework diagram;

FIG. 2 is a diagram of a network model for coexistence of LTE and Wi-Fi.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a dynamic energy threshold optimization method based on Q learning, aiming at the coexistence problem of LTE and Wi-Fi on an unlicensed frequency band. Referring to a fixed reference threshold scheme in a 3GPP standardization file, the invention can dynamically optimize the energy threshold of the LAA SBSs for channel detection based on a Q learning algorithm. The LAA SBSs can dynamically adjust the size of the energy threshold according to the real-time environment of the network. As shown in fig. 1, first, LAA SBSs is used as an agent, and in a certain state, action selection is performed according to an epsilon-greedy action selection policy, which can obtain rewards in a certain environment, and the Q-table is updated according to a Q-learning update formula, and the above actions are repeated until convergence.

In the coexistence scenario, a plurality of LAA SBSs and a plurality of Wi-Fi APs exist, the network model is shown in fig. 2, and we only consider the transmission of data traffic in the downlink, so that the LAA SBSs and the Wi-Fi APs respectively perform channel detection. In fig. 2, the solid black line and the dashed line respectively represent the licensed spectrum and the unlicensed spectrum, we only consider data transmission on the unlicensed spectrum, the dashed red line represents that the Wi-Fi APs broadcasts information such as throughput of the current access point at each decision time, and furthermore, the LAA SBSs can analyze the received broadcast information.

In Q learning, we treat LAA SBSs in heterogeneous networks as agents. At a particular time, the agent observes the state from its environment and takes action, and at each decision time t the agent takes appropriate action to maximize the reward at the next time t + 1. In Q learning, the learned Q value is updated with the instant prize and the discount prize, and stored in a two-dimensional Q table.

In a heterogeneous network, LAA devices coexist with Wi-Fi users on unlicensed bands. Based on the working principle of Q learning, the action and state set representation is defined as follows: a ═ a₁,a₂...a_tS and S ═ S₁,s₂...s_t}. Wherein each element in the A set represents a different energy threshold for detecting the state of the unlicensed channel, and each element in the S set represents a parameter pair consisting of a throughput coefficient and a fairness coefficient, i.e. S_t＝{R_t,F_tFor state s }_tThroughput R in (1)_tThe sum of the throughput of the LAA system and the throughput of the Wi-Fi system is expressed, and the throughput of the coexisting system is obtained by referring to a Markov chain model. For state s_tFairness factor F in_tThe fairness coefficient, which represents a coexistence system, is defined as:

wherein R is_lAnd R_wDenotes the LAA and Wi-Fi throughputs, n_lAnd n_wRespectively representing the equipment quantity and the fairness factor F of LAA SBSs and Wi-Fiaps_tThe closer to 1, the more fair the coexistence system. Therefore, the states can be divided into four states, low throughput low fairness, low throughput high fairness, high throughput low throughput and high throughput high fairness, respectively, based on throughput and fairness. The high throughput fairness is a target state of the LAA SBSs, and the defined state is as follows:

wherein

And

indicating a threshold for fairness.

The algorithm uses an epsilon-greedy strategy to select actions on the action selection strategy. Different from a random selection strategy and a greedy selection strategy, the epsilon-greedy action selection strategy can efficiently and accurately select actions by adopting a selection mode of combining the random selection strategy and the greedy selection strategy. Is defined as:

according to the epsilon-greedy selection policy, action a is performed_tThe earning award is r(s)_t,a_t). The reward function is defined as:

F₁DEG and F₂Degree is a prescribed minimum fairness factor, only when action a_tWhen the corresponding throughput and fairness coefficients meet certain conditions, the selected action is rewarded. Updating the Q value according to an updating formula:

where 0< α <1, 0 ≦ γ <1 if α ≦ 1, the previously learned experience will be ignored and replaced with the latest estimated reward, and the larger γ, the greater the agent's dependence on the value reward.

Finally, for the Q learning algorithm herein, an iterative process is only completed if the current state reaches the target state, i.e., the LAA SBSs current state reaches high throughput fairness.

Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A dynamic energy threshold optimization method based on Q learning is characterized in that: the method comprises the following steps:

s5: and repeatedly executing the step S2 and the following steps until the Q table convergence completes the training.

2. The method of claim 1, wherein the energy threshold dynamic optimization method based on Q learning is as follows: in step S1, for action set a, { a } is set₁,a₂...a_tIn which each action a_tValues representing different energy thresholds, S ═ S for the set of states₁,s₂...s_tEach state s_tAre all composed of throughput and fairness coefficients, i.e. s_t＝{R_t,F_t}。

3. The method of claim 2, wherein the energy threshold dynamic optimization method based on Q learning is as follows: in step S2, an action is chosen using an epsilon-greedy action selection policy. The epsilon-greedy action selection strategy can efficiently and accurately select actions by adopting a selection mode combining the epsilon-greedy action selection strategy and the greedy action selection strategy.

4. According to claimThe method for dynamically optimizing the energy threshold based on the Q learning, as set forth in claim 3, is characterized in that: in step S3, action a is selected using the ε -greedy selection strategy_tUsing action a_tCalculating a corresponding throughput R_tAnd a fairness coefficient F_tI.e. to confirm the state s corresponding to the current action_t＝{R_t,F_t}. For state s_tThroughput R in (1)_tThe sum of the throughput of the LAA system and the throughput of the Wi-Fi system is expressed, and the throughput of the coexisting system is obtained by referring to a Markov chain model. For state s_tFairness factor F in_tThe fairness coefficient, which represents a coexistence system, is defined as:

wherein R is_lAnd R_wDenotes the LAA and Wi-Fi throughputs, n_lAnd n_wRespectively representing the equipment quantity and the fairness coefficient F of the LAA SBSs and the Wi-Fi APs_tThe closer to 1, the more fair the coexistence system. Therefore, the states can be divided into four states, low throughput low fairness, low throughput high fairness, high throughput low throughput and high throughput high fairness, respectively, based on throughput and fairness. Wherein the high throughput fairness is a target state for LAA SBSs, and further, when selecting action a_tUpon completion, the reward r(s) is acquired according to the currently selected action_t,a_t). The reward function is defined as:

5. The method of claim 4, wherein the energy threshold dynamic optimization method based on Q learning is as follows: in step S4, the formula is updated from the Q-table of Q-learning:

6. The method of claim 5, wherein the energy threshold dynamic optimization method based on Q learning is as follows: in step S5, for Q learning herein, an iterative process is only completed if the current state reaches the target state, i.e., the LAA SBSs current state reaches high throughput fairness. And repeatedly executing the step S2 and the following steps until the Q table convergence completes the training.