CN111245541B

CN111245541B - Channel multiple access method based on reinforcement learning

Info

Publication number: CN111245541B
Application number: CN202010154072.7A
Authority: CN
Inventors: 雷建军; 黎露
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-03-07
Filing date: 2020-03-07
Publication date: 2021-11-16
Anticipated expiration: 2040-03-07
Also published as: CN111245541A

Abstract

The invention provides a channel multiple access method based on reinforcement learning, which comprises the steps of adjusting and modeling a competition window in a channel access process into a Markov decision process; selecting a competition window action by using an epsilon-greedy strategy in the current action adjustment period; the AP selects an optimal competition window in the current state from the competition window set; the STA generates an OBO backoff value to carry out backoff by using the optimal contention window broadcasted by the AP until the contention is finished; the AP allocates resource units RU to the successfully contended stations through the trigger frame, and the stations use the respective allocated RUs to transmit data; calculating the performance index of the system as a reward after one action adjustment period is finished until the current action adjustment period is finished; updating an action cost function according to the reward of the currently obtained performance index; repeatedly executing the processes to continuously optimize the competition window; the invention can improve the system throughput and fairness and reduce the data transmission time delay.

Description

Channel multiple access method based on reinforcement learning

Technical Field

The invention relates to the field of WLAN (wireless local area network), in particular to a channel multiple access method based on reinforcement learning, which is mainly applied to an IEEE 802.11ax high-density network environment.

Background

In recent years, with the rapid development of various intelligent terminal devices and mobile internet of things services, the demands of people on wireless traffic and service quality are increasing. Wireless local area networks and cellular networks are the main services for carrying wireless networks due to their high speed, flexible deployment and low cost. In the past, the standardization work for WLANs has focused primarily on improving the throughput of the link, rather than efficiently utilizing spectrum resources and improving user experience, and the design of MAC algorithms has not improved significantly. However, after wide deployment of WLANs, we will face some fundamental technical challenges, especially in dense network environments. In these environments, the high collision due to channel contention may cause severe degradation of network performance, failing to provide users with sufficient bandwidth and good user experience. In 2014, the IEEE standards Committee approved the establishment of the 802.11ax task group. 802.11ax aims to provide a mode of operation that enables stations to be deployed in dense scenarios with at least a four-fold improvement in average throughput of STAs.

In a high-density scenario, a conventional MAC protocol has a high collision rate, severe interference and a low channel utilization rate, and cannot support the requirement of diversity of Quality of Service (QoS) of a wireless Service in the future. Meanwhile, an efficient contention window backoff mechanism cannot be provided. The multiple access mechanism based on IEEE 802.11ax can reduce the conflict to a certain extent and improve the utilization rate of the channel. There are still a number of unsolved problems: on one hand, in a high-density scene, along with the great increase of stations, the current multiple access mechanism still cannot effectively avoid conflict and interference, and the performance of an MAC layer is seriously reduced; on the other hand, in the operation process of the network, the faced environment is extremely complex, and the current MAC algorithm based on the traditional communication theory cannot perform dynamic allocation of resources and also cannot efficiently learn historical experience.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a channel multiple access method based on reinforcement learning, which models the problem of adjusting the contention window in channel access as a markov decision process, and improves the system performance by a reinforcement learning algorithm.

In order to achieve the purpose, the invention mainly adopts the following processes to process:

and in the channel access process, the station generates an OBO backoff value for backoff by using the uniform contention window broadcasted by the AP. After the backoff is finished, randomly selecting an RU (Buffer Status Report, BSR) to send a competition channel resource, and if a competition ending condition is met, ending the competition; otherwise, the remaining stations continue to contend for the RU. After the competition is finished, the AP allocates different RUs to the stations which succeed in the competition through the trigger frame, and the stations use the respective allocated RUs to transmit data.

In the learning process, after an action adjusting period, the AP calculates the reward in the current state according to the network performance of the previous period, updates the value function of the reinforcement learning model, and reselects a competition window to broadcast to the STA. After a motion adjustment period, the above process is repeatedly performed to optimize the contention window.

Specifically, the invention provides a channel multiple access method based on reinforcement learning, which is particularly suitable for the 802.11ax standard, and the method comprises the following steps:

step 1) adjusting and modeling a competition window in a channel access process into a Markov decision process;

step 2) selecting a competition window action by using an epsilon-greedy strategy in the current action adjustment period; the AP selects an optimal competition window in the current state from the competition window set;

step 3), the STA uses the optimal competition window broadcasted by the AP to generate an OBO backoff value for backoff until the competition is finished;

step 4) the AP allocates resource units RU to the successfully contended stations through the trigger frame, and the stations use the respective allocated RUs to transmit data; judging whether the current action adjusting period is finished or not, and if so, entering the step 5); otherwise, returning to the step 3);

step 5) after an action adjusting period, calculating a performance index of the system as a reward;

step 6) updating an action value function according to the reward of the currently obtained performance index; judging whether a termination condition is met, if not, returning to the step 2) to continuously optimize the competition window after entering the next action adjusting period; otherwise, the flow is terminated.

The invention has the beneficial effects that:

the invention provides a channel multiple access method based on reinforcement learning. Based on the standard back-off mechanism of IEEE 802.11ax, a reinforced learning algorithm is used for dynamically adjusting the contention window. The method realizes the further control of the station competition channel, thereby achieving the effects of improving the throughput and fairness of the system and reducing the time delay.

Drawings

FIG. 1 is a block diagram of a channel multiple access architecture based on reinforcement learning according to the present invention;

FIG. 2 is a diagram of a model for reinforcement learning according to the present invention;

FIG. 3 is a flow chart of the channel multiple access method based on reinforcement learning according to the present invention;

FIG. 4 is a flowchart illustrating AP learning according to the present invention;

fig. 5 is a flowchart of channel access by an STA in the present invention;

fig. 6 is a timing diagram of channel access by STAs in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly and completely apparent, the technical solutions in the embodiments of the present invention are described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

In an embodiment, as shown in fig. 1, this embodiment provides a frame structure of channel multiple access based on reinforcement learning, an AP adjusts a contention window CW in a learning evaluation manner, and an STA acquires the contention window CW from an acknowledgement frame sent by the AP and uses the contention window to contend for a channel resource; after the STA uses the contention window to contend for the channel resource for multiple times, network feedback is formed and sent to the AP.

In an embodiment, as shown in fig. 2, the embodiment provides a model for performing reinforcement learning by an AP, in the reinforcement learning model, the AP is set as an agent, and an environment state S where the AP is located is a current contention window; the action A allowed to be executed is the increase, decrease and maintenance of the current contention window; the reward is an important performance index in the network, such as throughput, time delay, fairness and the like.

In one embodiment, as shown in fig. 3, the present embodiment provides a channel multiple access method based on reinforcement learning, the method including:

In one embodiment, the model of the markov decision process in step 1) comprises:

wherein, S represents a state space, namely a set of all competition windows which can be selected by a station; s_tRepresents the contention window at time t; a represents the action space, namely the scaling or holding operation is carried out on the current competition window; a is_t-1 represents taking the action of contention window reduction at time t; a is_t0 denotes that at time t the action is taken with the contention window remaining unchanged, a_t1 represents the action of increasing the contention window at time t; η represents an adjustment factor; CW_currA contention window representing a current state; CW_nextIndicating the contention window for the next state.

In a preferred embodiment, the next state is determined by the contention window at the next time instant. Eta may be 2 or

Other values may be selected depending on the actual situation. By passing

The contention window can be controlled to scale or remain unchanged, and only a unique state can be obtained after a certain action is performed, so that the transition probability p(s)_t+1|s_t,a_t) 1. Wherein, the maximum contention window and the minimum contention window are respectively CW_min＝15，CW_max1023, can be adjusted according to the actual conditions。

In one embodiment, the system adopts a value function updating mode of a Q-learning algorithm; the system does not have an updating action value function when running for the first time, and the following formula is needed when the system is not running for the first time; the action cost function includes:

q(s，a)←q(s，a)+α[U-q(s，a)](2)

U←R+γmax_{a′∈A(s′)}q_π(s′，a′)(3)

wherein q (s, a) represents the value of taking the contention window action a in the s state; alpha is the learning rate and gamma is the discount factor; r represents the reward of a performance index; u is a time sequence difference target and represents the predicted actual reward; q. q.s_π(s ', a') represents the value of using the strategy π to take action a 'in the next state s', although other reinforcement learning algorithms may be used with the present invention.

In one embodiment, the system uses_ε-a greedy policy selection action, the action of AP selection actually referring to the contention window. That is, the AP selects the optimal contention window CW in the current state to broadcast to the STAs through reinforcement learning. The system may initially broadcast the CW piggybacked with a beacon frame and the non-first broadcast contention window may be set by adding a CW field in the acknowledgment frame MBA. The CW is broadcast while the AP acknowledges the data frame._εThe greedy strategy selection action formula is as follows:

wherein pi (s | a) represents that the AP agent selects an action of the current maximized value with a probability of 1-epsilon, and randomly selects an action from all actions with a probability of epsilon; | a(s) | represents the number of selectable actions under the contention window of the s state; q. q.s_π(s, a) represents the value function under the strategy pi, namely the value of the action a selected by the strategy pi under the current state s.

In an embodiment, after the channel access method of the present invention is used, after a system runs an action adjustment period, performance indexes such as system throughput, time delay, fairness, etc. in the period can be counted, and data transmission can be performed for many times in the action adjustment period. From these performance indicators, a reward R can be calculated, the calculation formula of the reward of the performance indicators comprising:

R＝p(t)(5)

wherein, p (t) is an important performance index in the network, including throughput, time delay or/and fairness; thoutthput_iRepresents the system throughput of the ith cycle; delyTime_iRepresenting the average delay of the ith cycle.

In another embodiment, the reinforcement learning-based channel multiple access method may further include:

And in the channel access process, the station generates an OBO backoff value for backoff by using the uniform contention window broadcasted by the AP. After the backoff is finished, randomly selecting one RU to send BSR competition channel resources, and if the competition ending condition is met, ending the competition; otherwise, the remaining stations continue to contend for the RU. After the competition is finished, the AP allocates different RUs to the stations which succeed in the competition through the trigger frame, and the stations use the respective allocated RUs to transmit data.

The learning process may refer to fig. 4, and may also include:

step S21, initializing parameters, and establishing a Markov decision process with the AP as an agent and the environment state as a current competition window;

the environment state S of the AP is the current competition window; the action allowed to be performed is increasing, decreasing or not changing the current contention window; the reward is an important performance index in the network, such as throughput, time delay and the like.

Step S22, updating the action value function;

the action value function can record historical experience and can be used for adjusting a later competition window.

Step S23, using_ε-a greedy policy selection action;

thereby trading off exploration and utilization. The AP performs the action by actually scaling or holding without adjustment the current contention window.

Step S24, the STA contends for the channel and transmits data;

alternatively, the process of step S24 may be a channel access process, and reference may be made to the above-described embodiment.

Step S25, obtaining reward, and counting some performance indexes of the last action adjusting period as reward;

as a preferred embodiment, the present embodiment prioritizes throughput as a reward.

Step S26, the action cost function is updated, and the system updates the action cost function again according to the reward obtained in the current action adjustment period. At this time, it is determined whether the system satisfies the termination condition, and if so, the system is terminated. Otherwise, execution continues at step S21.

The channel access procedure may refer to fig. 5, and may also include:

in step S11, after acquiring the contention window CW from the acknowledgement frame sent by the AP, the STA randomly selects a backoff value from [0, CW ], and records it as an OBO. If the channel is idle, the total RU number is subtracted from the OBO in each backoff process until the OBO is less than or equal to 0.

In step S12, when the OBO of the STA is less than or equal to 0, the STA gets a chance to contend for the channel resource. The STA randomly selects one RU to transmit a BSR. In order to ensure the service quality of the high-priority STA, the high-priority STA is allowed to obtain two continuous chances of competing for channel resources after one backoff is finished; while low priority STAs have only one chance to contend for channel resources.

In step S13, while the STA competes for the RU, the AP counts the number of STAs and the number of contention rounds that successfully competed. If the number of STAs for which the contention succeeds is greater than or equal to the total number of RUs or the number of contention rounds is greater than the maximum number of contention rounds, the contention ends. After the competition is finished, the AP sends a trigger frame, and each STA which successfully competes for RUs is allocated with one RU.

In step S14, the STA transmits data using the RU allocated by the AP, and after receiving the acknowledgement frame, the STA re-executes step S11.

In another embodiment, the channel access procedure may further include:

step S111, after acquiring the contention window CW from the acknowledgement frame sent by the AP, the STA randomly selects a backoff value from [0, CW ], and records the backoff value as OBO. If the channel is continuously idle for one DIFS frame interval, the STA starts to retreat; if the channel is busy due to the transmission of RSB frames, the backoff may start just after waiting for one MIFS frame interval. The total number of RUs is subtracted from the OBO in each backoff procedure until the OBO is less than or equal to 0.

In step S112, when the OBO of the STA is less than or equal to 0, the STA obtains an opportunity to contend for the channel resource. The STA randomly selects one sub-channel to transmit the BSR contention channel, and each RU can be regarded as one sub-channel. In order to ensure the service quality of the high-priority STA, the high-priority STA is allowed to obtain two continuous chances of competing for channel resources after one backoff is completed, and if one competition is successful in the two times, the STA is considered to be successful in competing for the channel; while low priority STAs have only one chance to contend for channel resources. The system mainly divides the service types into two types: a high priority video site and a low priority background site.

In step S113, while the STA competes for the RU, the AP counts the number of STAs and the number of contention rounds that successfully compete. After waiting for a DIFS frame idle time, each idle slot and each time BSR is transmitted is considered a contention round.

For example, fig. 6 is a timing diagram of STAs contending for a channel and transmitting data in the system of the present invention; the number of competing rounds in fig. 6 is 5 rounds. If the number of STAs for which the contention succeeds is greater than or equal to the total number of RUs or the number of contention rounds is greater than the maximum number of contention rounds, the contention ends. After the competition is finished, the AP sends a trigger frame and allocates one RU for each STA which successfully competes for the RUs. If the number of the successful STAs is larger than the total number of the RUs, randomly marking the part of the STAs as failed in contention. Until the number of STAs with successful contention equals the total number of RUs, one RU is allocated to each STA.

Step S114, after receiving the trigger frame TF, the STAs participating in the contention for channel resources transmit data using the RU allocated by the AP, and after receiving the acknowledgment frame MBA, the process goes to step S111, and the STAs having data transmission can perform channel access again.

In a preferred embodiment, video traffic is prioritized high and background traffic is prioritized low.

In one embodiment, the optimal contention window broadcasted by the AP includes that the AP broadcasts in a manner of piggybacking a contention window CW by using a beacon frame, and a non-first-time broadcast contention window adds a CW field in an acknowledgement frame MBA; broadcasting CW when the AP confirms the data frame; of course, the first broadcast may be as shown in fig. 6, with MBA frames sent in all subchannels.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A channel multiple access method based on reinforcement learning, the method comprising:

the epsilon-greedy strategy selects the formula adopted by the competition window action and comprises the following steps:

wherein pi (s | a) represents that the AP agent selects an action of the current maximized value with a probability of 1-epsilon, and randomly selects an action from all actions with a probability of epsilon; | a(s) | represents the number of selectable actions under the contention window of the s state; q. q.s_π(s, a) represents a cost function of taking action a under policy π;

the backoff contention process includes:

step 31) after obtaining the optimal contention window CW from the acknowledgement frame sent by the AP, the STA randomly selects a backoff value from [0, CW ] and records the backoff value as OBO; if the channel is idle, subtracting the total RU number from the OBO in each backoff process until the OBO is less than or equal to 0;

step 32) when the OBO of the STA is less than or equal to 0, the STA obtains the opportunity of competing for the channel resources; the STA randomly selects one RU to send BSR frames;

step 33), the AP counts the number of the STAs successfully contended and the number of the contention rounds; if the number of the STAs successfully contended is larger than or equal to the total number of the RUs or the number of the contention rounds is larger than the maximum number of the contention rounds, the contention is ended;

2. The channel multiple access method based on reinforcement learning of claim 1, wherein the model of the markov decision process in step 1) comprises:

S＝{s₁,s₂,…,s_n},s_t∈CW

A＝{a₁,a₂,…,a_n},a_t∈{-1,0,1}

s_t＝CW_curr

s_t+1＝CW_next

3. The channel multiple access method based on reinforcement learning of claim 1, wherein the initial action value in step 2) is q (s, a) ═ 0.

4. The channel multiple access method based on reinforcement learning of claim 1, wherein the step 32) comprises, in order to guarantee the service quality of the high priority STA, allowing the high priority STA to obtain two consecutive opportunities to contend for the channel resource after one backoff is completed; while low priority STAs have only one chance to contend for channel resources.

5. The channel multiple access method based on reinforcement learning of claim 1, wherein the optimal contention window broadcasted by the AP comprises that the AP broadcasts by using a beacon frame to piggyback a contention window CW, and a non-first-time broadcast contention window (MBA) adds a CW field in an acknowledgement frame (MBA); the CW is broadcast while the AP acknowledges the data frame.

6. The channel multiple access method based on reinforcement learning of claim 1, wherein the calculation formula of the reward of the performance index in the step 5) comprises:

R＝p(t)

wherein, p (t) is an important performance index in the network, including any one or more of throughput, time delay or fairness; thoutthput_iRepresents the system throughput of the ith cycle; delyTime_iRepresenting the average delay of the ith cycle.

7. The channel multiple access method based on reinforcement learning of claim 1, wherein the calculation formula of the action cost function in step 6) comprises:

q(s,a)←q(s,a)+α[U-q(s,a)]

U←R+γmax_{a′∈A(s′)}q_π(s′,a′)

wherein q (s, a) represents the value of taking the contention window action a in the s state; alpha is the learning rate and gamma is the discount factor; r represents the reward of a performance index; u is a time sequence difference target and represents the predicted actual reward; q. q.s_π(s ', a') represents the value of selecting action a 'in the next state s' using strategy π.