CN113613339B

CN113613339B - Channel access method of multi-priority wireless terminal based on deep reinforcement learning

Info

Publication number: CN113613339B
Application number: CN202110781263.0A
Authority: CN
Inventors: 孙红光; 高银洁; 张宏鸣; 李书琴; 徐超; 高振宇
Original assignee: Northwest A&F University
Current assignee: Northwest A&F University
Priority date: 2021-07-10
Filing date: 2021-07-10
Publication date: 2023-10-17
Anticipated expiration: 2041-07-10
Also published as: CN113613339A

Abstract

The invention belongs to the technical field of wireless communication, and discloses a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, which comprises the following steps: establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple; and performing performance verification on the trained model through simulation comparison of multiple scenes. The invention designs the channel access method of the multi-priority service wireless terminal by using deep reinforcement learning, is more suitable for wireless networks with different priority services, improves the throughput of the system and the utilization rate of wireless channel resources, reduces the scheduling delay of high priority services and improves the opportunity of accessing channels of low priority services.

Description

Channel access method of multi-priority wireless terminal based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a channel access method of a multi-priority wireless terminal based on deep reinforcement learning.

Background

Currently, with the rapid development of wireless communication technology, the demand for wireless channels for new services such as data transmission and exchange is increasing. In a wireless network, when a plurality of users contend for the use right of a specific resource (e.g., the use right of a shared channel) at the same time, the users transmit data packets by acquiring the use right of the channel, and at this time, since information from different users needs to occupy the channel for transmission, collision of the data packets may be caused, thereby causing communication failure. In order to improve communication efficiency, a multiple access protocol needs to be introduced to determine the use authority of a user on resources, so that the problem that multiple users share the same physical link resource is solved.

YIding Yu et al propose a Deep-reinforcement learning based multiple access protocol (Deep-reinforcement Learning Multiple Access, DLMA) in the literature in wireless heterogeneous networks. Different multiple access algorithms based on deep reinforcement learning are provided for different optimization targets in the literature, and system simulation comparison analysis is performed on the algorithms corresponding to the different targets. Simulation results show that DLMA can achieve the expected goal without knowing the multiple access protocols adopted by other coexisting networks, thereby improving the proportional fairness and the throughput of the system. However, the existing channel access method based on deep reinforcement learning does not consider the service quality requirement difference of the services with different priorities, and cannot well guarantee the service quality requirement of the service with high priority.

Conventional multiple access protocols such as time division multiple access (Time Division Multiple Access, TDMA), ALOHA protocol, code division multiple access (Code Division Multiple Access, CDMA), carrier sense multiple access (Carrier Sense Multiple Access, CSMA), etc. all have a problem of low channel utilization. Therefore, the invention mainly aims at the problems of the multiple access protocol, designs a channel access method of the multiple priority wireless terminal based on deep reinforcement learning in a network scene with multiple priority services, and reduces the scheduling delay of the high priority services under the constraint of ensuring the better throughput of the system.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) In a wireless network, when a plurality of users contend for the use of a specific resource at the same time, the users transmit data packets by acquiring the use of a channel, and since information from different users needs to occupy the channel for transmission, collision of the data packets may be caused, resulting in communication failure.

(2) The existing channel access method based on deep reinforcement learning does not consider the service quality requirement difference of the services with different priorities, and cannot well ensure the service quality requirement of the high-priority service.

(3) The traditional multiple access method based on the multiple priority service can not fully utilize wireless channel resources, so that the resource waste is caused.

The difficulty and meaning for solving the problems and the defects are as follows: the invention provides a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, which endows the wireless terminal with learning capability, and interacts with the environment by using a reinforcement learning mechanism so as to further improve the utilization rate of wireless channel resources; in the design of the rewarding function, by setting different rewards for the services with different priorities, the scheduling delay of the service with high priority can be reduced, and meanwhile, the opportunity of accessing the channel to the service with low priority can be improved.

Disclosure of Invention

Aiming at the problems of the existing multiple access protocol, the invention provides a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, in particular to a channel access method of a multi-priority wireless terminal based on deep reinforcement learning.

The invention is realized in such a way that the channel access method of the multi-priority wireless terminal based on the deep reinforcement learning comprises the following steps:

Step one, establishing network scenes with different priority services; and (3) defining network scenes, and modifying the system model and the neural network model according to different network scenes by a user so as to be deployed in different network scenes.

Step two, designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; according to different network scenes, the state space and the action space are subjected to fine tuning and use, and the form of the reward function is designed and modified according to different points (the core of the reward function design is the priority), so that the requirements of different points in different network scenes are better met, and the requirements of actual deployment are more met.

Step three, a neural network model used by the protocol is definitely and established, and the network model is trained through an experience tuple; aiming at different neural network model training processes, the model convergence rates are not very same, and the training process is faster and more accurate by comparing and selecting the better neural network model.

And step four, performing performance verification on the trained model through simulation comparison of multiple scenes. The feasibility and superiority of the invention are verified and illustrated by comparing the multiple access protocol with the service without priority in multiple scenes.

Further, in the first step, the establishing a network scenario with services of different priorities includes:

establishing a transmission network with k priority services, wherein k is greater than 0; the network scene comprises a base station, N DRL-MAC nodes (the nodes adopting the invention), M time division multiple access (Time Division Multiple Access, TDMA) nodes and X q-ALOHA nodes, wherein (N >1; M+X > 1) at least one DRL-MAC node and one other protocol node exist.

The base station is used for acquiring data from a wireless channel between the node and the base station and transmitting the data; the DRL-MAC node adopts a multiple access technology based on deep reinforcement learning, and if the node sends services with different priorities, a transmission result fed back by a base station is obtained and different rewards are obtained according to the services with different priorities; if the node does not send the service, the node will monitor the channel and obtain the transmission state of other nodes in a certain time slot through the channel observation result; the time division multiple access node adopts a TDMA protocol and is used for carrying out service transmission according to regularly and periodically occupying the allocated time slot; the q-ALOHA node adopts q-ALOHA for carrying out service transmission in each time slot with fixed transmission probability q according to q values under different scenes.

Further, in the second step, each DRL-MAC node is equivalent to one agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state:

wherein q(s) _t A, θ) is an approximation of the deep neural network model calculation, selecting action a from the set of actions according to a greedy strategy to maximize rewards overall expectations or better accommodate dynamically changing wireless network environments.

In the second step, the system model of the protocol is designed and defined, state space modeling and action space modeling are performed according to the network scene of the protocol, and reward functions are designed according to different scenes, including:

(1) Motion space modeling

System action set A _t ＝{a ₀ ,a ₁ ,a ₂ ......a _k Where k is the number of classes of priority traffic in the network scenario. In time slot t, DRL-MAC node needs to make decision a through deep neural network to determine whether to access data packet into channel in current time slot; wherein a is ₀ Indicating that the channel is not accessed; a, a ₁ ......a _k Representing the service access channels with different priorities corresponding to k;

the channel observation result after taking action is Z _t E { SUCCESS, COLLISION, IDLENESS }, obtaining channel observations for the construction of the experience tuples by listening to the channel; wherein SUCCESS indicates that the node accesses the channel and transmits the data packet successfully; COLLISION means that a plurality of nodes access channels simultaneously for transmission, resulting in COLLISION; IDLENESS indicates that the channel is idle, i.e., no node accesses the channel; the Agent determines a channel observation result according to the acknowledgement signal from the access point and the interception channel;

(2) State space modeling

State setWherein M history states to be tracked are contained, each history state is observed by action>Composition, of which there are a total of 2k+3 combinations:

(3) Reward function

The principle that the reward function always follows for network scenes of different priority services is as follows: the higher the priority of the service, the higher the rewards brought by successful transmission, and the more the penalties brought by failed transmission; the bonus function is set to sum_repeat=α+s (1- α) ×delay/T; wherein alpha and T are controllable variables, the parameter alpha is used for adjusting the influence value of the time delay on the overall rewards, and when the influence of the time delay on the overall rewards is not considered, the parameter alpha is initialized to 1; the parameter T is used for unifying the influence range of time delay on rewards and is initialized to 50; delay is the time delay of a certain service from generation to an accessed channel, namely, scheduling time delay; rewards are prize values, r, presented for different priority traffic ₁ ......r _k Rewards corresponding to successful transmission of different priority services, r _-1 ......r _-k Penalty corresponding to transmission failure for different priority service:

further, in the third step, the defining and establishing the neural network model used by the protocol, and training the network model through the experience tuple, includes:

The DQN is introduced to enable the DRL-MAC node to better learn the use condition of the channel by other nodes to make a next decision, the intelligent agent adopts a depth residual error network architecture for training, the depth residual error network is used for approximating the Q value, the current state s is input, the action strategy a is output, and then an experience tuple is formed by combining other information for training the depth residual error network.

Further, in the third step, the defining and establishing the neural network model used by the protocol, and training the network model through the experience tuple, further includes:

(1) Initializing an experience pool, setting the capacity of the experience pool and initializing parameters;

(2) Starting from time slot t=0;

(3) The Q value of the current state s is calculated in the afferent neural network; selecting an action to be executed through a greedy strategy, and recording a channel observation result z and total rewards sum_forwards obtained after the action is taken; putting the obtained state s, the rewards r obtained after taking action a and the experience tuples (s, a, r, s ') which are formed by the next state s' into an experience pool;

(4) If the current generated experience tuple is larger than the experience pool capacity, discarding the experience tuple which enters the experience pool earliest, and placing the latest experience tuple into the experience pool; otherwise, sequentially entering the experience tuples into an experience pool according to the sequence;

(5) If the current time slot t is a multiple of 10, randomly extracting N experience tuples from the experience pool, and sequentially calculating y values of the experience tuples:where r represents the current prize obtained by taking action a in the current state s, gamma E (0, 1) is the discount factor, ++>Selecting rewards obtained by the action with the maximum Q value for future prediction, otherwise, entering the step (8);

(6) Updating the Q-estimated network parameter θ using a half gradient descent algorithm;

(7) If the current time slot t is a multiple of the parameter F, the Q-estimated network parameter theta is assigned to the target network parameter theta ^- Otherwise, entering the step (8);

(8) If the time slot t is greater than or equal to the set training round, exiting the training process; otherwise, entering the next time slot t=t+1, and entering the step (3).

Another object of the present invention is to provide a channel access system of a multi-priority service wireless terminal to which the channel access method of a multi-priority service wireless terminal based on deep reinforcement learning is applied, the channel access system of the multi-priority service wireless terminal comprising:

the network scene establishing module is used for establishing network scenes with different priority services;

the system model design module is used for designing and defining a system model of the protocol;

The space modeling module is used for carrying out state space modeling and action space modeling according to the protocol network scene;

the rewarding function design module is used for designing rewarding functions aiming at different scenes;

the neural network model building module is used for definitely and building a neural network model used by the protocol;

the network model training module is used for training the network model through the experience tuple;

and the performance verification module is used for verifying the performance of the trained model through simulation comparison of multiple scenes.

It is a further object of the present invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple; and performing performance verification on the trained model through simulation comparison of multiple scenes.

Another object of the present invention is to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide a wireless communication information data processing terminal for implementing a channel access system of the multi-priority service wireless terminal.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention provides a channel access method (Deep Reinforcement Learning-based Medium Access Control, DRL-MAC) of a multi-priority service wireless terminal based on deep reinforcement learning, which is implemented by establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple; and performing performance verification on the trained model through simulation comparison of multiple scenes so as to reduce the scheduling delay of the high-priority service under the constraint of ensuring the throughput of the system. The invention designs the channel access method of the multi-priority service wireless terminal by using deep reinforcement learning, is more suitable for wireless networks with different priority services, improves the throughput of the system and reduces the scheduling delay of high priority services.

Aiming at wireless networks with different priority services, the invention provides a channel access method of a multi-priority service wireless terminal based on deep reinforcement learning, which endows the wireless terminal with learning capability, and interacts with the environment by using a reinforcement learning mechanism so as to further improve the utilization rate of wireless channel resources; in the design of the rewarding function, by setting different rewards for the services with different priorities, the scheduling delay of the service with high priority can be reduced, and meanwhile, the opportunity of accessing the channel to the service with low priority can be improved. The comparison with the multiple access protocol of the non-prioritized service shows that the channel access method of the multi-priority service wireless terminal based on the deep reinforcement learning has better system throughput and scheduling delay of the high-priority service.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a channel access method of a multi-priority service wireless terminal according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a channel access method of a multi-priority service wireless terminal according to an embodiment of the present invention.

Fig. 3 is a block diagram of a channel access system of a multi-priority service wireless terminal according to an embodiment of the present invention;

in the figure: 1. a network scene building module; 2. a system model design module; 3. a spatial modeling module; 4. a reward function design module; 5. the neural network model building module; 6. a network model training module; 7. and a performance verification module.

Fig. 4 is a flowchart of a training deep neural network based on deep reinforcement learning according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network model according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of simulation results provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems existing in the prior art, the invention provides a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the channel access method of the multi-priority wireless terminal based on deep reinforcement learning provided by the embodiment of the invention comprises the following steps:

s101, establishing network scenes with different priority services;

s102, designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to a network scene of the protocol, and designing reward functions aiming at different scenes;

s103, defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple;

s104, performing performance verification on the trained model through simulation comparison of multiple scenes.

The schematic diagram of the channel access method of the multi-priority wireless terminal based on deep reinforcement learning provided by the embodiment of the invention is shown in fig. 2.

As shown in fig. 3, a channel access system of a multi-priority service wireless terminal according to an embodiment of the present invention includes:

the network scene establishment module 1 is used for establishing network scenes with different priority services;

a system model design module 2 for designing and specifying a system model of the protocol;

the space modeling module 3 is used for carrying out state space modeling and action space modeling according to the protocol network scene;

A reward function design module 4, configured to design a reward function for different scenes;

the neural network model building module 5 is used for definitely and building a neural network model used by the protocol;

a network model training module 6 for training the network model through the experience tuple;

and the performance verification module 7 is used for performing performance verification on the trained model through multi-scene simulation comparison.

The technical scheme of the invention is further described below by combining the embodiments.

As shown in fig. 1, the DRL-MAC protocol provided by the embodiment of the present invention includes the following steps:

(1) Establishing a wireless network scene containing a plurality of different priority services;

establishing a wireless network with two priority services, wherein the network scene comprises the following steps: a base station, N DRL-MAC nodes (the nodes adopting the invention), M time division multiple access (Time Division Multiple Access, TDMA) nodes, X q-ALOHA nodes, wherein (N >1; M+X > 1) at least one DRL-MAC node and one other protocol node exist.

The base station acquires data from a wireless channel between the node and the base station and transmits the data; the DRL-MAC node adopts a multiple access technology based on deep reinforcement learning, and if the node sends the service with different priority, a transmission result fed back by the base station is obtained and different rewards are obtained according to the service with different priority; if the node does not send the service, the node will monitor the channel and obtain the transmission state of other nodes in a certain time slot through the channel observation result; the time division multiple access node adopts a TDMA protocol, and performs service transmission according to regularly and periodically occupying the allocated time slot; the illustrated q-ALOHA node employs q-ALOHA, which performs traffic transmission with a fixed transmission probability q at each slot according to q values in different scenarios.

(2) Designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes;

each DRL-MAC node is equivalent to one agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state:

wherein q(s) _t A, θ) is an approximation of the deep neural network model calculation and action a is selected from the set of actions according to a greedy strategy to maximize rewards overall expectations or better accommodate dynamically changing wireless network environments.

(2.1) motion space modeling

For a network scene with two priority services, the system action set is A _t ＝{a ₀ ,a ₁ ,a ₂ The DRL-MAC node needs to make a decision a via the deep neural network at time slot t to determine whether to access the packet to the channel at the current time slot, where a ₀ Indicating that the channel is not accessed; a, a ₁ Indicating to access the high priority traffic to the channel; a, a ₂ Indicating that the low priority traffic is to be accessed to the channel.

The channel observation result after taking action is Z _t E { SUCCESS, COLLISION, IDLENESS }, obtaining a channel observation result through a listening channel for the formation of an experience tuple, wherein SUCCESS indicates that a node accesses the channel and transmits a data packet successfully; COLLISION means that a plurality of nodes access channels simultaneously for transmission, resulting in COLLISION; IDLENESS indicates that the channel is idle, i.e., no node has access to the channel. The Agent determines the channel observations based on the acknowledgement signal from the access point (if it is transmitting) and the listening channel (if it is waiting).

(2.2) State space modeling

State setWherein M history states to be tracked are contained, each history state is observed by action>The composition, for a network scenario with two priority traffic, of action-observation pairs is seven in total:

(2.3) reward function

For a network scenario with two priority services, the principle that the reward function always follows is: the higher the priority of the service, the higher the rewards brought by successful transmission; the more penalty it incurs from a transmission failure. The reward function is set as sum_reward=α+1- α (delay/T), where α and T are controllable variables, and the parameter α is used to adjust the impact value of the delay on the overall reward, and when the impact of the delay on the overall reward is not considered, it is initialized to 1; the parameter T is mainly used for unifying the influence range of time delay on rewards and initializing the influence range to 50; delay is the time delay of a certain service from generation to an accessed channel, namely, scheduling time delay; the forwards is a prize value proposed for a network scenario with two priority traffic.

(3) Defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple; FIG. 4 is a flow chart of training a deep neural network based on deep reinforcement learning according to the present invention.

And (3.1) initializing an experience pool, setting the capacity of the experience pool and initializing parameters.

(3.2) starting from time slot t=0;

(3.3) afferent the current state s into a neural network to calculate the Q value of the state; selecting an action to be executed through a greedy strategy, and recording a channel observation result z and total rewards sum_forwards obtained after the action is taken; the obtained state s, the prize r obtained after taking action a and the next state s 'are put into an experience pool to form an experience tuple (s, a, r, s').

(3.4) if the currently generated experience tuple is greater than the experience pool capacity, discarding the experience tuple that was first entered into the experience pool, and placing the latest experience tuple into the experience pool; otherwise, the experience tuples are sequentially put into the experience pool.

(3.5) if the current time slot t is a multiple of 10, randomly extracting N experience tuples from the experience pool, and sequentially calculating y values of the experience tuples: Where r represents the current prize obtained by taking action a in the current state s, gamma E (0, 1) is the discount factor, ++>And selecting the rewards obtained by the action with the maximum Q value for future prediction, otherwise, entering the step (3.8).

(3.6) updating the Q-estimated network parameter θ using a half gradient descent algorithm.

(3.7) if the current time slot t is a multiple of the parameter F, assigning the Q-estimated network parameter θ to the target network parameter θ ^- Otherwise, the step (3.8) is entered.

(3.8) if the time slot t is greater than or equal to the set training round, exiting the training process; otherwise, entering the next time slot t=t+1; and then enter step (3.3).

(4) And performing performance verification on the trained model through simulation comparison of multiple scenes so as to reduce the scheduling delay of the high-priority service under the constraint of ensuring the throughput of the system.

Fig. 5 is a network scenario used in the simulation experiment of the present invention, where the network scenario includes a base station, N DRL-MAC nodes (nodes adopting the present invention) (N > 1), M TDMA nodes, and X q-ALOHA nodes (m+x > 1), where data packets are transmitted between the nodes and the base station through a shared wireless channel.

The technical effects of the present invention will be described in detail with reference to simulation experiments.

1. Simulation conditions:

the simulation experiment of the invention is implemented on a Windows platform, and is mainly configured as follows: CPU is Intel (R) Core (TM) i7-7500U,2.70GHz; the memory is 8G; the operating system is Windows10; the simulation software is PyCharm.

2. Simulation content and result analysis:

the simulation experiment is compared with a model-aware node, wherein the model-aware node is aware of multiple access protocol (multiple access control, MAC) mechanisms of other coexisting nodes, and obtains the optimal MAC protocol coexisting with the node by utilizing the mechanism of the known MAC protocol. The simulation experiment results are shown in fig. 6.

Example one: in a transmission network with two priority services, a network scene comprises a base station, a DRL-MAC node and a TDMA node; the DRL-MAC node is always in a state with traffic to be transmitted (saturated traffic scenario).

Fig. 6 (a) is a throughput result when a DRL-MAC node and a TDMA node coexist in a saturated traffic scenario, and the goal is to achieve the optimal throughput of the system.

The throughput result when the TDMA occupied time slot N is changed from 2 to 9 when the frame length is 10 can be seen from fig. 6 (a). The diagonal filled-in portion and the solid filled-in portion of the histogram represent the throughput of the DRL-MAC node and TDMA node, respectively. The circular marked dashed line is the simulated total throughput, i.e., the total throughput of the system, in coexistence of the DRL-MAC node and the TDMA node. The diamond marked dashed line represents the value of the theoretical optimal system throughput verified by the model aware node. It can be seen from fig. 6 (a) that the circular marked dashed line and the diamond marked dashed line almost coincide. This means that the DRL-MAC node can discover the unused time slots of TDMA by learning without knowing the protocol used by another node.

Fig. 6 (b) is a throughput result when the DRL-MAC node and one TDMA node coexist in a saturated traffic scenario, with the goal of achieving the optimal throughput of the system.

Fig. 6 (b) shows access probabilities of high-priority packets in a scenario where the DRL-MAC node and the TDMA node consider and do not consider traffic priorities. The square marked solid line of fig. 6 (b) shows the access probability of the high priority packet in the scenario where traffic priority is considered; the circular marked solid line indicates the probability of access of high priority packets without considering traffic priority scenarios. From fig. 6 (b) it can be seen that the blue line is in most cases above the red line. It can be concluded that the DRL-MAC node can send the high-priority service more timely in the service priority scenario, so as to ensure the priority communication of the high-priority service.

Example two: under the transmission network with two priority services, the network scene comprises a base station, a DRL-MAC node and a q-ALOHA node; the DRL-MAC node is always in a state with traffic to be transmitted (saturated traffic scenario).

Fig. 6 (c) is a throughput result when the DRL-MAC node and one q-ALOHA node coexist in a saturated traffic scenario, with the goal of achieving the optimal throughput of the system.

Fig. 6 (c) shows the throughput result of q-ALOHA when the access probability q is changed from 0.2 to 0.9 when the q-ALOHA node coexist with the DRL-MAC node in the saturated traffic scenario. The diagonal line filled-in portion and the solid color filled-in portion in fig. 6 (c) represent throughput of the DRL-MAC node and the q-ALOHA node, respectively. The circular marked dashed line is the simulated total throughput in the coexistence of the DRL-MAC node and the q-ALOHA node, i.e. the total throughput of the system. The diamond marked dashed line represents the value of the theoretical optimal system throughput verified by the model aware node. From fig. 6 (c), it can be seen that the circular marked dashed line and the circular marked dashed line almost coincide in most cases. This shows that the DRL-MAC node can obtain the best throughput by learning the policy without knowing that the other node is a q-ALOHA node and the transmission probability q.

Fig. 6 (d) is a throughput result when a DRL-MAC node and a q-ALOHA node coexist in a saturated traffic scenario, and the goal is to achieve fair transmission between nodes.

Fig. 6 (d) is a throughput result of q-ALOHA when the access probability q is changed from 0.2 to 0.6 in the case of implementing the proportional fairness index. The actual throughput of the q-ALOHA node, the DRL-MAC node and the system are respectively represented by a circular marked solid line, a triangular marked solid line and a square marked solid line, and the theoretical optimal throughput of the q-ALOHA node, the DRL-MAC node and the system obtained by the model perception node is respectively represented by a circular marked dotted line, a triangular marked dotted line and a square marked dotted line. As can be seen from fig. 6 (d), in the case of implementing the proportional fairness index, there is some relatively small error between the actual throughput and the theoretical optimal throughput, which may indicate that the DRL-MAC node may implement the proportional fairness index by learning a policy without knowing that another node is a q-ALOHA node and the transmission probability q.

Fig. 6 (e) is a throughput result when the DRL-MAC node and one q-ALOHA node coexist in a saturated traffic scenario, and the goal is to achieve the optimal throughput of the system.

Fig. 6 (e) is a simulation result of q-ALOHA accessing probability for high priority packet when the access probability q is changed from 0.2 to 0.9. The solid line in the figure shows the duty cycle of the Agent access high priority packet action in all access channel actions at different access probabilities q. As can be seen from fig. 6 (e), in a scenario where DRL-MAC and q-ALOHA coexist (in case of unstable environment), high priority traffic can be achieved with a larger access probability than low priority traffic.

Example three: in a transmission network with two priority services, a network scene comprises a base station, a DRL-MAC node and a TDMA node; the DRL-MAC is in an unsaturated traffic scenario.

The unsaturated service scene refers to: each time slot has a data packet arrival, the arrival rates of the data packets with different priorities are different, the arrival probability of the data packets with high priority is defined as 0.3, the arrival probability of the data packets with low priority is defined as 0.7, the data packets with different priorities enter the corresponding data packet queues with different priorities respectively for queuing and waiting, in order to avoid the condition that the queues are empty when access action operation is taken, when the service queues are initialized, a certain amount of data packets with corresponding priorities are respectively enqueued to the service queues with different priorities (in the invention, the queues with high and low priorities are respectively initialized to enqueue 5 data packets), and the condition that the access service queues are empty when training is just started is avoided.

The channel access criteria under the unsaturated service scene are as follows: when action a is taken ₁ Namely, accessing low-priority service in the channel, and dequeuing the data packet and discarding the data packet when collision occurs in the transmission process; when action a is taken ₂ Namely, high-priority service is accessed in the channel, collision is generated in the transmission process, the data packet is re-queued and placed at the head of the queue; high priority traffic is sent in greater demand due to the greater rewards of high priority traffic, but action a is taken herein for Agent by learning due to the fixed probability of arrival of traffic ₂ I.e. when accessing high priority service, when there is no service in the high priority service queue, the action a of accessing high priority service is still executed ₂ Actually taking out the low-priority service from the low-priority queue for access, and giving rewards corresponding to the low-priority service if the data packet is successfully transmitted at the moment; if the transmission fails, the corresponding punishment is given to the high-priority data packet, and the data packet is re-accessed into the low-priority queue head.

Fig. 6 (f) is a result obtained in a simulation scenario corresponding to different parameters α under coexistence of a DRL-MAC node and a TDMA node in an unsaturated service scenario, where each simulation result is divided into three parts in the diagram: system throughput, high priority service access probability, and system delay.

It can be seen from fig. 6 (f) that the time delays for the transmission of the different priority traffic for the different parameter a in the lower left corner of each simulation graph are seen to remain almost unchanged when decreasing from 1 to 0.8, while the maximum time delay for the low priority traffic is stepped down from a value close to 80 to a value close to 60. It can be shown from fig. 6 (f) that in this scenario, by adjusting the parameter α, it is ensured that the optimal throughput of the system is achieved by the learning strategy; on the premise that the access probability of the high-priority service is larger than that of the low-priority data packets, the overall time delay of the system is reduced by sacrificing the access probability of some high-priority data packets.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The channel access method of the multi-priority wireless terminal based on the deep reinforcement learning is characterized by comprising the following steps of:

step one, establishing network scenes with different priority services;

step two, designing and defining a system model of a protocol, carrying out state space modeling and action space modeling according to a network scene of the protocol, and designing reward functions aiming at different scenes;

step three, a neural network model used by the protocol is definitely and established, and the network model is trained through an experience tuple;

step four, performing performance verification on the trained model through simulation comparison of multiple scenes;

in the second step, the system model of the designed and defined protocol performs state space modeling and action space modeling according to the network scene of the protocol, and designs reward functions aiming at different scenes, including:

(1) Motion space modeling

System action set A _t ＝{a ₀ ,a ₁ ,a ₂ ......a _k The method comprises the steps that k is the number of types of priority service in a network scene, and in a time slot t, a DRL-MAC node needs to make a decision a through a deep neural network for determining whether to access a data packet into a channel in the current time slot; wherein a is ₀ Indicating that the channel is not accessed; a, a ₁ ......a _k Representing the service access channels with different priorities corresponding to k;

(2) State space modeling

(3) Reward function

in the third step, the neural network model used by the protocol is defined and built, and the neural network model is trained through experience tuples, and the method further comprises the following steps:

(2) Starting from time slot t=0;

(5) If the current time slot t is a multiple of 10, randomly extracting N experience tuples from the experience pool, and sequentially calculating y values of the experience tuples:where r represents the current prize obtained by taking action a in the current state s, gamma E (0, 1) is the discount factor, ++ >Selecting rewards obtained by the action with the maximum Q value for future prediction, otherwise, entering the step (8);

2. The method for channel access in a multi-priority wireless terminal based on deep reinforcement learning as set forth in claim 1, wherein in the first step, the establishing a network scenario with different priority services includes:

establishing a transmission network with k priority services, wherein k is greater than 0; the network scene comprises a base station, N DRL-MAC nodes, M time division multiple access (Time DivisionMultiple Access, TDMA) nodes and X q-ALOHA nodes, wherein N is greater than 1; m+x >1, i.e. there is at least one DRL-MAC node and one other protocol node;

3. The method for channel access in a multi-priority wireless terminal based on deep reinforcement learning of claim 1, wherein in step two, each DRL-MAC node corresponds to an agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state:

4. The method for channel access in a multi-priority wireless terminal based on deep reinforcement learning as set forth in claim 1, wherein in step three, the defining and establishing a neural network model used by the protocol and training the network model through an experience tuple comprises:

5. A channel access system of a multi-priority service wireless terminal implementing the channel access method of a multi-priority wireless terminal based on deep reinforcement learning as set forth in any one of claims 1 to 4, wherein the channel access system of a multi-priority service wireless terminal comprises:

the performance verification module is used for verifying the performance of the trained model through simulation comparison of multiple scenes;

designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes, wherein the method comprises the following steps:

(1) Motion space modeling

taking ofThe channel observation result obtained after the operation is Z _t E { SUCCESS, COLLISION, IDLENESS }, obtaining channel observations for the construction of the experience tuples by listening to the channel; wherein SUCCESS indicates that the node accesses the channel and transmits the data packet successfully; COLLISION means that a plurality of nodes access channels simultaneously for transmission, resulting in COLLISION; IDLENESS indicates that the channel is idle, i.e., no node accesses the channel; the Agent determines a channel observation result according to the acknowledgement signal from the access point and the interception channel;

(2) State space modeling

(3) Reward function

defining and establishing a neural network model used by the protocol, training the network model through experience tuples, and further comprising:

(2) Starting from time slot t=0;

6. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the channel access method of a deep reinforcement learning based multi-priority wireless terminal of any of claims 1-4.

7. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the channel access method of a deep reinforcement learning based multi-priority wireless terminal of any of claims 1 to 4.

8. A wireless communication data processing terminal for implementing a channel access system for a multi-priority service wireless terminal according to claim 5.