CN113613339A

CN113613339A - Channel access method of multi-priority wireless terminal based on deep reinforcement learning

Info

Publication number: CN113613339A
Application number: CN202110781263.0A
Authority: CN
Inventors: 孙红光; 高银洁; 张宏鸣; 李书琴; 徐超; 高振宇
Original assignee: Northwest A&F University
Current assignee: Northwest A&F University
Priority date: 2021-07-10
Filing date: 2021-07-10
Publication date: 2021-11-05
Anticipated expiration: 2041-07-10
Also published as: CN113613339B

Abstract

The invention belongs to the technical field of wireless communication, and discloses a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, which comprises the following steps: establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through experience tuples; and performing performance verification on the trained model through multi-scene simulation comparison. The invention designs the channel access method of the multi-priority service wireless terminal by using deep reinforcement learning, is more suitable for wireless networks with different priority services, improves the throughput of the system and the utilization rate of wireless channel resources, and improves the opportunity of accessing the low-priority service to the channel while reducing the scheduling delay of the high-priority service.

Description

Channel access method of multi-priority wireless terminal based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a channel access method of a multi-priority wireless terminal based on deep reinforcement learning.

Background

Currently, with the rapid development of wireless communication technology, the demand of emerging services such as data transmission and exchange for wireless channels is increasing. In a wireless network, when a plurality of users contend for the usage right of a specific resource (for example, the usage right of a shared channel) at the same time, the users transmit data packets by acquiring the usage right of the channel, and at this time, information from different users needs to occupy the channel for transmission, which may cause the collision of the data packets, thereby causing communication failure. In order to improve the communication efficiency, a multiple access protocol needs to be introduced to determine the use permission of the user to the resource, and the problem that multiple users share the same physical link resource is solved.

Yiding Yu et al propose Deep-reinforcement Learning-based Multiple Access protocol (DLMA) in the literature in wireless heterogeneous networks. Different deep reinforcement learning-based multiple access algorithms are proposed in the literature for different optimization targets, and system simulation comparative analysis is performed on the algorithms corresponding to the different targets. Simulation results show that DLMA can achieve expected targets under the condition of not knowing multiple access protocols adopted by other coexisting networks, so that the proportional fairness and the system throughput of the system are improved. However, the existing channel access method based on deep reinforcement learning does not consider the difference of the service quality requirements of different priority services, and cannot well guarantee the service quality requirements of high priority services.

Conventional Multiple Access protocols such as Time Division Multiple Access (TDMA), ALOHA protocol, Code Division Multiple Access (CDMA), Carrier Sense Multiple Access (CSMA), etc. all have a problem of low channel utilization. Therefore, the invention mainly aims at the problems of the multiple access protocol, and designs a channel access method of a multi-priority wireless terminal based on deep reinforcement learning in a network scene with multi-priority service, so as to reduce the scheduling delay of high-priority service under the constraint of ensuring better system throughput.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) in a wireless network, when a plurality of users compete for the right to use a specific resource at the same time, the users send data packets by acquiring the right to use a channel, and since information from different users needs to occupy the channel for transmission, collision of the data packets may be caused, thereby causing communication failure.

(2) The existing channel access method based on deep reinforcement learning does not consider the difference of the service quality requirements of different priority services, and can not well ensure the service quality requirements of high priority services.

(3) The traditional multi-access method based on multi-priority service can not fully utilize wireless channel resources, which causes resource waste.

The difficulty and significance for solving the problems and defects are as follows: the invention provides a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, which endows the wireless terminal with learning capability, utilizes a reinforcement learning mechanism to interact with the environment, and can further improve the utilization rate of wireless channel resources; in the design of the reward function, different rewards are set for different priority services, so that the scheduling time delay of high priority services can be reduced, and the opportunity of accessing the low priority services to a channel can be improved.

Disclosure of Invention

Aiming at the problems of the existing multiple access protocol, the invention provides a channel access method of a multi-priority wireless terminal based on deep reinforcement learning, in particular to a channel access method of a multi-priority wireless terminal based on deep reinforcement learning.

The invention is realized in such a way that a channel access method of a multi-priority wireless terminal based on deep reinforcement learning comprises the following steps:

step one, establishing network scenes with different priority level services; and the network scene is determined, and the user can modify the system model and the neural network model according to different network scenes so as to deploy the system model and the neural network model in different network scenes.

Designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; according to different network scenes, the state space and the action space are finely adjusted and used, and the form of the reward function is designed and modified according to different key points (the core of reward function design, such as priority), so that the method better adapts to the requirements of the different key points under different network scenes and better meets the requirements of actual deployment.

Step three, defining and establishing a neural network model used by the protocol, and training the network model through experience tuples; aiming at different neural network model training processes, the model convergence speeds are different, and a better neural network model is selected by comparison, so that the training process is quicker and more accurate.

And fourthly, performing performance verification on the trained model through multi-scene simulation comparison. The feasibility and the superiority of the invention are verified and explained by comparing the multi-scene access protocol with the multi-address access protocol without priority service.

Further, in step one, the establishing a network scenario with services of different priorities includes:

establishing a transmission network with k priority services, wherein k is greater than 0; the network scene comprises a base station, N DRL-MAC nodes (adopting the nodes of the invention), M Time Division Multiple Access (TDMA) nodes and X q-ALOHA nodes, wherein (N > 1; M + X >1) at least comprises one DRL-MAC node and one other protocol node.

The base station is used for acquiring data from a wireless channel between the nodes and the base station and transmitting the data; the DRL-MAC node adopts a multiple access technology based on deep reinforcement learning, if the node sends different priority services, a transmission result fed back by a base station is obtained, and different rewards are obtained according to the different priority services; if the node does not send the service, the channel is intercepted, and the transmission state of other nodes in a certain time slot is obtained through the channel observation result; the time division multiple access node adopts a TDMA protocol and is used for carrying out service transmission according to the time slot which is regularly and periodically occupied and allocated; and the q-ALOHA node adopts q-ALOHA and is used for carrying out service transmission at each time slot with a fixed transmission probability q according to q values under different scenes.

Further, in the second step, each DRL-MAC node is equivalent to an agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state by:

wherein q(s)_tAnd a, theta) is an approximation of the deep neural network model calculation, and the action a is selected from the action set according to a greedy strategy to maximize the overall expectation of the reward or to better adapt to a dynamically changing wireless network environment.

Further, in the second step, the designing and defining the system model of the protocol, performing state space modeling and action space modeling according to the network scene of the protocol, and designing the reward function for different scenes includes:

(1) motion space modeling

System action set A_t＝{a₀,a₁,a₂......a_kAnd k is the number of types of priority traffic in the network scenario. In the time slot t, the DRL-MAC node needs to make a decision a through a deep neural network, and the decision a is used for determining whether a data packet is accessed to a channel in the current time slot; wherein a is₀Indicating no access to the channel; a is₁......a_kRepresenting different priority service access channels corresponding to k;

the resulting channel observation after taking action is Z_tThe element belongs to { SUCCESS, COLLISION, IDLENESS }, and channel observation results are obtained by monitoring a channel and are used for forming experience tuples; wherein SUCCESS indicates that the node has accessed the channel and transmitted the data packet successfully; COLLISION means that a plurality of nodes are simultaneously accessed into a channel for transmission, so that COLLISION is caused; idle indicates that the channel is idle, i.e., no node has access to the channel; the Agent determines a channel observation result according to the confirmation signal from the access point and the interception channel;

(2) state space modeling

State collection

Containing M historical states to be tracked, each historical state being observed by an action

Composition, wherein there are a total of 2k +3 combinations:

(3) reward function

For network scenarios of different priority services, the principle that the reward function always follows is as follows: the higher the priority of the service is, the higher the reward brought by successful transmission is, and the more the punishment brought by failed transmission is; the reward function is set to sum _ rewarded ═ α rewards + (1- α) × (delay/T); wherein alpha and T are controllable variables, the parameter alpha is used for adjusting the influence value of the time delay on the whole reward, and when the influence of the time delay on the whole reward is not considered, the parameter alpha is initialized to 1; the parameter T is used for unifying the influence range of the time delay on the reward, and is initialized to 50; delay is the time delay from the generation of a certain service to the access of the channel, namely scheduling time delay; forwards is the reward value, r, proposed for different priority services₁......r_kReward r corresponding to successful transmission of services with different priorities_-1......r_-kPenalty for transmission failure of traffic of different priorityPenalty:

further, in step three, the defining and establishing the neural network model used by the protocol, and training the network model through the experience tuple include:

the DQN is introduced to enable the DRL-MAC node to better learn the next decision of other nodes on the use condition of a channel, the intelligent agent adopts a deep residual error network architecture for training, the deep residual error network is used for approaching a Q value, the current state s is input, an action strategy a is output, and then an experience tuple is formed by combining other information and used for training the deep residual error network.

Further, in step three, the defining and establishing a neural network model used by the protocol, and training the network model through an experience tuple further includes:

(1) initializing an experience pool, setting the capacity of the experience pool, and initializing parameters;

(2) starting from time slot t ═ 0;

(3) transmitting the current state s into a neural network to calculate the Q value of the state; selecting an action to be executed through a greedy strategy, and recording a channel observation result z and a total reward sum _ rewards obtained after the action is taken; putting the acquired state s, the reward r acquired after the action a is taken and the next state s 'which is reached into an experience tuple (s, a, r, s') into an experience pool;

(4) if the current generated experience tuple is larger than the capacity of the experience pool, discarding the experience tuple which enters the experience pool earliest, and putting the latest experience tuple into the experience pool; otherwise, sequentially entering the experience tuples into an experience pool according to the sequence;

(5) if the current time slot t is a multiple of 10, randomly extracting N experience tuples from the experience pool, and sequentially calculating the y values of the experience tuples:

where r denotes taking action in the current state sa the current prize won, γ ∈ (0, 1) is the discount factor,

selecting the reward obtained by the action with the maximum Q value for future prediction, and otherwise, entering the step (8);

(6) updating the Q-estimation network parameter theta by using a half-gradient descent algorithm;

(7) if the current time slot t is a multiple of the parameter F, the Q-estimated network parameter theta is assigned to the target network parameter theta^-Otherwise, entering the step (8);

(8) if the time slot t is more than or equal to the set training round, the training process is exited; otherwise, entering the next time slot t ═ t +1, and entering step (3).

Another object of the present invention is to provide a channel access system of a multi-priority service wireless terminal, in which the channel access method of a multi-priority wireless terminal based on deep reinforcement learning is applied, the channel access system of the multi-priority service wireless terminal including:

the network scene establishing module is used for establishing network scenes with different priority level services;

the system model design module is used for designing and defining a system model of the protocol;

the space modeling module is used for carrying out state space modeling and action space modeling according to the protocol network scene;

the reward function design module is used for designing reward functions aiming at different scenes;

the neural network model establishing module is used for determining and establishing a neural network model used by the protocol;

the network model training module is used for training the network model through experience tuples;

and the performance verification module is used for performing performance verification on the trained model through multi-scene simulation comparison.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through experience tuples; and performing performance verification on the trained model through multi-scene simulation comparison.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide a wireless communication information data processing terminal for implementing a channel access system of the multi-priority service wireless terminal.

By combining all the technical schemes, the invention has the advantages and positive effects that: the channel Access method (DRL-MAC) of the Deep Reinforcement Learning-based multi-priority service wireless terminal provided by the invention is realized by establishing network scenes with different priority services; designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes; defining and establishing a neural network model used by the protocol, and training the network model through experience tuples; and performing performance verification on the trained model through multi-scene simulation comparison to reduce the scheduling delay of the high-priority service under the constraint of ensuring the system throughput. The invention designs the channel access method of the multi-priority service wireless terminal by using deep reinforcement learning, is more suitable for wireless networks with different priority services, improves the throughput of the system and reduces the scheduling delay of high-priority services.

Aiming at wireless networks with different priority services, the invention provides a channel access method of a multi-priority service wireless terminal based on deep reinforcement learning, which endows the wireless terminal with learning capability, utilizes a reinforcement learning mechanism to interact with the environment, and can further improve the utilization rate of wireless channel resources; in the design of the reward function, different rewards are set for different priority services, so that the scheduling time delay of high priority services can be reduced, and the opportunity of accessing the low priority services to a channel can be improved. Compared with a multiple access protocol of a service without priority, the result shows that the channel access method of the multi-priority service wireless terminal based on deep reinforcement learning has better system throughput and scheduling delay of a high-priority service.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a channel access method for a multi-priority service wireless terminal according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a channel access method of a multi-priority service wireless terminal according to an embodiment of the present invention.

Fig. 3 is a block diagram of a channel access system of a multi-priority service wireless terminal according to an embodiment of the present invention;

in the figure: 1. a network scene establishing module; 2. a system model design module; 3. a spatial modeling module; 4. a reward function design module; 5. a neural network model building module; 6. a network model training module; 7. and a performance verification module.

Fig. 4 is a flowchart of deep neural network training based on deep reinforcement learning according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network model according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of simulation results provided by the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a channel access method for a multi-priority wireless terminal based on deep reinforcement learning, and the following describes the present invention in detail with reference to the accompanying drawings.

As shown in fig. 1, a channel access method for a deep reinforcement learning-based multi-priority wireless terminal according to an embodiment of the present invention includes the following steps:

s101, establishing network scenes with different priority services;

s102, designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes;

s103, defining and establishing a neural network model used by the protocol, and training the network model through experience tuples;

and S104, performing performance verification on the trained model through multi-scene simulation comparison.

A schematic diagram of a channel access method for a multi-priority wireless terminal based on deep reinforcement learning according to an embodiment of the present invention is shown in fig. 2.

As shown in fig. 3, a channel access system of a multi-priority service wireless terminal according to an embodiment of the present invention includes:

the network scene establishing module 1 is used for establishing network scenes with different priority level services;

a system model design module 2, which is used for designing and defining the system model of the protocol;

the space modeling module 3 is used for carrying out state space modeling and action space modeling according to the protocol network scene;

the reward function design module 4 is used for designing reward functions aiming at different scenes;

a neural network model establishing module 5, which is used for determining and establishing the neural network model used by the protocol;

the network model training module 6 is used for training the network model through experience tuples;

and the performance verification module 7 is used for performing performance verification on the trained model through multi-scene simulation comparison.

The technical solution of the present invention will be further described with reference to the following examples.

As shown in fig. 1, the DRL-MAC protocol provided in the embodiment of the present invention includes the following steps:

(1) establishing a wireless network scene containing a plurality of services with different priorities;

establishing a wireless network with two priority services, wherein the network scene comprises: the system comprises a base station, N DRL-MAC nodes (adopting the nodes of the invention), M Time Division Multiple Access (TDMA) nodes and X q-ALOHA nodes, wherein (N > 1; M + X >1) at least comprises one DRL-MAC node and one other protocol node.

The base station acquires data from a wireless channel between the nodes and the base station and transmits the data; the DRL-MAC node adopts a multiple access technology based on deep reinforcement learning, if the node sends different priority services, a transmission result fed back by a base station is obtained, and different rewards are obtained according to the different priority services; if the node does not send the service, the channel is intercepted, and the transmission state of other nodes in a certain time slot is obtained through the channel observation result; the time division multiple access node adopts a TDMA protocol and carries out service transmission according to the time slot which is regularly and periodically occupied and allocated; the q-ALOHA node shown employs q-ALOHA, which performs traffic transmission at each slot with a fixed transmission probability q according to q values under different scenarios.

(2) Designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes;

each DRL-MAC node is equivalent to an agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state by:

wherein q(s)_tAnd a, theta) are approximations of the deep neural network model calculations and the action a is selected from the set of actions according to a greedy strategy to maximize the overall expectation of the reward or to better adapt to dynamically changing wireless network environments.

(2.1) motion space modeling

For a network scenario with two priority services, the system action set is A_t＝{a₀,a₁,a₂At time slot t, the DRL-MAC node needs to make a decision a through a deep neural network to determine whether to access the data packet to the channel at the current time slot, wherein a₀Indicating no access to the channel; a is₁Indicating that high priority traffic is to be accessed to the channel; a is₂Indicating that low priority traffic is to be accessed to the channel.

The resulting channel observation after taking action is Z_tThe method comprises the following steps that (SUCCESS, COLLISION and IDLENESS) belongs to the field, channel observation results are obtained by monitoring a channel and are used for forming experience tuples, wherein the SUCCESS indicates that a node accesses the channel and transmits data packets successfully; COLLISION means that a plurality of nodes are simultaneously accessed into a channel for transmission, so that COLLISION is caused; idle indicates that the channel is idle, i.e., no node has access to the channel. The Agent determines the channel observation based on the acknowledgement signal from the access point (if it sends) and listening to the channel (if it waits).

(2.2) State space modeling

State collection

The composition, for a network scenario with two types of priority traffic, there are seven combinations of action observation pairs in total:

(2.3) reward function

For a network scenario with two types of priority traffic, the principle that the reward function always follows is: the higher the priority of the service is, the higher the reward brought by the successful transmission of the service is; the more penalty it incurs for transmission failure. The reward function is set to sum _ rewarded ═ α rewards + (1- α) × (delay/T), wherein α and T are controllable variables, the parameter α is to adjust the influence value of the time delay on the overall reward, and when the influence of the time delay on the overall reward is not considered, the parameter α is initialized to 1; the parameter T is mainly used for unifying the influence range of the time delay on the reward and is initialized to 50; delay is the time delay from the generation of a certain service to the access of the channel, namely scheduling time delay; forwards is the value of the reward proposed for a network scenario with two priority services.

(3) Defining and establishing a neural network model used by the protocol, and training the network model through experience tuples; FIG. 4 is a process of training a deep neural network based on deep reinforcement learning according to the present invention.

And (3.1) initializing the experience pool, setting the capacity of the experience pool and initializing parameters.

(3.2) starting from the time slot t ═ 0;

(3.3) transmitting the current state s into a neural network to calculate the Q value of the state; selecting an action to be executed through a greedy strategy, and recording a channel observation result z and a total reward sum _ rewards obtained after the action is taken; the experience tuple (s, a, r, s ') is put in the experience pool by the obtained state s, the reward r obtained after taking action a and the reaching of the next state s'.

(3.4) if the current generated experience tuple is larger than the capacity of the experience pool, discarding the experience tuple which enters the experience pool earliest, and putting the latest experience tuple into the experience pool; and conversely, sequentially entering the experience tuples into the experience pool according to the sequence.

(3.5) if the current time slot t is a multiple of 10, randomly extracting N experience tuples from the experience pool, and sequentially calculating the y values of the experience tuples:

where r represents the current reward obtained by taking action a in the current state s, γ ∈ (0, 1) is the discount factor,

and (4) selecting the reward obtained by the action with the maximum Q value for future prediction, and otherwise, entering the step (3.8).

(3.6) updating the Q-estimated network parameter θ using a half-gradient descent algorithm.

(3.7) if the current time slot t is a multiple of the parameter F, assigning the Q-estimated network parameter theta to the target network parameter theta^-Otherwise, go to step (3.8).

(3.8) if the time slot t is more than or equal to the set training round, exiting the training process; otherwise, entering the next time slot t ═ t + 1; and proceeds to step (3.3).

(4) And performing performance verification on the trained model through multi-scene simulation comparison to reduce the scheduling delay of the high-priority service under the constraint of ensuring the system throughput.

Fig. 5 is a network scenario used in the simulation experiment of the present invention, where the network scenario includes a base station, N DRL-MAC nodes (nodes adopting the present invention) (N >1), M TDMA nodes, and X q-ALOHA nodes (M + X >1), and data packets are transmitted between the nodes and the base station through a shared wireless channel.

The technical effects of the present invention will be described in detail with reference to simulation experiments.

1. Simulation conditions are as follows:

the simulation experiment of the invention is on a Windows platform, and is mainly configured as follows: CPU is Intel (R) core (TM) i7-7500U, 2.70 GHz; the memory is 8G; the operating system is Windows 10; the simulation software was Pycharm.

2. Simulation content and result analysis:

the simulation experiment is compared with a model sensing node, wherein the model sensing node refers to a multiple access protocol (MAC) mechanism of other coexisting nodes known by the node, and an optimal MAC protocol coexisting with the node is obtained by utilizing the mechanism of the known MAC protocol. The results of the simulation experiment are shown in fig. 6.

Example one: under a transmission network with two types of priority services, a network scene comprises a base station, a DRL-MAC node and a TDMA node; the DRL-MAC node is always in a state with traffic to transmit (saturated traffic scenario).

Fig. 6(a) is a throughput result when a DRL-MAC node coexists with one TDMA node in a saturated traffic scenario, with the goal of achieving system optimal throughput.

The throughput results when the time slot N occupied by TDMA varies from 2 to 9 when the frame length is 10 can be seen from fig. 6 (a). The diagonal filled portions and the solid filled portions in the histogram indicate the throughputs of the DRL-MAC node and the TDMA node, respectively. The circular dotted line is the simulated total throughput, i.e., the total throughput of the system, under coexistence of the DRL-MAC node and the TDMA node. The diamond-marked dashed line represents the value of theoretically optimal system throughput verified by the model-aware node. It can be seen from fig. 6(a) that the circular mark dashed line and the diamond mark dashed line almost coincide. This means that the DRL-MAC node can discover the unused time slots of the TDMA by learning without knowing the protocol used by another node.

Fig. 6(b) is a throughput result when a DRL-MAC node coexists with one TDMA node in a saturated traffic scenario, with the goal of achieving system optimal throughput.

Fig. 6(b) shows the access probability of high-priority data packets by the DRL-MAC node and the TDMA node under the scenario of considering the service priority and not considering the service priority. The solid square-labeled line of fig. 6(b) represents the access probability of a high-priority packet in a scenario in which the traffic priority is considered; the solid line with a circular mark represents the access probability of a high priority packet without considering the traffic priority scenario. It can be seen from fig. 6(b) that the blue line is in most cases above the red line. It can be concluded that the DRL-MAC node can transmit the high-priority service more timely in the scenario in which the service priority is considered, thereby ensuring the priority communication of the high-priority service.

Example two: under a transmission network with two types of priority services, a network scene comprises a base station, a DRL-MAC node and a q-ALOHA node; the DRL-MAC node is always in a state with traffic to transmit (saturated traffic scenario).

Fig. 6(c) is a throughput result when a DRL-MAC node coexists with a q-ALOHA node in a saturated traffic scenario, with the goal of achieving system optimal throughput.

Fig. 6(c) shows the throughput results for q-ALOHA when the access probability q is changed from 0.2 to 0.9 when the q-ALOHA node coexists with the DRL-MAC node in a saturated traffic scenario. The diagonal filled portion and the solid filled portion in fig. 6(c) represent the throughputs of the DRL-MAC node and the q-ALOHA node, respectively. The circular-labeled dotted line represents the total throughput of the system, i.e., the total throughput simulated in the coexistence of the DRL-MAC node and the q-ALOHA node. The diamond-marked dashed line represents the value of theoretically optimal system throughput verified by the model-aware node. It can be seen from fig. 6(c) that the circular mark broken line and the circular mark broken line almost coincide in most cases. This means that the DRL-MAC node can obtain the best throughput by learning the policy without knowing that another node is a q-ALOHA node and the transmission probability q.

Fig. 6(d) is a throughput result when a DRL-MAC node coexists with one q-ALOHA node in a saturated traffic scenario, with the goal of achieving fair transmission between nodes.

Fig. 6(d) is a throughput result of q-ALOHA when the access probability q is changed from 0.2 to 0.6 in the case of realizing the proportional fairness index. The actual throughputs of the q-ALOHA node, the DRL-MAC node and the system are respectively represented by a circular marked solid line, a triangular marked solid line and a square marked solid line, and the theoretical optimal throughputs of the q-ALOHA node, the DRL-MAC node and the system obtained by the model sensing node are respectively represented by a circular marked dotted line, a triangular marked dotted line and a square marked dotted line. As can be seen from fig. 6(d), in the case of implementing the proportional fairness index, there is some relatively small error between the actual throughput and the theoretical optimal throughput, which may indicate that the DRL-MAC node may implement the proportional fairness index by learning a policy without knowing that another node is a q-ALOHA node and that the transmission probability q is.

Fig. 6(e) is a throughput result when a DRL-MAC node coexists with a q-ALOHA node in a saturated traffic scenario, with the goal of achieving system optimal throughput.

Fig. 6(e) is a simulation of the access probability of a high priority packet by q-ALOHA when the access probability q changes from 0.2 to 0.9. The solid line in the figure represents the proportion of all access channel actions of the Agent accessing the high priority data packet under different access probabilities q. As can be seen from fig. 6(e), in a scenario where DRL-MAC and q-ALOHA coexist (in a case where the environment is unstable), it is possible to achieve a higher access probability for high priority traffic than for low priority traffic.

Example three: under a transmission network with two types of priority services, a network scene comprises a base station, a DRL-MAC node and a TDMA node; the DRL-MAC is in a non-saturated traffic scenario.

The unsaturated service scenario refers to: each time slot has a data packet arriving, the arrival rates of the data packets with different priorities are different, the arrival probability of a data packet with a high priority is defined to be 0.3, the arrival probability of a data packet with a low priority is defined to be 0.7, the data packets with different priorities respectively enter data packet queues corresponding to different priorities to queue and wait, in order to avoid the situation that the queues are empty when access action operation is taken, the data packets with certain corresponding priorities are respectively queued into the service queues with different priorities (5 data packets are respectively queued by the high and low priority queues in the invention) when the service queues are initialized, and the situation that the access service queues are empty when training is just started is avoided.

The channel access criterion in the unsaturated service scene is as follows: when taking action a₁That is, a low-priority service is accessed in a channel, and if collision occurs in the transmission process, the data packet is dequeued and discarded; when taking action a₂That is, a high-priority service is accessed in a channel, collision is generated in the transmission process, the data packet is re-queued and is placed at the head of a queue; the high priority traffic transmission needs are large due to the large reward of the high priority traffic, but since the traffic arrival probability is fixed, the Agent takes action a through learning in the text₂That is, when accessing the action of high priority service, when there is no service in the high priority service queue, the action a of accessing the high priority service is still executed₂However, actually, the low-priority service is taken out from the low-priority queue for access, and if the data packet is successfully transmitted at the moment, the reward corresponding to the low-priority service is given; and if the transmission fails, giving a penalty corresponding to the high-priority data packet and re-accessing the data packet to the head of the low-priority queue.

Fig. 6(f) shows the results obtained in simulation scenarios corresponding to different parameters α under the coexistence of a DRL-MAC node and a TDMA node in an unsaturated service scenario, where the diagram of each simulation result is divided into three parts: system throughput, high priority service access probability, and system latency.

It can be seen from fig. 6(f) that the delay of the transmission of different priority traffic for different parameters α in the lower left corner of each simulation diagram, when decreasing from 1 to 0.8, the delay of the high priority traffic remains almost constant, while the maximum delay of the low priority traffic decreases gradually from a value close to 80 to a value close to 60. Fig. 6(f) shows that in this scenario, by adjusting the parameter α, the optimal throughput of the system is ensured to be achieved by the learning strategy; on the premise that the access probability of the high-priority service is greater than that of the low-priority data packets, the overall time delay of the system is reduced by sacrificing the access probability of some high-priority data packets.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A channel access method of a multi-priority wireless terminal based on deep reinforcement learning is characterized by comprising the following steps:

step one, establishing network scenes with different priority level services;

designing and defining a system model of the protocol, carrying out state space modeling and action space modeling according to the network scene of the protocol, and designing reward functions aiming at different scenes;

step three, defining and establishing a neural network model used by the protocol, and training the network model through experience tuples;

and fourthly, performing performance verification on the trained model through multi-scene simulation comparison.

2. The channel access method for a deep reinforcement learning-based multi-priority wireless terminal as claimed in claim 1, wherein in step one, the establishing of network scenarios with different priority services comprises:

establishing a transmission network with k priority services, wherein k is greater than 0; the network scene comprises a base station, N DRL-MAC nodes (adopting the nodes of the invention), M Time Division Multiple Access (TDMA) nodes and X q-ALOHA nodes, wherein (N > 1; M + X >1) at least comprises one DRL-MAC node and one other protocol node;

3. The channel access method of multi-priority wireless terminal based on deep reinforcement learning of claim 1, wherein in step two, each DRL-MAC node corresponds to an agent in reinforcement learning; in each time slot, the agent calculates the Q value in the current state by:

4. The method for accessing channels of a multi-priority wireless terminal based on deep reinforcement learning of claim 1, wherein in step two, the designing and defining a system model of the protocol, performing state space modeling and action space modeling according to the network scenario of the protocol, and designing a reward function for different scenarios comprises:

(1) motion space modeling

System action set A_t＝{a₀,a₁,a₂......a_kAnd k is the number of types of priority traffic in the network scenario. In the time slot t, the DRL-MAC node needs to make a decision a through a deep neural network, and the decision a is used for determining whether a data packet is accessed to a channel in the current time slot; it is composed ofIn (a)₀Indicating no access to the channel; a is₁......a_kRepresenting different priority service access channels corresponding to k;

(2) state space modeling

State collection

Composition, wherein there are a total of 2k +3 combinations:

(3) reward function

For network scenarios of different priority services, the principle that the reward function always follows is as follows: the higher the priority of the service is, the higher the reward brought by successful transmission is, and the more the punishment brought by failed transmission is; the reward function is set to sum _ rewarded ═ α rewards + (1- α) × (delay/T); wherein alpha and T are controllable variables, the parameter alpha is used for adjusting the influence value of the time delay on the whole reward, and when the influence of the time delay on the whole reward is not considered, the parameter alpha is initialized to 1; the parameter T is used for unifying the influence range of the time delay on the reward, and is initialized to 50; delay is the time delay from the generation of a certain service to the access of the channel, namely scheduling time delay; forwards is the reward value, r, proposed for different priority services₁......r_kReward r corresponding to successful transmission of services with different priorities_-1......r_-kPunishment corresponding to transmission failure of services with different priorities:

5. the method for accessing channels of multi-priority wireless terminals based on deep reinforcement learning of claim 1, wherein in step three, the defining and establishing the neural network model used by the protocol and training the network model through the experience tuples comprise:

6. The method for accessing channels of multi-priority wireless terminals based on deep reinforcement learning of claim 1, wherein in step three, the neural network model used by the protocol is defined and established, and the network model is trained through experience tuples, further comprising:

(2) starting from time slot t ═ 0;

7. A channel access system of a multi-priority service wireless terminal for implementing the channel access method of the multi-priority wireless terminal based on deep reinforcement learning of any one of claims 1 to 6, wherein the channel access system of the multi-priority service wireless terminal comprises:

8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

10. A wireless communication information data processing terminal, characterized in that the wireless communication information data processing terminal is configured to implement the channel access system of the multi-priority service wireless terminal according to claim 7.