CN113923794A

CN113923794A - Distributed dynamic spectrum access method based on multi-agent reinforcement learning

Info

Publication number: CN113923794A
Application number: CN202111339165.8A
Authority: CN
Inventors: 周力; 谭翔; 魏急波; 赵海涛; 熊俊; 高文颖; 唐麒; 张姣; 曹阔; 刘潇然
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-01-11

Abstract

The invention discloses a distributed dynamic spectrum access method based on multi-agent reinforcement learning, which models a multi-user distributed dynamic spectrum access problem into a multi-agent Markov cooperation game model and constructs a centralized training and distributed execution multi-agent reinforcement learning framework, wherein the multi-agent reinforcement learning framework comprises an offline training module and an online execution module, the online execution module utilizes a learned access strategy to perform spectrum access of a cognitive user, and the offline training module dynamically updates the online execution module according to a spectrum access result of the cognitive user. The invention provides a communication environment self-adaptive and network scale extensible multi-user cooperative spectrum access method, which reduces access conflicts among cognitive users when interference on authorized users is avoided, thereby maximizing the access success rate of the cognitive users and improving the utilization efficiency of a spectrum.

Description

Distributed dynamic spectrum access method based on multi-agent reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication networks, in particular to a distributed dynamic spectrum access method and a distributed dynamic spectrum access system based on multi-agent reinforcement learning.

Background

In the cognitive wireless network, a cognitive user is accessed to a spectrum hole of an authorized user in an overlapping mode for data transmission. Distributed multi-user dynamic spectrum access faces two major challenges: the method has the advantages that interference of the cognitive user on the master user is avoided, namely when the master user occupies the authorized spectrum for data transmission, the cognitive user cannot access the corresponding spectrum; and secondly, access conflict among the cognitive users is avoided, namely, the situation that more than two cognitive users access the same frequency spectrum hole is avoided, so that data transmission is unsuccessful. Due to the limited sensing capability of a single cognitive node, only partial channel state information can be observed. Meanwhile, due to the influence of factors such as hidden nodes and shielding objects, perception information of a cognitive user is incomplete and inaccurate.

Disclosure of Invention

The invention provides a distributed dynamic spectrum access method and system based on multi-agent reinforcement learning, which are used for overcoming the defects that in the prior art, when a cognitive user is accessed into a spectrum cavity of an authorized user for data transmission, interference is generated on a main user, and meanwhile, access conflict is generated among the cognitive users, so that the throughput of a communication system is low and the like.

In order to achieve the above object, the present invention provides a distributed dynamic spectrum access method based on multi-agent reinforcement learning, which comprises the following steps:

modeling a multi-user distributed dynamic spectrum access problem into a multi-agent Markov cooperative game model, and constructing a centralized training and distributed execution multi-agent reinforcement learning framework; the multi-agent reinforcement learning framework comprises an off-line training module and an on-line execution module;

acquiring local spectrum occupation information according to the self narrow-band sensing capability of the cognitive user;

according to the local spectrum occupation information, through a trained online execution module, spectrum access of a cognitive user is carried out by utilizing a learned access strategy;

and monitoring the access success rate of the cognitive user in real time, and when the power is lower than a threshold value, retraining the online execution module by the offline training module so as to be self-adaptive to various communication environments.

In order to achieve the above object, the present invention further provides a distributed dynamic spectrum access system based on multi-agent reinforcement learning, including:

the algorithm building module is used for modeling a multi-user distributed dynamic spectrum access problem into a multi-agent Markov cooperative game model and building a centralized training and distributed execution multi-agent reinforcement learning framework; the multi-agent reinforcement learning framework comprises an off-line training module and an on-line execution module;

the spectrum sensing module is used for acquiring local spectrum occupation information according to the self narrow-band sensing capability of the cognitive user;

the spectrum access module is used for performing spectrum access of the cognitive user by using a learned access strategy through a trained online execution module according to the local spectrum occupation information;

and the real-time monitoring module is used for monitoring the access success rate of the cognitive user in real time, and when the power is lower than a threshold value, the offline training module retrains the online execution module so as to be self-adaptive to various communication environments.

To achieve the above object, the present invention further provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

To achieve the above object, the present invention further proposes a computer-readable storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the above method.

Compared with the prior art, the invention has the beneficial effects that:

the distributed dynamic spectrum access method based on multi-agent reinforcement learning provided by the invention models a multi-user distributed dynamic spectrum access problem into a multi-agent Markov cooperation game model, and constructs a centralized training and distributed execution multi-agent reinforcement learning framework, wherein the multi-agent reinforcement learning framework comprises an offline training module and an online execution module, the online execution module utilizes a learned access strategy to perform spectrum access of cognitive users, and the offline training module dynamically updates the online execution module according to spectrum access results of the cognitive users. The invention provides a communication environment self-adaptive and network scale extensible multi-user cooperative spectrum access method, which reduces access conflicts among cognitive users when interference on authorized users is avoided, thereby maximizing the access success rate of the cognitive users and improving the utilization efficiency of a spectrum.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic diagram of a distributed dynamic spectrum access method based on multi-agent reinforcement learning according to the present invention;

FIG. 2 is a diagram of a centralized training, distributed execution multi-agent reinforcement learning framework of the present invention;

fig. 3 is a schematic diagram of timeslot division according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

The drugs/reagents used are all commercially available without specific mention.

The invention provides a distributed dynamic spectrum access method based on multi-agent reinforcement learning, which comprises the following steps as shown in figure 1:

101: modeling a multi-user distributed dynamic spectrum access problem into a multi-agent Markov cooperative game model, and constructing a centralized training and distributed execution multi-agent reinforcement learning framework (as shown in FIG. 2); the multi-agent reinforcement learning framework comprises an off-line training module and an on-line execution module.

102: acquiring local spectrum occupation information according to the self narrow-band sensing capability of the cognitive user;

103: according to the local spectrum occupation information, through a trained online execution module, spectrum access of a cognitive user is carried out by utilizing a learned access strategy;

104: and monitoring the access success rate of the cognitive user in real time, and when the power is lower than a threshold value, retraining the online execution module by the offline training module so as to be self-adaptive to various communication environments.

The invention models a multi-user distributed dynamic spectrum access problem of a cognitive wireless network into a multi-agent Markov game process, constructs a centralized training and distributed execution multi-agent reinforcement learning framework according to a multi-agent Markov cooperation game model, wherein the multi-agent reinforcement learning framework comprises an offline training module and an online execution module, the online execution module utilizes a learned access strategy to perform spectrum access of cognitive users, and the offline training module dynamically updates the online execution module according to a real-time monitoring result. The invention provides a communication environment self-adaptive and network scale extensible multi-user cooperative spectrum access method, which reduces access conflicts among cognitive users when interference on authorized users is avoided, thereby maximizing the access success rate of the cognitive users and improving the utilization efficiency of a spectrum.

In one embodiment, for step 101, the offline training module includes a centralized trainer that is built by a network edge computing server (e.g., a small cell, a wireless access point, or a drone assisted edge computing server, etc.).

The online execution module comprises a policy network, and the policy network is loaded on the cognitive user side.

The multi-agent reinforcement learning framework is a centralized training and distributed execution multi-agent reinforcement learning framework.

In the next embodiment, for step 101, the offline training module collects the interaction information between the cognitive users and the wireless environment through the common channel, trains a mutually cooperative policy network for each cognitive user by using the collected interaction information, and sends the trained policy network parameters to the corresponding cognitive user through the common channel to update the parameters of the policy network of the corresponding cognitive user side.

In another embodiment, for step 104, monitoring access success rate of the cognitive user in real time includes:

401: outputting a reward value of the current spectrum access by utilizing a multi-agent reinforcement learning framework according to the spectrum access condition of the cognitive user;

402: and monitoring the access success rate of the cognitive user in real time according to the reward value.

In the next embodiment, for step 401, outputting the reward value of the current spectrum access by using the multi-agent reinforcement learning framework according to the spectrum access condition of the cognitive user, including:

4011: adding the access success times of all the cognitive users as a utility function of each cognitive user;

4012: establishing a reward function in a multi-agent reinforcement learning framework according to the utility function;

4013: and outputting the reward value of the current spectrum access by using the reward function according to the spectrum access condition of the cognitive user.

And local state information of the wireless channel is acquired according to the limited perception capability of the cognitive user, so that an observation space for reinforcement learning is formed.

And selecting a perception channel according to the perception capability of the cognitive user, and selecting an available channel for access, thereby forming an action space for reinforcement learning.

In this embodiment, the available frequency spectrum is divided into K orthogonal sub-channels with equal bandwidth, and the bandwidth of the sub-channel is smaller than the coherent bandwidth of the channel;

each sub-channel is divided into time slots according to the same start-stop time, as shown in fig. 3, the time slot length is shorter than the channel coherence time;

the K orthogonal sub-channels are randomly occupied by the corresponding K authorized users, the idle/occupied states of the K orthogonal sub-channels form a state space of the cognitive wireless network, and the state space is as follows: 2^K；

And modeling the cognitive user into an intelligent agent, and cooperatively accessing an available spectrum hole to perform data transmission according to the channel state sensed by the cognitive user.

The perception capability of the cognitive user isOnly M subchannels can be selected from the K subchannels for sensing, so that the observation space size of a single cognitive user is as follows:

the joint observation space of all cognitive users is:

selecting 1 channel in an idle state for access according to the sensing results of the selected M sub-channels; the action space size of the cognitive user is as follows:

the joint motion space of all cognitive users is:

in one embodiment, the reward function is:

in the formula (I), the compound is shown in the specification,

representing the utility function of all cognitive users at the time t;

representing the access success times of the cognitive user n at the time t; oⁿRepresenting the observation of the cognitive user n at the time t; a isⁿRepresenting the access action of the cognitive user n at the time t; n represents the total number of cognitive users.

In a next embodiment, the policy network is a deep recurrent neural network structure.

In this embodiment, in a training phase, a centralized trainer deployed on an edge server trains a spectrum cooperation access strategy offline by using a perception-access experience of each cognitive user; when the method is executed, the cognitive user node only depends on local spectrum sensing information to perform spectrum access through autonomous decision of a policy network. The method comprises the steps of dividing available channels into time slots at equal intervals according to the same starting and stopping time, modeling a multi-user frequency spectrum cooperative access problem into a complete cooperative game problem, and solving the optimal strategy for achieving a balance point of a distributed partially observable Markov game problem by using centralized training and distributed executed multi-agent reinforcement learning.

The invention also provides a distributed dynamic spectrum access system based on multi-agent reinforcement learning, which comprises:

The invention further provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method described above.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A distributed dynamic spectrum access method based on multi-agent reinforcement learning is characterized by comprising the following steps:

2. The multi-agent reinforcement learning-based distributed dynamic spectrum access method of claim 1, wherein the offline training module comprises a centralized trainer, the centralized trainer being constructed by a network edge computing server;

3. The multi-agent reinforcement learning-based distributed dynamic spectrum access method as claimed in claim 2, wherein the offline training module collects interaction information between the cognitive users and the wireless environment through a common channel, trains a mutually cooperative policy network for each cognitive user using the collected interaction information, and sends the trained policy network parameters to the corresponding cognitive user through the common channel to update the parameters of the policy network at the corresponding cognitive user end.

4. The distributed dynamic spectrum access method based on multi-agent reinforcement learning as claimed in claim 1, wherein monitoring access success rate of cognitive users in real time comprises:

outputting a reward value of the current spectrum access by utilizing a multi-agent reinforcement learning framework according to the spectrum access condition of the cognitive user;

and monitoring the access success rate of the cognitive user in real time according to the reward value.

5. The multi-agent reinforcement learning-based distributed dynamic spectrum access method as claimed in claim 4, wherein outputting the reward value of the current spectrum access by using the multi-agent reinforcement learning framework according to the spectrum access condition of the cognitive user comprises:

adding the access success times of all the cognitive users as a utility function of each cognitive user;

establishing a reward function in a multi-agent reinforcement learning framework according to the utility function;

and outputting the reward value of the current spectrum access by using the reward function according to the spectrum access condition of the cognitive user.

6. The multi-agent reinforcement learning-based distributed dynamic spectrum access method of claim 5, wherein the reward function is:

in the formula (I), the compound is shown in the specification,

representing the utility function of all cognitive users at the time t;

7. The multi-agent reinforcement learning-based distributed dynamic spectrum access method of claim 2, wherein the policy network is a deep-cycle neural network structure.

8. A distributed dynamic spectrum access system based on multi-agent reinforcement learning, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.