CN113207129A

CN113207129A - Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm

Info

Publication number: CN113207129A
Application number: CN202110506184.9A
Authority: CN
Inventors: 申滨; 颜廷秋; 方广进
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-03
Anticipated expiration: 2041-05-10
Also published as: CN113207129B

Abstract

The invention relates to a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm, and belongs to the field of wireless communication. The method specifically comprises the following steps: s1: constructing a distributed dynamic spectrum access system model; s2: constructing a cumulative expected reward function for the SUE; s3: obtaining an optimal access strategy according to historical experience and state actions of an access channel so as to obtain the maximum accumulated expected reward; s4: and solving the access strategy by adopting a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning, and obtaining the optimal access strategy through continuous iteration. The invention can obtain the optimal dynamic spectrum access strategy corresponding to the condition that the channel state transfer rule is known under the condition that the channel dynamic change rule is unknown.

Description

Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm

Technical Field

The invention belongs to the field of wireless communication, and relates to a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm.

Background

In recent years, increasing spectrum resources is one of the key solutions for future wireless communication networks to cope with this exponential data traffic growth. However, radio spectrum is an expensive and scarce resource. The current shortage of radio spectrum makes it difficult for wireless operators to obtain sufficient proprietary licensed bands. On the other hand, experimental tests and investigations from the academic and industrial circles indicate that the static spectrum allocation policy results in insufficient utilization of allocated licensed bands, most of which are under 30%, and more than half of which are under 20%. These statistics reflect the fact that radio spectrum resources are under-utilized, which has prompted the industry to reconsider current static spectrum allocation policies and to employ dynamic spectrum access to promote spectrum efficient utilization.

In order to realize the coexistence of frequency spectrums between cognitive users and primary users, various frequency spectrum access strategies have been proposed at present, and are mainly divided into two frequency spectrum access mechanisms. The first is Listen Before Talk (LBT), also known as an interleaving scheme, in which a SUE can access a band only if it detects that the band is available. Although this scheme can effectively avoid strong interference to the primary user, the chances of the SUE accessing the shared band are quite limited. This is because under LBT, spectrum access is completely dependent on the current spectrum access result. In reality, due to randomness of a wireless environment, limited or no cooperation among cognitive users, and other practical factors, a spectrum access result may always have a large error. This will result in false positives or missed detections of primary user activity, leading to incorrect decisions by cognitive users on channel access. The second spectrum access scheme is spectrum sharing, also referred to as the underlay scheme. In this scheme, cognitive users coexist with primary users on a shared frequency band and adjust their transmit power levels such that the cumulative interference experienced at the primary users is less than a tolerable interference threshold. This scheme requires a strong assumption that channel state information between the transmitter of the cognitive user and the receiver of the primary user is already known for power control. However, in reality, it is often difficult to obtain such channel state information without a central controller. Even in the presence of a central controller, exchanging these channel state information may impose heavy control overhead on the underlying network, making it difficult to implement in practice.

In summary, in view of the various defects and shortcomings of the conventional dynamic spectrum access, a new dynamic spectrum access method is needed to solve the above problems.

Disclosure of Invention

In view of the above, the present invention provides a dynamic spectrum access method based on a combination of a confidence interval upper bound algorithm and a Deep Reinforcement Learning (DRL) algorithm, which aims at various defects and deficiencies of the conventional dynamic spectrum access, and obtains an optimal dynamic spectrum access strategy corresponding to a situation that a channel state transition rule is known approximately under the situation that a channel dynamic change rule is unknown.

In order to achieve the purpose, the invention provides the following technical scheme:

a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm specifically comprises the following steps:

s1: constructing a distributed dynamic spectrum access system model;

s2: constructing a cumulative expectation reward function (SUE) of a Secondary User Equipment (SUE);

s3: according to the historical experience of the first SUE in M time slots before t time slot

And accessing the state action of the channel to obtain an optimal access strategy so as to obtain the maximum accumulated expected reward;

s4: and solving the access strategy by adopting a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning, and obtaining the optimal access strategy through continuous iteration.

Further, in step S1, the constructed distributed dynamic spectrum access system model specifically includes: a Primary User network consisting of N Primary Users (PUs) and a secondary User network consisting of L SUEs; assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel to avoid interference between PUs; the operating states of the PU on the channel are indicated as both active (labeled 1) and idle (labeled 0), and communicate in the channel in a TDMA fashion. And the state of the channel is determined by the PUThe state decision of (2): occupied (0) or idle (1), the state of all channels is defined by 2^NDiscrete markov models of individual states, whose state space is represented as: s ═ S₁,s₂,...,s_n,...,s_N)∣s_n0 or 1, N1, 2, N, wherein s is_n0 or 1 respectively represents two states per channel: occupied (0) or idle (1).

Further, in step S1, the state transition probability on a single channel is expressed as:

wherein p is_ijRepresenting the probability of state i transitioning to state j. Assuming that the channel is stationary, the transfer matrix P is constant and time independent.

Further, in step S1, assuming that each SUE has a need for transmitting data, each SUE should select at least one channel to access to transmit data, and different SUE access action spaces are the same, and at this time, the action space of the ith SUE is used to generally represent the SUE; the access action of the ith SUE in the time slot t is represented as:

a^l(t)∈{1,2,...,n,...,N}

wherein, a^l(t) indicates the channel within the time slot t that the ith SUE is to access and transmit data; suppose that after the SUE accesses the nth channel at the t time slot, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as

After the SUE accesses the nth channel, three situations occur: (1) successful transmission of the SUE; (2) the SUEs collide with each other to interfere with each other; (3) the SUE creates interference to the PU; corresponding to the three cases, feedback is set to

Namely, it is

Further, in step S1, the reward value is set as the feedback signal

The value of (c), the cumulative discount reward earned by the ith SUE is expressed as:

wherein, gamma is more than or equal to 0 and less than or equal to 1, which is a discount factor and represents the influence of future rewards on the current action; r is^l(t) indicates the prize value for which the ith SUE transmitted successfully on the channel.

Further, in step S2, the cumulative expected reward function of the constructed SUE is expressed as:

wherein the content of the first and second substances,

the historical experience of the first SUE M slots before t slots is shown, L the number of SUEs.

Further, in step S2, the historical experience of the first SUE for M time slots before t time slot

Selecting an action access channel to obtain the maximum cumulative expected reward, whereby the SUE optimal access policy formula is:

further, in step S3, the method of combining the DQN algorithm and the confidence interval upper bound algorithm in the deep reinforcement learning is used to solve the access policy, which specifically includes: when the SUE takes action, the action is selected as

Wherein the content of the first and second substances,

indicating action before t time slot

The selected times, sigma, represent the uncertainty measure, control the degree of exploration;

showing the historical experience given by the ith SUE at the t time slot

Acting as a state

Is expressed as

The invention has the beneficial effects that: the invention can adapt to the dynamically changing cognitive radio environment. Specifically, with deep reinforcement learning, spectrum access selection depends not only on the current spectrum access result but also on the learning result of the past spectrum state. In this way, it is possible to greatly reduce the negative effect of the traditional imperfect access method on the spectrum access performance. In addition, deep reinforcement learning can enable cognitive user equipment to obtain more accurate channel states and useful channel state prediction/statistical information, such as behavior rules of a master user. The frequency spectrum access based on the invention can also greatly reduce the conflict between the cognitive user equipment and the master user. In addition, the exploration strategy adopting the confidence interval also accelerates the exploration and convergence speed of the deep reinforcement learning.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

fig. 1 is a diagram of a dynamic spectrum access scenario;

FIG. 2 is a state transition model of a channel;

fig. 3 is a flowchart of a dynamic spectrum access method based on a combination of confidence intervals and deep reinforcement learning.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1 to fig. 3, an implementation process of a dynamic spectrum access method based on a combination of a confidence interval and deep reinforcement learning specifically includes the following three initial conditions and five main steps.

Initial condition 1:

the system model is a dynamic multi-channel access problem in a specific cell, and the structure of the system model is shown in fig. 1. In a dynamic multi-channel access scene, a primary user network composed of N PUs and a secondary user network composed of L SUEs are considered. Assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel to avoid interference between PUs; the SUE may find a free channel among the N channels for transmission at any time. Since the channel may not be accessed or a failed transmission may occur while accessing the channel, a feedback signal is needed between transceivers to flag whether the transmission was successful. Specifically, when the SUE receiver successfully receives a packet from a channel, it transmits a feedback signal to its corresponding transmitting end through the common control channel of the SUE system itself at the end of the slot. The operating state of a PU on a channel may be represented as both active (labeled 1) and inactive (labeled 0), and communicates in the channel in a TDMA fashion. The state of the channel is determined by the PU state: assuming that the PU on the channel n is active, the channel is in a busy state, and the state of the channel is 0; conversely, in the time slot t, if the nth channel is in the idle state, it is denoted as 1.

Initial condition 2:

the states of the channels conform to a discrete Markov model, and the state space of the N channels is represented as follows:

S＝{s＝(s₁,...,s_n,...,s_N)∣s_n0 or 1, N ═ 1,2,.., N } (1)

Wherein s is_n0 or 1 respectively represents two states per channel: occupied (0) and idle (1).

The state of each channel is described as a markov chain, and the state transition probability on the nth channel is expressed as:

wherein p is_ijRepresenting the probability of a transition of state i to state j, transition matrix P_nIs constant and time-independent. Because the SUE can only access one channel at the beginning of each time slot and cannot observe the states of all the channels, the considered dynamic multi-channel access problem belongs to the category of the POMDP, and the invention adopts a deep reinforcement learning method to solve the problem.

Initial condition 3:

assuming that each SUE has a need for transmitting data, each SUE should select at least one channel to access for transmitting data, and different SUEs access action spaces are the same, and are summarized by the action space of the ith SUE. The access action of the ith SUE in the time slot t is represented as:

a^l(t)∈{1,2,...,N} (3)

wherein, a^l(t) denotes the channel within the time slot t that the ith SUE is to access and transmit data. Suppose that after the SUE accesses the nth channel at the t time slot, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as

After the SUE accesses the nth channel, three situations occur: (1) successful transmission of the SUE; (2) the SUEs collide with each other to interfere with each other; (3) SUEs create interference to PUs. Corresponding to the three cases, feedback is set to

Namely, it is

Initial condition 4:

suppose that the ith SUE is based on historical experience at the t slot

Adopt strategy pi^l(t) after accessing the nth channel, the SUE transmitting end receives the feedback signal of the nth channel accessed by the SUE transmitted by the receiving end through the control channel

Whether the data of the ith SUE is successfully transmitted depends on the state of the channel occupied by the PU and other SUE access action strategies, and if the channel is occupied by the PU or the SUE accesses the channel to transmit data, the data transmission of the ith SUE fails. To generally represent the quality of the transmission of the ith SUE on the nth channel, one may combineThe reward for successful transmission is set to the transmission rate on the channel, e.g.

Wherein, B is the nth channel bandwidth. To simplify the calculation process, in an embodiment the reward value is set as a feedback signal

The value of (c). The cumulative discount reward earned by this ith SUE may be expressed as:

wherein, gamma is more than or equal to 0 and less than or equal to 1, which is a discount factor and represents the influence of future rewards on the current action.

Step 1:

the dynamic spectrum access policy is distributed, and access result information is not shared between SUEs. Each SUE has its own DQN network to make channel access decisions independently. According to the initial condition 4, the goal of each SUE is to find a strategy pi suitable for the current dynamic spectrum environment, and prompt the SUE to take a proper access action, so that the cumulative discount reward of the SUE itself is maximized. An act of mapping observations of historical timeslots to next timeslots, the cumulative expected reward function for the ith SUE may be expressed as

Wherein γ ∈ (0,1) represents a decay factor,

an action taken at t time slot for the ith SUE, s represents a state in reinforcement learning;

historical experience showing the first SUE M time slots before t time slot, including its accessAnd its observed channel state. The optimization-aware strategy formula that can be derived from equation (7) is expressed as:

step 2:

the merits of the measurement strategy can be measured by a state action value function, namely a Q function, in addition to the equation (5). Under strategy π, the Q function of the ith SUE is expressed as:

where s and a represent states and actions in reinforcement learning.

The access policy of the ith SUE of equation (6) can be solved by solving for the Q value:

the access strategy of dynamic spectrum access is distributed, because access results and historical experience information are not shared among the SUEs, each SUE has its own deep reinforcement learning to decide the decision of accessing the channel, but the strategy solving mode between different SUEs is the same, only it needs to be noticed that the same channel may be accessed between different SUEs, thereby causing interference between SUEs, in order to avoid the conflict of accessing the same channel between SUEs, the strategy of other SUEs also needs to be learned between different SUEs in the invention, which mainly learns through the difference of reward values (namely feedback signals).

And step 3:

and solving the access strategy by adopting a method of combining a DQN algorithm and an upper bound of confidence interval (UCB) in deep reinforcement learning. Firstly, initializing variables in the learning process: initializing the size of an experience playback pool E to be D; ② initialize two nets in DQN of the first SUEComplexing: current network and target network, respectively denoted as

And

setting the weight of the current network as theta and the weight of the target network as theta^-θ; ③ 10 for initial learning rate alpha^-4The activation function in the neural network is ReLU, and the attenuation factor γ is 0.9.

And 4, step 4:

in the t slot, the ith SUE will experience history

And actions taken

As input to the neural network, all state action pairs based on this state are output

Q value of

When the ith SUE takes action based on the policy of the upper bound of the confidence interval at the t time slot, the optimal action is expressed as:

wherein the content of the first and second substances,

the confidence level is indicated and the confidence level is indicated,

indicates that the ith SUE acts before the t slot

The number of times chosen, σ, represents an uncertainty measure, controls the extent of exploration,

showing the historical experience given by the ith SUE at the t time slot

As actions taken in a state

The Q value of (A) is represented by

The Q value is updated by adopting a method of DQN + UCB, and the Q value updating formula of the ith SUE is expressed as:

wherein the content of the first and second substances,

to represent

Number of times selected before t slot

Representing algorithm versus current state action pair

Wherein c > 0 represents a constant, H represents the number of iteration steps per round, and access or no access is made to one round in dynamic spectrum access, so H is generally set to 1(H³It is meaningless in this scenario, only if it is a roundThe influence is large when there are many actions, for example, when there is a round from the starting point to the ending point in a maze scene); the most efficient exploration is found when iota is log (S | a | T/P), where P ∈ (0,1), | S | represents the number of states, | a | represents the number of actions, and T represents the algorithm runtime.

In the process of interactive learning with environment, the access action is carried out

Historical experience as status

Awarding of prizes

And the new state generated

As training samples

And storing the training samples into an experience playback pool E, and deleting the old training samples when the number of the training samples in the experience playback pool is more than M. During the subsequent DQN training, samples can be selected randomly from the experience playback pool continuously and input into the neural network for training, so that the correlation among data is broken.

And 5:

as can be seen from step 3, in DQN, there are two types of neural networks: one is the current network

An estimate representing a cumulative discount reward for resolving all actions; one is a target network

For generating the target values, both networks have the same structure. In DQN, the loss function is calculated by time difference, i.e. the loss function is expressed as:

wherein the content of the first and second substances,

representing the target value generated by the target network.

The weight θ in the loss function L (θ) is updated using Adam's method. Every other T_sUpdating the target network by one time slot, let theta^-＝θ。

Step 6: after a period of iterative learning, each SUE gradually obtains its own optimal access strategy

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm is characterized by comprising the following steps:

s1: constructing a distributed dynamic spectrum access system model;

s2: constructing a cumulative expected reward function for the SUE;

2. The dynamic spectrum access method according to claim 1, wherein the distributed dynamic spectrum access system model constructed in step S1 specifically includes: a primary user network consisting of N PUs and a secondary user network consisting of L SUEs; assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel; the working state of the PU on the channel is represented as active and idle, and respectively marked as '1' and '0'; the states of all channels are represented by 2^NDiscrete markov models of individual states, whose state space is represented as: s ═ S (S)₁,s₂,...,s_n,...,s_N)∣s_n0 or 1, N1, 2, N, wherein s is_n0 or 1 respectively represents two states per channel: occupied or idle.

3. The dynamic spectrum access method of claim 2, wherein in step S1, the state transition probability on a single channel is expressed as:

wherein p is_ijRepresenting the probability of state i transitioning to state j.

4. The dynamic spectrum access method according to claim 2, wherein in step S1, assuming that each SUE has a need for data transmission, each SUE accesses a channel, and different SUE access action spaces are the same, and are generally represented by the action space of the ith SUE; the access action of the ith SUE in the time slot t is represented as:

a^l(t)∈{1,2,...,n,...,N}

wherein, a^l(t) indicates the channel within the time slot t that the ith SUE is to access and transmit data; assuming that the SUE accesses the nth channel at t-slotThen, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as

Namely, it is

5. The dynamic spectrum access method of claim 4, wherein in step S1, a reward value is set as a feedback signal

6. The dynamic spectrum access method of claim 5, wherein in step S2, the cumulative expected reward function of the SUE is constructed by the following expression:

wherein the content of the first and second substances,

7. The dynamic spectrum access method of claim 6, wherein in step S2, the historical experience of the ith SUE in M time slots before the t time slot

8. the dynamic spectrum access method according to claim 7, wherein in step S3, a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning is used to solve the access policy, which specifically includes: when the SUE takes action, the action is selected as

Wherein the content of the first and second substances,

indicating action before t time slot

showing the historical experience given by the ith SUE at the t time slot

Acting as a state

Is expressed as