CN113207129A - Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm - Google Patents

Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm Download PDF

Info

Publication number
CN113207129A
CN113207129A CN202110506184.9A CN202110506184A CN113207129A CN 113207129 A CN113207129 A CN 113207129A CN 202110506184 A CN202110506184 A CN 202110506184A CN 113207129 A CN113207129 A CN 113207129A
Authority
CN
China
Prior art keywords
sue
channel
access
dynamic spectrum
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110506184.9A
Other languages
Chinese (zh)
Other versions
CN113207129B (en
Inventor
申滨
颜廷秋
方广进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110506184.9A priority Critical patent/CN113207129B/en
Publication of CN113207129A publication Critical patent/CN113207129A/en
Application granted granted Critical
Publication of CN113207129B publication Critical patent/CN113207129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/373Predicting channel quality or other radio frequency [RF] parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/382Monitoring; Testing of propagation channels for resource allocation, admission control or handover
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Electromagnetism (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm, and belongs to the field of wireless communication. The method specifically comprises the following steps: s1: constructing a distributed dynamic spectrum access system model; s2: constructing a cumulative expected reward function for the SUE; s3: obtaining an optimal access strategy according to historical experience and state actions of an access channel so as to obtain the maximum accumulated expected reward; s4: and solving the access strategy by adopting a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning, and obtaining the optimal access strategy through continuous iteration. The invention can obtain the optimal dynamic spectrum access strategy corresponding to the condition that the channel state transfer rule is known under the condition that the channel dynamic change rule is unknown.

Description

Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm
Technical Field
The invention belongs to the field of wireless communication, and relates to a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm.
Background
In recent years, increasing spectrum resources is one of the key solutions for future wireless communication networks to cope with this exponential data traffic growth. However, radio spectrum is an expensive and scarce resource. The current shortage of radio spectrum makes it difficult for wireless operators to obtain sufficient proprietary licensed bands. On the other hand, experimental tests and investigations from the academic and industrial circles indicate that the static spectrum allocation policy results in insufficient utilization of allocated licensed bands, most of which are under 30%, and more than half of which are under 20%. These statistics reflect the fact that radio spectrum resources are under-utilized, which has prompted the industry to reconsider current static spectrum allocation policies and to employ dynamic spectrum access to promote spectrum efficient utilization.
In order to realize the coexistence of frequency spectrums between cognitive users and primary users, various frequency spectrum access strategies have been proposed at present, and are mainly divided into two frequency spectrum access mechanisms. The first is Listen Before Talk (LBT), also known as an interleaving scheme, in which a SUE can access a band only if it detects that the band is available. Although this scheme can effectively avoid strong interference to the primary user, the chances of the SUE accessing the shared band are quite limited. This is because under LBT, spectrum access is completely dependent on the current spectrum access result. In reality, due to randomness of a wireless environment, limited or no cooperation among cognitive users, and other practical factors, a spectrum access result may always have a large error. This will result in false positives or missed detections of primary user activity, leading to incorrect decisions by cognitive users on channel access. The second spectrum access scheme is spectrum sharing, also referred to as the underlay scheme. In this scheme, cognitive users coexist with primary users on a shared frequency band and adjust their transmit power levels such that the cumulative interference experienced at the primary users is less than a tolerable interference threshold. This scheme requires a strong assumption that channel state information between the transmitter of the cognitive user and the receiver of the primary user is already known for power control. However, in reality, it is often difficult to obtain such channel state information without a central controller. Even in the presence of a central controller, exchanging these channel state information may impose heavy control overhead on the underlying network, making it difficult to implement in practice.
In summary, in view of the various defects and shortcomings of the conventional dynamic spectrum access, a new dynamic spectrum access method is needed to solve the above problems.
Disclosure of Invention
In view of the above, the present invention provides a dynamic spectrum access method based on a combination of a confidence interval upper bound algorithm and a Deep Reinforcement Learning (DRL) algorithm, which aims at various defects and deficiencies of the conventional dynamic spectrum access, and obtains an optimal dynamic spectrum access strategy corresponding to a situation that a channel state transition rule is known approximately under the situation that a channel dynamic change rule is unknown.
In order to achieve the purpose, the invention provides the following technical scheme:
a dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm specifically comprises the following steps:
s1: constructing a distributed dynamic spectrum access system model;
s2: constructing a cumulative expectation reward function (SUE) of a Secondary User Equipment (SUE);
s3: according to the historical experience of the first SUE in M time slots before t time slot
Figure BDA0003058541370000021
And accessing the state action of the channel to obtain an optimal access strategy so as to obtain the maximum accumulated expected reward;
s4: and solving the access strategy by adopting a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning, and obtaining the optimal access strategy through continuous iteration.
Further, in step S1, the constructed distributed dynamic spectrum access system model specifically includes: a Primary User network consisting of N Primary Users (PUs) and a secondary User network consisting of L SUEs; assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel to avoid interference between PUs; the operating states of the PU on the channel are indicated as both active (labeled 1) and idle (labeled 0), and communicate in the channel in a TDMA fashion. And the state of the channel is determined by the PUThe state decision of (2): occupied (0) or idle (1), the state of all channels is defined by 2NDiscrete markov models of individual states, whose state space is represented as: s ═ S1,s2,...,sn,...,sN)∣sn0 or 1, N1, 2, N, wherein s isn0 or 1 respectively represents two states per channel: occupied (0) or idle (1).
Further, in step S1, the state transition probability on a single channel is expressed as:
Figure BDA0003058541370000022
wherein p isijRepresenting the probability of state i transitioning to state j. Assuming that the channel is stationary, the transfer matrix P is constant and time independent.
Further, in step S1, assuming that each SUE has a need for transmitting data, each SUE should select at least one channel to access to transmit data, and different SUE access action spaces are the same, and at this time, the action space of the ith SUE is used to generally represent the SUE; the access action of the ith SUE in the time slot t is represented as:
al(t)∈{1,2,...,n,...,N}
wherein, al(t) indicates the channel within the time slot t that the ith SUE is to access and transmit data; suppose that after the SUE accesses the nth channel at the t time slot, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as
Figure BDA0003058541370000031
After the SUE accesses the nth channel, three situations occur: (1) successful transmission of the SUE; (2) the SUEs collide with each other to interfere with each other; (3) the SUE creates interference to the PU; corresponding to the three cases, feedback is set to
Figure BDA0003058541370000032
Namely, it is
Figure BDA0003058541370000033
Further, in step S1, the reward value is set as the feedback signal
Figure BDA0003058541370000034
The value of (c), the cumulative discount reward earned by the ith SUE is expressed as:
Figure BDA0003058541370000035
wherein, gamma is more than or equal to 0 and less than or equal to 1, which is a discount factor and represents the influence of future rewards on the current action; r isl(t) indicates the prize value for which the ith SUE transmitted successfully on the channel.
Further, in step S2, the cumulative expected reward function of the constructed SUE is expressed as:
Figure BDA0003058541370000036
wherein the content of the first and second substances,
Figure BDA0003058541370000037
the historical experience of the first SUE M slots before t slots is shown, L the number of SUEs.
Further, in step S2, the historical experience of the first SUE for M time slots before t time slot
Figure BDA0003058541370000038
Selecting an action access channel to obtain the maximum cumulative expected reward, whereby the SUE optimal access policy formula is:
Figure BDA0003058541370000039
further, in step S3, the method of combining the DQN algorithm and the confidence interval upper bound algorithm in the deep reinforcement learning is used to solve the access policy, which specifically includes: when the SUE takes action, the action is selected as
Figure BDA00030585413700000310
Wherein the content of the first and second substances,
Figure BDA00030585413700000311
indicating action before t time slot
Figure BDA00030585413700000312
The selected times, sigma, represent the uncertainty measure, control the degree of exploration;
Figure BDA00030585413700000313
showing the historical experience given by the ith SUE at the t time slot
Figure BDA00030585413700000314
Acting as a state
Figure BDA00030585413700000315
Is expressed as
Figure BDA00030585413700000316
The invention has the beneficial effects that: the invention can adapt to the dynamically changing cognitive radio environment. Specifically, with deep reinforcement learning, spectrum access selection depends not only on the current spectrum access result but also on the learning result of the past spectrum state. In this way, it is possible to greatly reduce the negative effect of the traditional imperfect access method on the spectrum access performance. In addition, deep reinforcement learning can enable cognitive user equipment to obtain more accurate channel states and useful channel state prediction/statistical information, such as behavior rules of a master user. The frequency spectrum access based on the invention can also greatly reduce the conflict between the cognitive user equipment and the master user. In addition, the exploration strategy adopting the confidence interval also accelerates the exploration and convergence speed of the deep reinforcement learning.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
fig. 1 is a diagram of a dynamic spectrum access scenario;
FIG. 2 is a state transition model of a channel;
fig. 3 is a flowchart of a dynamic spectrum access method based on a combination of confidence intervals and deep reinforcement learning.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Referring to fig. 1 to fig. 3, an implementation process of a dynamic spectrum access method based on a combination of a confidence interval and deep reinforcement learning specifically includes the following three initial conditions and five main steps.
Initial condition 1:
the system model is a dynamic multi-channel access problem in a specific cell, and the structure of the system model is shown in fig. 1. In a dynamic multi-channel access scene, a primary user network composed of N PUs and a secondary user network composed of L SUEs are considered. Assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel to avoid interference between PUs; the SUE may find a free channel among the N channels for transmission at any time. Since the channel may not be accessed or a failed transmission may occur while accessing the channel, a feedback signal is needed between transceivers to flag whether the transmission was successful. Specifically, when the SUE receiver successfully receives a packet from a channel, it transmits a feedback signal to its corresponding transmitting end through the common control channel of the SUE system itself at the end of the slot. The operating state of a PU on a channel may be represented as both active (labeled 1) and inactive (labeled 0), and communicates in the channel in a TDMA fashion. The state of the channel is determined by the PU state: assuming that the PU on the channel n is active, the channel is in a busy state, and the state of the channel is 0; conversely, in the time slot t, if the nth channel is in the idle state, it is denoted as 1.
Initial condition 2:
the states of the channels conform to a discrete Markov model, and the state space of the N channels is represented as follows:
S={s=(s1,...,sn,...,sN)∣sn0 or 1, N ═ 1,2,.., N } (1)
Wherein s isn0 or 1 respectively represents two states per channel: occupied (0) and idle (1).
The state of each channel is described as a markov chain, and the state transition probability on the nth channel is expressed as:
Figure BDA0003058541370000051
wherein p isijRepresenting the probability of a transition of state i to state j, transition matrix PnIs constant and time-independent. Because the SUE can only access one channel at the beginning of each time slot and cannot observe the states of all the channels, the considered dynamic multi-channel access problem belongs to the category of the POMDP, and the invention adopts a deep reinforcement learning method to solve the problem.
Initial condition 3:
assuming that each SUE has a need for transmitting data, each SUE should select at least one channel to access for transmitting data, and different SUEs access action spaces are the same, and are summarized by the action space of the ith SUE. The access action of the ith SUE in the time slot t is represented as:
al(t)∈{1,2,...,N} (3)
wherein, al(t) denotes the channel within the time slot t that the ith SUE is to access and transmit data. Suppose that after the SUE accesses the nth channel at the t time slot, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as
Figure BDA0003058541370000052
After the SUE accesses the nth channel, three situations occur: (1) successful transmission of the SUE; (2) the SUEs collide with each other to interfere with each other; (3) SUEs create interference to PUs. Corresponding to the three cases, feedback is set to
Figure BDA0003058541370000053
Figure BDA0003058541370000054
Namely, it is
Figure BDA0003058541370000055
Initial condition 4:
suppose that the ith SUE is based on historical experience at the t slot
Figure BDA0003058541370000056
Adopt strategy pil(t) after accessing the nth channel, the SUE transmitting end receives the feedback signal of the nth channel accessed by the SUE transmitted by the receiving end through the control channel
Figure BDA0003058541370000057
Whether the data of the ith SUE is successfully transmitted depends on the state of the channel occupied by the PU and other SUE access action strategies, and if the channel is occupied by the PU or the SUE accesses the channel to transmit data, the data transmission of the ith SUE fails. To generally represent the quality of the transmission of the ith SUE on the nth channel, one may combineThe reward for successful transmission is set to the transmission rate on the channel, e.g.
Figure BDA0003058541370000061
Wherein, B is the nth channel bandwidth. To simplify the calculation process, in an embodiment the reward value is set as a feedback signal
Figure BDA0003058541370000062
The value of (c). The cumulative discount reward earned by this ith SUE may be expressed as:
Figure BDA0003058541370000063
wherein, gamma is more than or equal to 0 and less than or equal to 1, which is a discount factor and represents the influence of future rewards on the current action.
Step 1:
the dynamic spectrum access policy is distributed, and access result information is not shared between SUEs. Each SUE has its own DQN network to make channel access decisions independently. According to the initial condition 4, the goal of each SUE is to find a strategy pi suitable for the current dynamic spectrum environment, and prompt the SUE to take a proper access action, so that the cumulative discount reward of the SUE itself is maximized. An act of mapping observations of historical timeslots to next timeslots, the cumulative expected reward function for the ith SUE may be expressed as
Figure BDA0003058541370000064
Wherein γ ∈ (0,1) represents a decay factor,
Figure BDA0003058541370000065
an action taken at t time slot for the ith SUE, s represents a state in reinforcement learning;
Figure BDA0003058541370000066
historical experience showing the first SUE M time slots before t time slot, including its accessAnd its observed channel state. The optimization-aware strategy formula that can be derived from equation (7) is expressed as:
Figure BDA0003058541370000067
step 2:
the merits of the measurement strategy can be measured by a state action value function, namely a Q function, in addition to the equation (5). Under strategy π, the Q function of the ith SUE is expressed as:
Figure BDA0003058541370000068
where s and a represent states and actions in reinforcement learning.
The access policy of the ith SUE of equation (6) can be solved by solving for the Q value:
Figure BDA0003058541370000069
the access strategy of dynamic spectrum access is distributed, because access results and historical experience information are not shared among the SUEs, each SUE has its own deep reinforcement learning to decide the decision of accessing the channel, but the strategy solving mode between different SUEs is the same, only it needs to be noticed that the same channel may be accessed between different SUEs, thereby causing interference between SUEs, in order to avoid the conflict of accessing the same channel between SUEs, the strategy of other SUEs also needs to be learned between different SUEs in the invention, which mainly learns through the difference of reward values (namely feedback signals).
And step 3:
and solving the access strategy by adopting a method of combining a DQN algorithm and an upper bound of confidence interval (UCB) in deep reinforcement learning. Firstly, initializing variables in the learning process: initializing the size of an experience playback pool E to be D; ② initialize two nets in DQN of the first SUEComplexing: current network and target network, respectively denoted as
Figure BDA0003058541370000071
And
Figure BDA0003058541370000072
setting the weight of the current network as theta and the weight of the target network as theta-θ; ③ 10 for initial learning rate alpha-4The activation function in the neural network is ReLU, and the attenuation factor γ is 0.9.
And 4, step 4:
in the t slot, the ith SUE will experience history
Figure BDA0003058541370000073
And actions taken
Figure BDA0003058541370000074
As input to the neural network, all state action pairs based on this state are output
Figure BDA0003058541370000075
Q value of
Figure BDA0003058541370000076
When the ith SUE takes action based on the policy of the upper bound of the confidence interval at the t time slot, the optimal action is expressed as:
Figure BDA0003058541370000077
wherein the content of the first and second substances,
Figure BDA0003058541370000078
the confidence level is indicated and the confidence level is indicated,
Figure BDA0003058541370000079
indicates that the ith SUE acts before the t slot
Figure BDA00030585413700000710
The number of times chosen, σ, represents an uncertainty measure, controls the extent of exploration,
Figure BDA00030585413700000711
showing the historical experience given by the ith SUE at the t time slot
Figure BDA00030585413700000712
As actions taken in a state
Figure BDA00030585413700000713
The Q value of (A) is represented by
Figure BDA00030585413700000714
The Q value is updated by adopting a method of DQN + UCB, and the Q value updating formula of the ith SUE is expressed as:
Figure BDA00030585413700000715
wherein the content of the first and second substances,
Figure BDA00030585413700000716
Figure BDA00030585413700000717
to represent
Figure BDA00030585413700000718
Number of times selected before t slot
Figure BDA00030585413700000719
Representing algorithm versus current state action pair
Figure BDA00030585413700000720
Wherein c > 0 represents a constant, H represents the number of iteration steps per round, and access or no access is made to one round in dynamic spectrum access, so H is generally set to 1(H3It is meaningless in this scenario, only if it is a roundThe influence is large when there are many actions, for example, when there is a round from the starting point to the ending point in a maze scene); the most efficient exploration is found when iota is log (S | a | T/P), where P ∈ (0,1), | S | represents the number of states, | a | represents the number of actions, and T represents the algorithm runtime.
In the process of interactive learning with environment, the access action is carried out
Figure BDA0003058541370000081
Historical experience as status
Figure BDA0003058541370000082
Awarding of prizes
Figure BDA0003058541370000083
And the new state generated
Figure BDA0003058541370000084
As training samples
Figure BDA0003058541370000085
And storing the training samples into an experience playback pool E, and deleting the old training samples when the number of the training samples in the experience playback pool is more than M. During the subsequent DQN training, samples can be selected randomly from the experience playback pool continuously and input into the neural network for training, so that the correlation among data is broken.
And 5:
as can be seen from step 3, in DQN, there are two types of neural networks: one is the current network
Figure BDA0003058541370000086
An estimate representing a cumulative discount reward for resolving all actions; one is a target network
Figure BDA0003058541370000087
For generating the target values, both networks have the same structure. In DQN, the loss function is calculated by time difference, i.e. the loss function is expressed as:
Figure BDA0003058541370000088
wherein the content of the first and second substances,
Figure BDA0003058541370000089
representing the target value generated by the target network.
The weight θ in the loss function L (θ) is updated using Adam's method. Every other TsUpdating the target network by one time slot, let theta-=θ。
Step 6: after a period of iterative learning, each SUE gradually obtains its own optimal access strategy
Figure BDA00030585413700000810
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (8)

1. A dynamic spectrum access method based on a confidence interval upper bound algorithm and a DRL algorithm is characterized by comprising the following steps:
s1: constructing a distributed dynamic spectrum access system model;
s2: constructing a cumulative expected reward function for the SUE;
s3: according to the historical experience of the first SUE in M time slots before t time slot
Figure FDA0003058541360000015
And accessing the state action of the channel to obtain an optimal access strategy so as to obtain the maximum accumulated expected reward;
s4: and solving the access strategy by adopting a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning, and obtaining the optimal access strategy through continuous iteration.
2. The dynamic spectrum access method according to claim 1, wherein the distributed dynamic spectrum access system model constructed in step S1 specifically includes: a primary user network consisting of N PUs and a secondary user network consisting of L SUEs; assuming that there are N orthogonal channels, each PU transmits on a unique wireless channel; the working state of the PU on the channel is represented as active and idle, and respectively marked as '1' and '0'; the states of all channels are represented by 2NDiscrete markov models of individual states, whose state space is represented as: s ═ S (S)1,s2,...,sn,...,sN)∣sn0 or 1, N1, 2, N, wherein s isn0 or 1 respectively represents two states per channel: occupied or idle.
3. The dynamic spectrum access method of claim 2, wherein in step S1, the state transition probability on a single channel is expressed as:
Figure FDA0003058541360000011
wherein p isijRepresenting the probability of state i transitioning to state j.
4. The dynamic spectrum access method according to claim 2, wherein in step S1, assuming that each SUE has a need for data transmission, each SUE accesses a channel, and different SUE access action spaces are the same, and are generally represented by the action space of the ith SUE; the access action of the ith SUE in the time slot t is represented as:
al(t)∈{1,2,...,n,...,N}
wherein, al(t) indicates the channel within the time slot t that the ith SUE is to access and transmit data; assuming that the SUE accesses the nth channel at t-slotThen, the feedback of the nth channel accessed by the SUE sent by the receiving end through the control channel is received by the SUE sending end as
Figure FDA0003058541360000012
After the SUE accesses the nth channel, three situations occur: (1) successful transmission of the SUE; (2) the SUEs collide with each other to interfere with each other; (3) the SUE creates interference to the PU; corresponding to the three cases, feedback is set to
Figure FDA0003058541360000013
Namely, it is
Figure FDA0003058541360000014
5. The dynamic spectrum access method of claim 4, wherein in step S1, a reward value is set as a feedback signal
Figure FDA0003058541360000021
The value of (c), the cumulative discount reward earned by the ith SUE is expressed as:
Figure FDA0003058541360000022
wherein, gamma is more than or equal to 0 and less than or equal to 1, which is a discount factor and represents the influence of future rewards on the current action; r isl(t) indicates the prize value for which the ith SUE transmitted successfully on the channel.
6. The dynamic spectrum access method of claim 5, wherein in step S2, the cumulative expected reward function of the SUE is constructed by the following expression:
Figure FDA0003058541360000023
wherein the content of the first and second substances,
Figure FDA0003058541360000024
the historical experience of the first SUE M slots before t slots is shown, L the number of SUEs.
7. The dynamic spectrum access method of claim 6, wherein in step S2, the historical experience of the ith SUE in M time slots before the t time slot
Figure FDA0003058541360000025
Selecting an action access channel to obtain the maximum cumulative expected reward, whereby the SUE optimal access policy formula is:
Figure FDA0003058541360000026
8. the dynamic spectrum access method according to claim 7, wherein in step S3, a method of combining a DQN algorithm and a confidence interval upper bound algorithm in deep reinforcement learning is used to solve the access policy, which specifically includes: when the SUE takes action, the action is selected as
Figure FDA0003058541360000027
Wherein the content of the first and second substances,
Figure FDA0003058541360000028
indicating action before t time slot
Figure FDA0003058541360000029
The selected times, sigma, represent the uncertainty measure, control the degree of exploration;
Figure FDA00030585413600000210
showing the historical experience given by the ith SUE at the t time slot
Figure FDA00030585413600000211
Acting as a state
Figure FDA00030585413600000212
Is expressed as
Figure FDA00030585413600000213
CN202110506184.9A 2021-05-10 2021-05-10 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm Active CN113207129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110506184.9A CN113207129B (en) 2021-05-10 2021-05-10 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110506184.9A CN113207129B (en) 2021-05-10 2021-05-10 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm

Publications (2)

Publication Number Publication Date
CN113207129A true CN113207129A (en) 2021-08-03
CN113207129B CN113207129B (en) 2022-05-20

Family

ID=77030590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110506184.9A Active CN113207129B (en) 2021-05-10 2021-05-10 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm

Country Status (1)

Country Link
CN (1) CN113207129B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102256262A (en) * 2011-07-14 2011-11-23 南京邮电大学 Multi-user dynamic spectrum accessing method based on distributed independent learning
CN103209419A (en) * 2013-04-25 2013-07-17 西安电子科技大学 User demand orientated dynamic spectrum accessing method capable of improving network performance
US20180098330A1 (en) * 2016-09-30 2018-04-05 Drexel University Adaptive Pursuit Learning Method To Mitigate Small-Cell Interference Through Directionality
CN108833040A (en) * 2018-06-22 2018-11-16 电子科技大学 Smart frequency spectrum cooperation perceptive method based on intensified learning
CN111654342A (en) * 2020-06-03 2020-09-11 中国人民解放军国防科技大学 Dynamic spectrum access method based on reinforcement learning with priori knowledge

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102256262A (en) * 2011-07-14 2011-11-23 南京邮电大学 Multi-user dynamic spectrum accessing method based on distributed independent learning
CN103209419A (en) * 2013-04-25 2013-07-17 西安电子科技大学 User demand orientated dynamic spectrum accessing method capable of improving network performance
US20180098330A1 (en) * 2016-09-30 2018-04-05 Drexel University Adaptive Pursuit Learning Method To Mitigate Small-Cell Interference Through Directionality
CN108833040A (en) * 2018-06-22 2018-11-16 电子科技大学 Smart frequency spectrum cooperation perceptive method based on intensified learning
CN111654342A (en) * 2020-06-03 2020-09-11 中国人民解放军国防科技大学 Dynamic spectrum access method based on reinforcement learning with priori knowledge

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHEN DAI: "Contextual Multi-Armed Bandit for Cache-Aware Decoupled Multiple Association in UDNs: A Deep Learning Approach", 《IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING》 *
YU ZHANG: "Multi-Agent Deep Reinforcement Learning-Based Cooperative Spectrum Sensing With Upper Confidence Bound Exploration", 《IEEE》 *
宁文丽: "基于强化学习的频谱感知策略研究", 《CNKI硕士期刊》 *
王董礼等: "基于UCB的短波认知信道选择算法", 《铁道学报》 *

Also Published As

Publication number Publication date
CN113207129B (en) 2022-05-20

Similar Documents

Publication Publication Date Title
JP5274140B2 (en) Method for reducing inter-cell interference in a radio frequency division multiplexing network
CN111726811B (en) Slice resource allocation method and system for cognitive wireless network
CN112188503B (en) Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network
US11777636B2 (en) Joint link-level and network-level intelligent system and method for dynamic spectrum anti-jamming
CN113423110B (en) Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
CN110492955B (en) Spectrum prediction switching method based on transfer learning strategy
CN110891276A (en) Multi-user anti-interference channel access system and dynamic spectrum cooperative anti-interference method
CN112153744B (en) Physical layer security resource allocation method in ICV network
EP2566273A1 (en) Method for dynamically determining sensing time in cognitive radio network
CN116744311B (en) User group spectrum access method based on PER-DDQN
CN113207129B (en) Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm
CN108449151B (en) Spectrum access method in cognitive radio network based on machine learning
CN114501667A (en) Multi-channel access modeling and distributed implementation method considering service priority
KR101073294B1 (en) DYNAMIC FREQUENCY SELECTION SYSTEM AND METHOD BASED ON GENETIC ALGORITHM For COGNITIVE RADIO SYSTEM
CN114126021B (en) Power distribution method of green cognitive radio based on deep reinforcement learning
CN116709567A (en) Joint learning access method based on channel characteristics
Kaytaz et al. Distributed deep reinforcement learning with wideband sensing for dynamic spectrum access
CN113890653B (en) Multi-agent reinforcement learning power distribution method for multi-user benefits
CN115278896A (en) MIMO full duplex power distribution method based on intelligent antenna
CN112367131B (en) Jump type spectrum sensing method based on reinforcement learning
CN114916087A (en) Dynamic spectrum access method based on India buffet process in VANET system
CN111866979B (en) Base station and channel dynamic allocation method based on multi-arm slot machine online learning mechanism
CN104660392A (en) Prediction based joint resource allocation method for cognitive OFDM (orthogonal frequency division multiplexing) network
CN113473419B (en) Method for accessing machine type communication device into cellular data network based on reinforcement learning
Chen et al. Dynamic Spectrum Access Scheme of Joint Power Control in Underlay Mode Based on Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant