CN115190489A

CN115190489A - Cognitive wireless network dynamic spectrum access method based on deep reinforcement learning

Info

Publication number: CN115190489A
Application number: CN202210796138.1A
Authority: CN
Inventors: 刘洋; 赵鑫; 张秋彤; 宋凯鹏; 龙旭东; 那顺乌力吉
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-14

Abstract

The application provides a cognitive radio network dynamic spectrum access method based on deep reinforcement learning, which comprises the following steps: modeling and analyzing a dynamic spectrum access problem proposed in advance; pre-constructing a double-depth reinforcement learning network model; according to the first deep reinforcement learning network model, obtaining Q values of all dynamic spectrum access actions of secondary users based on a dynamic spectrum access strategy in the system model; each secondary user selects the dynamic spectrum access action of the secondary user under the state of the optimal Q value; selecting a target Q value of a second deep reinforcement learning network model according to the selected dynamic spectrum access action of the secondary user; and calculating a loss function, training the double-depth reinforcement learning network model through the minimized loss function, and updating the weight of the double-depth reinforcement learning network model. The method and the device meet the high calculation requirement of a large-state action space in the multi-user multi-channel cognitive wireless network, predict the actual state by using the past observation, accelerate the convergence rate and improve the prediction precision.

Description

Cognitive wireless network dynamic spectrum access method based on deep reinforcement learning

Technical Field

The application relates to the technical field of computers, in particular to a cognitive wireless network dynamic spectrum access method based on deep reinforcement learning.

Background

With the rapid development of wireless communication, the application is moving towards the world of everything interconnection. In 4G communications, only the frequency band below 10GHz is occupied, while 5G starts using the millimeter wave frequency band, and even 6G radio can realize allocation and service over a channel bandwidth at least five times larger than 5G, so as to adapt to new services such as continuously improved data rate, higher reliability requirement, sensing, positioning and the like. Meanwhile, on the road to 6G, next generation radios are expected to target higher bandwidth occupancy, currently in excess of 100GHz. However, the utilization of spectrum resources by modern communication systems presents two challenges. On the one hand, spectrum resources are scarce resources that are limited and non-renewable. On the other hand, the existing radio spectrum has the problem of unbalanced distribution in terms of resources and traffic; a large portion of the radio spectrum is still underutilized while another portion of the spectrum resources carry excessive traffic and the spectrum is subject to crowded usage.

In 1999, mitola and Maguire proposed a completely new concept of Cognitive Radio (CR) based on software Radio. The main principle is to realize opportunistic spectrum access, that is, an unauthorized User (also referred to as a Secondary User, SU, or a cognitive User) first performs spectrum sensing and opportunistically accesses an idle frequency band which is temporarily rarely used or even unused by a Primary User (or a PU, or a Primary User), that is, a spectrum hole. Therefore, how to improve the utilization efficiency of the secondary user to the frequency band of the primary user becomes a key problem of spectrum access on the premise of not influencing the communication of the primary user.

The spectrum access mainly comprises two access modes of static spectrum access and dynamic spectrum access. Static spectrum access means that a communication system can only work on a frequency or a frequency band pre-allocated by a frequency management department, and has the advantages of management specifications and high reliability of system operation. However, the static spectrum access mode may result in spectrum under-utilization due to very few wireless spectrum resources. A Dynamic Spectrum Access (DSA) technology is proposed to be used in a wide communication system to improve Spectrum efficiency in a limited frequency band because it can select a transmission channel according to different service requirements of a user, thereby meeting bandwidth and service quality requirements of various special applications. In the dynamic spectrum access technique, once the SU detects the PU and reacquires the frequency band, the channel should be vacated quickly. Therefore, the key problem of the dynamic spectrum access technology is how to ensure that the dynamic access of the secondary user to the spectrum hole is realized on the premise of not influencing the communication of the primary user. This problem puts higher demands on the speed of spectrum access decisions. Meanwhile, the primary user should have the capability of resisting interference when facing the interference.

At present, the principle of the dynamic spectrum access method based on economics is to regard the dynamic spectrum access process as a transaction process of spectrum, and complete the spectrum access of SU to PU according to a transaction strategy. The game theory is a popular method at present, competition between SU to network resource is modeled into a non-cooperative game, and the game theory is used as a mathematical framework for scene-based cognitive wireless network analysis and modeling. The anti-interference channel selection problem in dynamic spectrum access is expressed as an anti-interference dynamic game problem in a cognitive wireless network by using a game theory, wherein an active user set changes due to specific traffic requirements of the active user set. A new incentive architecture based on dynamic radio frequency charging technology is also proposed to improve spectral efficiency and to formulate the problem using the Stackelberg game theory. In the bidding auction theory, the primary user plays the role of a spectrum seller and the secondary user plays the role of a spectrum buyer. Based on pricing strategies to overcome the shortcomings of the conventional mechanisms, a new framework is proposed for planning on the management of a multi-winner auction mechanism. A general framework for unlicensed spectrum resource management based on blockchain techniques and intelligent contracts is also presented. However, most game theory and bid auction theory tend to rely on the availability of spectrum statistics to make policy and cope with dynamic changes in the spectrum. Since such information is not a priori information, the applicability of this method is limited.

The current dynamic spectrum access method based on graph theory is to represent the competition of primary and secondary users for frequency bands by a conflict graph, wherein each node represents an SU, and an edge represents the interference between nodes sharing the edge. And performing access decision by comprehensively considering factors such as bandwidth, network overhead, preference, connectivity probability, signal-to-noise ratio and the like by using a k-nearest neighbor method and an analytic hierarchy process. Under the condition of greedy forwarding, a common energy-saving transmission range and transmission time limit are searched for all cognitive radio network nodes and network deployment, so that the energy consumption of a receiver is reduced to the maximum extent. However, when a modern communication network with a large amount of data of multiple nodes is processed by the conventional graph theory method, due to the complex relationship and the huge number of nodes, a decision result cannot be obtained through rapid calculation in practical application, and the method is greatly limited.

Therefore, the above conventional dynamic spectrum access method cannot be effectively used in the current situation of intensive spectrum usage. Furthermore, one limitation of conventional DSA (dynamic spectrum access) techniques is that a priori network information (e.g., probability of accessibility of each channel in each time slot) is required, which is often unknown or difficult to obtain in practice. While Machine Learning (ML) based methods are introduced in the DSA (dynamic spectrum access) domain because of their ability to adapt to dynamic unknown environments. Specifically, through machine learning, spectrum access will be determined not only by the current spectrum sensing result, but also by the learning result of the past spectrum state. In this way, the negative effects of imperfect spectrum sensing can be greatly mitigated. In addition, machine learning can enable DSA (dynamic spectrum access) devices to obtain accurate channel state and useful channel state prediction/statistical information, such as the behavior of PUs and the load of other SUs, so that the spectrum access based on machine learning can greatly reduce collisions between SUs and PUs. As an important branch of ML, reinforcement learning is characterized by the learning network interacting with a changing and uncertain environment to acquire knowledge, which provides excellent performance in dealing with dynamic systems. Q learning, a commonly used RL (reinforcement learning) method, replaces the method of building a dynamic model of the markov decision process by directly estimating the Q value of each action in each state, then the Q value estimates the expected cumulative discount yield, and then a strategy can be implemented by selecting the action with the highest Q value in each state. A coexistence scenario of multiple Licensed Assisted Access (LAA) and Wi-Fi link competing for spectrum sharing sub-channel Access is proposed. A reinforcement learning based sub-channel selection technique is proposed that allows access points and eNBs to distributively select the best sub-channel taking into account the media access control channel access protocol and physical layer parameters. The DDQN algorithm-based multi-target ant colony algorithm and the greedy algorithm-based optimization method are further provided, and the interference between the SU and the PU is reduced, so that the network performance of the internet-of-things-based cognitive wireless network is improved. In order to solve the influence caused by unbalanced resource allocation, a Q learning algorithm is provided, and under the communication and calculation constraint condition, the network utility is maximized according to the total parameters of the network slice request.

However, since future wireless networks are complex and large-scale, RL (reinforcement learning) cannot effectively handle high-dimensional input state space, deep reinforcement learning has been developed to solve complex spectrum access decision task under large state space. Deep reinforcement learning combines deep learning with RL (reinforcement learning) and agents learn strategies and maximize profits in interacting with the environment. At present, a deep reinforcement learning model is provided, and real-time spectrum allocation is realized by a mathematical method of mixed integers and nonlinear programming. A spectrum resource management scheme of the industrial Internet of things based on a Deep Q Network (DQN) of a single intelligent agent is also provided, and the aim is to realize spectrum sharing among different types of users. A distributed dynamic spectrum access method based on DRL is also provided, and the optimal solution of the DSA (dynamic spectrum access) problem is searched under large state space and local observation information. These distributed learning approaches can encourage devices to make spectrum access decisions based on their own observations without a central controller, and therefore they have great potential in finding efficient solutions for real-time services. A network-driven deep distributed Q-network is proposed for allocating radio resources for diverse services in a fifth Generation Mobile Communication technology (5G) network. In addition, a DRL-based multi-access protocol is also provided, and an optimal spectrum access strategy is learned on the premise of considering service fairness. In order to maximize the experience quality of edge nodes and prolong the battery life of the nodes by using a self-adaptive compression scheme, a distributed dynamic network selection framework based on multi-agent deep reinforcement learning is provided. In order to learn a channel access strategy with low collision rate and high channel utilization rate, a distributed dynamic spectrum access algorithm based on Deep cycle Q Network (DRQN) is provided. DRQN represents an impressive empirical performance in spectrum access tasks with partial observability or noise state observation. However, the above work does not investigate how to solve the large scale spectrum resource allocation problem. Moreover, most articles sacrifice convergence speed for improving access accuracy, and do not consider strict reliability and delay constraints in the optimization problem.

Therefore, solving the high calculation requirement caused by the large state action space in the multi-user multi-channel cognitive wireless network and how to better embody the capability of the deep neural network in the deep reinforcement learning algorithm that the real state can be predicted by using the past observation, accelerating the convergence speed and improving the prediction precision is a technical problem which needs to be solved urgently at present.

Disclosure of Invention

The application aims to provide a cognitive wireless network dynamic spectrum access method based on deep reinforcement learning, which meets the high calculation requirement of a large state action space in a multi-user multi-channel cognitive wireless network, can predict a real state by using past observation, accelerates the convergence rate and improves the prediction precision.

In order to achieve the above object, the present application provides a cognitive radio network dynamic spectrum access method based on deep reinforcement learning, including:

in a system model, modeling and analyzing a dynamic spectrum access problem proposed in advance to obtain a dynamic spectrum access strategy;

constructing a double-depth reinforcement learning network model in advance;

initializing parameters of a first deep reinforcement learning network model and a second deep reinforcement learning network model in the double-deep reinforcement learning network model;

according to the first deep reinforcement learning network model, obtaining estimated Q values of all dynamic spectrum access actions of secondary users based on a dynamic spectrum access strategy in the system model;

each secondary user selects the dynamic spectrum access action of the secondary user in the state of the optimal Q value according to the estimated Q value;

selecting a target Q value of a second deep reinforcement learning network model according to the selected dynamic spectrum access action of the secondary user;

and calculating a loss function, training the double-depth reinforcement learning network model by minimizing the loss function, and updating the weight of the double-depth reinforcement learning network model.

The system model is a multi-user multi-channel cognitive wireless network, and the system model comprises a main network and a secondary network, wherein the main network consists of M main users, and the secondary network consists of N secondary users.

The above, wherein the dual deep reinforcement learning network model comprises an input layer, an echo state network layer, a priority empirical playback deep Q network layer and an output layer.

As above, wherein the echo state network layer includes an input layer, a reserve pool layer and an output layer connected in sequence; the output vector O (t) of the output layer is represented as:

O(t)＝W _out x(t)；

wherein the output vector O (t) is a vector of dimension 2M, each element on O (t) corresponding to an estimated Q value for each secondary user selection; wout is the output weight from the reservoir layer to the output layer.

As above, wherein the weight update of the dual-deep reinforcement learning network model is represented as:

wherein, theta _t+1 Representing the weight of the double-depth reinforcement learning network at the moment t + 1; theta.theta. _t Representing the weight of the double-depth reinforcement learning network at the moment t; s _t Represents the state at time t; s is _t+1 Represents the state at time t + 1; a is _t Represents the action at time t; gamma is an element of [0,1 ]]Is a discount factor; θ represents the weight of DQN 1; θ' represents the weight of DQN 2; the phase of the arriving signal on the LoS (line-of-sight path between transmitter and receiver) path takes values from a uniform distribution between 0 and 1;

presentation selection

The behavior a' when the value takes the maximum value;

a gradient operator, representing a gradient;

representing the acquisition of a Q value; alpha is alpha _t Representing a parameter.

As above, wherein the target Q value of the dual-deep reinforcement learning network is expressed as:

wherein r is _t+1 Showing the profit at the time t;

presentation selection

When the value takes the maximum valueAn act a'; s is _t+1 Represents the state at time t + 1; theta represents the weight of the first deep reinforcement learning network model; θ' represents a weight of the second deep reinforcement learning network model.

As above, wherein the loss function of the dual-deep reinforcement learning network is defined as:

wherein L (θ) represents a loss function; e2] ² Representing the mean square error; r is a radical of hydrogen _t+1 Representing the profit at the moment of t + 1; gamma is an element of [0,1 ]]Is a discount factor;

indicating the acquisition Q value.

The above, wherein, the method of importance sampling is performed on the weight of the dual-deep reinforcement learning network; the calculation method of the correction comprises the following steps:

w _j ＝(N·P(j)) ^-β /max _i w _i ；

wherein, w _i Representing the weight before correction; w is a _j Representing the corrected weight; p (j) represents the probability of extracting the experience j, N represents the sample size, and the parameter β represents the correction rate.

As above, wherein the goal of each secondary user is to find a dynamic spectrum access policy σ _i Maximizing its expected cumulative discounted revenue:

wherein the content of the first and second substances,

strategy σ representing the acquisition of the maximum cumulative discount yield _i ，R ⁱ Representing the cumulative discount revenue for the ith secondary user;

γ∈[0,1]for the discounting factor, T is the time range of the entire channel access procedure;

representing the revenue function for the ith SU.

The above, wherein the parameters for initializing the first deep reinforcement learning network model and the second deep reinforcement learning network model in the dual deep reinforcement learning network model comprise states

Movement of

A weight update interval W, an empirical playback zone D of capacity | D |, and weights.

The beneficial effect that this application realized is as follows:

(1) The application provides a multi-user multi-channel cognitive wireless network dynamic spectrum access algorithm based on PER-DESQN. In classical deep reinforcement learning, the convergence rate is reduced due to the complex structure of a deep neural network, so that the ESN (echo state network) is adopted to predict and estimate the Q value by taking the bottom time correlation as the Q network, the calculation amount of training is greatly reduced, and the convergence time is shortened. Meanwhile, in order to solve the problem of Q value over-estimation in the DQN network, the DDQN network is adopted to train the Q value estimation and decision action processes by two networks respectively, so that the same Q value training network is avoided, and the prediction precision is improved.

(2) In order to solve the problem of unstable Q value caused by sampling in an empirical playback zone by a random sampling method in a DDQN algorithm, the application provides a priority empirical playback mechanism based on Sum Tree and combines an importance sampling principle to sample samples in an empirical pool according to priority, thereby improving the stability and the access accuracy of the algorithm. Simulation experiments show that a rapid and accurate dynamic spectrum access decision can be made by a PER-DESQN-based multi-user multi-channel cognitive wireless network dynamic spectrum access algorithm, and the transmission rate of a system is remarkably increased.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to these drawings.

Fig. 1 is a flowchart of a cognitive radio network dynamic spectrum access method based on deep reinforcement learning according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a system model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, the present application provides a cognitive radio network dynamic spectrum access method based on deep reinforcement learning, which includes:

s1, building a system model of the multi-user multi-channel cognitive wireless network.

Specifically, a system model and a channel model of the multi-user multi-channel cognitive wireless network are built.

As shown in fig. 2, a system model of the multi-user multi-channel cognitive wireless network includes a primary network and a secondary network, the primary network is composed of M Primary Users (PUs), the secondary network is composed of N Secondary Users (SUs), and (1) in the figure represents an expected signal link; (2) representing an interfering signal link. Assuming that each PU is assigned a separate radio channel, cross-channel interference is negligible. Interference occurs only when the primary user and the secondary user or users use the same radio channel at the same time.

SU is mixed _i Transmitter, SU _i Receiver and PU _j Transmitter and PU _j Respectively, as a position coordinate of the receiver

And

wherein i belongs to {1,2, \8230;, N }, j belongs to {1,2, \8230;, M }. Thus, the propagation distances of the desired signal and the interfering signal links are defined as:

wherein d is _ii Representing the propagation distance of the desired signal link; d is a radical of _ji Representing the propagation distance of the interfering signal link.

According to the propagation distance of the expected signal link, calculating the path loss of the expected signal as follows:

wherein PL (d, f) _c ) Represents the path loss of the desired signal; f. of _c [GHz]Represents a carrier frequency of a wireless channel;

path loss representing propagation distance; a. The _W Represents a path loss exponent; b _W A frequency dependence value representing the path loss; d [ m ]]Representing the propagation distance of the desired signal link. Interference signal PL (d) _ji ,f _c ) The path loss calculation method of (2) is the same as that of the formula (2).

Suppose that there is a strong Line of Sight (LoS) path between the transmitter and the receiver. The channel model is represented as:

wherein h represents a channel model; σ represents a path loss factor, determined by the path loss; k is a k factor representing the ratio of the receiver signal power of the LoS path to the scattered path; theta is the phase of the arriving signal on the LoS path, theta-U (0, 1) represents the value in the uniform distribution from 0 to 1,

representing a circularly symmetric complex Gaussian random variable; j represents the j-th master user; e denotes the natural logarithm.

A discrete time model is set, i.e. the behavior of the users and the change of the wireless environment in the system are limited to occur in discrete time slots t (t is a natural number). The Signal to Interference plus Noise Ratio (SINR) is used as a quality metric for wireless communication. SINR, i.e. signal-to-noise ratio, the higher the value of SINR, the better the quality of the wireless connection.

The SINR when secondary user i communicates on channel b of time slot t is expressed as:

wherein, the first and the second end of the pipe are connected with each other,

representing the signal-to-noise ratio of the secondary user i when communicating on channel b of time slot t; p is a radical of formula _jb PU for indicating primary user _j Transmit power on a (b) th wireless channel; p is a radical of formula _ib Indicating a secondary user SU _i Transmit power on the b-th radio channel. | h _ii (t)| ² Is the channel gain of the desired link for user i at t time slot; | h _ji (t)| ² Representing transmitters of user j and user iChannel gain, N, of an interfering link between receivers at t time slots _b Is the background noise power on the b-th radio channel. The desired link refers to a link between a transmitter and a receiver of the same user. An interfering link refers to a link between a transmitter and a receiver where two different users transmit on the same channel at the same time.

And S2, modeling and analyzing the pre-proposed dynamic spectrum access problem to obtain a PER-DESQN-based dynamic spectrum access strategy.

In the system model, a Secondary User (SU) perceives at most one channel in each time slot t and learns its channel state. Since the user's observations of the environment are incomplete in every slot, the Dynamic Spectrum Access (DSA) problem can be expressed as a POMDP model, with the goal of predicting the channel state based on previous decisions and observations.

A basic POMDP model consists of a 6-tuple

The definition, wherein,

is a finite set of states that can be,

is a finite set of actions, p is the transition probability from state s to state s' after action a is performed, r is the immediate benefit obtained after action a is performed, Ω and

defined as the observation value and observation probability set, respectively. In each time slot, the agent is in a state s, an action a is selected according to an observed value of the agent to the current state s, namely b(s), the instant benefit r and the current observation probability o are observed, and then the next decision is made.

The state action space and the observation gain function of the POMDP model are set as follows: at the beginning of each time slot, each Secondary User (SU) performs spectrum sensing on all M channels to detect the channelsStatus. In order to realize the protection of the main network, the application assumes a secondary user SU _i When transmitting on channel b, broadcasting a warning signal when the SINR (signal to interference and noise ratio) at the receiving end is lower than a set threshold. There may be two reasons for the lower SINR. First, the wireless connection of the link required by the PU is in a deeply faded state, i.e., a spectral hole is generated but not successfully accessed. Second, when one or more secondary users simultaneously access the same wireless channel to transmit signals, they will cause strong interference to the PU, i.e. collision, which also means access failure.

Taking the early warning signal as the observed channel state information, namely the perception result at the time slot t is as follows:

wherein the content of the first and second substances,

is an i-dimensional vector

While

Is the perceived state of the ith SU on the b-th channel. However, the activity of a PU includes both active and idle states. If a PU is transmitting data, it is in an active state; otherwise, it is in an idle state. When the authorized PU of a channel is in an idle state, a spectrum hole will appear on the channel, and any SU can transmit on the channel with little interference to the authorized PU. However, in a highly dynamic 5G network, it is difficult for the SU to timely perceive the activity state of the PU, and its perception accuracy is affected by the wireless link between the PU and the SU's transmitter, background noise, and the PU's transmit power; on the other hand, the degree of interference caused by the SU is determined by the interfering link from the SU to the PU, the desired link for the PU, the transmit power of the PU and SU, and the background noise. Furthermore, all of these blocksThe factors for determining the spectrum access opportunity all change with time, so the state information is out of date quickly. Since the cost of acquiring the state information is high in a 5G mobile wireless network, it is not practical to design a spectrum access policy assuming that all the sensed state information is accurate. Therefore, the temperature of the molten metal is controlled,

possibly containing errors, let the probability of perceptual error of the ith SU on the b-th channel be E ^ib Defined as:

wherein the content of the first and second substances,

the true state of the b channel;

indicating the probability that the true state of the b-th channel is the perceptual state of the ith SU on the b-th channel. The transition probabilities and perceptual error probabilities for these channels are unknown. The only known information about the ith SU is the perceptual result at time slot t

I.e. the channel conditions observed in the environment and the input to the deep reinforcement learning network.

After spectrum sensing, each SU determines to access one channel at most or keep idle according to a sensing result. The decision action for the ith SU is defined as:

wherein the content of the first and second substances,

indicating that the ith SU accesses the b-th SU at time slot tA number of channels, which are,

indicating that the ith SU does not access any channel in time slot t, and M indicates the total number of channels, which is equal to the number of primary users.

The local observation value is denoted as o _t E {0,1}. DRL (deep reinforcement learning) agent history from previous actions and observations [ a _t-1 ,o _t-1 ,...,a _t-M ,o _t-M ]And (6) learning. When the SU accesses a channel which is not currently used by the PU or other SUs, no interference is generated, and the spectrum access is successful, namely o _t =1. Log data transfer rate to be realized ₂ (1 + SINR/Γ) is set as the revenue function. When an SU accesses a channel currently occupied by a PU or more than two SUs simultaneously access the channel, the SU collides with the PU or the SU, i.e. o _t And =0. The negative gain of the gain setting-C (C > 0) is taken as a result of receiving the warning signal. Thus, the revenue function for the ith SU may be expressed as:

because the spectrum access strategy is distributed, the sensing result and the access decision information are not shared between the SUs. Each SU determines channel access independently with its DQN (deep Q network, a kind of deep reinforcement learning), and the only input of each SU's DQN is the sensing result obtained by its sensor. The SU is also unaware of the transition probabilities and perceptual error probabilities of the channel states. The SU makes a next access decision through an SINR (signal to interference and noise ratio) received after the channel access, the access strategy can improve the accumulated discount income of the SU to the maximum extent, and the calculation method of the accumulated discount income comprises the following steps:

wherein R is ⁱ Represents the ith Secondary User (SU)Accumulated discount revenue of (a); gamma is an element of [0,1 ]]For the discounting factor, T is the time range of the entire channel access procedure;

the revenue function for the ith SU is shown. Thus, the goal of each secondary user is to find a dynamic spectrum access policy σ _i Maximizing its expected cumulative discounted revenue:

strategy σ representing the acquisition of the maximum cumulative discount yield _i ，R ⁱ Represents the cumulative discount revenue of the ith secondary user, again because the revenue function is set to log ₂ (1 + SINR/gamma), thus maximizing the accumulated discount gain while maximizing the channel capacity, and further improving the data transmission rate.

And S3, constructing a double-depth reinforcement learning network model in advance. The double-depth reinforcement learning network model comprises a first depth reinforcement learning network model and a second depth reinforcement learning network model.

The first deep reinforcement learning network (DQN 1) model and the second deep reinforcement learning network (DQN 2) model respectively comprise an input layer, an echo state network layer, a priority experience playback deep Q network layer and an output layer.

(1) An input layer:

the input of the setting input layer is the sensing result at one time slot t: an NxM matrix containing N most recently occurring actions and observed historical state information

The matrix inputs each row vector to the input layer of DQN in turn, which has M nodes in total. If a node selects a channel for transmission in the last time slot, the first row of the input matrix isA vector of size M, where the ith element is 1 or-1 and the remaining elements are 0.

(2) Echo State Network (ESN) layer:

wherein, the Echo State Network (ESN) layer comprises an input layer, a reserve pool layer and an output layer which are connected in sequence, and the Echo State Network (ESN) layer is used for training the output weight (W) from the reserve pool layer to the output layer _out ) The process of (2). An ESN (echo state network) is used as a deep neural network in a deep reinforcement learning framework to quickly adapt to the environment. The ESN (echo state network) only trains the output weight by keeping the input weight and the recursion weight fixed, thereby greatly simplifying the training process of the deep neural network.

Taking into account historical state information

The output O (t) is predicted over the reservoir network x (t). And the x (t) reservoir network is formed by 64 sparse random neurons with the average degree d

And (4) establishing. To satisfy W _rec The spectrum radius (the radius of the spectrum of the internal connection weight of the reserve pool layer) of (A) is a given value rho, W _rec Is scaled and scaled from [ -1,1 ] to]Independently and uniformly selected. The change in the reservoir state vector over time is described as:

wherein x (t + Δ t) represents the reservoir state vector at the current moment; w _in Representing the connection weight of the input layer to the reserve pool layer; w _rec Representing a reserve pool layer connection weight to a next reserve pool layer; tanh represents a hyperbolic tangent function; alpha represents a non-zero element proportion parameter; x (t) represents the reservoir state vector at the last time instant.

Wherein the output vector O (t) of the output layer is represented as:

O(t)＝W _out x(t)； (12)

where the output vector O (t) is a vector of dimension 2M, each element on O (t) corresponding to an estimated Q value selected by each SU; w _out The output weight from the pool layer to the output layer.

(3) Priority empirical playback-deep reinforcement learning (PER-DQN) network layer

Since the decision that one action is selected and evaluated based on the same Q value in the standard Q learning and the training of DQN results in a higher Q value being selected all the time in learning, resulting in a Q value over-estimation problem. The selection and evaluation process is divided into two separate processes, i.e. another neural network is introduced to reduce the influence of errors, and two neural networks, i.e. DQN1 and DQN2, are used. DQN1

For selective action, DQN2

For estimating a Q value associated with the selected action.

The weight update of the dual-depth reinforcement learning network (DQN 1 and DQN 2) can be represented as:

wherein, theta _t+1 Representing the weight of the double-depth reinforcement learning network at the moment t + 1; theta _t Representing the weight of the double-depth reinforcement learning network at the moment t; s is _t Indicates the state at time t; s is _t+1 Represents the state at time t + 1; a is _t Represents the action at time t; gamma is an element of [0,1 ]]Is a discount factor; θ represents the weight of DQN 1; θ' represents the weight of DQN 2; the phase of the arriving signal on the LoS (line-of-sight path between transmitter and receiver) path takes values from a uniform distribution between 0 and 1;

presentation selection

The behavior a' when the value takes the maximum value;

a gradient operator, representing a gradient;

representing the acquisition Q value; alpha is alpha _t Representing a parameter.

The target Q value of a double deep reinforcement learning network (DDQN) can be expressed as:

wherein r is _t+1 Representing the profit at the time t + 1;

presentation selection

The behavior a' when the value takes the maximum value; s _t+1 Represents the state at time t + 1; θ represents the weight of the first deep reinforcement learning network model DQN 1; θ' represents the weight of the second deep reinforcement learning network model DQN 2;

thus, a target Q value may be selected by an action selected by the estimated Q network, and a Mean Square Error (MSE) loss function may be calculated from the target Q value and the estimated Q value, the loss function of the dual-depth reinforcement learning network being defined as:

wherein L (θ) represents a loss function; e [ 2 ]] ² Representing the mean square error; finally, in each time slot, the weight of the double deep reinforcement learning network (DDQN) is updated with the goal of minimizing the loss function.

The deep reinforcement learning network DQN adopts two deep reinforcement learning networks with the same structure. The weight θ' of the target DQN2 is a delayed replica of the weight θ of DQN 1. Both neural network structures comprise two hidden layers, each hidden layer has 128 and 256 hidden neural units, and both neural networks receive observation results from the ESN (echo state network) and output action parameters of DDQN as decision Dropout parameters to be transmitted to the ESN (echo state network) network. The final training network incorporates all collected observation pairs into the empirical replay zone.

In the classical DDQN algorithm, a uniform random sampling method is typically used when collecting samples from an empirical playback zone. Since experience samples are continuously stored in the experience replay area for training the model during the interaction of the training network with the environment, these experiences may be kept in the experience replay unit, regardless of successful attempts or traces of failure. And through frequent playback of these experiences, the agent can be aware of the different results that will occur after making correct or incorrect behavior, and thus continually correct its behavior. However, the importance of different empirical samples is different. Since the samples in the empirical playback zone are continuously updated, if a small number of samples are collected from the empirical playback zone in a uniform random sampling manner as model inputs, some of the empirical samples with higher importance cannot be fully utilized, or even directly covered, so that the model training efficiency is reduced. In order to improve the training efficiency of the model, the samples are collected from the experience playback area in a priority experience playback mode, so that the probability that the samples with higher importance are collected is improved. The application adopts a Sum-Tree based proportional priority experience playback mechanism. The priority of experience is first defined as:

p _t ＝|δ _t |+c； (16)

wherein p is _t Priority representing experience; delta _t Representing the metric TD-error (difference between the current estimate and the estimated target, TD error), c represents the number of criteria that ensures that samples with TD-error close to 0 also have the probability of being sampled.

The Sum-Tree data structure used in this application is theoretically very similar to a binary array representation. In the context of this data structure, the data structure,each leaf node stores the priority p of each sample _t Each branch node contains two bifurcations, and the value of the branch node is the sum of its child nodes. The parent node contains the sum p of all priorities _total . Such a data structure provides an efficient way to compute the cumulative sum of priorities. In particular, to sample a sample of size k, the range [0,p ] is _total ]First divided equally into k segments. A value is then uniformly drawn from within each segment. Finally, a range region corresponding to each of these sample values is retrieved from the tree.

However, the method frequently plays back experience with high TD error, and enters some states too frequently, so that the experience is not diverse, and the training of the network is easy to over-fit, so that the weights can be corrected by using the importance sampling method, and the corrected weights are expressed as follows:

wherein, w _i Representing the corrected weight; p (U) represents the probability of extracting the experience U, N represents the sample size, and β represents the correction rate. To increase sampling stability, the present application normalizes the weights so that they are only updated down-scaled. The normalized weights are expressed as:

w _j ＝(N·P(V)) ^-β /max _i w _i ； (18)

wherein, w _j Representing the normalized weights; p (V) represents the probability of extracting the experience V; max _i w _i Denotes w _i Is measured.

After the training network brings all the collected observation action pairs into an experience playback area, priority ordering is carried out on the observation action pairs, probability sampling size is determined according to a SumTree model of the priority, and finally an experience sample with higher priority is selected to be in the DQN network for parameter training by adopting an importance sampling method.

(4) Output layer

Wherein, the output layer outputs a vector with the size of M. Where the ith element corresponds to the Q value for selecting one channel estimate in a given state, where 1 ≦ i ≦ M.

And S4, initializing parameters of a first deep reinforcement learning network model and a second deep reinforcement learning network model in the double-deep reinforcement learning network model.

Wherein the parameter comprises a state

Movement of

A weight update interval W, an empirical playback zone D of capacity | D |, and a weight. In particular, the initialization state

Movement of

Updating the interval W by the weight, and using an empirical playback zone D with the capacity of | D |; estimating Q networks with random weights of theta and theta', respectively

And target Q network

The weights of (a) are initialized.

And S5, acquiring estimated Q values of all dynamic spectrum access actions of secondary users based on a dynamic spectrum access strategy in the system model according to the first deep reinforcement learning network model.

At the beginning of each step, the secondary user first observes the initial state s _t This state is used as an input to the echo state network layer for each time slot of the system.

The reservoir network x (t + Δ t) is then updated according to equation (11) and O (t) is output according to equation (12) as the estimated Q value for all dynamic spectrum access actions.

After all estimated Q values are obtained, determining a dynamic spectrum access action a at t by the estimated Q values and an epsilon-greedy strategy _t 。

Then obtaining the next state s _t+1 . The yield r is obtained by the formula (8) _t+1 。

And S6, each secondary user selects the dynamic spectrum access action of the secondary user under the state of the optimal Q value according to the estimated Q value.

Will(s) _t ,a _t ,r _t+1 ,s _t+1 ) And storing the data into the experience playback zone D. In the network training process, the samples are determined to be in priority p according to the formula (16) according to the TD error _t And setting the collected probability P (i) of each sample according to a Sum Tree data structure, and finally collecting data by an importance sampling method so as to destroy the correlation among the data and ensure the effectiveness of training.

And S7, selecting a target Q value of the second deep reinforcement learning network model according to the selected dynamic spectrum access action of the secondary user.

And S8, calculating a loss function, training the first deep reinforcement learning network model and the second deep reinforcement learning network model by minimizing the loss function, and updating the weights of the first deep reinforcement learning network model and the second deep reinforcement learning network model.

The target Q value is selected by the action selected by the estimated Q network according to equation (14), and the loss function is calculated by equation (15). Finally, in each W slot, the weight of DDQN is updated according to equation (13).

Steps S5-S8 are repeated.

As a specific embodiment of the present invention: the dynamic spectrum access algorithm of the multi-user multi-channel cognitive wireless network based on PER-DESQN is as follows:

and inputting historical action sequences of all users.

And outputting the optimal access action sequence of all the users.

1: initializing hyper-parameters of PER-DDQN and ESN (echo state network) networks;

2: initializing a state, an action, an experience playback zone D and a weight update interval W;

3：For t＝1,2,...,T do；

4: after receiving the preprocessed channel state information, the echo state network updates the reservoir network and outputs an estimated Q table Q containing all access strategies of the next time slot _t (s,a)；

5: each user is in state s _t Selecting an action a using an epsilon-greedy strategy _t ；

6: update state s _t ←s _t+1 ；

7: calculating a loss function and obtaining a profit r _t+1 ；

8: will(s) _t ,a _t ,r _t+1 ,s _t+1 ) Storing into an empirical playback zone D, where the maximum priority p _t ＝max _i＜t p _i ；

9: if t is within the playback period;

10：For j＝1,2,...,minibatchdo；

setting the priority according to the formula (16);

12: calculating an importance sampling weight according to formula (18);

13: calculating TD-error

14: updating the priority;

15：End for

16:End if

17: calculating a target Q value by equation (14);

calculating a loss function by formula (15) and training an estimation network by minimizing the loss function;

updating the weight theta by formula (13);

20:End For。

as a specific embodiment of the present invention, a dynamic spectrum access algorithm of the present application is simulated, and the specific method is as follows:

the application firstly sets the positions of SU and PU to be randomly selected in a space of 150 m × 150 m. The method adopts a WINNER II model and a Rician model to respectively calculate the path loss and the channel model. The bandwidth is 5MHz, the noise power density is-147 dBm/Hz, the transmitting power of SU is 20mW, the transmitting power of PU is 40mW. All system parameters used in the system model are shown in table 1. The application uses Tensorflow to build an algorithm model design simulation experiment in Python to evaluate the performance of the proposed dynamic spectrum access algorithm. In simulation, the PER-DESQN algorithm is compared with the DQN-LSTM algorithm, the DQNRC method, the DQNMLP method, the Myopic method and the Q learning method. The specific hyper-parameter settings are shown in table 2.

TABLE 1 System parameter settings

TABLE 2 hyper-parameter settings

The access performance of the algorithm is compared as follows:

the simulation of the application is based on the performance of the algorithm under 2 SU and 6 channel cognitive wireless networks. Under the system model, the access performance of the algorithm is evaluated mainly through four aspects of average access success rate, average income, average primary user interference probability and average secondary user interference probability of the algorithm, compared with other deep reinforcement learning algorithms of the PER-DESQN algorithm, the convergence rate of the PER-DESQN algorithm is obviously increased, because an ESN (echo state network) network is used for replacing a traditional deep network, namely a method of fixed weight is used for replacing a gradient descent method to update the weight, and the convergence time is greatly shortened. In addition, compared with other algorithms, the PER-DESQN algorithm has higher access success rate and can obtain higher benefit, namely transmission rate. And the method has higher access accuracy and lower user collision rate, thereby realizing the protection of the communication quality of the user. This is because compared with other algorithms, the PER-DESQN algorithm adopts a priority empirical playback sampling mechanism, so that samples with high TD errors have more sampling opportunities, and the effectiveness of network training is increased.

As the number of channels increases, the access accuracy of all algorithms also increases, since the number of channels increases provides the SU with more access options, and therefore the probability of access collisions decreases. The leading advantage of the PER-DESQN algorithm is more obvious along with the increase of the number of channels, and the algorithm combines high-precision prediction brought by high-efficiency sampling of the PER-DDQN and simplification of an ESN (echo state network) to a network training process, so that the dynamic spectrum access algorithm based on the PER-DDQN can be more suitable for the problem of data volume increase brought by a large state space, namely the PER-DDQN algorithm is more suitable for a complex cognitive network of multiple channels and multiple users, and the applicability of the algorithm to actual application is improved. On the premise that the number of users is not changed, the access accuracy of the algorithm is increased along with the increase of the number of the channels, and the collision probability is reduced along with the increase of the number of the channels. This is because as the number of channels increases, the secondary users can get more access opportunities.

On the premise that the number of channels is not changed, the access accuracy of the algorithm is reduced along with the increase of the number of users, and the collision probability is increased along with the increase of the number of users. This is because as the number of users increases, more opportunities for collisions between secondary users occur. The PER-DESQN algorithm can realize convergence in a multi-user multi-channel cognitive wireless network, has strong adaptability and stability, and can be widely used for wireless communication systems which are oriented to a large number of users and transmit mass data.

In order to explore the influence of different discount factors on convergence stability, the cases when the discount factors of the algorithm are 0.5 and 0.95 are obtained, and the simulation results of the above section based on the discount factor of 0.9 are combined: the convergence stability of the PER-ESQN algorithm increases with the increase of the discount factor; this is because the larger the discount factor is, the larger the influence of the future profit on the current expected profit is, and the higher the predicted future profit proportion is when calculating the expected profit, which is more favorable for the learning of the environment and shortens the training time. Therefore, when the environment has a strong temporal correlation, the discount factor needs to be set large.

The beneficial effect that this application realized is as follows:

(1) The application provides a PER-DESQN-based multi-user multi-channel cognitive wireless network dynamic spectrum access algorithm. In classical deep reinforcement learning, the convergence rate is reduced due to the complex structure of a deep neural network, so that the ESN (echo state network) is adopted to predict and estimate the Q value by taking the bottom time correlation as the Q network, the calculation amount of training is greatly reduced, and the convergence time is shortened. Meanwhile, in order to solve the problem of Q value over-estimation in the DQN network, the DDQN network is adopted to train the Q value estimation and decision action processes by two networks respectively, so that the same Q value training network is avoided, and the prediction precision is improved.

(2) In order to solve the problem of unstable Q value caused by sampling in an empirical playback zone by a random sampling method in a DDQN algorithm, the application provides a priority empirical playback mechanism based on Sum Tree and combines an importance sampling principle to sample samples in an empirical pool according to priority, thereby improving the stability and the access accuracy of the algorithm. Simulation experiments show that a PER-DESQN-based multi-user multi-channel cognitive wireless network dynamic spectrum access algorithm can make a rapid and accurate dynamic spectrum access decision and remarkably increase the transmission rate of a system.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A cognitive radio network dynamic spectrum access method based on deep reinforcement learning is characterized by comprising the following steps:

constructing a double-depth reinforcement learning network model in advance;

according to the first deep reinforcement learning network model, obtaining Q values of all dynamic spectrum access actions of secondary users based on a dynamic spectrum access strategy in the system model;

each secondary user selects the dynamic spectrum access action of the secondary user in the state of the optimal Q value according to the Q value;

2. The deep reinforcement learning-based cognitive wireless network dynamic spectrum access method as claimed in claim 1, wherein the system model is a multi-user multi-channel cognitive wireless network, the system model comprises a primary network and a secondary network, the primary network is composed of M primary users, and the secondary network is composed of N secondary users.

3. The deep reinforcement learning-based cognitive wireless network dynamic spectrum access method as claimed in claim 1, wherein the dual deep reinforcement learning network model comprises an input layer, an echo state network layer, a priority experience playback deep Q network layer and an output layer.

4. The cognitive wireless network dynamic spectrum access method based on deep reinforcement learning of claim 3, wherein the echo state network layer comprises an input layer, a reserve pool layer and an output layer which are connected in sequence; the output vector O (t) of the output layer is represented as:

O(t)＝W _out x(t)；

wherein the output vector O (t) is a vector of dimension 2M, each element on O (t) corresponding to a Q value for each secondary user selection; wout is the output weight from the reservoir layer to the output layer.

5. The deep reinforcement learning-based dynamic spectrum access method for the cognitive wireless network according to claim 3, wherein the weight update of the double deep reinforcement learning network model is represented as:

wherein, theta _t+1 Representing the weight of the double-depth reinforcement learning network at the moment t + 1; theta.theta. _t Representing the weight of the double-depth reinforcement learning network at the moment t; s _t Indicates the state at time t; s is _t+1 Represents the state at time t + 1; a is _t Represents the action at time t; gamma is an element of [0,1 ]]Is a discount factor; θ represents the weight of DQN 1; θ' represents the weight of DQN 2; the phase of the arriving signal on the LoS (line of sight path between transmitter and receiver) path takes values from the uniform distribution between 0 and 1;

presentation selection

The behavior a' when the value takes the maximum value;

a gradient operator, representing a gradient; q () represents obtaining a Q value; alpha is alpha _t Representing a parameter.

6. The deep reinforcement learning-based dynamic spectrum access method for the cognitive wireless network according to claim 5, wherein the target Q value of the double deep reinforcement learning network is as follows:

wherein r is _t+1 Representing the income at the time t;

presentation selection

Behavior a' when the value takes the maximum value; s _t+1 Represents the state at time t + 1; theta represents the weight of the first deep reinforcement learning network model; θ' represents a weight of the second deep reinforcement learning network model.

7. The deep reinforcement learning-based dynamic spectrum access method for the cognitive wireless network according to claim 6, wherein a loss function of the double deep reinforcement learning network is defined as:

wherein L (θ) represents a loss function; e [ 2 ]] ² Representing the mean square error; r is a radical of hydrogen _t+1 Representing the profit at the time t + 1; gamma is an element of [0,1 ]]Is a discount factor;

indicating the acquisition Q value.

8. The deep reinforcement learning based dynamic spectrum access method for the cognitive wireless network according to claim 6, wherein the method for sampling importance of the weights of the double deep reinforcement learning networks is modified; the calculation method of the correction comprises the following steps:

w _j ＝(N·P(j)) ^-β /max _i w _i ；

wherein, w _i Represents the weight before correction; w is a _j Representing the corrected weight; p (j) represents the probability of extracting the experience j, N represents the sample size, and the parameter β represents the correction rate.

9. The deep reinforcement learning-based dynamic spectrum access method for cognitive wireless network according to claim 1, wherein the goal of each secondary user is to find a dynamic spectrum access strategy σ _i Maximizing their expected cumulative discounted revenue:

wherein the content of the first and second substances,

representing the revenue function for the ith SU.

10. The deep reinforcement learning-based dynamic spectrum access method for cognitive wireless network according to claim 6, wherein the parameters for initializing the first deep reinforcement learning network model and the second deep reinforcement learning network model in the dual deep reinforcement learning network model comprise states

Movement of

A weight update interval W, an empirical playback zone D of capacity | D |, and a weight.