CN117896027A

CN117896027A - Distributed dynamic spectrum allocation method and equipment based on deep reinforcement learning

Info

Publication number: CN117896027A
Application number: CN202410066885.9A
Authority: CN
Inventors: 王树彬; 刘艳超
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-04-16

Abstract

The invention belongs to the field of communication, and particularly relates to a distributed dynamic spectrum allocation method and equipment based on deep reinforcement learning. The method constructs a CWSN environment with multi-user and multi-channel, models the user access problem of the multi-user and multi-channel as a Markov decision process, and proposes a deep Q network model for predicting the occupancy state of the main user. The invention adds the residual network structure into the DQN to solve the problem of performance degradation caused by network depth in the deep neural network. Aiming at the constructed DQN model, each SU inputs a channel observation value into the DQN for training according to a perception result so as to learn an optimal spectrum access strategy. And finally, outputting a predicted result of the channel occupation state according to the DQN model, and responding to the access request of each SU. The invention solves the problem that the management of the dynamic spectrum access problem of multiple users is difficult to realize by adopting a centralized DSA method.

Description

Distributed dynamic spectrum allocation method and equipment based on deep reinforcement learning

Technical Field

The invention belongs to the field of communication, and particularly relates to a distributed dynamic spectrum allocation method and equipment based on deep reinforcement learning.

Background

The cognitive wireless sensor network (Cognitive Wireless Sensor Network, CWSN) combines a cognitive radio technology with a wireless sensor network (Wireless Sensor Network, WSN) to solve the problem of scarce spectrum resources of the WSN network by allowing a large number of sensor nodes to serve as Secondary Users (SU) to access the idle spectrum of a Primary User (PU) opportunistically. Among them, dynamic spectrum access (Dynamic Spectrum Access, DSA) is one of the key technologies of CWSN, and its task is to make decisions based on spectrum sensing data of cognitive sensor nodes, and access a certain idle spectrum authorized to PU. However, when using this technique, the problems to be solved are: how to minimize the interference to the PU while opportunistically accessing and using licensed spectrum; when multiple SUs attempt to access the same spectrum, how to avoid collisions between SUs.

The traditional solutions such as game theory, particle swarm optimization algorithm, genetic algorithm and the like provide solutions for solving the DSA problem, and the methods realize spectrum multiplexing, but have complex model design, are easy to fall into local optimal solutions and have poor flexibility and adaptability. In contrast, reinforcement learning, in the face of an uncertain dynamic complex environment, can adaptively learn optimal strategies without prior information.

The study of combining DSA with deep reinforcement learning (Deep Rainforcement Learning, DRL) is a necessary trend to achieve coexistence of PU and SU in the future, however, the DSA method proposed by the current study is mainly focused on the scenario of only one SU in the network environment. In a multi-user scene, a centralized DSA method is mostly adopted to avoid interference among SU, and a solution adopting a distributed DSA method has the problems of complex algorithm, low convergence speed and the like.

Disclosure of Invention

In order to solve the problem that the management of the dynamic spectrum access problem of multiple users is difficult to realize by adopting a centralized DSA method; the invention provides a distributed dynamic spectrum allocation method and equipment based on deep reinforcement learning.

The invention is realized by adopting the following technical scheme:

the distributed dynamic spectrum allocation method based on deep reinforcement learning is used for managing access requests of secondary users SU to a main user PU in a cognitive wireless sensor network, and comprises the following steps of:

s1: and characterizing the channel occupation state between a secondary user and a main user in the cognitive wireless sensor network through a two-state Markov chain, and constructing an environment model for generating the two-state Markov chain.

S2: modeling the multi-user multi-channel spectrum access problem as a partially observable Markov decision process; a state space, an action space, a reward function, and a policy function of the decision process are determined.

S3: and building a dynamic spectrum allocation model based on a deep reinforcement learning framework (DQN) by combining an environment model and a deep learning algorithm (DNN).

The dynamic spectrum allocation model comprises a target network, an estimation network, an environment model and an experience pool; the environmental model is used to supplement the experience pool with experiences for training the target network and estimating the network. The parameters of the target network and the estimated network are updated by back propagation according to the calculated loss function in accordance with the gradient descent strategy.

S4: training a dynamic spectrum allocation model, wherein the trained dynamic spectrum allocation model is used for predicting the occupancy state of a channel when a secondary user accesses a primary user in a communication network.

In the training process, an epsilon-greedy strategy is used for selecting actions, and then a plurality of groups of experience value vectors containing channel observation values, actions and rewards are generated through a target network, an estimation network and an environment model and stored in an experience pool. Then, extracting experiences in the experience pool, and respectively inputting the channel observation values into an estimation network and a target network to obtain action values; and updating parameters of the network model by minimizing mean square error through loss function calculation.

S5: and predicting the occupancy state of the channel when the secondary user accesses the primary user by using the trained dynamic spectrum allocation model, and responding to the access request of the secondary user according to the occupancy state.

As a further improvement of the invention, in the environmental model of the step S1, defining the number of the primary users PU in the cognitive wireless sensor network as N and the number of the secondary users SU as M; then the ith secondary user SU _i Receiving signals on channel nThe expression of (2) is as follows:

in the above-mentioned method, the step of,is SU _i A desired signal on channel n; />And->Respectively represent from PU _n And SU _j Is a signal of interference of (1); />And->Respectively represent slave SU _i 、PU _n And SU _j Transmitter to SU of (c) _i Channel gain of (a); />Representing the received additive white gaussian noise.

Dividing a spectrum hole of an authorized channel into a plurality of time slots, wherein the channel of each time slot has two states of occupation and idle; when accessing the channel occupied by the PU, the SU will receive the alarm of the channelA notification signal; two-state Markov transition probability p for nth channel _n Expressed as:

wherein each state parameter in the Markov chainSatisfies the following formula:

as a further improvement of the present invention, in step S2, the expression of the state space O of the constructed markov decision process is as follows:

in the above formula, N represents the number of channels; s is S _i A real state space representing the state of the channel in each time slot;indicating that the channel is busy by the PU and in a busy state, and (2)>Indicating that channel N is in an idle state, N e N; o (O) _i Representing SU _i A state space of the observation channel; p (P) _r (O _i ) Representing channel true state value S _i To SU _i Final observed channel state O _i Is a process of (1); />Representing SU in channel n _i Is determined by the perceptual error probability of (1); />Representing SU _i The state of channel n is observed, with values 1 and 0 representing free and busy, respectively.

As a further improvement of the present invention, in step S2, the expression of the action space a of the constructed markov decision process is as follows:

in the above, a _i Representing SU _i An action state transmitted on each channel; a, a _i =0 means SU _i No channel is accessed; a, a _i =n (n∈n) denotes SU _i And selecting an access channel n for information transmission.

As a further improvement of the present invention, in step S2, the expression of the reward function R of the constructed markov decision process is as follows:

in the above formula, t represents a time slot; gamma ray _t-1 Discount factor, gamma, representing last time slot _t-1 ∈[0,1]；r _i Representing SU _i An action reward satisfying the following equation:

in the above, SINR _i Representing the signal-to-interference-and-noise ratio at the time of the i-th secondary user access.

As a further improvement of the present invention, in step S2, the expression of the policy function of the constructed markov decision process is as follows:

in the above, pi ^* Representing an optimal strategy;and the optimal Q value corresponding to the optimal strategy is represented.

As a further improvement of the present invention, in step S3, the target network and the estimation network in the constructed dynamic spectrum allocation model adopt a res net structure with four hidden layers; there are 64 neurons in each hidden layer, and the activation function is ReLU.

As a further improvement of the present invention, in step S4, the training step of the dynamic spectrum allocation model is as follows:

(1) All parameters of the estimated network and the target network are initialized.

(2) The secondary user perceives the channel state to obtain an observed value and inputs it into the target network.

(3) The action is taken randomly with a probability epsilon or the action with the largest Q value is selected with a probability 1-epsilon and the reward is generated based on the observations and the action.

(4) And generating a channel state observation value of the next time slot, and storing the quadruple comprising the channel state observation value, the action, the rewards of the current time slot and the channel state observation value of the next time slot into an experience pool.

(5) Repeating the steps (2) - (4) until the data amount in the experience pool meets the requirement.

(6) And randomly extracting experiences in the experience pool, and respectively inputting the channel state observed value of the time slot and the channel state observed value of the next time slot into an estimation network and a target network to obtain action value.

(7) Updating the estimation network by minimizing the mean square error by the loss function calculation; iteratively updating the parameters and copying the parameters of the estimated network to the target network.

As a further improvement of the present invention, the loss function employed by the dynamic spectrum allocation model in the training phase is as follows:

in the above-mentioned method, the step of,the Q value of the output of the target network is represented; />Representing the Q value of the estimated network output; gamma represents a discount factor; o' _i ,a' _i The method comprises the steps of carrying out a first treatment on the surface of the θ' and o _i ,a _i The method comprises the steps of carrying out a first treatment on the surface of the θ represents the state, action, and network parameters of the target network and the estimated network, respectively.

The invention also includes a distributed dynamic spectrum allocation apparatus based on deep reinforcement learning, comprising a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, the steps of the distributed dynamic spectrum allocation method based on deep reinforcement learning are realized, the occupation state of the channel when the secondary user accesses the main user is predicted by using the trained dynamic spectrum allocation model, and the access request of the secondary user is responded according to the occupation state.

The technical scheme provided by the invention has the following beneficial effects:

the distributed dynamic spectrum allocation method based on deep reinforcement learning is described as a multi-user multi-channel spectrum access problem, and is modeled as a partially observable Markov decision process; and then, a dynamic spectrum allocation model capable of predicting the channel occupation state of the CWSN environment is designed by combining an MLP4, resNet and Q learning algorithm aiming at the modeled CWSN environment. And the predicted result of the trained dynamic spectrum allocation model is used as the basis for responding to the user access request.

Compared with the prior art, the method sets a more practical scene, namely, the situation that the secondary user spectrum sensing has sensing errors is considered into spectrum access; and secondly, a residual network structure is introduced to improve the model training precision, the problem of performance degradation caused by network depth in a deep neural network is solved, and compared with other methods, the method can enable secondary users to learn an optimal channel access strategy faster, obtain better spectrum access opportunities, improve the SU access success rate and effectively reduce the collision among users.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

fig. 1 is a flowchart of the steps of a distributed dynamic spectrum allocation method based on deep reinforcement learning provided in embodiment 1 of the present invention.

Fig. 2 is a system model of a cognitive wireless sensor network environment with a graph structure constructed in embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of a two-state markov chain featuring channel occupancy.

Fig. 4 is a model architecture diagram of the dynamic spectrum allocation model created in embodiment 1 of the present invention.

Fig. 5 is a network architecture diagram of a DNN network with a four hidden layer res net structure employed in the dynamic spectrum allocation model.

FIG. 6 is a plot of average rewards for different schemes in a simulation as a function of iteration.

Fig. 7 is a plot of successful access rate versus iteration process for different schemes in the simulation.

Fig. 8 is a plot of average collision rate with SU as a function of iterative process in different schemes in the simulation.

FIG. 9 is a graph showing the average collision rate with PU in different schemes in simulation as a function of the iterative process.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment provides a distributed dynamic spectrum allocation method based on deep reinforcement learning, which is used for managing access requests of secondary users SU to a primary user PU in a cognitive wireless sensor network. The main idea of the scheme of the embodiment is as follows: first, a CWSN environment with multi-user multi-channel is constructed. Then, a deep Q network (DeepQNetwork, DQN) model for predicting the occupancy state of the primary user is presented. In order to improve training accuracy and solve the problem of performance degradation caused by network depth in the deep neural network, the present embodiment further adds a residual network (ResidualNetwork, resNet) structure to the DQN. Next, each SU inputs the channel observations to the DQN for training to learn the optimal spectrum access strategy according to the perceived result. Finally, each SU intelligently accesses the appropriate channel according to the DQN model output.

In detail, as shown in fig. 1, the distributed dynamic spectrum allocation method provided in this embodiment includes the following steps:

In the embodiment, firstly, a cognitive wireless sensor network comprising multiple users and multiple channels is modeled, and then, by combining with a graph structure of the cognitive wireless sensor network, an environment model capable of quantifying the channel occupation state of a main user in the cognitive wireless sensor network and generating a corresponding double-state Markov chain is created.

Specifically, it is assumed that a typical cognitive wireless sensor network is composed of N primary users PU and M secondary users M. In the CWSN environment, each master user monopolizes one grant channel in the cognitive wireless sensor network. Cross-channel interference between primary users is ignored, and each user corresponds to a pair of a transmitter and a receiver, both using the same wireless channel. The association of different signal links between different users in the CWSN can be represented by the graph structure data. Specifically, fig. 2 shows a complex association of desired links and interfering links when PU1, SU1, and SU2 are operating on the same channel in a simplified CWSN that includes only PU1, SU1, and SU 2.

In the CWSN environment of fig. 2, when the SU is using the grant channel, it may cause a PU toInterference, and possibly also by other SUs, so that the ith secondary user SU _i Receiving signals on channel nThe expression of (2) is as follows:

in the above-mentioned method, the step of,is SU _i A desired signal on channel n; />And->Respectively represent from PU _n And SU _j Is a signal of interference of (1); />And->Respectively represent slave SU _i 、PU _n And SU _j Transmitter to SU of (c) _i Channel gain of (a); />Representing the received Additive White Gaussian Noise (AWGN).

Corresponding signal to interference plus noise ratio (Signal to Interference plus Noise Ratio, SINR)Given by the formula:

wherein,and->Representing the transmit power of users i, m and j on channel n, respectively; b and N ₀ Channel bandwidth and noise power density, respectively.

At this time, SU _i The transmission rate C received by the receiver of (2) _i The method comprises the following steps:

C _i ＝log ₂ (1+SINR _i )

as can be seen from a combination of the above two equations, when a plurality of SUs transmit on the same channel, transmission collisions occur between each other, so that the SINR of the receiver decreases, resulting in a decrease in the transmission rate or even failure. Therefore, only one SU works best when transmitting on an idle channel.

Based on this, the present embodiment divides the spectrum hole of the grant channel into a plurality of slots, and the channel of each slot has two states of occupied and idle. When accessing a channel occupied by a PU, the SU will receive a warning signal of the channel. At this time, the occupation of the channel can be described by a two-state markov chain as shown in fig. 3. A state value of 0 indicates that the channel is occupied; a 1 indicates that the channel is idle. Two-state Markov transition probability p for nth channel _n Expressed as:

s2: the spectrum access problem of multi-user multi-channel is modeled as a partly observable Markov decision process, and the state space, action space, rewarding function and strategy function of the decision process are determined.

The markov decision process (Markov Decision Process, MDP) is a mathematical model of sequential decisions (sequential decision) for simulating randomness strategies and rewards achievable by an agent in an environment where system states have markov properties. The learning process of MDP is as follows: the intelligent agent senses the initial environment, performs actions according to the strategy, and the environment is influenced by the actions to enter a new state and feeds back a reward to the intelligent agent. The agent then interacts continuously with the environment based on taking new policies. During the learning process, rewards in the MDP are typically designed with pertinence according to corresponding reinforcement learning problems.

In this embodiment, the solution to the spectrum access problem of multi-user and multi-channel is performed by reinforcement learning framework, so that a markov decision process is selected to be modeled as a partially observable one. In the modeled MDP, the functions of state, action, rewards, and policies are as follows:

1. status of

At the beginning of each time slot, SU _i The channel state information is obtained by performing spectrum sensing on N channels, and the channel state of the t-th time slot is expressed as follows:

wherein the occupancy state of any channel nEqual to 1 or 0, indicating that the channel is free or occupied.

There may be errors in the result of sensing the channel state due to imperfections of the spectrum detector. Defining SU on channel n _i Is the perception error probability of (1)Then a channel true is observedReal state o _i The probability Pr of (2) is:

observations are primarily used herein as historical channel state data, since SU does not know if a perceptual error would occur and can be considered to be a correct reaction to channel state. The perceived result of SU, which may have a perceived error, is expressed as:

this is the state space of the markov decision process.

2. Action

After the SU carries out spectrum sensing, determining which channel is accessed or not accessed according to the sensing result. SU (SU) _i The actions of (1) are expressed as: a, a _i E {0,1, …, N }. Wherein a is _i ＝n(n>0) Representing time slot t, SU _i Selecting an nth channel to be accessed for information transmission; a, a _i =0 means SU _i It is decided not to access the channel. The actions of each SU are expressed as:

A＝{a ₁ ,a ₂ ,…,a _N }

this is the action space of the Markov decision process.

3. Rewards

After the SU performs the action, it will get rewards according to the action situation. The principle of SU access to the channel is to reduce collisions with other SU's without interfering with PU, thereby maximizing its own transmission rate. Thus, set its action rewards as:

namely: when SU collides with PU, the rewards are set to-C, C is a preset constant, and C is more than 0; when SU does not transmit data, the reward is 0; otherwise, the SU rewards the transmission rate of its receiver. After the SU gets rewarded, the state of each channel will change according to the markov chain. The SU will perceive the new channel state and choose whether to access or not in the next time slot.

4. Policy function

Because the designed spectrum access strategy has distribution, the sensing result and the access decision information cannot be shared among SUs. Each SU has one DQN to make channel access policies independently, the only input to the DQN being the perceived result obtained by its sensor. The SU does not know the transition probability of the channel state and the probability of perceived error, can only learn how to access the channel by rewards obtained after accessing the channel, and formulate an access policy to maximize its cumulative discount rewards, i.e. the rewards function is:

wherein gamma E [0,1] is a discount factor, and represents the importance degree of future rewards to the current state. Since the reward function in this embodiment is set to the data transmission rate, the channel capacity is maximized while the cumulative discount rewards are maximized, thereby improving the data transmission rate.

The final objective of the DSA described above translates into a prize in the maximized prize function. Finding the optimal strategy pi ^* The optimal action can be made in any state to maximize the prize.

Therefore, the present embodiment finds pi by calculating the optimal Q value by the following formula ^* ：

This is the policy function of the markov decision process.

S3: the efficiency of the Q learning algorithm considered to solve the markov decision process may be degraded with the increase of the state space, and in this embodiment, a dynamic spectrum allocation model based on a deep reinforcement learning framework DQN is built by combining the environment model and the deep learning algorithm DNN.

Specifically, as shown in fig. 4, the network architecture of the dynamic spectrum allocation model constructed in this embodiment includes a target network, an estimation network, an environment model, and an experience pool. The environment model is used for generating a double-state Markov chain for representing a channel occupation state prediction result between SU and PU in the cognitive wireless sensor network, storing the double-state Markov chain into an experience pool, and training a target network and an estimation network by using data in the experience pool. The parameters of the target network and the estimated network are updated by back propagation according to the calculated loss function in accordance with the gradient descent strategy.

In the dynamic spectrum allocation model constructed in this embodiment, in order to improve training accuracy and solve the problem of performance degradation of the deep neural network due to network depth, the embodiment finally determines the DNN structure in the DQN to be a res net structure with four hidden layers as shown in fig. 5. There are 64 neurons in each hidden layer, and the activation function is a rectifying linear unit (ReLU).

In the training phase of DQN, SU _i As an input to the DQN estimation network, which takes its observations at each time slot as an agent, the estimation network selects an action using an ε -greedy policy, i.e., taking an action randomly with a probability ε, or greedily selecting an action for the current DQN with a probability 1- ε. The agent takes action a _i Thereafter, a prize r is obtained from the environment _i And observe the channel o 'in the next time slot' _i Input into the target network to obtain the next time slot action a' _i And a target Q value(o _i ,a _i ,r _i ,o′ _i ) An experience is represented, which is collected by an epsilon-greedy strategy and deposited into an experience pool before training begins. In using experience pools during DQN trainingAccumulated empirically calculated loss values:

in the above-mentioned method, the step of,the Q value of the output of the target network is represented; />Representing the Q value of the estimated network output; gamma represents a discount factor; o' _i ,a' _i The method comprises the steps of carrying out a first treatment on the surface of the θ' and o _i ,a _i The method comprises the steps of carrying out a first treatment on the surface of the θ represents the state, action, and network parameters of the target network and the estimated network, respectively. And updating the parameter theta of the estimated network by using the loss value calculated by back propagation, copying the parameter theta 'of the estimated network to the target network at regular intervals, and updating the parameter theta'.

Specifically, in the dynamic spectrum allocation model of the present embodiment, the secondary user SU _i (i ε M) the detailed steps of the training process based on the DQN method are as follows:

(1) Initializing all parameters of an estimated network and a target network;

(2) Secondary user SU _i Obtaining observed value o by sensing channel state _i And inputs it into the target network.

(3) Action a with probability epsilon is taken randomly or action a with maximum Q value is selected with probability 1-epsilon _i ，

And based on the observations and actions (o _i ,a _i ) Generating rewards r _i 。

(4) Generating a channel state observation o 'for a next slot' _i And a quadruple (o) containing the channel state observations of the current time slot, actions, rewards and channel state observations of the next time slot _i ,a _i ,r _i ,o' _i ) Stored in an experience pool.

(5) Repeating steps (1) - (4) until the amount of data in the experience pool meets the requirements;

(6) Randomly extracting experiences (o) _i ,a _i ,r _i ,o' _i ) The observed value o of the channel state of the time slot _i And the next time slot channel state observation o' _i Respectively inputting into an estimation network and a target network to obtain action valueAnd

(7) The estimation network is updated by minimizing the mean square error by the loss function calculation:

updating the parameters at regular intervals, and copying the parameters theta of the estimated network to the target network as a model theta' thereof.

Example 2

On the basis of the scheme of the embodiment 1, the embodiment further provides a distributed dynamic spectrum allocation device based on deep reinforcement learning, which comprises a memory, a processor and a computer program stored on the memory and running on the processor. When the processor executes the computer program, the distributed dynamic spectrum allocation method based on deep reinforcement learning as in embodiment 1 is implemented, and the channel occupancy state of the secondary user when accessing the primary user is predicted by using the trained dynamic spectrum allocation model, and the access request of the secondary user is responded according to the occupancy state.

The distributed dynamic spectrum allocation device based on deep reinforcement learning is essentially a computer device for realizing data processing and instruction generation. The computer device provided in this embodiment may be an embedded model capable of executing a computer program, or may be an intelligent terminal capable of executing a program, such as a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by multiple servers), or the like. The computer device of the present embodiment includes at least, but is not limited to: a memory, a processor, and the like, which may be communicatively coupled to each other via a system bus.

In this embodiment, the memory (i.e., readable storage medium) includes flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device.

In other embodiments, the memory may also be an external storage device of a computer device, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card) or the like, which are provided on the computer device. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is typically used to store an operating system and various application software installed on the computer device. In addition, the memory can be used to temporarily store various types of data that have been output or are to be output.

The processor may be a central processing unit (Central Processing Unit, CPU), an image processor GPU (Graphics Processing Unit), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process the data.

Simulation experiment

In order to verify the effectiveness of the distributed dynamic spectrum allocation method based on deep reinforcement learning, the technician performs simulation experiments based on a TensorFlow framework in a CWSN environment containing 2 SUs and 6 PUs.

Experimental conditions

In the experimental process, the positions of the PU and the SU are randomly set in a scene of 150m×150m area. Furthermore, a distance range of 20m-40m is maintained between SUs. The simulation uses the WINNERII model and Rician model to calculate the path loss and channel model, respectively. P is due to the low utilization of most licensed bands, i.e. the high probability of the channel being in idle state ₁₁ Higher value of p ₀₀ The values of (2) are lower, the implementation will be from [0.7,1 ], respectively]And [0,0.3]Randomly selecting corresponding state quantity from the upper uniform distribution, and then calculating p ₁₀ ＝1-p ₁₁ And p ₀₁ ＝1-p ₀₀ . The system model parameters are shown in Table 1:

table 1: parameter table of CWSN environment in simulation experiment:

parameters (parameters)	Value
		PU quantity (N)	6
SU number (M)	2
		Noise power density (N) ₀ )	-174dBm/Hz
Transmit power of PU	40mW
		Transmit power of SU	20mW

To avoid trapping a sub-optimal decision strategy before enough learning experience is obtained, the decaying epsilon-greedy algorithm is used this time: the initial value of ε is set to 1 and decays every slot according to ε++max {0.995 ε,0.005} see Table 2 for details about super parameters:

table 2: super-parameters of DQN structure in simulation experiment

Super parameter	Value
		Attenuation Rate ε	1.0→0.005
Learning rate alpha	0.01
		Discount factor gamma	0.9
Activation function	ReLu
		Experience pool size	2000
Optimizer	Adam
		Target network update frequency	300

(II) control group

In order to more intuitively compare the advantages and disadvantages of the scheme and other existing schemes in performance, simulations are performed in Python and Tensorflow. The simulation experiment winning takes the algorithm DQN+MLP4+ResNet provided by the invention as an experimental group, and takes a near vision algorithm (myotic), DQN+RC, Q-learning and DQN (DQN+MLP4) with four full-connection layers as a comparison group. The experimental and control group protocols were compared for performance differences in jackpot, success rate, collision with PU and SU, respectively.

(III) Experimental data and analysis

In the simulation experiment, the variation curves of average rewards of the experiment group and the control group along with the iterative process are shown in fig. 6, and the analysis of the data in fig. 6 can be known: the inventive regimen achieved the highest average prize among all 5 regimens.

The curves of the successful access rates of the experimental group and the control group along with the iterative process are shown in fig. 7, and the analysis of the data in fig. 7 can be known: the scheme provided by the invention is far higher than other methods in the aspect of the successful access rate of the access channel, and reaches about 95%.

The average collision rate with SU for the experimental and control groups is shown in fig. 8, and analysis of the data in fig. 8 shows that: in terms of collisions with other SUs, all methods eventually have collision rates with SU that eventually drop to 0, except for myic collision. This means that the learning-based method can learn the access policy of other SU by interacting with the environment, and the myotic policy can not learn the access policy of other SU although it accesses to the channel that can bring the maximum expected reward on the premise of knowing the system channel information (such as the state transition probability of the channel, etc.). In addition, according to experimental data, in order to avoid collision with PU, the present invention sets the prize-C at the time of collision in the prize function to-2.

The average collision rate with PU of the experimental group and the control group is shown in fig. 8, and the analysis of the data in fig. 8 can be known: the scheme provided by the invention has the lowest collision rate with PU and is even lower than that of myotic splice. This means that according to the method provided by the invention, the secondary user can accurately learn the optimal access strategy from the random environment, thereby avoiding interference to the PU.

In summary, compared with various traditional dynamic spectrum access methods such as myotic open, Q-learning, DQN+RC and DQN+MLP4, the method provided by the invention not only has better performance in the aspects of reducing collision among users, improving SU access success rate and the like, but also has faster convergence rate.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The distributed dynamic spectrum allocation method based on deep reinforcement learning is characterized by being used for managing access requests of secondary users SU to primary users PU in a cognitive wireless sensor network, and comprises the following steps of:

s1: characterizing the channel occupation state between a secondary user and a main user in a cognitive wireless sensor network through a two-state Markov chain, and constructing an environment model for generating the two-state Markov chain;

s2: modeling the multi-user multi-channel spectrum access problem as a partially observable Markov decision process, and determining a state space, an action space, a reward function and a strategy function of the decision process;

s3: building a dynamic spectrum allocation model based on a deep reinforcement learning framework by combining the environment model and a deep learning algorithm;

the dynamic spectrum allocation model comprises a target network, an estimation network, an environment model and an experience pool; the environmental model is used for supplementing experience pools with experiences for training the target network and the estimation network; the parameters of the target network and the estimated network are updated by back propagation according to the calculated loss function and the gradient descent strategy;

s4: training the dynamic spectrum allocation model, wherein the trained dynamic spectrum allocation model is used for predicting the occupation state of a channel when a secondary user accesses a primary user in a communication network;

in the training process, firstly, selecting actions by using an epsilon-greedy strategy, generating a plurality of groups of experience value vectors containing channel observation values, actions and rewards through a target network, an estimation network and an environment model, and storing the experience value vectors into an experience pool; then, extracting experiences in the experience pool, and respectively inputting the channel observation values into an estimation network and a target network to obtain action values; and updating parameters of the network model by minimizing mean square error through loss function calculation;

2. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 1, wherein: in the environment model of the step S1, defining the number of the primary users PU in the cognitive wireless sensor network as N and the number of the secondary users SU as M; then the ith secondary user SU _i Receiving signals on channel nThe expression of (2) is as follows:

in the above-mentioned method, the step of,is SU _i A desired signal on channel n; />And->Respectively represent from PU _n And SU _j Is a signal of interference of (1);and->Respectively represent slave SU _i 、PU _n And SU _j Transmitter to SU of (c) _i Channel gain of (a); />Representing the received additive white gaussian noise;

dividing a spectrum hole of an authorized channel into a plurality of time slots, wherein the channel of each time slot has two states of occupation and idle; when accessing a channel occupied by the PU, the SU receives a warning signal of the channel; two-state Markov transition probability p for nth channel _n Expressed as:

3. the distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 2, wherein: in step S2, the expression of the state space O of the constructed markov decision process is as follows:

in the above formula, N represents the number of channels; s is S _i A real state space representing the state of the channel in each time slot;indicating that the channel is busy by the PU and in a busy state, and (2)>Indicating that channel N is in an idle state, N e N; o (O) _i Representing SU _i A state space of the observation channel; p (P) _r (O _i ) Representing the channel real state value S _i To SU _i Final observed channel state O _i Is a process of (1); />Representing SU in channel n _i Is determined by the perceptual error probability of (1); />Representing SU _i The state of channel n is observed, with values 1 and 0 representing free and busy, respectively.

4. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 3, wherein: in step S2, the expression of the action space a of the constructed markov decision process is as follows:

5. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 4, wherein: in step S2, the expression of the reward function R of the constructed markov decision process is as follows:

6. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 5, wherein: in step S2, the expression of the policy function of the constructed markov decision process is as follows:

7. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 1, wherein: in step S3, a target network and an estimation network in the constructed dynamic spectrum allocation model adopt a ResNet structure with four hidden layers; there are 64 neurons in each hidden layer, and the activation function is ReLU.

8. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 1, wherein: in step S4, the training steps of the dynamic spectrum allocation model are as follows:

(1) Initializing all parameters of an estimated network and a target network;

(2) The secondary user perceives the channel state to obtain an observation value, and inputs the observation value into a target network;

(3) Randomly taking action with probability epsilon or selecting the action with the maximum Q value with probability 1-epsilon, and generating rewards according to the observed value and the action;

(4) Generating a channel state observation value of the next time slot, and storing a quadruple comprising the channel state observation value, action, rewards of the current time slot and the channel state observation value of the next time slot into an experience pool;

(5) Repeating steps (2) - (4) until the amount of data in the experience pool meets the requirements;

(6) Randomly extracting experiences in an experience pool, and respectively inputting the channel state observed value of the time slot and the channel state observed value of the next time slot into an estimation network and a target network to obtain action values;

9. The distributed dynamic spectrum allocation method based on deep reinforcement learning as claimed in claim 6, wherein: the loss function adopted by the dynamic spectrum allocation model in the training stage is as follows:

in the above-mentioned method, the step of,the Q value of the output of the target network is represented; />Representing the Q value of the estimated network output; gamma represents a discount factor; o (o) _i ',a _i 'A'; θ' and o _i ,a _i The method comprises the steps of carrying out a first treatment on the surface of the θ represents the state, action, and network parameters of the target network and the estimated network, respectively.

10. A distributed dynamic spectrum allocation apparatus based on deep reinforcement learning, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized in that: when the processor executes the computer program, the steps of implementing the distributed dynamic spectrum allocation method based on deep reinforcement learning according to any one of claims 1-9 are performed, the occupation state of the channel when the secondary user accesses the primary user is predicted by using the trained dynamic spectrum allocation model, and the access request of the secondary user is responded according to the occupation state.