CN112188503B

CN112188503B - Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network

Info

Publication number: CN112188503B
Application number: CN202011055360.3A
Authority: CN
Inventors: 徐友云; 李大鹏; 蒋锐
Original assignee: Nanjing Nanyou Communication Network Industry Research Institute Co ltd; Nanjing Ai Er Win Technology Co ltd
Current assignee: Nanjing Nanyou Communication Network Industry Research Institute Co ltd; Nanjing Ai Er Win Technology Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-06-22
Anticipated expiration: 2040-09-30
Also published as: CN112188503A

Abstract

The invention discloses a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and adopts the technical scheme that the dynamic multichannel access method comprises the steps of providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals; and a dynamic multi-channel model following a part of observable Markov chains is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, and the optimal strategy algorithm performs training optimization through a deep reinforcement learning method. The method avoids huge exponential calculation amount through deep reinforcement learning, enables the user terminal to be quickly accessed into the optimal channel on the premise of ensuring the communication quality of the user terminal, and improves the spectrum utilization rate.

Description

Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network

Technical Field

The invention relates to the technical field of communication, in particular to a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network.

Background

Radio spectrum is a limited and precious natural resource in wireless communication, and existing wireless communication adopts an authorization-based method to allocate the spectrum, namely, the radio spectrum is divided into a plurality of spectrum segments with fixed widths, and the spectrum segments are allocated to user terminals by government administration departments for individual use. However, with the rapid development of wireless communication technology and the continuous growth of new services, and in addition, the problem of spectrum resource shortage caused by the low efficiency of spectrum utilization rate, spectrum resources become more and more scarce, and the increasingly scarce spectrum cannot meet the increasing demand of wireless communication. This phenomenon has also prompted the development of efficient dynamic spectrum access schemes to cater for emerging wireless network technologies. The cognitive radio technology becomes a key technology for improving the utilization rate of frequency spectrums, and the main idea of the technology is to detect which frequency spectrums are in idle states and then intelligently select and access the idle frequency spectrums, so that the utilization rate of the frequency spectrums can be greatly improved.

Research on a dynamic spectrum access technology, which is one of key technologies of cognitive radio technologies, is being conducted, and the existing method is mainly markov modeling, that is, a dynamic spectrum access process of a user terminal is modeled as a markov model. The access procedure is described accurately in a two-dimensional or multi-dimensional markov chain. The spectrum utilization rate can be improved through Markov modeling, but the requirement on the environment is high, the system is not subjected to a learning process, and the convergence speed is low.

With the vigorous development of reinforcement learning, new research is brought to the dynamic spectrum access technology. The reinforcement learning refers to learning from an environment state to an action mapping, and the reinforcement learning focuses on studying how a system learns an optimal behavior strategy under the condition that a state transition probability function is unknown. The reinforcement learning has less requirements on environmental knowledge, strong adaptability to dynamic change environment and good compatibility when applied to a wireless network, and the characteristics enable the reinforcement learning to have wide prospects in the business of the cognitive radio field. However, when the number of user terminals increases sharply, the state quantity generated by reinforcement learning is also power-series, the algorithm complexity becomes very large, and the exponential calculation quantity makes the reinforcement learning difficult to be practically used.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and can avoid huge exponential calculation, so that a user terminal can be quickly accessed to an optimal channel on the premise of ensuring the communication quality of the user terminal, and the spectrum utilization rate is improved.

In order to achieve the purpose, the invention provides the following technical scheme: a dynamic multichannel access method based on deep reinforcement learning applied to a cellular network provides a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;

a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data is successfully transmitted on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;

s10, an experience pool, a main neural network and a target neural network are configured in the channel allocation system, the experience pool is used for storing a data set, the experience pool has a capacity threshold value D, the capacity threshold value D represents the maximum value of the data set stored by the experience pool, the main neural network and the target neural network are constructed through the optimal strategy algorithm, parameters of the main neural network and the target neural network respectively comprise channel states, execution actions and weights of the neural networks, the channel states are S, the execution actions are a, the execution actions a represent the allocation mode of the channels, the weight of the main neural network is w, and the weight of the target neural network is w^-Otherwise, the weight of the target neural network is equal to that of the main neural network, and S20 is entered;

s20, the channel allocation system obtains the execution action a of the next time slot through the preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal, and the step enters S30;

s30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variable_t+1And saved, and proceeds to S40;

s40, the channel distribution system passes the channel state S of the current time slot_tCurrent time slot execution action a_tObtaining the channel state s of the next time slot_t+1And will(s)_t,a_t,r_t,s_t+1) Save to the experience pool as a set of data sets, r_tChannel state s for t-1 time slot_t-1Performing action a_t-1The prize value obtained later in the t time slot, and the process goes to S50;

s50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making S_t＝s_t+1And returns to step S20; otherwise, go to step S60;

s60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling mode_t,a_t,r_t,s_t+1) The main neural network trains each group of data sets to obtain an estimated Q value, the target neural network calculates to obtain an actual Q value through a preset actual Q value algorithm, and the step is S70;

s70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, updating the weight w of the main neural network according to a gradient descent method, and entering S80;

s80, every preset updating interval step number C, let w^-The update interval step number C represents the number of steps taken to change the weight of the target neural network to the weight of the master neural network, and proceeds to S90;

and S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the operation if the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network in the convergence state.

The dynamic multi-channel model is a dynamic multi-channel model following a partially observable Markov chain, and the dynamic multi-channel model follows the following constraints:

C1：

C2：

C3：

C4：Ω(t+1)＝Ω'(t)P

C5：

C6：

wherein: c1 is a state space of a partially observable Markov chain, each state s_i(i∈{1,2,...,3^NAre all a vector of length N s_i1,...,s_ij,...,s_iN]，s_ijRepresents the channel state of the j channel;

c2 is the confidence vector for the location,

assigning a system position s for the channel_iA state, and the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known;

c3 is an updating mode of each possible state in the confidence vector, I (-) is an indication function, a (t) is a channel accessed by the t-slot user terminal, o (t) is a channel state observation value of the channel accessed by the t-slot user terminal, the observation value is 1 to characterize the channel state well, the observation value is 0.5 to characterize the channel state uncertain, and the observation value is 0 to characterize the channel state difference;

c4 is the update formula of the confidence vector, P is the transition matrix of the partially observable markov chain;

c5 is the optimal strategy algorithm, gamma is the preset discount factor, r_t+1Obtaining a reward value at a time slot t +1 after the action a is executed for the channel state s of the time slot t;

c6 is the optimal channel allocation strategy that is achieved when the accumulated prize value is maximum.

As a further refinement of the invention, the allocation algorithm is configured to:

wherein the content of the first and second substances,

representing the access action with the maximum estimated Q value of the current master neural network, a_randomIndicating that one access scheme is randomly selected from all possible access schemes, and epsilon is a preset allocation probability value.

As a further refinement of the present invention, the reward algorithm is configured to:

as a further improvement of the present invention, the actual Q value algorithm is configured to:

wherein, y_tIs the actual Q value.

As a further refinement of the invention, the error algorithm is configured to:

L(w)＝(y_t-Q(s_t,a_t；w))²

wherein L (w) is the error value.

The invention has the beneficial effects that: and a dynamic multi-channel model is configured in the channel allocation system and used for calculating an optimal channel allocation mode and realizing continuous optimization of an optimal strategy algorithm through deep reinforcement learning. The dynamic multi-channel access method reduces the requirement on the environment, so that the channel allocation system can quickly allocate each channel to each user terminal in an optimized mode through learning, and the dynamic multi-channel model is solved through a deep reinforcement learning method, thereby avoiding huge exponential calculation quantity. Therefore, the dynamic multi-channel access method can avoid huge exponential calculation, and can ensure that the user terminal can quickly access the optimal channel and improve the spectrum utilization rate on the premise of ensuring the communication quality of the user terminal.

Drawings

FIG. 1 is a flow chart of a deep reinforcement learning method;

fig. 2 is a diagram of a wireless network dynamic multi-channel access scenario;

FIG. 3 is a block diagram of a deep reinforcement learning method;

FIG. 4 is a graph comparing convergence of error algorithms at different learning rates;

FIG. 5 is a graph of the convergence of the error algorithm at a learning rate of 0.1;

FIG. 6 is a comparison of the dynamic multi-channel model using a deep reinforcement learning method with the ideal state and random selection for normalized reward;

FIG. 7 is a comparison of error values for a dynamic multi-channel model using a deep reinforcement learning method versus an ideal state and a random selection.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Referring to fig. 1, 2 and 3, a dynamic multi-channel access method based on deep reinforcement learning applied to a cellular network according to the present embodiment provides a channel allocation system and a plurality of user terminals, where the channel allocation system is in communication connection with the user terminals.

And a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, and the dynamic multi-channel model is used for calculating the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot. The configuration principle of the dynamic multi-channel model is as follows:

referring to fig. 2, it is assumed that a certain range is covered by one base station and M user terminals, and each user terminal needs to select one transmission data packet from N channels. And assume that the user always has data to transmit and that the N channels are orthogonal to each other. In each time slot, the user terminal needs to dynamically sense the state of the channel and select one channel to transmit data, the states of the channels are three, namely a good channel state, an uncertain channel state and a bad channel state, the good channel state indicates that the data of the user terminal can be successfully transmitted, the uncertain channel state indicates that the data of the user terminal cannot be successfully transmitted, and the bad channel state indicates that the data of the user terminal cannot be successfully transmitted. The channel state is expressed by data words by S, and the expression rule is as follows:

the user terminal obtains corresponding reward according to the actual channel state of the distributed channel, if the user terminal selects that the channel state is good, a positive reward value (+1) can be obtained; if the user terminal selects the channel state to be poor, a negative reward value (-1) is obtained; if the selected state of the user terminal is the uncertain channel state, a negative reward value (-0.1) is obtained, using r_tIndicating a prize value.

By a 3^N-a state markov chain to model the correlation between channels, the state space of the partially observable markov chain being

Each state s_i(i∈{1,2,...,3^NAre all a vector of length N s_i1,s_i2,...,s_iN]，s_ijRepresents the channel state of the j channel: the channel state is good (1), the channel state is bad (0), and the channel state is uncertain (0.5). Each channel may be characterized as a 3 × 3 state transition matrix, which is specified as follows:

wherein, P_i(x | y), x, y ∈ {0,0.5,1}, defined as the state transition probability of the channel from state x to state y. The state transition matrix for the entire Markov chain is defined as P. Since the user terminal can only perceive one channel and observe its state at the beginning of each slot, it cannot observe the channel states of all channels. However, the channel allocation system can observe and predict the distribution of channel conditions in the system. Thus, modeling the dynamic multi-channel access problem as a general framework of a partially observable markov decision process, follows the constraints of:

C1：

C2：

C3：

C4：Ω(t+1)＝Ω'(t)P

C5：

C6：

wherein: c1 is a state space of a partially observable Markov chain, each state s_i(i∈{1,2,...,3^NAre all a vector of length N s_i1,...,s_ij,...,s_iN]，s_ijIndicating the channel state of the j channel.

C2 is the confidence vector for the location,

assigning a system position s for the channel_iState of andand knows the performing action of the past slot and the conditional probability of the channel state of each channel of the next slot.

C3 is the updating mode of each possible state in the confidence vector, I (-) is an indication function, in each time slot, the channel allocation system needs to allocate the access policy to the user terminal, a (t) for the channel accessed by the user terminal in t time slot, i.e. the execution action of the user terminal, the execution action of the user terminal is represented by data:

a_t＝{0,1,2,...,N}

wherein a is_t0 denotes that the subscriber terminal does not transmit data in time slot t, and a_tN is more than or equal to 1 and less than or equal to N, and the user terminal selects to access the N channel to transmit data in the time slot t.

And o (t) is a channel state observation value of a channel accessed by the user terminal in the t time slot, wherein the observation value is 1, the representation channel state is good, the observation value is 0.5, the representation channel state is uncertain, and the observation value is 0, the representation channel state is poor.

C4 is the updated formula for the confidence vector and P is the transition matrix for the partially observable markov chain.

C5 is the optimal strategy algorithm, gamma is the preset discount factor, r_t+1The reward value obtained at t +1 time slot according to the reward algorithm after performing action a for channel state s of t time slot, it should be noted that the user is in state s at t time slot_tTaking action a_tFollowed by a prize in the t +1 slot. The reward algorithm is configured to:

in the dynamic multi-channel model, the channel allocation system needs to maximize a long-term accumulated discount reward value, the accumulated discount reward value represents an accumulated value of a reward value obtained after a period of time slot execution action is predicted according to a current channel state, and a calculation algorithm of the accumulated discount reward value is configured as follows:

the discount factor gamma (gamma is more than or equal to 0 and less than or equal to 1), and the absolute value of the obtained reward value is relatively smaller when the predicted time slot is longer than the current time slot through the algorithm, so that the influence on the accumulated discount reward value is smaller when the predicted time slot is longer than the current time slot.

C6 is to find the optimal channel allocation strategy by Bellman equation

。

Q-learning is the most common solution in reinforcement learning

However, the Q learning process is complicated due to a large motion space. And the deep reinforcement learning can be combined with the traditional reinforcement learning and the deep neural network to solve the defect. The deep neural network can find the mathematical relation between the input data and the output data, so that a main neural network is used for approximating an optimal strategy algorithm with weight of w, namely Q (s, a; w) is approximately equal to Q_π(s, a) while using a target neural network Q (s ', a'; w)^-) To generate the target values required for the training of the master neural network. The two neural network frameworks are the same, only the weights are different, the correlation is disturbed by the setting, the main neural network is used for estimating the Q value and has the latest parameters, and the target neural network has the parameters which are used for a long time ago. Another feature is experience playback, learning with previous experience. The two characteristics enable the deep reinforcement learning method to be superior to the traditional reinforcement learning method. Referring to fig. 1 and 3, the deep reinforcement learning method includes the following steps:

s10, an experience pool, a main neural network Q (S, a; w) and a target neural network Q (S ', a'; w) are configured in the channel distribution system^-) For storing a data set, said experience pool having a capacity threshold D characterizing a maximum value of said experience pool storing data sets, said master neural network Q (s, a; w) and a target neural network Q (s ', a'; w is a^-) All the parameters are obtained by weighting with an optimal strategy algorithm, s is a channel state, a is an execution action, the execution action a represents a distribution mode of a channel, w is the weight of a neural network, and w is^-W. And the channel allocation system receives an instruction of an operator to assign values to the learning rate alpha, the capacity threshold value D, the discount factor gamma, the allocation probability value epsilon, the channel number N and the updating interval step number C, and the step number enters S20.

And S20, the channel allocation system obtains the execution action a of the next time slot through a preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal. The allocation algorithm is configured to:

wherein the content of the first and second substances,

representing the access action with the maximum estimated Q value of the current master neural network, a_randomIndicating that one of all possible access schemes is randomly selected, epsilon is a preset assignment probability value, and S30 is entered.

S30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variable_t+1And storing. The reward algorithm is configured to:

i.e. the channel allocation system allocates channels to the user terminals, which are based on the channel state observations o_tData is transmitted over this channel. When the data is successfully transmitted, the reward r_t+11 ═ 1; when data transmission fails, a prize r_t+1-1; when no data is transmitted on this channel, the reward r is_t+1-0.1 and proceeds to S40.

S40, the channel distribution system passes the channel state S of the current time slot_tCurrent time slot execution action a_tObtaining the channel state s of the next time slot_t+1And will(s)_t,a_t,r_t,s_t+1) Save to the experience pool as a set of data sets, r_tChannel state s for t-1 time slot_t-1Performing action a_t-1Then the prize value obtained at the time slot t and proceeds to S50.

S50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making S_t＝s_t+1And returns to step S20; otherwise, the process proceeds to step S60.

S60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling mode_t,a_t,r_t,s_t+1) The main neural network Q (s, a; w) training each set of data to obtain an estimated Q value, the target neural network Q (s ', a'; w is a^-) Calculating to obtain an actual Q value through a preset actual Q value algorithm; the actual Q value algorithm is configured to:

wherein, y_tIs the actual Q value, and proceeds to S70.

S70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, wherein the error algorithm is configured as follows:

L(w)＝(y_t-Q(s_t,a_t；w))²

and updating the weight w of the main neural network Q (s, a; w) according to a gradient descent method in the following specific mode:

where α is a preset learning rate, and proceeds to S80.

S80, every preset intervalUpdate interval step number C, let w^-The update interval step number C characterizes the change of the weights of the target neural network Q (s, a; w) to the master neural network Q (s ', a'; w)^-) The number of steps passed by the weight of (b), and proceeds to S90.

And S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the step when the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network Q (S, a; w) in the convergence state.

A main neural network Q (s, a; w) and a target neural network Q (s ', a'; w)^-) Fully-connected neural networks of three hidden layers (50 neurons) are adopted, an Adam optimizer is adopted in the optimization method, and the setting of main parameters of the networks is shown in table 1.

Table 1 main parameter settings

Learning rate alpha	0.01
		Capacity threshold value D	10000
Discount factor gamma	0.9
		Assigning a probability value epsilon	0.9
Number of channels N	32
		Updating interval step number C	300

Referring to fig. 4 and 5, the magnitude of the learning rate directly affects the convergence performance of the error algorithm. If the learning rate is too small, the convergence rate will be slow; if the learning rate is too large, the optimum is skipped and even oscillations may occur. The setting of the learning rate is very important. Referring to fig. 4, as the number of training increases, all of the 3 curves tend to converge, especially when the learning rate is 0.01, which can be converged with a small number of training; referring to fig. 5, when the learning rate is set to 0.1, a sudden increase in the error value occurs, and the performance is poor.

Fig. 6 and 7 are graphs showing the performance of the dynamic multi-channel model after deep reinforcement learning is adopted, compared with the ideal state and the random selection. In an ideal state, the channel allocation system calculates all possible choices and selects an access strategy that maximizes the Q value in each round, which can be considered as an ideal state. In random selection, the channel allocation system randomly selects an access strategy in each round. Referring to fig. 6 and 7, the normalized reward obtained by the deep reinforcement learning method is far better than the performance of random selection, although the random selection has the lowest error value. When the epsilon is set to be 0.99, the normalized reward obtained after the deep reinforcement learning method is adopted is 12.45 percent lower than the ideal state, and when the epsilon is set to be 0.9, the normalized reward obtained after the deep reinforcement learning method is adopted is nearly close to the ideal state, which proves that the dynamic multi-channel access method can obtain a near-optimal channel distribution mode in a dynamic multi-channel model through the deep reinforcement learning method.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A dynamic multichannel access method based on deep reinforcement learning applied to a cellular network is characterized in that: providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;

a dynamic multi-channel model is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data are successfully sent on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;

s40, the channel distribution system passes the channel state S of the current time slot_tExecution of the current time slotLine action a_tObtaining the channel state s of the next time slot_t+1And will(s)_t,a_t,r_t,s_t+1) Save to the experience pool as a set of data sets, r_tChannel state s for t-1 time slot_t-1Performing action a_t-1The prize value obtained later in the t time slot, and the process goes to S50;

C1：S＝{s₁,...,s_3N}

C2：

C3：

C4：Ω(t+1)＝Ω'(t)P

C5：

C6：

c2 is the confidence vector for the location,

2. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the allocation algorithm is configured to:

wherein the content of the first and second substances,

3. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the reward algorithm is configured to:

4. the dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the actual Q value algorithm is configured to:

wherein, y_tIs the actual Q value.

5. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 4, characterized in that: the error algorithm is configured to:

L(w)＝(y_t-Q(s_t,a_t；w))²

wherein L (w) is the error value.