CN112188503A - Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network - Google Patents
Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network Download PDFInfo
- Publication number
- CN112188503A CN112188503A CN202011055360.3A CN202011055360A CN112188503A CN 112188503 A CN112188503 A CN 112188503A CN 202011055360 A CN202011055360 A CN 202011055360A CN 112188503 A CN112188503 A CN 112188503A
- Authority
- CN
- China
- Prior art keywords
- channel
- value
- time slot
- neural network
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000002787 reinforcement Effects 0.000 title claims abstract description 43
- 230000001413 cellular effect Effects 0.000 title claims abstract description 13
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 47
- 238000004891 communication Methods 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims description 54
- 230000009471 action Effects 0.000 claims description 34
- 230000008569 process Effects 0.000 claims description 9
- 230000007704 transition Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 abstract description 21
- 238000004364 calculation method Methods 0.000 abstract description 5
- 238000012549 training Methods 0.000 abstract description 5
- 238000005457 optimization Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 9
- 230000001149 cognitive effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/02—Resource partitioning among network components, e.g. reuse partitioning
- H04W16/10—Dynamic resource partitioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and adopts the technical scheme that the dynamic multichannel access method comprises the steps of providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals; and a dynamic multi-channel model following a part of observable Markov chains is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, and the optimal strategy algorithm performs training optimization through a deep reinforcement learning method. The method avoids huge exponential calculation amount through deep reinforcement learning, enables the user terminal to be quickly accessed into the optimal channel on the premise of ensuring the communication quality of the user terminal, and improves the spectrum utilization rate.
Description
Technical Field
The invention relates to the technical field of communication, in particular to a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network.
Background
Radio spectrum is a limited and precious natural resource in wireless communication, and existing wireless communication adopts an authorization-based method to allocate the spectrum, namely, the radio spectrum is divided into a plurality of spectrum segments with fixed widths, and the spectrum segments are allocated to user terminals by government administration departments for individual use. However, with the rapid development of wireless communication technology and the continuous growth of new services, and in addition, the problem of spectrum resource shortage caused by the low efficiency of spectrum utilization rate, spectrum resources become more and more scarce, and the increasingly scarce spectrum cannot meet the increasing demand of wireless communication. This phenomenon has also prompted the development of efficient dynamic spectrum access schemes to cater for emerging wireless network technologies. The cognitive radio technology becomes a key technology for improving the utilization rate of frequency spectrums, and the main idea of the technology is to detect which frequency spectrums are in idle states and then intelligently select and access the idle frequency spectrums, so that the utilization rate of the frequency spectrums can be greatly improved.
Research on a dynamic spectrum access technology, which is one of key technologies of cognitive radio technologies, is being conducted, and the existing method is mainly markov modeling, that is, a dynamic spectrum access process of a user terminal is modeled as a markov model. The access procedure is described accurately in a two-dimensional or multi-dimensional markov chain. The spectrum utilization rate can be improved through Markov modeling, but the requirement on the environment is high, the system is not subjected to a learning process, and the convergence speed is low.
With the vigorous development of reinforcement learning, new research is brought to the dynamic spectrum access technology. The reinforcement learning refers to learning from an environment state to an action mapping, and the reinforcement learning focuses on studying how a system learns an optimal behavior strategy under the condition that a state transition probability function is unknown. The reinforcement learning has less requirements on environmental knowledge, strong adaptability to dynamic change environment and good compatibility when applied to a wireless network, and the characteristics enable the reinforcement learning to have wide prospects in the business of the cognitive radio field. However, when the number of user terminals increases sharply, the state quantity generated by reinforcement learning is also power-series, the algorithm complexity becomes very large, and the exponential calculation quantity makes the reinforcement learning difficult to be practically used.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and can avoid huge exponential calculation, so that a user terminal can be quickly accessed to an optimal channel on the premise of ensuring the communication quality of the user terminal, and the spectrum utilization rate is improved.
In order to achieve the purpose, the invention provides the following technical scheme: a dynamic multichannel access method based on deep reinforcement learning applied to a cellular network provides a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;
a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data is successfully transmitted on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;
s10, configuring an experience pool, a main neural network and a target neural network in the channel allocation system, wherein the experience pool is used for storing a data set, the experience pool has a capacity threshold D, the capacity threshold D represents the maximum value of the data set stored by the experience pool, the main neural network and the target neural network are constructed through the optimal strategy algorithm, parameters of the main neural network and the target neural network respectively comprise a channel state, an execution action and weights of the neural networks, the channel state is S, the execution action is a, the execution action a represents the allocation mode of the channel, the weights of the neural networks are w, the weights of the other target neural networks are equal to the weights of the main neural network, and the process proceeds to S20;
s20, the channel allocation system obtains the execution action a of the next time slot through the preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal, and the step enters S30;
s30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And saved, and proceeds to S40;
s40, the channel distribution system passes the channel state S of the current time slottCurrent time slot execution action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1The prize value obtained later in the t time slot, and the process goes to S50;
s50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1And returns to step S20; otherwise, go to step S60;
s60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network trains each group of data sets to obtain an estimated Q value, the target neural network calculates to obtain an actual Q value through a preset actual Q value algorithm, and the step is S70;
s70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, updating the weight w of the main neural network according to a gradient descent method, and entering S80;
s80, every preset updating interval step number C, let w-The update interval step number C represents the number of steps taken to change the weight of the target via the network to the weight of the master neural network, and S90 is entered;
and S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the operation if the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network in the convergence state.
As a further improvement of the invention, the dynamic multi-channel model is a dynamic multi-channel model following a partially observable markov chain, the dynamic multi-channel model following the constraints:
C4:Ω(t+1)=Ω'(t)P
wherein: c1 is a state space of a partially observable Markov chain, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijRepresents the channel state of the j channel;
c2 is the confidence vector for the location,assigning a system position s for the channeliA state, and the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known;
c3 is an updating mode of each possible state in the confidence vector, I (-) is an indication function, a (t) is a channel accessed by the t-slot user terminal, o (t) is a channel state observation value of the channel accessed by the t-slot user terminal, the observation value is 1 to characterize the channel state well, the observation value is 0.5 to characterize the channel state uncertain, and the observation value is 0 to characterize the channel state difference;
c4 is the update formula of the confidence vector, P is the transition matrix of the partially observable markov chain;
c5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1Obtaining a reward value at a time slot t +1 after the action a is executed for the channel state s of the time slot t;
c6 is the optimal channel allocation strategy that is achieved when the accumulated prize value is maximum.
As a further refinement of the invention, the allocation algorithm is configured to:
wherein,representing the access action with the maximum estimated Q value of the current master neural network, arandomThe method comprises the steps of representing that one access scheme is randomly selected from all possible access schemes and is a preset distribution probability value.
As a further refinement of the present invention, the reward algorithm is configured to:
as a further improvement of the present invention, the actual Q value algorithm is configured to:
wherein, ytIs the actual Q value.
As a further refinement of the invention, the error algorithm is configured to:
L(w)=(yt-Q(st,at;w))2
wherein L (w) is the error value.
The invention has the beneficial effects that: and a dynamic multi-channel model is configured in the channel allocation system and used for calculating an optimal channel allocation mode and realizing continuous optimization of an optimal strategy algorithm through deep reinforcement learning. The dynamic multi-channel access method reduces the requirement on the environment, so that the channel allocation system can quickly allocate each channel to each user terminal in an optimized mode through learning, and the dynamic multi-channel model is solved through a deep reinforcement learning method, thereby avoiding huge exponential calculation quantity. Therefore, the dynamic multi-channel access method can avoid huge exponential calculation, and can ensure that the user terminal can quickly access the optimal channel and improve the spectrum utilization rate on the premise of ensuring the communication quality of the user terminal.
Drawings
FIG. 1 is a flow chart of a deep reinforcement learning method;
fig. 2 is a diagram of a wireless network dynamic multi-channel access scenario;
FIG. 3 is a block diagram of a deep reinforcement learning method;
FIG. 4 is a graph comparing convergence of error algorithms at different learning rates;
FIG. 5 is a graph of the convergence of the error algorithm at a learning rate of 0.1;
FIG. 6 is a comparison of the dynamic multi-channel model using a deep reinforcement learning method with the ideal state and random selection for normalized reward;
FIG. 7 is a comparison of error values for a dynamic multi-channel model using a deep reinforcement learning method versus an ideal state and a random selection.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Referring to fig. 1, 2 and 3, a dynamic multi-channel access method based on deep reinforcement learning applied to a cellular network according to the present embodiment provides a channel allocation system and a plurality of user terminals, where the channel allocation system is in communication connection with the user terminals.
And a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, and the dynamic multi-channel model is used for calculating the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot. The configuration principle of the dynamic multi-channel model is as follows:
referring to fig. 2, it is assumed that a certain range is covered by one base station and M user terminals, and each user terminal needs to select one transmission data packet from N channels. And assume that the user always has data to transmit and that the N channels are orthogonal to each other. In each time slot, the user terminal needs to dynamically sense the state of the channel and select one channel to transmit data, the states of the channels are three, namely a good channel state, an uncertain channel state and a bad channel state, the good channel state indicates that the data of the user terminal can be successfully transmitted, the uncertain channel state indicates that the data of the user terminal cannot be successfully transmitted, and the bad channel state indicates that the data of the user terminal cannot be successfully transmitted. The channel state is expressed by data words by S, and the expression rule is as follows:
the user terminal obtains corresponding reward according to the actual channel state of the distributed channel, if the user terminal selects that the channel state is good, a positive reward value (+1) can be obtained; if the user terminal selects the channel state to be poor, a negative reward value (-1) is obtained; if the selected state of the user terminal is the uncertain channel state, a negative reward value (-0.1) is obtained, using rtIndicating a prize value.
By a 3N-a state markov chain to model the correlation between channels, the state space of the partially observable markov chain beingEach state si(i∈{1,2,...,3N}) are allIs a vector of length N si1,si2,...,siN],sijRepresents the channel state of the j channel: the channel state is good (1), the channel state is bad (0), and the channel state is uncertain (0.5). Each channel may be characterized as a 3 × 3 state transition matrix, which is specified as follows:
wherein, Pi(x | y), x, y ∈ {0,0.5,1}, defined as the state transition probability of the channel from state x to state y. The state transition matrix for the entire Markov chain is defined as P. Since the user terminal can only perceive one channel and observe its state at the beginning of each slot, it cannot observe the channel states of all channels. However, the channel allocation system can observe and predict the distribution of channel conditions in the system. Thus, modeling the dynamic multi-channel access problem as a general framework of a partially observable markov decision process, follows the constraints of:
C4:Ω(t+1)=Ω'(t)P
wherein: c1 is the state of a partially observable Markov chainSpace, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijIndicating the channel state of the j channel.
C2 is the confidence vector for the location,assigning a system position s for the channeliAnd the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known.
C3 is the updating mode of each possible state in the confidence vector, I (-) is an indication function, in each time slot, the channel allocation system needs to allocate the access policy to the user terminal, a (t) for the channel accessed by the user terminal in t time slot, i.e. the execution action of the user terminal, the execution action of the user terminal is represented by data:
at={0,1,2,...,N}
wherein a ist0 denotes that the subscriber terminal does not transmit data in time slot t, and atN is more than or equal to 1 and less than or equal to N, and the user terminal selects to access the N channel to transmit data in the time slot t.
And o (t) is a channel state observation value of a channel accessed by the user terminal in the t time slot, wherein the observation value is 1, the representation channel state is good, the observation value is 0.5, the representation channel state is uncertain, and the observation value is 0, the representation channel state is poor.
C4 is the updated formula for the confidence vector and P is the transition matrix for the partially observable markov chain.
C5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1The reward value obtained at t +1 time slot according to the reward algorithm after performing action a for channel state s of t time slot, it should be noted that the user is in state s at t time slottTaking action atFollowed by a prize in the t +1 slot. The reward algorithm is configured to:
in the dynamic multi-channel model, the channel allocation system needs to maximize a long-term accumulated discount reward value, the accumulated discount reward value represents an accumulated value of a reward value obtained after a period of time slot execution action is predicted according to a current channel state, and a calculation algorithm of the accumulated discount reward value is configured as follows:
the discount factor gamma (gamma is more than or equal to 0 and less than or equal to 1), and the absolute value of the obtained reward value is relatively smaller when the predicted time slot is longer than the current time slot through the algorithm, so that the influence on the accumulated discount reward value is smaller when the predicted time slot is longer than the current time slot.
C6 is to find out the optimal channel allocation strategy pi by the Bellman equation*。
Q learning is the most common solution pi in reinforcement learning*However, the Q learning process is complicated due to a large motion space. And the deep reinforcement learning can be combined with the traditional reinforcement learning and the deep neural network to solve the defect. The deep neural network can find the mathematical relation between the input data and the output data, so that a main neural network is used for approximating an optimal strategy algorithm with weight of w, namely Q (s, a; w) is approximately equal to Qπ(s, a) while using a target neural network Q (s ', a'; w)-) To generate the target values required for the training of the master neural network. The two neural network frameworks are the same, only the weights are different, the correlation is disturbed by the setting, the main neural network is used for estimating the Q value and has the latest parameters, and the target neural network has the parameters which are used for a long time ago. Another feature is experience playback, learning with previous experience. The two characteristics enable the deep reinforcement learning method to be superior to the traditional reinforcement learning method. Referring to fig. 1 and 3, the deep reinforcement learning method includes the following steps:
s10, an experience pool, a main neural network Q (S, a; w) and a target neural network Q (S ', a'; w) are configured in the channel distribution system-) Said experienceThe experience pool is used for storing the data set, the experience pool is provided with a capacity threshold value D, the capacity threshold value D represents the maximum value of the experience pool for storing the data set, and the main neural network Q (s, a; w) and the target neural network Q (s ', a'; w)-) All the parameters are obtained by weighting with an optimal strategy algorithm, s is a channel state, a is an execution action, the execution action a represents a distribution mode of a channel, w is the weight of a neural network, and w is-W. And the channel allocation system receives an instruction of an operator to assign values to the learning rate alpha, the capacity threshold value D, the discount factor gamma, the allocation probability value, the channel number N and the updating interval step number C, and the step proceeds to S20.
And S20, the channel allocation system obtains the execution action a of the next time slot through a preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal. The allocation algorithm is configured to:
wherein,representing the access action with the maximum estimated Q value of the current master neural network, arandomIndicating that one of all possible access schemes is randomly selected, a preset probability value is assigned, and S30 is entered.
S30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And storing. The reward algorithm is configured to:
i.e. the channel allocation system allocates channels to the user terminals, which are based on the channel state observations otData is transmitted over this channel. When the data is successfully sentSending time, rewarding rt+11 ═ 1; when data transmission fails, a prize rt+1-1; when no data is transmitted on this channel, the reward r ist+1-0.1 and proceeds to S40.
S40, the channel distribution system passes the channel state S of the current time slottCurrent time slot execution action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1Then the prize value obtained at the time slot t and proceeds to S50.
S50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1 and returns to step S20; otherwise, the process proceeds to step S60.
S60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network Q (s, a; w) training each set of data to obtain an estimated Q value, the target neural network Q (s ', a'; w is a-) Calculating to obtain an actual Q value through a preset actual Q value algorithm; the actual Q value algorithm is configured to:
wherein, ytIs the actual Q value, and proceeds to S70.
S70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, wherein the error algorithm is configured as follows:
L(w)=(yt-Q(st,at;w))2
and updating the weight w of the main neural network Q (s, a; w) according to a gradient descent method in the following specific mode:
where α is a preset learning rate, and proceeds to S80.
S80, every preset updating interval step number C, let w-The update interval step number C characterizes the change of the weight of the target via the network Q (s, a; w) to the main neural network Q (s ', a'; w)-) The number of steps passed by the weight of (b), and proceeds to S90.
And S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the step when the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network Q (S, a; w) in the convergence state.
A main neural network Q (s, a; w) and a target neural network Q (s ', a'; w)-) Fully-connected neural networks of three hidden layers (50 neurons) are adopted, an Adam optimizer is adopted in the optimization method, and the setting of main parameters of the networks is shown in table 1.
Table 1 main parameter settings
Learning rate alpha | 0.01 |
Capacity threshold value D | 10000 |
Discount factor gamma | 0.9 |
Assigning a probability value epsilon | 0.9 |
Number of channels N | 32 |
Updating interval step number C | 300 |
Referring to fig. 4 and 5, the magnitude of the learning rate directly affects the convergence performance of the error algorithm. If the learning rate is too small, the convergence rate will be slow; if the learning rate is too large, the optimum is skipped and even oscillations may occur. The setting of the learning rate is very important. Referring to fig. 4, as the number of training increases, all of the 3 curves tend to converge, especially when the learning rate is 0.01, which can be converged with a small number of training; referring to fig. 5, when the learning rate is set to 0.1, a sudden increase in the error value occurs, and the performance is poor.
Fig. 6 and 7 are graphs showing the performance of the dynamic multi-channel model after deep reinforcement learning is adopted, compared with the ideal state and the random selection. In an ideal state, the channel allocation system calculates all possible choices and selects an access strategy that maximizes the Q value in each round, which can be considered as an ideal state. In random selection, the channel allocation system randomly selects an access strategy in each round. Referring to fig. 6 and 7, the normalized reward obtained by the deep reinforcement learning method is far better than the performance of random selection, although the random selection has the lowest error value. When the normalized reward is set to be 0.99, the normalized reward obtained after the deep reinforcement learning method is adopted is 12.45 percent lower than the ideal state, and when the normalized reward is set to be 0.9, the normalized reward obtained after the deep reinforcement learning method is adopted is nearly close to the ideal state, which proves that the dynamic multi-channel access method can obtain a near-optimal channel distribution mode in a dynamic multi-channel model through the deep reinforcement learning method.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (6)
1. A dynamic multichannel access method based on deep reinforcement learning applied to a cellular network is characterized in that: providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;
a dynamic multi-channel model is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data are successfully sent on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;
s10, configuring an experience pool, a main neural network and a target neural network in the channel allocation system, wherein the experience pool is used for storing a data set, the experience pool has a capacity threshold D, the capacity threshold D represents the maximum value of the data set stored by the experience pool, the main neural network and the target neural network are constructed through the optimal strategy algorithm, parameters of the main neural network and the target neural network respectively comprise a channel state, an execution action and weights of the neural networks, the channel state is S, the execution action is a, the execution action a represents the allocation mode of the channel, the weights of the neural networks are w, the weights of the other target neural networks are equal to the weights of the main neural network, and the process proceeds to S20;
s20, the channel allocation system obtains the execution action a of the next time slot through the preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal, and the step enters S30;
s30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And save, and enter S40;
S40, the channel distribution system passes the channel state S of the current time slottCurrent time slot execution action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1The prize value obtained later in the t time slot, and the process goes to S50;
s50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1And returns to step S20; otherwise, go to step S60;
s60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network trains each group of data sets to obtain an estimated Q value, the target neural network calculates to obtain an actual Q value through a preset actual Q value algorithm, and the step is S70;
s70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, updating the weight w of the main neural network according to a gradient descent method, and entering S80;
s80, every preset updating interval step number C, let w-The update interval step number C represents the number of steps taken to change the weight of the target via the network to the weight of the master neural network, and S90 is entered;
and S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the operation if the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network in the convergence state.
2. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the dynamic multi-channel model is a dynamic multi-channel model following a partially observable Markov chain, and the dynamic multi-channel model follows the following constraints:
C4:Ω(t+1)=Ω'(t)P
wherein: c1 is a state space of a partially observable Markov chain, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijRepresents the channel state of the j channel;
c2 is the confidence vector for the location,assigning a system position s for the channeliA state, and the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known;
c3 is an updating mode of each possible state in the confidence vector, I (-) is an indication function, a (t) is a channel accessed by the t-slot user terminal, o (t) is a channel state observation value of the channel accessed by the t-slot user terminal, the observation value is 1 to characterize the channel state well, the observation value is 0.5 to characterize the channel state uncertain, and the observation value is 0 to characterize the channel state difference;
c4 is the update formula of the confidence vector, P is the transition matrix of the partially observable markov chain;
c5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1Obtaining a reward value at a time slot t +1 after the action a is executed for the channel state s of the time slot t;
c6 is the optimal channel allocation strategy that is achieved when the accumulated prize value is maximum.
3. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 2, characterized in that: the allocation algorithm is configured to:
6. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network of claim 5, characterized in that: the error algorithm is configured to:
L(w)=(yt-Q(st,at;w))2
wherein L (w) is the error value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011055360.3A CN112188503B (en) | 2020-09-30 | 2020-09-30 | Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011055360.3A CN112188503B (en) | 2020-09-30 | 2020-09-30 | Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112188503A true CN112188503A (en) | 2021-01-05 |
CN112188503B CN112188503B (en) | 2021-06-22 |
Family
ID=73946065
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011055360.3A Active CN112188503B (en) | 2020-09-30 | 2020-09-30 | Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112188503B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112925319A (en) * | 2021-01-25 | 2021-06-08 | 哈尔滨工程大学 | Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning |
CN112954814A (en) * | 2021-01-27 | 2021-06-11 | 哈尔滨工程大学 | Channel quality access method in cognitive radio |
CN113784359A (en) * | 2021-09-08 | 2021-12-10 | 昆明理工大学 | Dynamic channel access method based on improved BP neural network algorithm |
CN115103372A (en) * | 2022-06-17 | 2022-09-23 | 东南大学 | Multi-user MIMO system user scheduling method based on deep reinforcement learning |
CN115811788A (en) * | 2022-11-23 | 2023-03-17 | 齐齐哈尔大学 | D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning |
WO2023040812A1 (en) * | 2021-09-15 | 2023-03-23 | 华为技术有限公司 | Communication method and related apparatus |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108966352A (en) * | 2018-07-06 | 2018-12-07 | 北京邮电大学 | Dynamic beam dispatching method based on depth enhancing study |
CN110035478A (en) * | 2019-04-18 | 2019-07-19 | 北京邮电大学 | A kind of dynamic multi-channel cut-in method under high-speed mobile scene |
CN110691422A (en) * | 2019-10-06 | 2020-01-14 | 湖北工业大学 | Multi-channel intelligent access method based on deep reinforcement learning |
CN110856268A (en) * | 2019-10-30 | 2020-02-28 | 西安交通大学 | Dynamic multichannel access method for wireless network |
CN111628855A (en) * | 2020-05-09 | 2020-09-04 | 中国科学院沈阳自动化研究所 | Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning |
-
2020
- 2020-09-30 CN CN202011055360.3A patent/CN112188503B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108966352A (en) * | 2018-07-06 | 2018-12-07 | 北京邮电大学 | Dynamic beam dispatching method based on depth enhancing study |
CN110035478A (en) * | 2019-04-18 | 2019-07-19 | 北京邮电大学 | A kind of dynamic multi-channel cut-in method under high-speed mobile scene |
CN110691422A (en) * | 2019-10-06 | 2020-01-14 | 湖北工业大学 | Multi-channel intelligent access method based on deep reinforcement learning |
CN110856268A (en) * | 2019-10-30 | 2020-02-28 | 西安交通大学 | Dynamic multichannel access method for wireless network |
CN111628855A (en) * | 2020-05-09 | 2020-09-04 | 中国科学院沈阳自动化研究所 | Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning |
Non-Patent Citations (3)
Title |
---|
SHANGXING WANG等: "Deep Reinforcement Learning for Dynamic", 《IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING》 * |
Y. XU等: "Deep Reinforcement Learning for Dynamic", 《MILCOM 2018 TRACK 5 - BIG DATA AND MACHINE LEARNING》 * |
李凡等: "Dynamic Multi-channel Access in Wireless System", 《12TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112925319A (en) * | 2021-01-25 | 2021-06-08 | 哈尔滨工程大学 | Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning |
CN112954814A (en) * | 2021-01-27 | 2021-06-11 | 哈尔滨工程大学 | Channel quality access method in cognitive radio |
CN113784359A (en) * | 2021-09-08 | 2021-12-10 | 昆明理工大学 | Dynamic channel access method based on improved BP neural network algorithm |
WO2023040812A1 (en) * | 2021-09-15 | 2023-03-23 | 华为技术有限公司 | Communication method and related apparatus |
CN115103372A (en) * | 2022-06-17 | 2022-09-23 | 东南大学 | Multi-user MIMO system user scheduling method based on deep reinforcement learning |
CN115811788A (en) * | 2022-11-23 | 2023-03-17 | 齐齐哈尔大学 | D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning |
Also Published As
Publication number | Publication date |
---|---|
CN112188503B (en) | 2021-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112188503B (en) | Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network | |
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
CN109947545B (en) | Task unloading and migration decision method based on user mobility | |
CN110809306B (en) | Terminal access selection method based on deep reinforcement learning | |
CN111182637B (en) | Wireless network resource allocation method based on generation countermeasure reinforcement learning | |
CN111918339B (en) | AR task unloading and resource allocation method based on reinforcement learning in mobile edge network | |
CN109474980A (en) | A kind of wireless network resource distribution method based on depth enhancing study | |
CN112105062B (en) | Mobile edge computing network energy consumption minimization strategy method under time-sensitive condition | |
CN113038616B (en) | Frequency spectrum resource management and allocation method based on federal learning | |
CN107708152B (en) | Task unloading method of heterogeneous cellular network | |
CN110856268B (en) | Dynamic multichannel access method for wireless network | |
CN111556572A (en) | Spectrum resource and computing resource joint allocation method based on reinforcement learning | |
CN109831808B (en) | Resource allocation method of hybrid power supply C-RAN based on machine learning | |
CN110519849B (en) | Communication and computing resource joint allocation method for mobile edge computing | |
CN111262638B (en) | Dynamic spectrum access method based on efficient sample learning | |
CN112202847B (en) | Server resource allocation method based on mobile edge calculation | |
CN110233755A (en) | The computing resource and frequency spectrum resource allocation method that mist calculates in a kind of Internet of Things | |
Lei et al. | Learning-based resource allocation: Efficient content delivery enabled by convolutional neural network | |
CN114501667A (en) | Multi-channel access modeling and distributed implementation method considering service priority | |
CN117119486B (en) | Deep unsupervised learning resource allocation method for guaranteeing long-term user rate of multi-cell cellular network | |
CN110392377B (en) | 5G ultra-dense networking resource allocation method and device | |
CN111917529A (en) | Underwater sound OFDM resource allocation method based on improved EXP3 algorithm | |
CN114615705B (en) | Single-user resource allocation strategy method based on 5G network | |
CN115633402A (en) | Resource scheduling method for mixed service throughput optimization | |
Eskandari et al. | Smart interference management xApp using deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |