CN112188503B - Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network - Google Patents

Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network Download PDF

Info

Publication number
CN112188503B
CN112188503B CN202011055360.3A CN202011055360A CN112188503B CN 112188503 B CN112188503 B CN 112188503B CN 202011055360 A CN202011055360 A CN 202011055360A CN 112188503 B CN112188503 B CN 112188503B
Authority
CN
China
Prior art keywords
channel
value
neural network
time slot
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011055360.3A
Other languages
Chinese (zh)
Other versions
CN112188503A (en
Inventor
徐友云
李大鹏
蒋锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Nanyou Communication Network Industry Research Institute Co ltd
Nanjing Ai Er Win Technology Co ltd
Original Assignee
Nanjing Nanyou Communication Network Industry Research Institute Co ltd
Nanjing Ai Er Win Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Nanyou Communication Network Industry Research Institute Co ltd, Nanjing Ai Er Win Technology Co ltd filed Critical Nanjing Nanyou Communication Network Industry Research Institute Co ltd
Priority to CN202011055360.3A priority Critical patent/CN112188503B/en
Publication of CN112188503A publication Critical patent/CN112188503A/en
Application granted granted Critical
Publication of CN112188503B publication Critical patent/CN112188503B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/02Resource partitioning among network components, e.g. reuse partitioning
    • H04W16/10Dynamic resource partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and adopts the technical scheme that the dynamic multichannel access method comprises the steps of providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals; and a dynamic multi-channel model following a part of observable Markov chains is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, and the optimal strategy algorithm performs training optimization through a deep reinforcement learning method. The method avoids huge exponential calculation amount through deep reinforcement learning, enables the user terminal to be quickly accessed into the optimal channel on the premise of ensuring the communication quality of the user terminal, and improves the spectrum utilization rate.

Description

Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network
Technical Field
The invention relates to the technical field of communication, in particular to a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network.
Background
Radio spectrum is a limited and precious natural resource in wireless communication, and existing wireless communication adopts an authorization-based method to allocate the spectrum, namely, the radio spectrum is divided into a plurality of spectrum segments with fixed widths, and the spectrum segments are allocated to user terminals by government administration departments for individual use. However, with the rapid development of wireless communication technology and the continuous growth of new services, and in addition, the problem of spectrum resource shortage caused by the low efficiency of spectrum utilization rate, spectrum resources become more and more scarce, and the increasingly scarce spectrum cannot meet the increasing demand of wireless communication. This phenomenon has also prompted the development of efficient dynamic spectrum access schemes to cater for emerging wireless network technologies. The cognitive radio technology becomes a key technology for improving the utilization rate of frequency spectrums, and the main idea of the technology is to detect which frequency spectrums are in idle states and then intelligently select and access the idle frequency spectrums, so that the utilization rate of the frequency spectrums can be greatly improved.
Research on a dynamic spectrum access technology, which is one of key technologies of cognitive radio technologies, is being conducted, and the existing method is mainly markov modeling, that is, a dynamic spectrum access process of a user terminal is modeled as a markov model. The access procedure is described accurately in a two-dimensional or multi-dimensional markov chain. The spectrum utilization rate can be improved through Markov modeling, but the requirement on the environment is high, the system is not subjected to a learning process, and the convergence speed is low.
With the vigorous development of reinforcement learning, new research is brought to the dynamic spectrum access technology. The reinforcement learning refers to learning from an environment state to an action mapping, and the reinforcement learning focuses on studying how a system learns an optimal behavior strategy under the condition that a state transition probability function is unknown. The reinforcement learning has less requirements on environmental knowledge, strong adaptability to dynamic change environment and good compatibility when applied to a wireless network, and the characteristics enable the reinforcement learning to have wide prospects in the business of the cognitive radio field. However, when the number of user terminals increases sharply, the state quantity generated by reinforcement learning is also power-series, the algorithm complexity becomes very large, and the exponential calculation quantity makes the reinforcement learning difficult to be practically used.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a dynamic multichannel access method based on deep reinforcement learning, which is applied to a cellular network, and can avoid huge exponential calculation, so that a user terminal can be quickly accessed to an optimal channel on the premise of ensuring the communication quality of the user terminal, and the spectrum utilization rate is improved.
In order to achieve the purpose, the invention provides the following technical scheme: a dynamic multichannel access method based on deep reinforcement learning applied to a cellular network provides a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;
a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data is successfully transmitted on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;
s10, an experience pool, a main neural network and a target neural network are configured in the channel allocation system, the experience pool is used for storing a data set, the experience pool has a capacity threshold value D, the capacity threshold value D represents the maximum value of the data set stored by the experience pool, the main neural network and the target neural network are constructed through the optimal strategy algorithm, parameters of the main neural network and the target neural network respectively comprise channel states, execution actions and weights of the neural networks, the channel states are S, the execution actions are a, the execution actions a represent the allocation mode of the channels, the weight of the main neural network is w, and the weight of the target neural network is w-Otherwise, the weight of the target neural network is equal to that of the main neural network, and S20 is entered;
s20, the channel allocation system obtains the execution action a of the next time slot through the preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal, and the step enters S30;
s30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And saved, and proceeds to S40;
s40, the channel distribution system passes the channel state S of the current time slottCurrent time slot execution action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1The prize value obtained later in the t time slot, and the process goes to S50;
s50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1And returns to step S20; otherwise, go to step S60;
s60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network trains each group of data sets to obtain an estimated Q value, the target neural network calculates to obtain an actual Q value through a preset actual Q value algorithm, and the step is S70;
s70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, updating the weight w of the main neural network according to a gradient descent method, and entering S80;
s80, every preset updating interval step number C, let w-The update interval step number C represents the number of steps taken to change the weight of the target neural network to the weight of the master neural network, and proceeds to S90;
and S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the operation if the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network in the convergence state.
The dynamic multi-channel model is a dynamic multi-channel model following a partially observable Markov chain, and the dynamic multi-channel model follows the following constraints:
C1:
Figure GDA0003035860400000031
C2:
Figure GDA0003035860400000032
C3:
Figure GDA0003035860400000033
C4:Ω(t+1)=Ω'(t)P
C5:
Figure GDA0003035860400000034
C6:
Figure GDA0003035860400000035
wherein: c1 is a state space of a partially observable Markov chain, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijRepresents the channel state of the j channel;
c2 is the confidence vector for the location,
Figure GDA0003035860400000036
assigning a system position s for the channeliA state, and the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known;
c3 is an updating mode of each possible state in the confidence vector, I (-) is an indication function, a (t) is a channel accessed by the t-slot user terminal, o (t) is a channel state observation value of the channel accessed by the t-slot user terminal, the observation value is 1 to characterize the channel state well, the observation value is 0.5 to characterize the channel state uncertain, and the observation value is 0 to characterize the channel state difference;
c4 is the update formula of the confidence vector, P is the transition matrix of the partially observable markov chain;
c5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1Obtaining a reward value at a time slot t +1 after the action a is executed for the channel state s of the time slot t;
c6 is the optimal channel allocation strategy that is achieved when the accumulated prize value is maximum.
As a further refinement of the invention, the allocation algorithm is configured to:
Figure GDA0003035860400000041
wherein the content of the first and second substances,
Figure GDA0003035860400000042
representing the access action with the maximum estimated Q value of the current master neural network, arandomIndicating that one access scheme is randomly selected from all possible access schemes, and epsilon is a preset allocation probability value.
As a further refinement of the present invention, the reward algorithm is configured to:
Figure GDA0003035860400000043
as a further improvement of the present invention, the actual Q value algorithm is configured to:
Figure GDA0003035860400000044
wherein, ytIs the actual Q value.
As a further refinement of the invention, the error algorithm is configured to:
L(w)=(yt-Q(st,at;w))2
wherein L (w) is the error value.
The invention has the beneficial effects that: and a dynamic multi-channel model is configured in the channel allocation system and used for calculating an optimal channel allocation mode and realizing continuous optimization of an optimal strategy algorithm through deep reinforcement learning. The dynamic multi-channel access method reduces the requirement on the environment, so that the channel allocation system can quickly allocate each channel to each user terminal in an optimized mode through learning, and the dynamic multi-channel model is solved through a deep reinforcement learning method, thereby avoiding huge exponential calculation quantity. Therefore, the dynamic multi-channel access method can avoid huge exponential calculation, and can ensure that the user terminal can quickly access the optimal channel and improve the spectrum utilization rate on the premise of ensuring the communication quality of the user terminal.
Drawings
FIG. 1 is a flow chart of a deep reinforcement learning method;
fig. 2 is a diagram of a wireless network dynamic multi-channel access scenario;
FIG. 3 is a block diagram of a deep reinforcement learning method;
FIG. 4 is a graph comparing convergence of error algorithms at different learning rates;
FIG. 5 is a graph of the convergence of the error algorithm at a learning rate of 0.1;
FIG. 6 is a comparison of the dynamic multi-channel model using a deep reinforcement learning method with the ideal state and random selection for normalized reward;
FIG. 7 is a comparison of error values for a dynamic multi-channel model using a deep reinforcement learning method versus an ideal state and a random selection.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Referring to fig. 1, 2 and 3, a dynamic multi-channel access method based on deep reinforcement learning applied to a cellular network according to the present embodiment provides a channel allocation system and a plurality of user terminals, where the channel allocation system is in communication connection with the user terminals.
And a dynamic multi-channel model following a partially observable Markov chain is configured in the channel allocation system, and the dynamic multi-channel model is used for calculating the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot. The configuration principle of the dynamic multi-channel model is as follows:
referring to fig. 2, it is assumed that a certain range is covered by one base station and M user terminals, and each user terminal needs to select one transmission data packet from N channels. And assume that the user always has data to transmit and that the N channels are orthogonal to each other. In each time slot, the user terminal needs to dynamically sense the state of the channel and select one channel to transmit data, the states of the channels are three, namely a good channel state, an uncertain channel state and a bad channel state, the good channel state indicates that the data of the user terminal can be successfully transmitted, the uncertain channel state indicates that the data of the user terminal cannot be successfully transmitted, and the bad channel state indicates that the data of the user terminal cannot be successfully transmitted. The channel state is expressed by data words by S, and the expression rule is as follows:
Figure GDA0003035860400000051
the user terminal obtains corresponding reward according to the actual channel state of the distributed channel, if the user terminal selects that the channel state is good, a positive reward value (+1) can be obtained; if the user terminal selects the channel state to be poor, a negative reward value (-1) is obtained; if the selected state of the user terminal is the uncertain channel state, a negative reward value (-0.1) is obtained, using rtIndicating a prize value.
By a 3N-a state markov chain to model the correlation between channels, the state space of the partially observable markov chain being
Figure GDA0003035860400000052
Each state si(i∈{1,2,...,3NAre all a vector of length N si1,si2,...,siN],sijRepresents the channel state of the j channel: the channel state is good (1), the channel state is bad (0), and the channel state is uncertain (0.5). Each channel may be characterized as a 3 × 3 state transition matrix, which is specified as follows:
Figure GDA0003035860400000053
wherein, Pi(x | y), x, y ∈ {0,0.5,1}, defined as the state transition probability of the channel from state x to state y. The state transition matrix for the entire Markov chain is defined as P. Since the user terminal can only perceive one channel and observe its state at the beginning of each slot, it cannot observe the channel states of all channels. However, the channel allocation system can observe and predict the distribution of channel conditions in the system. Thus, modeling the dynamic multi-channel access problem as a general framework of a partially observable markov decision process, follows the constraints of:
C1:
Figure GDA0003035860400000061
C2:
Figure GDA0003035860400000062
C3:
Figure GDA0003035860400000063
C4:Ω(t+1)=Ω'(t)P
C5:
Figure GDA0003035860400000064
C6:
Figure GDA0003035860400000065
wherein: c1 is a state space of a partially observable Markov chain, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijIndicating the channel state of the j channel.
C2 is the confidence vector for the location,
Figure GDA0003035860400000066
assigning a system position s for the channeliState of andand knows the performing action of the past slot and the conditional probability of the channel state of each channel of the next slot.
C3 is the updating mode of each possible state in the confidence vector, I (-) is an indication function, in each time slot, the channel allocation system needs to allocate the access policy to the user terminal, a (t) for the channel accessed by the user terminal in t time slot, i.e. the execution action of the user terminal, the execution action of the user terminal is represented by data:
at={0,1,2,...,N}
wherein a ist0 denotes that the subscriber terminal does not transmit data in time slot t, and atN is more than or equal to 1 and less than or equal to N, and the user terminal selects to access the N channel to transmit data in the time slot t.
And o (t) is a channel state observation value of a channel accessed by the user terminal in the t time slot, wherein the observation value is 1, the representation channel state is good, the observation value is 0.5, the representation channel state is uncertain, and the observation value is 0, the representation channel state is poor.
C4 is the updated formula for the confidence vector and P is the transition matrix for the partially observable markov chain.
C5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1The reward value obtained at t +1 time slot according to the reward algorithm after performing action a for channel state s of t time slot, it should be noted that the user is in state s at t time slottTaking action atFollowed by a prize in the t +1 slot. The reward algorithm is configured to:
Figure GDA0003035860400000071
in the dynamic multi-channel model, the channel allocation system needs to maximize a long-term accumulated discount reward value, the accumulated discount reward value represents an accumulated value of a reward value obtained after a period of time slot execution action is predicted according to a current channel state, and a calculation algorithm of the accumulated discount reward value is configured as follows:
Figure GDA0003035860400000072
the discount factor gamma (gamma is more than or equal to 0 and less than or equal to 1), and the absolute value of the obtained reward value is relatively smaller when the predicted time slot is longer than the current time slot through the algorithm, so that the influence on the accumulated discount reward value is smaller when the predicted time slot is longer than the current time slot.
C6 is to find the optimal channel allocation strategy by Bellman equation
Figure DA00030358604045863478
Q-learning is the most common solution in reinforcement learning
Figure DA00030358604045867009
However, the Q learning process is complicated due to a large motion space. And the deep reinforcement learning can be combined with the traditional reinforcement learning and the deep neural network to solve the defect. The deep neural network can find the mathematical relation between the input data and the output data, so that a main neural network is used for approximating an optimal strategy algorithm with weight of w, namely Q (s, a; w) is approximately equal to Qπ(s, a) while using a target neural network Q (s ', a'; w)-) To generate the target values required for the training of the master neural network. The two neural network frameworks are the same, only the weights are different, the correlation is disturbed by the setting, the main neural network is used for estimating the Q value and has the latest parameters, and the target neural network has the parameters which are used for a long time ago. Another feature is experience playback, learning with previous experience. The two characteristics enable the deep reinforcement learning method to be superior to the traditional reinforcement learning method. Referring to fig. 1 and 3, the deep reinforcement learning method includes the following steps:
s10, an experience pool, a main neural network Q (S, a; w) and a target neural network Q (S ', a'; w) are configured in the channel distribution system-) For storing a data set, said experience pool having a capacity threshold D characterizing a maximum value of said experience pool storing data sets, said master neural network Q (s, a; w) and a target neural network Q (s ', a'; w is a-) All the parameters are obtained by weighting with an optimal strategy algorithm, s is a channel state, a is an execution action, the execution action a represents a distribution mode of a channel, w is the weight of a neural network, and w is-W. And the channel allocation system receives an instruction of an operator to assign values to the learning rate alpha, the capacity threshold value D, the discount factor gamma, the allocation probability value epsilon, the channel number N and the updating interval step number C, and the step number enters S20.
And S20, the channel allocation system obtains the execution action a of the next time slot through a preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal. The allocation algorithm is configured to:
Figure GDA0003035860400000081
wherein the content of the first and second substances,
Figure GDA0003035860400000082
representing the access action with the maximum estimated Q value of the current master neural network, arandomIndicating that one of all possible access schemes is randomly selected, epsilon is a preset assignment probability value, and S30 is entered.
S30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And storing. The reward algorithm is configured to:
Figure GDA0003035860400000083
i.e. the channel allocation system allocates channels to the user terminals, which are based on the channel state observations otData is transmitted over this channel. When the data is successfully transmitted, the reward rt+11 ═ 1; when data transmission fails, a prize rt+1-1; when no data is transmitted on this channel, the reward r ist+1-0.1 and proceeds to S40.
S40, the channel distribution system passes the channel state S of the current time slottCurrent time slot execution action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1Then the prize value obtained at the time slot t and proceeds to S50.
S50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1And returns to step S20; otherwise, the process proceeds to step S60.
S60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network Q (s, a; w) training each set of data to obtain an estimated Q value, the target neural network Q (s ', a'; w is a-) Calculating to obtain an actual Q value through a preset actual Q value algorithm; the actual Q value algorithm is configured to:
Figure GDA0003035860400000091
wherein, ytIs the actual Q value, and proceeds to S70.
S70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, wherein the error algorithm is configured as follows:
L(w)=(yt-Q(st,at;w))2
and updating the weight w of the main neural network Q (s, a; w) according to a gradient descent method in the following specific mode:
Figure GDA0003035860400000092
where α is a preset learning rate, and proceeds to S80.
S80, every preset intervalUpdate interval step number C, let w-The update interval step number C characterizes the change of the weights of the target neural network Q (s, a; w) to the master neural network Q (s ', a'; w)-) The number of steps passed by the weight of (b), and proceeds to S90.
And S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the step when the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network Q (S, a; w) in the convergence state.
A main neural network Q (s, a; w) and a target neural network Q (s ', a'; w)-) Fully-connected neural networks of three hidden layers (50 neurons) are adopted, an Adam optimizer is adopted in the optimization method, and the setting of main parameters of the networks is shown in table 1.
Table 1 main parameter settings
Learning rate alpha 0.01
Capacity threshold value D 10000
Discount factor gamma 0.9
Assigning a probability value epsilon 0.9
Number of channels N 32
Updating interval step number C 300
Referring to fig. 4 and 5, the magnitude of the learning rate directly affects the convergence performance of the error algorithm. If the learning rate is too small, the convergence rate will be slow; if the learning rate is too large, the optimum is skipped and even oscillations may occur. The setting of the learning rate is very important. Referring to fig. 4, as the number of training increases, all of the 3 curves tend to converge, especially when the learning rate is 0.01, which can be converged with a small number of training; referring to fig. 5, when the learning rate is set to 0.1, a sudden increase in the error value occurs, and the performance is poor.
Fig. 6 and 7 are graphs showing the performance of the dynamic multi-channel model after deep reinforcement learning is adopted, compared with the ideal state and the random selection. In an ideal state, the channel allocation system calculates all possible choices and selects an access strategy that maximizes the Q value in each round, which can be considered as an ideal state. In random selection, the channel allocation system randomly selects an access strategy in each round. Referring to fig. 6 and 7, the normalized reward obtained by the deep reinforcement learning method is far better than the performance of random selection, although the random selection has the lowest error value. When the epsilon is set to be 0.99, the normalized reward obtained after the deep reinforcement learning method is adopted is 12.45 percent lower than the ideal state, and when the epsilon is set to be 0.9, the normalized reward obtained after the deep reinforcement learning method is adopted is nearly close to the ideal state, which proves that the dynamic multi-channel access method can obtain a near-optimal channel distribution mode in a dynamic multi-channel model through the deep reinforcement learning method.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (5)

1. A dynamic multichannel access method based on deep reinforcement learning applied to a cellular network is characterized in that: providing a channel distribution system and a plurality of user terminals, wherein the channel distribution system is in communication connection with the user terminals;
a dynamic multi-channel model is configured in the channel allocation system, the dynamic multi-channel model calculates the optimal channel allocation mode of the next time slot through an optimal strategy algorithm according to each channel state of the current time slot, the channel state represents whether data are successfully sent on the channel, the optimal strategy algorithm is optimized through a deep reinforcement learning method, and the deep reinforcement learning method comprises the following steps;
s10, an experience pool, a main neural network and a target neural network are configured in the channel allocation system, the experience pool is used for storing a data set, the experience pool has a capacity threshold value D, the capacity threshold value D represents the maximum value of the data set stored by the experience pool, the main neural network and the target neural network are constructed through the optimal strategy algorithm, parameters of the main neural network and the target neural network respectively comprise channel states, execution actions and weights of the neural networks, the channel states are S, the execution actions are a, the execution actions a represent the allocation mode of the channels, the weight of the main neural network is w, and the weight of the target neural network is w-Otherwise, the weight of the target neural network is equal to that of the main neural network, and S20 is entered;
s20, the channel allocation system obtains the execution action a of the next time slot through the preset allocation algorithm according to the channel state S of the channel allocated by the current time slot of the user terminal, and the step enters S30;
s30, the channel distribution system distributes the channel to the user terminal according to the execution action a, the communication distribution system calculates the reward value r by the preset reward algorithm and by taking whether the user terminal successfully sends data through the channel as a variablet+1And saved, and proceeds to S40;
s40, the channel distribution system passes the channel state S of the current time slottExecution of the current time slotLine action atObtaining the channel state s of the next time slott+1And will(s)t,at,rt,st+1) Save to the experience pool as a set of data sets, rtChannel state s for t-1 time slott-1Performing action at-1The prize value obtained later in the t time slot, and the process goes to S50;
s50, judging whether the capacity of the experience pool reaches the capacity threshold value D, if not, making St=st+1And returns to step S20; otherwise, go to step S60;
s60, the channel allocation system obtains several groups of data sets (S) from the experience pool in a random sampling modet,at,rt,st+1) The main neural network trains each group of data sets to obtain an estimated Q value, the target neural network calculates to obtain an actual Q value through a preset actual Q value algorithm, and the step is S70;
s70, calculating an error value between the estimated Q value and the actual Q value through a preset error algorithm, updating the weight w of the main neural network according to a gradient descent method, and entering S80;
s80, every preset updating interval step number C, let w-The update interval step number C represents the number of steps taken to change the weight of the target neural network to the weight of the master neural network, and proceeds to S90;
and S90, comparing the error value with a preset convergence critical value, returning to the step S30 when the error value is larger than the convergence critical value, and ending the operation if the error value is not larger than the convergence critical value, wherein the convergence critical value represents the maximum error value of the main neural network in the convergence state.
The dynamic multi-channel model is a dynamic multi-channel model following a partially observable Markov chain, and the dynamic multi-channel model follows the following constraints:
C1:S={s1,...,s3N}
C2:
Figure FDA0003035860390000021
C3:
Figure FDA0003035860390000022
C4:Ω(t+1)=Ω'(t)P
C5:
Figure FDA0003035860390000023
C6:
Figure FDA0003035860390000024
wherein: c1 is a state space of a partially observable Markov chain, each state si(i∈{1,2,...,3NAre all a vector of length N si1,...,sij,...,siN],sijRepresents the channel state of the j channel;
c2 is the confidence vector for the location,
Figure FDA0003035860390000025
assigning a system position s for the channeliA state, and the execution action of the past time slot and the conditional probability of the channel state of each channel of the next time slot are known;
c3 is an updating mode of each possible state in the confidence vector, I (-) is an indication function, a (t) is a channel accessed by the t-slot user terminal, o (t) is a channel state observation value of the channel accessed by the t-slot user terminal, the observation value is 1 to characterize the channel state well, the observation value is 0.5 to characterize the channel state uncertain, and the observation value is 0 to characterize the channel state difference;
c4 is the update formula of the confidence vector, P is the transition matrix of the partially observable markov chain;
c5 is the optimal strategy algorithm, gamma is the preset discount factor, rt+1Obtaining a reward value at a time slot t +1 after the action a is executed for the channel state s of the time slot t;
c6 is the optimal channel allocation strategy that is achieved when the accumulated prize value is maximum.
2. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the allocation algorithm is configured to:
Figure FDA0003035860390000031
wherein the content of the first and second substances,
Figure FDA0003035860390000032
representing the access action with the maximum estimated Q value of the current master neural network, arandomIndicating that one access scheme is randomly selected from all possible access schemes, and epsilon is a preset allocation probability value.
3. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the reward algorithm is configured to:
Figure FDA0003035860390000033
4. the dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 1, characterized in that: the actual Q value algorithm is configured to:
Figure FDA0003035860390000034
wherein, ytIs the actual Q value.
5. The dynamic multichannel access method based on deep reinforcement learning applied to the cellular network according to claim 4, characterized in that: the error algorithm is configured to:
L(w)=(yt-Q(st,at;w))2
wherein L (w) is the error value.
CN202011055360.3A 2020-09-30 2020-09-30 Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network Active CN112188503B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011055360.3A CN112188503B (en) 2020-09-30 2020-09-30 Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011055360.3A CN112188503B (en) 2020-09-30 2020-09-30 Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network

Publications (2)

Publication Number Publication Date
CN112188503A CN112188503A (en) 2021-01-05
CN112188503B true CN112188503B (en) 2021-06-22

Family

ID=73946065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011055360.3A Active CN112188503B (en) 2020-09-30 2020-09-30 Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network

Country Status (1)

Country Link
CN (1) CN112188503B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925319B (en) * 2021-01-25 2022-06-07 哈尔滨工程大学 Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning
CN112954814B (en) * 2021-01-27 2022-05-20 哈尔滨工程大学 Channel quality access method in cognitive radio
CN113784359A (en) * 2021-09-08 2021-12-10 昆明理工大学 Dynamic channel access method based on improved BP neural network algorithm
CN115811801A (en) * 2021-09-15 2023-03-17 华为技术有限公司 Communication method and related device
CN115103372A (en) * 2022-06-17 2022-09-23 东南大学 Multi-user MIMO system user scheduling method based on deep reinforcement learning
CN115811788B (en) * 2022-11-23 2023-07-18 齐齐哈尔大学 D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035478A (en) * 2019-04-18 2019-07-19 北京邮电大学 A kind of dynamic multi-channel cut-in method under high-speed mobile scene
CN110691422A (en) * 2019-10-06 2020-01-14 湖北工业大学 Multi-channel intelligent access method based on deep reinforcement learning
CN111628855A (en) * 2020-05-09 2020-09-04 中国科学院沈阳自动化研究所 Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108966352B (en) * 2018-07-06 2019-09-27 北京邮电大学 Dynamic beam dispatching method based on depth enhancing study
CN110856268B (en) * 2019-10-30 2021-09-07 西安交通大学 Dynamic multichannel access method for wireless network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035478A (en) * 2019-04-18 2019-07-19 北京邮电大学 A kind of dynamic multi-channel cut-in method under high-speed mobile scene
CN110691422A (en) * 2019-10-06 2020-01-14 湖北工业大学 Multi-channel intelligent access method based on deep reinforcement learning
CN111628855A (en) * 2020-05-09 2020-09-04 中国科学院沈阳自动化研究所 Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Deep Reinforcement Learning for Dynamic;shangxing wang等;《IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING》;20180630;全文 *
Deep Reinforcement Learning for Dynamic;Y. Xu等;《Milcom 2018 Track 5 - Big Data and Machine Learning》;20200101;全文 *
Dynamic Multi-channel Access in Wireless System;李凡等;《12th International Conference on Advanced Computational Intelligence》;20200816;全文 *

Also Published As

Publication number Publication date
CN112188503A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN112188503B (en) Dynamic multichannel access method based on deep reinforcement learning and applied to cellular network
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN109947545B (en) Task unloading and migration decision method based on user mobility
CN112105062B (en) Mobile edge computing network energy consumption minimization strategy method under time-sensitive condition
CN113038616B (en) Frequency spectrum resource management and allocation method based on federal learning
CN111182637A (en) Wireless network resource allocation method based on generation countermeasure reinforcement learning
CN110856268B (en) Dynamic multichannel access method for wireless network
CN110233755B (en) Computing resource and frequency spectrum resource allocation method for fog computing in Internet of things
CN111556572A (en) Spectrum resource and computing resource joint allocation method based on reinforcement learning
CN113596785B (en) D2D-NOMA communication system resource allocation method based on deep Q network
CN109831808B (en) Resource allocation method of hybrid power supply C-RAN based on machine learning
CN111262638B (en) Dynamic spectrum access method based on efficient sample learning
CN110519849B (en) Communication and computing resource joint allocation method for mobile edge computing
CN112202847B (en) Server resource allocation method based on mobile edge calculation
Lei et al. Learning-based resource allocation: Efficient content delivery enabled by convolutional neural network
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
CN113810910B (en) Deep reinforcement learning-based dynamic spectrum sharing method between 4G and 5G networks
CN114501667A (en) Multi-channel access modeling and distributed implementation method considering service priority
CN114126021A (en) Green cognitive radio power distribution method based on deep reinforcement learning
CN117119486B (en) Deep unsupervised learning resource allocation method for guaranteeing long-term user rate of multi-cell cellular network
CN103618674A (en) A united packet scheduling and channel allocation routing method based on an adaptive service model
CN110392377B (en) 5G ultra-dense networking resource allocation method and device
CN111917529A (en) Underwater sound OFDM resource allocation method based on improved EXP3 algorithm
CN114615705B (en) Single-user resource allocation strategy method based on 5G network
Eskandari et al. Smart Interference Management xApp using Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant