CN111654342A - Dynamic spectrum access method based on reinforcement learning with priori knowledge - Google Patents
Dynamic spectrum access method based on reinforcement learning with priori knowledge Download PDFInfo
- Publication number
- CN111654342A CN111654342A CN202010495810.4A CN202010495810A CN111654342A CN 111654342 A CN111654342 A CN 111654342A CN 202010495810 A CN202010495810 A CN 202010495810A CN 111654342 A CN111654342 A CN 111654342A
- Authority
- CN
- China
- Prior art keywords
- representing
- learning
- mos
- knowledge
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B17/00—Monitoring; Testing
- H04B17/30—Monitoring; Testing of propagation channels
- H04B17/382—Monitoring; Testing of propagation channels for resource allocation, admission control or handover
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B17/00—Monitoring; Testing
- H04B17/30—Monitoring; Testing of propagation channels
- H04B17/309—Measuring or estimating channel quality parameters
- H04B17/336—Signal-to-interference ratio [SIR] or carrier-to-interference ratio [CIR]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/14—Spectrum sharing arrangements between different networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W74/00—Wireless channel access, e.g. scheduled or random access
- H04W74/08—Non-scheduled or contention based access, e.g. random access, ALOHA, CSMA [Carrier Sense Multiple Access]
- H04W74/0833—Non-scheduled or contention based access, e.g. random access, ALOHA, CSMA [Carrier Sense Multiple Access] using a random access procedure
Abstract
The invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, and belongs to the technical field of electromagnetic spectrum. Firstly, a secondary user acquires spectrum access environment information; then determining a frequency spectrum access evaluation model of the network, and adopting an MOS model as the access evaluation model; then, establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge; then, Q learning is carried out according to the priori knowledge to obtain Q table information of the secondary user; and finally, performing dynamic spectrum access according to the Q table information obtained by learning. In addition, the invention optimizes the action selection process during Q learning by using a greedy algorithm, thereby avoiding trapping in local optimization during Q learning. The invention effectively improves the learning efficiency and the dynamic spectrum access performance of the system by constructing and utilizing the prior knowledge.
Description
Technical Field
The invention belongs to the technical field of electromagnetic spectrum, and particularly relates to a dynamic spectrum access method based on prior knowledge reinforcement learning.
Background
With the continuous expansion of the wireless application environment, on one hand, the demand of the communication system for wireless resources causes the shortage of frequency spectrum resources; on the other hand, the conventional static spectrum resource management causes low spectrum resource utilization rate. In recent years, attention of scholars is drawn to a spectrum use mode of Dynamic Spectrum Access (DSA), Cognitive Radio (CR) is one of the popular research directions, and the main idea is that a Secondary User (SU) with spectrum sensing capability actively senses spectrum use condition and performs communication by accessing an idle channel "opportunistically" on the premise of not generating harmful interference to an authorized user (PU) with spectrum authorization.
Dynamic spectrum access techniques have achieved more success in academia. For example, the literature (AKBARZADEH N, MAHAJANA. dynamic spectrum access unit partial updates: A residual basic access C]Canadian Workshop on Information theory.2019:1-6.) models the dynamic spectrum access problem under multichannel transmission conditions as a partially observable Markov decision process (POMPDP) and uses a Whittle index excitation method to assist in the decision-making, and the simulation result shows that when the model is indexable, the user transmits on the channel with the smallest Whittle index as the optimal strategy. Literature (YANG H, CHEN H. energy-efficiency channel access data priority in cognitive radio sensor networks [ C)]International conference on Software, Telecommunications and Computer networks.2019:1-5.) in a Cognitive Radio Sensor Network (CRSN), a dynamic channel access scheme based on data priority and energy consumption minimization is proposed, which allocates power according to the data priority of each node and then reasonably allocates transmission time to each node to minimize energy consumption. To accommodate the signal-to-noise ratio and throughput requirements of authorized users, the literature (GURAJAPU S, RAJ S, CHOUSHAN S. Spectrum Allocation and Power management using Markov Chains and Beam-forming in Inderly coherent Radios [ C ]],Inthe method comprises the steps of using a Markov chain to model a dynamic spectrum allocation process of a cognitive radio system, and combining a Markov chain model and a beam forming technology, wherein the dynamic spectrum allocation based on the scheme is obviously improved in the aspect of authorized user throughput. Literature (PASTIRCAK J, GAZDA J, KOCUR D.A superficial on the scattering in dynamic sp-ectrum access networks[C],International SymposiumElectronics in Marine.2014:135-138.//ZHAO Q,SADLER B M.A Survey of DynamicSpectrum Access[J].IEEE Signal Process,2007,24(3):79-89.//SHARMILA A,DANANJAYAN P.Spectrum Sharing Techniques in Cognitive Radio Networks–A Survey[C]Three general dynamic spectrum access models are introduced in International Conference on System, computing, Automation and computing.2019: 1-4), respectively: an open sharing model (open sharing model), a licensed use model (shared use model), and a dynamic exclusive use model (dynamic exclusive use model). The open sharing model enables all users to use the spectrum resources equally, but easily causes interference problems; the authorized spectrum sharing model reduces the interference to authorized users, but limits the transmitting power of secondary users; the dynamic exclusive model avoids the generation of additional harmful interference, but only allows dynamic spectrum allocation among authorized users.
Most of the existing work is established on the premise that spectrum access environment knowledge and dynamic models are known, but in practical application, complete environment knowledge, such as the working condition of authorized users, frequency utilization characteristics and the like, is usually difficult to obtain. In order to solve the problem of efficient spectrum access without prior knowledge, a reinforcement learning method is paid more attention in recent years, and the basic principle is to optimize the system performance by continuously learning interactively with a spectrum environment under the condition without the prior knowledge of the spectrum environment. The method has the limitation that the convergence rate of learning is limited without prior knowledge, so that the improvement of the frequency spectrum utilization rate is restricted. In practical dynamic spectrum access applications, the secondary user is usually in the above two states, i.e. a certain environment prior knowledge can be obtained, but the prior knowledge is not sufficient. For example, before the secondary user dynamically accesses the spectrum, the spectrum database may already obtain prior knowledge of the location, interference constraint, and the like of the authorized user in the area, but cannot obtain the specific spectrum use condition of the authorized user. Under the scene closer to practical application, how to obtain sufficient environmental knowledge through reinforcement learning so as to realize efficient spectrum utilization is a problem that dynamic spectrum access is practical and needs to be solved urgently.
Disclosure of Invention
The technical problem is as follows: the invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which improves spectrum access efficiency and realizes global optimization of resource allocation by utilizing the prior knowledge reinforcement learning method under the condition that spectrum access environment knowledge and a dynamic model are partially unknown, thereby maximizing the transmission quality of a cognitive network.
The technical scheme is as follows: the invention relates to a dynamic spectrum access method based on prior knowledge reinforcement learning, which comprises the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining a frequency spectrum access evaluation model of the network, and adopting an MOS (metal oxide semiconductor) model as an access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
Further, in step S1, the signal to interference plus noise ratio constraint condition is:
in the formula (I), the compound is shown in the specification,representing authorized users PUkThe minimum signal-to-interference-and-noise ratio of a receiving end;indicating a secondary user SUiMinimum signal-to-interference-and-noise ratio of receiving end β0Represents the SINR threshold of the base station of the authorized subscriber, and β0Is constant βiRepresenting the signal-to-interference-and-noise ratio threshold of the secondary user base station; m represents the number of authorized users, N represents the number of secondary users, and k and i represent serial numbers.
Further, in step S2, the MOS model includes a MOS model of the data and a MOS model of the video, where the MOS model of the data is:
QD=g log10(bri (s)(1-pe2e))
in the formula, QDRepresenting the data stream MOS, ri (s)Indicating a secondary user SUiBit rate of pe2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;
the MOS model of the video is as follows:
in the formula, QVRepresenting MOS of the video, PSNR representing peak signal-to-noise ratio, c, d and f representing parameters of a logic function;
the MOS model is as follows:
in the formula, QμThe average MOS is represented, U represents the number of the secondary users with the service flow as data, N-U represents the number of the secondary users with the service flow as video, and m represents the serial number.
Further, in step S3, the specific method for acquiring the environmental knowledge of the existing secondary users in the system and constructing the prior knowledge by using the acquired environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user existing in the system, wherein the formula is as follows:
in the formula, QnewQ-table, Q, representing newly added secondary usersiQ tables representing existing secondary users in the system, K represents the number of Q tables of existing secondary users in the system, and i represents a serial number.
Further, in step S4, the step of performing Q learning using a priori knowledge includes:
s41: initializing Q tables of all sub-users SU in the system, wherein the newly added sub-users initialize the Q tables by using priori knowledge;
s42: selecting an action according to a formulaMaking action selection, whereinIndicating a secondary user SUiIn the state at the time t of the instant t,indicating a secondary user SUiAt time t, the state isAn action of the time selection;
s43: updating state, defining state setAndrespectively representing the signal-to-interference-and-noise ratio limits of authorized users and secondary users, and updating the state by adopting the following formula:
in the formula, N represents the number of secondary users, i represents the serial number, psii、αiAre all intermediate process parameters;
s44: obtaining the return: using a formulaUpdating the return value, wherein,the optimal return is represented in the form of,is composed ofThe mathematical expectation of (a) is that,for MOS-based reporting, the user SU is indicatediIn a state stIn the following, the actions are adoptedReporting time; p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy;
s45: updating the Q value: using formulasUpdating the Q value, wherein αtIt is indicated that the learning rate is,represents SUiThe Q value at the current "state-action",indicating an action to takeQ value, A, corresponding to the new state siIndicating a secondary user SUiThe set of actions of (1); (ii) a
S46: and repeating the steps S41-S45 until convergence.
Further, in the step S42, a formula is adoptedMaking action selection, wherein the action selection represents the exploration probability, a*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one of all actions to execute "search", otherwise execute "utilization", and select the action with high Q value.
Further, the calculation formula of the MOS-based reward is:
in the formula (I), the compound is shown in the specification,representing a user SUiIn a state stIn the following, the actions are adoptedReporting time;represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);reward for indicating successful attempt(ii) a T is a constant less thanRepresents a reward of unsuccessful attempts.
Has the advantages that: compared with the prior art, the invention has the following advantages:
(1) the invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, which can realize dynamic spectrum access under the condition that a spectrum access environment and a dynamic model are partially unknown by using the reinforcement learning method with prior knowledge, and can acquire and update environment information through Q learning by using the relation between the transmission rate of a secondary user SU and the SINR on the premise of considering the signal-to-interference-and-noise ratio (SINR) limitation of a main network PN and a secondary network SN, thereby effectively improving the access efficiency of dynamic spectrum and greatly improving the performance of a system.
(2) According to the method, the MOS is used as an evaluation model, the data stream and the video stream of the system can be comprehensively evaluated, so that the dynamic spectrum access condition of the system can be evaluated, and the accuracy is good.
(3) The Q learning action selection process is optimized based on the greedy algorithm, the problem that the dynamic spectrum access algorithm is easy to fall into local optimization is solved, iteration times required by algorithm convergence are effectively reduced, and therefore the efficiency of dynamic spectrum access and the performance of a system are improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic block diagram of a dynamic spectrum access using the method of the present invention;
FIG. 3 is a diagram illustrating the relationship between the number of sub-users and the average MOS in the system according to the embodiment of the present invention;
FIG. 4 is a diagram illustrating a relationship between a number of secondary users and a collision probability in the system according to the embodiment of the present invention;
fig. 5 is a diagram illustrating a relationship between the number of secondary users and the average iteration number in the system according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following examples and the accompanying drawings.
Description of the drawings: the meaning of the english abbreviation appearing in the present embodiment is as follows:
PN: a main network; SN: a secondary network; PSB: a master base station; SBS: a secondary base station; PU (polyurethane): an authorized user; SU: a secondary user; SINR: a signal to interference plus noise ratio; MOS: averaging subjective opinion scores; DSA: dynamic spectrum access.
Referring to fig. 1, the dynamic spectrum access method based on prior knowledge reinforcement learning of the present invention includes the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining an access evaluation model of a network, and adopting an MOS (metal oxide semiconductor) model as the access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
In general, the authorized user condition, the self-location, and the nearby secondary user location are directly available, so that the important point for step S1 is to determine the constraints of the signal to interference and noise ratio, which will be described in detail below.
Assume that there are two wireless networks in the system, one of which is the Primary Network (PN) and the other of which is the Secondary Network (SN). The PN consists of 1 main base station (PBS) and M authorized users (PU); the SN consists of 1 Secondary Base Station (SBS) and N Secondary Users (SUs), and the Secondary Users (SUs) are randomly distributed around the SBS.
Under a typical dynamic spectrum access model, a PN containing a single primary link shares a single channel with a SN at a given time. All secondary and primary links transmit using an Adaptive Modulation and Coding (AMC) scheme. Each SU can adjust the transmission parameters thereof to meet the interference requirements of the PU; the transmission parameters of SBS and PBS may not be adjusted.
The traffic flow carried on the SN link is video and data, and the transmission channel is assumed to be an additive white gaussian noise quasi-static channel. The PU employs AMC technology while assuming the transmit power to be constant, under which assumption the SU can infer Channel State Information (CSI) through active learning and then estimate the channel gain.
Because the SN only performs communication on channels that "opportunistically" access the PN when the PN is not occupying the channel, in embodiments of the present invention, only the PN access channel is observed, and the SN fails to exit the SINR of both networks in time during the channel period. Authorizing a user PU in a host networklL ∈ M authorization user PUkSINR received at k ∈ MIndicating that k and l represent serial numbers, k, l ∈ M, k is more than or equal to 0 and M is less than or equal to l, when the sending station or the receiving station is PBS, k or l is 0, and the secondary user SU in the secondary networkjJ ∈ N, sub-user SUiSINR for i ∈ N receptionI and j denote serial numbers, i, j ∈ N,0 ≦ i, j ≦ N, and i or j is 0 when the transmitting station or the receiving station is SBS.
Although only one pair of PN network devices (PBS and PU) and one pair of SN network devices (SBS and SU) are communicating, to avoid loss of generality, all stations' transmissions are considered here, and there are:
(1) formula represents authorized user PUlL ∈ M authorization user PUkK ∈ M, where the first of the denominators is the sumTo representThe total power of the interference generated in the PN network, the superscript (p) indicates the PN network, the subscript h ∈ M, h ≠ l indicates the authorized user PU generating the interferenceh,Representing authorized users PUhTo authorized user PUkThe channel gain of (a) is determined,representing authorized users PUhThe transmission power of (a); the second sum of denominatorsRepresenting the total interference power generated by the SN network to the PN, the superscript(s) representing the SN network,indicating a secondary user SUiI ∈ N to authorized user PUkChannel gain of (P)i (s)Indicating a secondary user SUiThe transmission power of (a); sigma in denominator2Representing the noise power. In the moleculeRepresenting authorized users PUlTo authorized user PUkChannel gain of (P)l (p)And l ∈ M denotes an authorized user PUlTransmission power of, and Pl (p)Is a constant.
(2) Formula represents secondary user SUjJ ∈ N, sub-user SUiI ∈ N, where the first of the denominators is summedIndicating the total power of the interference generated in the SN network, the subscripts f ∈ N, f ≠ j indicating the interfering secondary users SUf,Indicating a secondary user SUfTo SUiThe channel gain of (a) is determined,indicating a secondary user SUfThe transmission power of (a); the second sum of denominatorsRepresents the total interference power generated by the PN network to the SN,representing authorized users PUkTo the secondary user SUiThe channel gain of (a) is determined,representing authorized users PUkThe transmission power of (1). In the moleculeIndicating a secondary user SUjTo the secondary user SUiChannel gain of (P)j (s)Indicating a secondary user SUjThe transmission power of (1). In order to achieve the interference index under the DSA model, the invention is rightAndthe following limitations are made:
formula (III) β0And βijRespectively representing the SINR thresholds of the licensed user base station PBS and the secondary user base station SBS, and β0Is a constant. For convenience of explanation, the present invention defines an authorized user PUkMinimum SNIR at the receiving end isAnd a secondary user SUiMinimum SNIR at the receiving end isAs shown in the formulas (4) and (5):
then equation (3) can be rewritten as:
under the condition of SINR satisfying SN, the transmission power P allocated to each SUi (s)Is represented by the formula (7):
in the formula (7)Representing the primary network device to the secondary user SUiAverage channel gain of, noteWhereinRepresenting authorized users PUkFor secondary user SUiThe channel gain of (a);indicating a secondary user SUiAverage channel gain for secondary network devices, noteWhereinTo representSub-user SUiFor secondary user SUjThe channel gain of (1). To obtain efficient power allocation, conditions need to be metσ2Indicating an error. Combining the formula (7) with the formula (1) and the formula (6), the user PU is authorizedkIs received byCan be rewritten as:
in formula (8), αi、ΨiAre intermediate process parameters, wherein:
since β will be replaced0It is assumed that there is a constant number,representing authorized users PUkFor secondary user SU0(i ═ 0) channel gain, so it is necessary to adjust β on each SUiTo satisfy equations (7) and (8), the adjustment can be specifically done by adjusting the bit rate.
From the literature (QIU X, CHAWLA K. on the performance of adaptive modulation in cellular systems [ J ]]IEEE Transactions on Communications,1999,47(6):884-iBit rate r ofi (s)And βiThe relationship of (d) may be:
ri (s)=W log2(1+qβi) (10)
1+ q β in formula (10)iThe number of bits representing the modulated signal is typically an integer value, where q ═ 1.5/-ln (5BER) is a constant related to the maximum transmission Bit Error Rate (BER), and W represents a parameter.
For proposed dynamic spectrum accessScheme, each SUiSelects its SINR threshold value βiAnd determining the corresponding ri (s)So that all PUs and SUs satisfy the SINR limit of equation (6).
In order to evaluate the effect of spectrum access, in the invention, MOS (mean opinion score) is used as an evaluation model to evaluate the effect of dynamic spectrum access. During communication, transmission of data and video may be involved, and thus, MOS includes MOS models of data and video.
(1) MOS model of data
The MOS model of the data stream can be calculated by equation (11):
QD=g log10(b ri (s)(1-pe2e)) (11)
in formula (11), QDRepresenting a data stream MOS, pe2eAnd g and b are parameters, and the parameters g and b are obtained by calculating the data quality sensed by the terminal user. The perceptual data quality is defined as: if the transmission rate of the user is J and the receiving rate is also J, the packet loss rate is 0, and the perceived data quality of the terminal user is the maximum, namely 5; if the transmission rate of the user is J and the receiving rate is 0, the packet loss rate is 1, and the perceived data quality of the terminal user is the lowest, namely 0.
(2) MOS model of video
The MOS model of the video can be represented by equation (12):
q in formula (12)VRepresenting the MOS of the video, PSNR represents the peak signal-to-noise ratio, and c, d, and f are parameters of a logistic function.
In the present invention, a logistic function is selected to evaluate the quality of the video traffic. PSNR and ri (s)Can be determined by the function PSNR ═ λ logri (s)+ β, where λ and β are both constants.
In the case of considering the presence of data streams and video streams, the calculation formula for calculating the average MOS is obtained as follows:
in formula (13), QμDenotes average MOS, U denotes number of SU with data as service flow, N-U denotes number of SU with video as service flow, m denotes serial number, SU is adjusted βiAnd corresponding bit rate ri (s)The network quality (MOS) is maximized on the basis of satisfying the SINR limit of equation (6).
After the access evaluation model is determined, the priori knowledge is constructed, and has great influence on the access efficiency of the frequency spectrum and the transmission quality of the network. For the Q learning algorithm, Q this table stores the rewards for each action. Each SU first learns its surroundings, then proceeds to select the action associated with the largest reward, obtains the reward for the selected action by executing a Q learning algorithm, and finally updates its Q-table based on the received instantaneous reward. Thus, the Q table reflects the impact of these actions on the wireless environment. Since portions of the radio environment are correlated with interference that each SU may cause to other portions of the system, the Q-table also reflects the respective radio environment of each SU and the interrelationship of the system components. When a SU joins a learned system, it only has an impact on the wireless environment of the system, and thus it is inefficient to re-run the cognitive cycle and not look at the environmental knowledge acquired by other SUs in the system. Thus, a newly added SU can learn the existing SU environment knowledge in the system to reduce learning time and improve learning performance. This mechanism is defined as "primer-Knowledge base Cognitive Radio (PKBL-CR)". The focus of CR is observation (sensing and analysis), while the focus of PKBL-CR is learning a priori knowledge. The PKBL-CR has more "experience" nodes that can "teach" learning experiences to new nodes, thereby reducing learning time and improving learning performance.
Therefore, in step S3, when the prior knowledge is constructed, the environmental knowledge of the secondary user existing in the system is first acquired, and then the prior knowledge is constructed by using the acquired environmental knowledge.
After the priori knowledge is constructed, Q learning is performed using the priori knowledge, and in order to describe the Q learning process in detail, the Q learning process is described in detail with reference to fig. 2.
Defining the state set of the Q learning algorithm as StThe action set is A and the action a is selected in the current statetThe reward after ∈ A is Rt。
Agent in Q learning perceives the current state S ∈ StAnd takes action a pi(s) ∈ a accordingly and gets the instantaneous reward R under the given strategy pit。
The key of the Q learning algorithm is to consider the discount factor gamma (0)<γ<1) How to take appropriate strategies to maximize the cumulative reward V. I.e. the secondary user SUiFrom the corresponding action setSelecting actions, adjusting their transmission power and other transmission parameters, sensing changes in network conditions and obtaining instantaneous reportsEach PU and SU selects to maximize its MOS while satisfying the SINR constraint of equation (6).
The state set is defined asStFor reflecting interference generated by the network, in whichAndrepresenting whether SINR constraints are met for PU and SU, respectively, i.e.
If SUiAfter the action is executed, better communication quality is obtained under the condition that the SINR constraint condition in the formula (6) is met, namely the MOS score is improved, and then the reward of successful attempt reward is obtained, and the reward is usedRepresents; if the PU or SU violates the SINR constraint in equation (6) after the action is performed, then the reward at this time is "reward for failed attempts", denoted by T. Therefore, a MOS-based reward functionCan be expressed as (16):
in the formula (15), the reaction mixture is,representing a user SUiIn a state stIn the following, the actions are adoptedReporting time;represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);a reward indicating a successful attempt; t is a constant less thanRepresents a reward of unsuccessful attempts.
In the invention, it is assumed that all SUs select 'action' according to self judgment without considering total income so as to maximize self long-term expectation accumulated return value, and for user SUiThe long-term return value is expressed as:
where pi is the secondary user SUiThe "policy" (action) currently taken; gamma (0)<γ<1) Is a constant, is a time discount factor, and reflects the importance of future return; s0S represents the initial state and the initial state,indicating time t, sub-user SUiE () is a expectation function for mathematically expecting parameters in parentheses.
According to the Bellman optimization criterion, the maximum value of equation (17) is:
wherein the content of the first and second substances,represents the optimal return, Ri(s, a) isA mathematical expectation of (d); p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy.
The idea of Q learning is that without estimating an environment model, an optimal strategy pi can be found through simple Q value iteration under the condition that R (s, a) and p (s' | s, a) are unknown*The formula (18) is satisfied. The updating formula of the Q value iteration is as follows (19):
formula (III) αt(0<αt<1) It is indicated that the learning rate is,represents SUiThe Q value at the current "state-action",indicating an action to takeAnd the corresponding Q value after the new state s' is reached.
In order to fully explain the process of Q learning, the process of Q learning without considering prior knowledge and with considering prior knowledge is explained separately.
Without prior knowledge, each SU performs Q learning separately, with the Q tables of all SUs initialized to zero at the beginning of learning, i.e.The algorithm starts by adopting an 'exploration' strategy, and after each Q table is accessed once, the 'utilization' strategy is entered according to the 'action' with the maximum Q value. The specific process of performing Q learning is shown in table 1:
TABLE 1 procedure during single user Q learning
In consideration of prior knowledge, Q learning is carried out, when a new SU is added, the newly added SU initializes the Q table of the new SU through the existing Q table in the system, and the specific method is as follows (20):
in the formula (20), QnewInitialized Q-table, Q, representing newly added SUiIndicating the presence of a secondary user SU in the systemiK represents the number of secondary users already in the system. The specific process of performing Q learning when considering a priori knowledge is shown in table 2:
TABLE 2 prior knowledge based Q learning
When the prior knowledge is used for dynamic spectrum access, the intelligent agent executes the maximum value of Q (s, a) in the current state during each learning, however, the strategy has a learning vulnerability, namely: if the agent finds a higher Q action through learning in the initial iterative learning, it will tie up without finding a better action that may still exist at this time and that may not be explored, and thus fall into local optimality.
Meanwhile, the precondition that Q learning can converge to a stable state is that each state space is accessed infinitely frequently, and the above strategy cannot meet the infinite frequency. In order to meet the convergence requirement of Q learning and the trade-off between 'exploration' (trying to obtain greater return without return) and 'utilization' (adopting learned higher Q value action) in the learning process, the invention optimizes the selection process of the action based on a greedy algorithm, adopts a Boltzmann experiment strategy, endows all possible actions with certain probability according to the Q value, gives high selection probability to the action with high Q value, but the probability of selecting any action is not zero, namely:
in the formula (21), a represents a search probability*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one (rand (a)) from all actions, execute "search", otherwise execute "utilize", select the action with high Q value. The specific process is shown in Table 3:
TABLE 3 greedy algorithm based Q learning with a priori knowledge
In order to illustrate the beneficial effects of the method of the invention, simulation analysis is performed on the method of the invention. Assume that a primary network is made up of a PU that accesses a single channel with a bandwidth of 10 MHz. The target SINR for the PU is set to 10dB, and the gaussian noise power and transmit power for the PU are set to 1nW and 10mW, respectively. SU is randomly distributed in a circle with a radius of 200m and PU is randomly distributed in a circle with a radius of 1000m and a SBS base station is taken as a center. The channel gain follows a logarithmic distance path loss model with a path loss exponent equal to 2.8. For a single SU, its SINR can be selected from the set { -5, -3, -1,1,3,5,7,9,11,13,15 }; for all SUs, the same learning rate α is assumed to be 0.1 and the discount factor γ is assumed to be 0.4.
The performance of four algorithms is compared in the simulation process, namely a Random Access process, a single-user Q learning (Innovial Q), a priori knowledge-based Q learning (PKBL) and a greedy algorithm-based Q learning (E-based PKBL) by using the priori knowledge. The relationship between the number of sub-users SU and the average MOS is shown in fig. 3, and it can be seen from fig. 3 that the average MOS decreases as the number of SUs increases, which causes the phenomenon: as the number of users increases, each SU tends to converge to a smaller SINR value with the interference constraint satisfied, which results in a smaller average MOS overall. The results also show that the DSA algorithm proposed by the present invention achieves higher MOS values (always higher than acceptable MOS level (MOS >3.5) even though there are 25 SUs in the network). Meanwhile, it can be seen that after the e-based PKBL algorithm is adopted, the average MOS of the system is higher than that of the PKBL algorithm, because the agent does not trap partial optimality in the initial learning, and abandons the "exploration", and the network performance index is improved by 35% at most compared with the random access.
Fig. 4 shows the relationship between the collision probability and the number of SUs in the network, and in the present invention, the collision probability is defined as the probability of violating the SINR in equation (6) after the secondary user takes "action". As the number of SUs in the network increases, the probability of collision also increases. Of the four algorithms, the e-based PKBL algorithm has better performance in reducing the probability of collision because the algorithm "explores" more possibilities and thus obtains a better solution.
Fig. 5 shows the efficiency increase that would result from an SU with "experience" (ambient environmental information) imparting "experience" to an SU that newly joins the network. It can be seen that compared to single-user learning, the number of iterations required for the PKBL algorithm to achieve convergence is reduced by 65% at most. Compared with the ' PKBL ' algorithm, the ' e-based PKBL algorithm has the advantage of adding an ' exploration ' process, so that the iteration number is between the ' Indvidual Q ' and the ' PKBL ' algorithm.
The invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which can adaptively adjust the transmitting power and the corresponding transmitting rate of SU, and maximize the average QoE (Quality of Experience) of the network while satisfying the transmission interference constraints of SN and PN. The MOS is used as a measurement standard of subjective QoE, not only meets the quality evaluation requirement of a 5G network centering on an end user, but also provides a single universal measurement standard for different types of traffic. In addition, in order to shorten the convergence time of the Q learning algorithm, the invention enables a new SU to learn the 'knowledge' (environmental information) of the SU already existing in the system based on the thought of Q learning, thereby improving the learning process and reducing the iteration number. Simulation results show that the new method improves the network performance on the basis of ensuring the MOS.
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.
Claims (7)
1. A dynamic spectrum access method based on reinforcement learning with prior knowledge is characterized by comprising the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining a frequency spectrum access evaluation model of the network, and adopting an MOS (metal oxide semiconductor) model as an access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
2. The dynamic spectrum access method based on the prior knowledge reinforcement learning of claim 1, wherein in step S1, the signal to interference plus noise ratio constraints are:
in the formula (I), the compound is shown in the specification,representing authorized users PUkThe minimum signal-to-interference-and-noise ratio of a receiving end;indicating a secondary user SUiMinimum signal-to-interference-and-noise ratio of receiving end β0Represents the SINR threshold of the base station of the authorized subscriber, and β0Is constant βiRepresenting the signal-to-interference-and-noise ratio threshold of the secondary user base station; m represents the number of authorized users, N represents the number of secondary users, and k and i represent serial numbers.
3. The method according to claim 1, wherein in step S2, the MOS models include a data MOS model and a video MOS model, and the data MOS model is:
in the formula, QDWhich represents the data flow MOS and which is,indicating a secondary user SUiBit rate of pe2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;
the MOS model of the video is as follows:
in the formula, QVRepresenting MOS of the video, PSNR representing peak signal-to-noise ratio, c, d and f representing parameters of a logic function;
the MOS model is as follows:
in the formula, QμThe average MOS is represented, U represents the number of the secondary users with the service flow as data, N-U represents the number of the secondary users with the service flow as video, and m represents the serial number.
4. The dynamic spectrum access method based on reinforcement learning with priori knowledge as claimed in claim 1, wherein in step S3, the specific method for obtaining the environmental knowledge of the existing secondary users in the system and constructing the priori knowledge by using the obtained environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user in the system, wherein the formula is as follows:
in the formula, QnewQ-table, Q, representing newly added secondary usersiQ tables representing existing secondary users in the system, K represents the number of Q tables of existing secondary users in the system, and i represents a serial number.
5. The dynamic spectrum access method based on a priori knowledge reinforcement learning of claim 1, wherein in step S4, the step of Q learning using a priori knowledge comprises:
s41: initializing Q tables of all secondary users in the system, wherein the newly added secondary users initialize the Q tables by using priori knowledge;
s42: selecting an action according to a formulaMaking action selection, whereinIndicating a secondary user SUiIn the state at the time t of the instant t,indicating a secondary user SUiAt time t, the state isAn action of the time selection;
s43: updating state, defining state set Andrepresenting signal-to-interference-and-noise ratio limits for authorized and secondary users, respectively, e.g. byThe following formula updates the state:
in the formula, N represents the number of secondary users, i represents the serial number, psii、αiAre all intermediate process parameters;
s44: obtaining the return: using a formulaUpdating the return value, wherein,the optimal return is represented in the form of,is composed ofThe mathematical expectation of (a) is that,for MOS-based reporting, the user SU is indicatediIn a state stIn the following, the actions are adoptedReporting time; p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy;
s45: updating the Q value: using formulasUpdating the Q value, wherein αtIt is indicated that the learning rate is,represents SUiThe Q value at the current "state-action", indicating an action to takeQ value corresponding to the new state s', Ai representing the secondary user SUiThe set of actions of (1);
s46: and repeating the steps S41-S45 until convergence.
6. The dynamic spectrum access method based on reinforcement learning with a priori knowledge as claimed in claim 5, wherein in step S42, formula is adoptedMaking action selection, wherein the action selection represents the exploration probability, a*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one of all actions to execute "search", otherwise execute "utilization", and select the action with high Q value.
7. The method according to any of claims 5-6, wherein the MOS-based reward is calculated by the following formula:
in the formula (I), the compound is shown in the specification,representing a user SUiIn a state stIn the following, the actions are adoptedReporting time;represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);a reward indicating a successful attempt; t is a constant less thanRepresents a reward of unsuccessful attempts.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010495810.4A CN111654342B (en) | 2020-06-03 | 2020-06-03 | Dynamic spectrum access method based on reinforcement learning with priori knowledge |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010495810.4A CN111654342B (en) | 2020-06-03 | 2020-06-03 | Dynamic spectrum access method based on reinforcement learning with priori knowledge |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111654342A true CN111654342A (en) | 2020-09-11 |
CN111654342B CN111654342B (en) | 2021-02-12 |
Family
ID=72345008
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010495810.4A Active CN111654342B (en) | 2020-06-03 | 2020-06-03 | Dynamic spectrum access method based on reinforcement learning with priori knowledge |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111654342B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112672359A (en) * | 2020-12-18 | 2021-04-16 | 哈尔滨工业大学 | Dynamic spectrum access method based on bidirectional long-and-short-term memory network |
CN112672426A (en) * | 2021-03-17 | 2021-04-16 | 南京航空航天大学 | Anti-interference frequency point allocation method based on online learning |
CN113207129A (en) * | 2021-05-10 | 2021-08-03 | 重庆邮电大学 | Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm |
CN113423110A (en) * | 2021-06-22 | 2021-09-21 | 东南大学 | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning |
CN113747447A (en) * | 2021-09-07 | 2021-12-03 | 中国人民解放军国防科技大学 | Double-action reinforcement learning frequency spectrum access method and system based on priori knowledge |
CN113939040A (en) * | 2021-10-08 | 2022-01-14 | 中国人民解放军陆军工程大学 | State updating method based on state prediction in cognitive Internet of things |
CN114630333A (en) * | 2022-03-16 | 2022-06-14 | 军事科学院系统工程研究院网络信息研究所 | Multi-parameter statistical learning decision-making method in cognitive satellite communication |
CN115086965A (en) * | 2022-06-14 | 2022-09-20 | 中国人民解放军国防科技大学 | Dynamic spectrum allocation method and system based on element reduction processing and joint iteration optimization |
CN115412105A (en) * | 2022-05-06 | 2022-11-29 | 南京邮电大学 | Reinforcement learning communication interference method based on USRP RIO |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102238555A (en) * | 2011-07-18 | 2011-11-09 | 南京邮电大学 | Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio |
CN109586820A (en) * | 2018-12-28 | 2019-04-05 | 中国人民解放军陆军工程大学 | The anti-interference model of dynamic spectrum and intensified learning Anti-interference algorithm in fading environment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101466111B (en) * | 2009-01-13 | 2010-11-17 | 中国人民解放军理工大学通信工程学院 | Dynamic spectrum access method based on policy planning constrain Q study |
US10505616B1 (en) * | 2018-06-01 | 2019-12-10 | Samsung Electronics Co., Ltd. | Method and apparatus for machine learning based wide beam optimization in cellular network |
-
2020
- 2020-06-03 CN CN202010495810.4A patent/CN111654342B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102238555A (en) * | 2011-07-18 | 2011-11-09 | 南京邮电大学 | Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio |
CN109586820A (en) * | 2018-12-28 | 2019-04-05 | 中国人民解放军陆军工程大学 | The anti-interference model of dynamic spectrum and intensified learning Anti-interference algorithm in fading environment |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112672359A (en) * | 2020-12-18 | 2021-04-16 | 哈尔滨工业大学 | Dynamic spectrum access method based on bidirectional long-and-short-term memory network |
CN112672359B (en) * | 2020-12-18 | 2022-06-21 | 哈尔滨工业大学 | Dynamic spectrum access method based on bidirectional long-and-short-term memory network |
CN112672426A (en) * | 2021-03-17 | 2021-04-16 | 南京航空航天大学 | Anti-interference frequency point allocation method based on online learning |
CN113207129B (en) * | 2021-05-10 | 2022-05-20 | 重庆邮电大学 | Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm |
CN113207129A (en) * | 2021-05-10 | 2021-08-03 | 重庆邮电大学 | Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm |
CN113423110A (en) * | 2021-06-22 | 2021-09-21 | 东南大学 | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning |
CN113423110B (en) * | 2021-06-22 | 2022-04-12 | 东南大学 | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning |
CN113747447A (en) * | 2021-09-07 | 2021-12-03 | 中国人民解放军国防科技大学 | Double-action reinforcement learning frequency spectrum access method and system based on priori knowledge |
CN113939040A (en) * | 2021-10-08 | 2022-01-14 | 中国人民解放军陆军工程大学 | State updating method based on state prediction in cognitive Internet of things |
CN113939040B (en) * | 2021-10-08 | 2023-04-28 | 中国人民解放军陆军工程大学 | State updating method based on state prediction in cognitive Internet of things |
CN114630333A (en) * | 2022-03-16 | 2022-06-14 | 军事科学院系统工程研究院网络信息研究所 | Multi-parameter statistical learning decision-making method in cognitive satellite communication |
CN115412105A (en) * | 2022-05-06 | 2022-11-29 | 南京邮电大学 | Reinforcement learning communication interference method based on USRP RIO |
CN115412105B (en) * | 2022-05-06 | 2024-03-12 | 南京邮电大学 | Reinforced learning communication interference method based on USRP RIO |
CN115086965A (en) * | 2022-06-14 | 2022-09-20 | 中国人民解放军国防科技大学 | Dynamic spectrum allocation method and system based on element reduction processing and joint iteration optimization |
Also Published As
Publication number | Publication date |
---|---|
CN111654342B (en) | 2021-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111654342B (en) | Dynamic spectrum access method based on reinforcement learning with priori knowledge | |
Wang et al. | A survey on applications of model-free strategy learning in cognitive wireless networks | |
CN111556572B (en) | Spectrum resource and computing resource joint allocation method based on reinforcement learning | |
CN110492955B (en) | Spectrum prediction switching method based on transfer learning strategy | |
Hou et al. | Joint allocation of wireless resource and computing capability in MEC-enabled vehicular network | |
Amichi et al. | Spreading factor allocation strategy for LoRa networks under imperfect orthogonality | |
Li et al. | A delay-aware caching algorithm for wireless D2D caching networks | |
CN114698128B (en) | Anti-interference channel selection method and system for cognitive satellite-ground network | |
CN113423110B (en) | Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning | |
Mohammadi et al. | QoE-driven integrated heterogeneous traffic resource allocation based on cooperative learning for 5G cognitive radio networks | |
Kwasinski et al. | Reinforcement learning for resource allocation in cognitive radio networks | |
Shashi Raj et al. | Interference resilient stochastic prediction based dynamic resource allocation model for cognitive MANETs | |
Chandra et al. | Joint resource allocation and power allocation scheme for MIMO assisted NOMA system | |
St Jean et al. | Bayesian game-theoretic modeling of transmit power determination in a self-organizing CDMA wireless network | |
Lv et al. | A dynamic spectrum access method based on Q-learning | |
CN116828534A (en) | Intensive network large-scale terminal access and resource allocation method based on reinforcement learning | |
Mehta | Recursive quadratic programming for constrained nonlinear optimization of session throughput in multiple‐flow network topologies | |
Ren et al. | Joint spectrum allocation and power control in vehicular communications based on dueling double DQN | |
Sen et al. | Rate adaptation techniques using contextual bandit approach for mobile wireless lan users | |
Chuan et al. | Machine learning based popularity regeneration in caching-enabled wireless networks | |
Dogra et al. | Reinforcement Learning (RL) for optimal power allocation in 6G Network | |
Ekwe et al. | QoE-aware Q-learning resource allocation for spectrum reuse in 5G communications network | |
Wang et al. | Experience cooperative sharing in cross-layer cognitive radio for real-time multimedia communication | |
Mishra et al. | DDPG with Transfer Learning and Meta Learning Framework for Resource Allocation in Underlay Cognitive Radio Network | |
CN114828193B (en) | Uplink and downlink multi-service concurrent power distribution method for wireless network and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |