CN111654342A

CN111654342A - Dynamic spectrum access method based on reinforcement learning with priori knowledge

Info

Publication number: CN111654342A
Application number: CN202010495810.4A
Authority: CN
Inventors: 张建照; 柳永祥; 钱璟; 刘斌; 吕培
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-11
Anticipated expiration: 2040-06-03
Also published as: CN111654342B

Abstract

The invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, and belongs to the technical field of electromagnetic spectrum. Firstly, a secondary user acquires spectrum access environment information; then determining a frequency spectrum access evaluation model of the network, and adopting an MOS model as the access evaluation model; then, establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge; then, Q learning is carried out according to the priori knowledge to obtain Q table information of the secondary user; and finally, performing dynamic spectrum access according to the Q table information obtained by learning. In addition, the invention optimizes the action selection process during Q learning by using a greedy algorithm, thereby avoiding trapping in local optimization during Q learning. The invention effectively improves the learning efficiency and the dynamic spectrum access performance of the system by constructing and utilizing the prior knowledge.

Description

Dynamic spectrum access method based on reinforcement learning with priori knowledge

Technical Field

The invention belongs to the technical field of electromagnetic spectrum, and particularly relates to a dynamic spectrum access method based on prior knowledge reinforcement learning.

Background

With the continuous expansion of the wireless application environment, on one hand, the demand of the communication system for wireless resources causes the shortage of frequency spectrum resources; on the other hand, the conventional static spectrum resource management causes low spectrum resource utilization rate. In recent years, attention of scholars is drawn to a spectrum use mode of Dynamic Spectrum Access (DSA), Cognitive Radio (CR) is one of the popular research directions, and the main idea is that a Secondary User (SU) with spectrum sensing capability actively senses spectrum use condition and performs communication by accessing an idle channel "opportunistically" on the premise of not generating harmful interference to an authorized user (PU) with spectrum authorization.

Dynamic spectrum access techniques have achieved more success in academia. For example, the literature (AKBARZADEH N, MAHAJANA. dynamic spectrum access unit partial updates: A residual basic access C]Canadian Workshop on Information theory.2019:1-6.) models the dynamic spectrum access problem under multichannel transmission conditions as a partially observable Markov decision process (POMPDP) and uses a Whittle index excitation method to assist in the decision-making, and the simulation result shows that when the model is indexable, the user transmits on the channel with the smallest Whittle index as the optimal strategy. Literature (YANG H, CHEN H. energy-efficiency channel access data priority in cognitive radio sensor networks [ C)]International conference on Software, Telecommunications and Computer networks.2019:1-5.) in a Cognitive Radio Sensor Network (CRSN), a dynamic channel access scheme based on data priority and energy consumption minimization is proposed, which allocates power according to the data priority of each node and then reasonably allocates transmission time to each node to minimize energy consumption. To accommodate the signal-to-noise ratio and throughput requirements of authorized users, the literature (GURAJAPU S, RAJ S, CHOUSHAN S. Spectrum Allocation and Power management using Markov Chains and Beam-forming in Inderly coherent Radios [ C ]],Inthe method comprises the steps of using a Markov chain to model a dynamic spectrum allocation process of a cognitive radio system, and combining a Markov chain model and a beam forming technology, wherein the dynamic spectrum allocation based on the scheme is obviously improved in the aspect of authorized user throughput. Literature (PASTIRCAK J, GAZDA J, KOCUR D.A superficial on the scattering in dynamic sp_-ectrum access networks[C],International SymposiumElectronics in Marine.2014:135-138.//ZHAO Q,SADLER B M.A Survey of DynamicSpectrum Access[J].IEEE Signal Process,2007,24(3):79-89.//SHARMILA A,DANANJAYAN P.Spectrum Sharing Techniques in Cognitive Radio Networks–A Survey[C]Three general dynamic spectrum access models are introduced in International Conference on System, computing, Automation and computing.2019: 1-4), respectively: an open sharing model (open sharing model), a licensed use model (shared use model), and a dynamic exclusive use model (dynamic exclusive use model). The open sharing model enables all users to use the spectrum resources equally, but easily causes interference problems; the authorized spectrum sharing model reduces the interference to authorized users, but limits the transmitting power of secondary users; the dynamic exclusive model avoids the generation of additional harmful interference, but only allows dynamic spectrum allocation among authorized users.

Most of the existing work is established on the premise that spectrum access environment knowledge and dynamic models are known, but in practical application, complete environment knowledge, such as the working condition of authorized users, frequency utilization characteristics and the like, is usually difficult to obtain. In order to solve the problem of efficient spectrum access without prior knowledge, a reinforcement learning method is paid more attention in recent years, and the basic principle is to optimize the system performance by continuously learning interactively with a spectrum environment under the condition without the prior knowledge of the spectrum environment. The method has the limitation that the convergence rate of learning is limited without prior knowledge, so that the improvement of the frequency spectrum utilization rate is restricted. In practical dynamic spectrum access applications, the secondary user is usually in the above two states, i.e. a certain environment prior knowledge can be obtained, but the prior knowledge is not sufficient. For example, before the secondary user dynamically accesses the spectrum, the spectrum database may already obtain prior knowledge of the location, interference constraint, and the like of the authorized user in the area, but cannot obtain the specific spectrum use condition of the authorized user. Under the scene closer to practical application, how to obtain sufficient environmental knowledge through reinforcement learning so as to realize efficient spectrum utilization is a problem that dynamic spectrum access is practical and needs to be solved urgently.

Disclosure of Invention

The technical problem is as follows: the invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which improves spectrum access efficiency and realizes global optimization of resource allocation by utilizing the prior knowledge reinforcement learning method under the condition that spectrum access environment knowledge and a dynamic model are partially unknown, thereby maximizing the transmission quality of a cognitive network.

The technical scheme is as follows: the invention relates to a dynamic spectrum access method based on prior knowledge reinforcement learning, which comprises the following steps:

s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;

s2: determining a frequency spectrum access evaluation model of the network, and adopting an MOS (metal oxide semiconductor) model as an access evaluation model;

s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;

s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;

s5: and performing dynamic spectrum access according to the learned Q table information.

Further, in step S1, the signal to interference plus noise ratio constraint condition is:

in the formula (I), the compound is shown in the specification,

representing authorized users PU_kThe minimum signal-to-interference-and-noise ratio of a receiving end;

indicating a secondary user SU_iMinimum signal-to-interference-and-noise ratio of receiving end β₀Represents the SINR threshold of the base station of the authorized subscriber, and β₀Is constant β_iRepresenting the signal-to-interference-and-noise ratio threshold of the secondary user base station; m represents the number of authorized users, N represents the number of secondary users, and k and i represent serial numbers.

Further, in step S2, the MOS model includes a MOS model of the data and a MOS model of the video, where the MOS model of the data is:

Q_D＝g log₁₀(br_i ^(s)(1-p_e2e))

in the formula, Q_DRepresenting the data stream MOS, r_i ^(s)Indicating a secondary user SU_iBit rate of p_e2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;

the MOS model of the video is as follows:

in the formula, Q_VRepresenting MOS of the video, PSNR representing peak signal-to-noise ratio, c, d and f representing parameters of a logic function;

the MOS model is as follows:

in the formula, Q_μThe average MOS is represented, U represents the number of the secondary users with the service flow as data, N-U represents the number of the secondary users with the service flow as video, and m represents the serial number.

Further, in step S3, the specific method for acquiring the environmental knowledge of the existing secondary users in the system and constructing the prior knowledge by using the acquired environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user existing in the system, wherein the formula is as follows:

in the formula, Q_newQ-table, Q, representing newly added secondary usersⁱQ tables representing existing secondary users in the system, K represents the number of Q tables of existing secondary users in the system, and i represents a serial number.

Further, in step S4, the step of performing Q learning using a priori knowledge includes:

s41: initializing Q tables of all sub-users SU in the system, wherein the newly added sub-users initialize the Q tables by using priori knowledge;

s42: selecting an action according to a formula

Making action selection, wherein

Indicating a secondary user SU_iIn the state at the time t of the instant t,

indicating a secondary user SU_iAt time t, the state is

An action of the time selection;

s43: updating state, defining state set

And

respectively representing the signal-to-interference-and-noise ratio limits of authorized users and secondary users, and updating the state by adopting the following formula:

in the formula, N represents the number of secondary users, i represents the serial number, psi_i、α_iAre all intermediate process parameters;

s44: obtaining the return: using a formula

Updating the return value, wherein,

the optimal return is represented in the form of,

is composed of

The mathematical expectation of (a) is that,

for MOS-based reporting, the user SU is indicated_iIn a state s_tIn the following, the actions are adopted

Reporting time; p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi^*Is an optimal strategy;

s45: updating the Q value: using formulas

Updating the Q value, wherein α_tIt is indicated that the learning rate is,

represents SU_iThe Q value at the current "state-action",

indicating an action to take

Q value, A, corresponding to the new state sⁱIndicating a secondary user SU_iThe set of actions of (1); (ii) a

S46: and repeating the steps S41-S45 until convergence.

Further, in the step S42, a formula is adopted

Making action selection, wherein the action selection represents the exploration probability, a^*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one of all actions to execute "search", otherwise execute "utilization", and select the action with high Q value.

Further, the calculation formula of the MOS-based reward is:

in the formula (I), the compound is shown in the specification,

representing a user SU_iIn a state s_tIn the following, the actions are adopted

Reporting time;

represents SU_iIn a state s_tAction taken at time t, AⁱIndicating a secondary user SU_iA state space of (a);

reward for indicating successful attempt(ii) a T is a constant less than

Represents a reward of unsuccessful attempts.

Has the advantages that: compared with the prior art, the invention has the following advantages:

(1) the invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, which can realize dynamic spectrum access under the condition that a spectrum access environment and a dynamic model are partially unknown by using the reinforcement learning method with prior knowledge, and can acquire and update environment information through Q learning by using the relation between the transmission rate of a secondary user SU and the SINR on the premise of considering the signal-to-interference-and-noise ratio (SINR) limitation of a main network PN and a secondary network SN, thereby effectively improving the access efficiency of dynamic spectrum and greatly improving the performance of a system.

(2) According to the method, the MOS is used as an evaluation model, the data stream and the video stream of the system can be comprehensively evaluated, so that the dynamic spectrum access condition of the system can be evaluated, and the accuracy is good.

(3) The Q learning action selection process is optimized based on the greedy algorithm, the problem that the dynamic spectrum access algorithm is easy to fall into local optimization is solved, iteration times required by algorithm convergence are effectively reduced, and therefore the efficiency of dynamic spectrum access and the performance of a system are improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic block diagram of a dynamic spectrum access using the method of the present invention;

FIG. 3 is a diagram illustrating the relationship between the number of sub-users and the average MOS in the system according to the embodiment of the present invention;

FIG. 4 is a diagram illustrating a relationship between a number of secondary users and a collision probability in the system according to the embodiment of the present invention;

fig. 5 is a diagram illustrating a relationship between the number of secondary users and the average iteration number in the system according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following examples and the accompanying drawings.

Description of the drawings: the meaning of the english abbreviation appearing in the present embodiment is as follows:

PN: a main network; SN: a secondary network; PSB: a master base station; SBS: a secondary base station; PU (polyurethane): an authorized user; SU: a secondary user; SINR: a signal to interference plus noise ratio; MOS: averaging subjective opinion scores; DSA: dynamic spectrum access.

Referring to fig. 1, the dynamic spectrum access method based on prior knowledge reinforcement learning of the present invention includes the following steps:

s2: determining an access evaluation model of a network, and adopting an MOS (metal oxide semiconductor) model as the access evaluation model;

In general, the authorized user condition, the self-location, and the nearby secondary user location are directly available, so that the important point for step S1 is to determine the constraints of the signal to interference and noise ratio, which will be described in detail below.

Assume that there are two wireless networks in the system, one of which is the Primary Network (PN) and the other of which is the Secondary Network (SN). The PN consists of 1 main base station (PBS) and M authorized users (PU); the SN consists of 1 Secondary Base Station (SBS) and N Secondary Users (SUs), and the Secondary Users (SUs) are randomly distributed around the SBS.

Under a typical dynamic spectrum access model, a PN containing a single primary link shares a single channel with a SN at a given time. All secondary and primary links transmit using an Adaptive Modulation and Coding (AMC) scheme. Each SU can adjust the transmission parameters thereof to meet the interference requirements of the PU; the transmission parameters of SBS and PBS may not be adjusted.

The traffic flow carried on the SN link is video and data, and the transmission channel is assumed to be an additive white gaussian noise quasi-static channel. The PU employs AMC technology while assuming the transmit power to be constant, under which assumption the SU can infer Channel State Information (CSI) through active learning and then estimate the channel gain.

Because the SN only performs communication on channels that "opportunistically" access the PN when the PN is not occupying the channel, in embodiments of the present invention, only the PN access channel is observed, and the SN fails to exit the SINR of both networks in time during the channel period. Authorizing a user PU in a host network_lL ∈ M authorization user PU_kSINR received at k ∈ M

Indicating that k and l represent serial numbers, k, l ∈ M, k is more than or equal to 0 and M is less than or equal to l, when the sending station or the receiving station is PBS, k or l is 0, and the secondary user SU in the secondary network_jJ ∈ N, sub-user SU_iSINR for i ∈ N reception

I and j denote serial numbers, i, j ∈ N,0 ≦ i, j ≦ N, and i or j is 0 when the transmitting station or the receiving station is SBS.

Although only one pair of PN network devices (PBS and PU) and one pair of SN network devices (SBS and SU) are communicating, to avoid loss of generality, all stations' transmissions are considered here, and there are:

(1) formula represents authorized user PU_lL ∈ M authorization user PU_kK ∈ M, where the first of the denominators is the sum

To representThe total power of the interference generated in the PN network, the superscript (p) indicates the PN network, the subscript h ∈ M, h ≠ l indicates the authorized user PU generating the interference_h，

Representing authorized users PU_hTo authorized user PU_kThe channel gain of (a) is determined,

representing authorized users PU_hThe transmission power of (a); the second sum of denominators

Representing the total interference power generated by the SN network to the PN, the superscript(s) representing the SN network,

indicating a secondary user SU_iI ∈ N to authorized user PU_kChannel gain of (P)_i ^(s)Indicating a secondary user SU_iThe transmission power of (a); sigma in denominator²Representing the noise power. In the molecule

Representing authorized users PU_lTo authorized user PU_kChannel gain of (P)_l ^(p)And l ∈ M denotes an authorized user PU_lTransmission power of, and P_l ^(p)Is a constant.

(2) Formula represents secondary user SU_jJ ∈ N, sub-user SU_iI ∈ N, where the first of the denominators is summed

Indicating the total power of the interference generated in the SN network, the subscripts f ∈ N, f ≠ j indicating the interfering secondary users SU_f，

Indicating a secondary user SU_fTo SU_iThe channel gain of (a) is determined,

indicating a secondary user SU_fThe transmission power of (a); the second sum of denominators

Represents the total interference power generated by the PN network to the SN,

representing authorized users PU_kTo the secondary user SU_iThe channel gain of (a) is determined,

representing authorized users PU_kThe transmission power of (1). In the molecule

Indicating a secondary user SU_jTo the secondary user SU_iChannel gain of (P)_j ^(s)Indicating a secondary user SU_jThe transmission power of (1). In order to achieve the interference index under the DSA model, the invention is right

And

the following limitations are made:

formula (III) β₀And β_ijRespectively representing the SINR thresholds of the licensed user base station PBS and the secondary user base station SBS, and β₀Is a constant. For convenience of explanation, the present invention defines an authorized user PU_kMinimum SNIR at the receiving end is

And a secondary user SU_iMinimum SNIR at the receiving end is

As shown in the formulas (4) and (5):

then equation (3) can be rewritten as:

under the condition of SINR satisfying SN, the transmission power P allocated to each SU_i ^(s)Is represented by the formula (7):

in the formula (7)

Representing the primary network device to the secondary user SU_iAverage channel gain of, note

Wherein

Representing authorized users PU_kFor secondary user SU_iThe channel gain of (a);

indicating a secondary user SU_iAverage channel gain for secondary network devices, note

Wherein

To representSub-user SU_iFor secondary user SU_jThe channel gain of (1). To obtain efficient power allocation, conditions need to be met

σ²Indicating an error. Combining the formula (7) with the formula (1) and the formula (6), the user PU is authorized_kIs received by

Can be rewritten as:

in formula (8), α_i、Ψ_iAre intermediate process parameters, wherein:

since β will be replaced₀It is assumed that there is a constant number,

representing authorized users PU_kFor secondary user SU₀(i ═ 0) channel gain, so it is necessary to adjust β on each SU_iTo satisfy equations (7) and (8), the adjustment can be specifically done by adjusting the bit rate.

From the literature (QIU X, CHAWLA K. on the performance of adaptive modulation in cellular systems [ J ]]IEEE Transactions on Communications,1999,47(6):884-_iBit rate r of_i ^(s)And β_iThe relationship of (d) may be:

r_i ^(s)＝W log₂(1+qβ_i) (10)

1+ q β in formula (10)_iThe number of bits representing the modulated signal is typically an integer value, where q ═ 1.5/-ln (5BER) is a constant related to the maximum transmission Bit Error Rate (BER), and W represents a parameter.

For proposed dynamic spectrum accessScheme, each SU_iSelects its SINR threshold value β_iAnd determining the corresponding r_i ^(s)So that all PUs and SUs satisfy the SINR limit of equation (6).

In order to evaluate the effect of spectrum access, in the invention, MOS (mean opinion score) is used as an evaluation model to evaluate the effect of dynamic spectrum access. During communication, transmission of data and video may be involved, and thus, MOS includes MOS models of data and video.

(1) MOS model of data

The MOS model of the data stream can be calculated by equation (11):

Q_D＝g log₁₀(b r_i ^(s)(1-p_e2e)) (11)

in formula (11), Q_DRepresenting a data stream MOS, p_e2eAnd g and b are parameters, and the parameters g and b are obtained by calculating the data quality sensed by the terminal user. The perceptual data quality is defined as: if the transmission rate of the user is J and the receiving rate is also J, the packet loss rate is 0, and the perceived data quality of the terminal user is the maximum, namely 5; if the transmission rate of the user is J and the receiving rate is 0, the packet loss rate is 1, and the perceived data quality of the terminal user is the lowest, namely 0.

(2) MOS model of video

The MOS model of the video can be represented by equation (12):

q in formula (12)_VRepresenting the MOS of the video, PSNR represents the peak signal-to-noise ratio, and c, d, and f are parameters of a logistic function.

In the present invention, a logistic function is selected to evaluate the quality of the video traffic. PSNR and r_i ^(s)Can be determined by the function PSNR ═ λ logr_i ^(s)+ β, where λ and β are both constants.

In the case of considering the presence of data streams and video streams, the calculation formula for calculating the average MOS is obtained as follows:

in formula (13), Q_μDenotes average MOS, U denotes number of SU with data as service flow, N-U denotes number of SU with video as service flow, m denotes serial number, SU is adjusted β_iAnd corresponding bit rate r_i ^(s)The network quality (MOS) is maximized on the basis of satisfying the SINR limit of equation (6).

After the access evaluation model is determined, the priori knowledge is constructed, and has great influence on the access efficiency of the frequency spectrum and the transmission quality of the network. For the Q learning algorithm, Q this table stores the rewards for each action. Each SU first learns its surroundings, then proceeds to select the action associated with the largest reward, obtains the reward for the selected action by executing a Q learning algorithm, and finally updates its Q-table based on the received instantaneous reward. Thus, the Q table reflects the impact of these actions on the wireless environment. Since portions of the radio environment are correlated with interference that each SU may cause to other portions of the system, the Q-table also reflects the respective radio environment of each SU and the interrelationship of the system components. When a SU joins a learned system, it only has an impact on the wireless environment of the system, and thus it is inefficient to re-run the cognitive cycle and not look at the environmental knowledge acquired by other SUs in the system. Thus, a newly added SU can learn the existing SU environment knowledge in the system to reduce learning time and improve learning performance. This mechanism is defined as "primer-Knowledge base Cognitive Radio (PKBL-CR)". The focus of CR is observation (sensing and analysis), while the focus of PKBL-CR is learning a priori knowledge. The PKBL-CR has more "experience" nodes that can "teach" learning experiences to new nodes, thereby reducing learning time and improving learning performance.

Therefore, in step S3, when the prior knowledge is constructed, the environmental knowledge of the secondary user existing in the system is first acquired, and then the prior knowledge is constructed by using the acquired environmental knowledge.

After the priori knowledge is constructed, Q learning is performed using the priori knowledge, and in order to describe the Q learning process in detail, the Q learning process is described in detail with reference to fig. 2.

Defining the state set of the Q learning algorithm as S_tThe action set is A and the action a is selected in the current state_tThe reward after ∈ A is R_t。

Agent in Q learning perceives the current state S ∈ S_tAnd takes action a pi(s) ∈ a accordingly and gets the instantaneous reward R under the given strategy pi_t。

The key of the Q learning algorithm is to consider the discount factor gamma (0)<γ<1) How to take appropriate strategies to maximize the cumulative reward V. I.e. the secondary user SU_iFrom the corresponding action set

Selecting actions, adjusting their transmission power and other transmission parameters, sensing changes in network conditions and obtaining instantaneous reports

Each PU and SU selects to maximize its MOS while satisfying the SINR constraint of equation (6).

The state set is defined as

S_tFor reflecting interference generated by the network, in which

And

representing whether SINR constraints are met for PU and SU, respectively, i.e.

If SU_iAfter the action is executed, better communication quality is obtained under the condition that the SINR constraint condition in the formula (6) is met, namely the MOS score is improved, and then the reward of successful attempt reward is obtained, and the reward is used

Represents; if the PU or SU violates the SINR constraint in equation (6) after the action is performed, then the reward at this time is "reward for failed attempts", denoted by T. Therefore, a MOS-based reward function

Can be expressed as (16):

in the formula (15), the reaction mixture is,

representing a user SU_iIn a state s_tIn the following, the actions are adopted

Reporting time;

a reward indicating a successful attempt; t is a constant less than

Represents a reward of unsuccessful attempts.

In the invention, it is assumed that all SUs select 'action' according to self judgment without considering total income so as to maximize self long-term expectation accumulated return value, and for user SU_iThe long-term return value is expressed as:

where pi is the secondary user SU_iThe "policy" (action) currently taken; gamma (0)<γ<1) Is a constant, is a time discount factor, and reflects the importance of future return; s₀S represents the initial state and the initial state,

indicating time t, sub-user SU_iE () is a expectation function for mathematically expecting parameters in parentheses.

According to the Bellman optimization criterion, the maximum value of equation (17) is:

wherein the content of the first and second substances,

represents the optimal return, Rⁱ(s, a) is

A mathematical expectation of (d); p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi^*Is an optimal strategy.

The idea of Q learning is that without estimating an environment model, an optimal strategy pi can be found through simple Q value iteration under the condition that R (s, a) and p (s' | s, a) are unknown^*The formula (18) is satisfied. The updating formula of the Q value iteration is as follows (19):

formula (III) α_t(0<α_t<1) It is indicated that the learning rate is,

represents SU_iThe Q value at the current "state-action",

indicating an action to take

And the corresponding Q value after the new state s' is reached.

In order to fully explain the process of Q learning, the process of Q learning without considering prior knowledge and with considering prior knowledge is explained separately.

Without prior knowledge, each SU performs Q learning separately, with the Q tables of all SUs initialized to zero at the beginning of learning, i.e.

The algorithm starts by adopting an 'exploration' strategy, and after each Q table is accessed once, the 'utilization' strategy is entered according to the 'action' with the maximum Q value. The specific process of performing Q learning is shown in table 1:

TABLE 1 procedure during single user Q learning

In consideration of prior knowledge, Q learning is carried out, when a new SU is added, the newly added SU initializes the Q table of the new SU through the existing Q table in the system, and the specific method is as follows (20):

in the formula (20), Q_newInitialized Q-table, Q, representing newly added SUⁱIndicating the presence of a secondary user SU in the system_iK represents the number of secondary users already in the system. The specific process of performing Q learning when considering a priori knowledge is shown in table 2:

TABLE 2 prior knowledge based Q learning

When the prior knowledge is used for dynamic spectrum access, the intelligent agent executes the maximum value of Q (s, a) in the current state during each learning, however, the strategy has a learning vulnerability, namely: if the agent finds a higher Q action through learning in the initial iterative learning, it will tie up without finding a better action that may still exist at this time and that may not be explored, and thus fall into local optimality.

Meanwhile, the precondition that Q learning can converge to a stable state is that each state space is accessed infinitely frequently, and the above strategy cannot meet the infinite frequency. In order to meet the convergence requirement of Q learning and the trade-off between 'exploration' (trying to obtain greater return without return) and 'utilization' (adopting learned higher Q value action) in the learning process, the invention optimizes the selection process of the action based on a greedy algorithm, adopts a Boltzmann experiment strategy, endows all possible actions with certain probability according to the Q value, gives high selection probability to the action with high Q value, but the probability of selecting any action is not zero, namely:

in the formula (21), a represents a search probability^*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one (rand (a)) from all actions, execute "search", otherwise execute "utilize", select the action with high Q value. The specific process is shown in Table 3：

TABLE 3 greedy algorithm based Q learning with a priori knowledge

In order to illustrate the beneficial effects of the method of the invention, simulation analysis is performed on the method of the invention. Assume that a primary network is made up of a PU that accesses a single channel with a bandwidth of 10 MHz. The target SINR for the PU is set to 10dB, and the gaussian noise power and transmit power for the PU are set to 1nW and 10mW, respectively. SU is randomly distributed in a circle with a radius of 200m and PU is randomly distributed in a circle with a radius of 1000m and a SBS base station is taken as a center. The channel gain follows a logarithmic distance path loss model with a path loss exponent equal to 2.8. For a single SU, its SINR can be selected from the set { -5, -3, -1,1,3,5,7,9,11,13,15 }; for all SUs, the same learning rate α is assumed to be 0.1 and the discount factor γ is assumed to be 0.4.

The performance of four algorithms is compared in the simulation process, namely a Random Access process, a single-user Q learning (Innovial Q), a priori knowledge-based Q learning (PKBL) and a greedy algorithm-based Q learning (E-based PKBL) by using the priori knowledge. The relationship between the number of sub-users SU and the average MOS is shown in fig. 3, and it can be seen from fig. 3 that the average MOS decreases as the number of SUs increases, which causes the phenomenon: as the number of users increases, each SU tends to converge to a smaller SINR value with the interference constraint satisfied, which results in a smaller average MOS overall. The results also show that the DSA algorithm proposed by the present invention achieves higher MOS values (always higher than acceptable MOS level (MOS >3.5) even though there are 25 SUs in the network). Meanwhile, it can be seen that after the e-based PKBL algorithm is adopted, the average MOS of the system is higher than that of the PKBL algorithm, because the agent does not trap partial optimality in the initial learning, and abandons the "exploration", and the network performance index is improved by 35% at most compared with the random access.

Fig. 4 shows the relationship between the collision probability and the number of SUs in the network, and in the present invention, the collision probability is defined as the probability of violating the SINR in equation (6) after the secondary user takes "action". As the number of SUs in the network increases, the probability of collision also increases. Of the four algorithms, the e-based PKBL algorithm has better performance in reducing the probability of collision because the algorithm "explores" more possibilities and thus obtains a better solution.

Fig. 5 shows the efficiency increase that would result from an SU with "experience" (ambient environmental information) imparting "experience" to an SU that newly joins the network. It can be seen that compared to single-user learning, the number of iterations required for the PKBL algorithm to achieve convergence is reduced by 65% at most. Compared with the ' PKBL ' algorithm, the ' e-based PKBL algorithm has the advantage of adding an ' exploration ' process, so that the iteration number is between the ' Indvidual Q ' and the ' PKBL ' algorithm.

The invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which can adaptively adjust the transmitting power and the corresponding transmitting rate of SU, and maximize the average QoE (Quality of Experience) of the network while satisfying the transmission interference constraints of SN and PN. The MOS is used as a measurement standard of subjective QoE, not only meets the quality evaluation requirement of a 5G network centering on an end user, but also provides a single universal measurement standard for different types of traffic. In addition, in order to shorten the convergence time of the Q learning algorithm, the invention enables a new SU to learn the 'knowledge' (environmental information) of the SU already existing in the system based on the thought of Q learning, thereby improving the learning process and reducing the iteration number. Simulation results show that the new method improves the network performance on the basis of ensuring the MOS.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A dynamic spectrum access method based on reinforcement learning with prior knowledge is characterized by comprising the following steps:

2. The dynamic spectrum access method based on the prior knowledge reinforcement learning of claim 1, wherein in step S1, the signal to interference plus noise ratio constraints are:

in the formula (I), the compound is shown in the specification,

3. The method according to claim 1, wherein in step S2, the MOS models include a data MOS model and a video MOS model, and the data MOS model is:

in the formula, Q_DWhich represents the data flow MOS and which is,

indicating a secondary user SU_iBit rate of p_e2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;

the MOS model of the video is as follows:

the MOS model is as follows:

4. The dynamic spectrum access method based on reinforcement learning with priori knowledge as claimed in claim 1, wherein in step S3, the specific method for obtaining the environmental knowledge of the existing secondary users in the system and constructing the priori knowledge by using the obtained environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user in the system, wherein the formula is as follows:

5. The dynamic spectrum access method based on a priori knowledge reinforcement learning of claim 1, wherein in step S4, the step of Q learning using a priori knowledge comprises:

s41: initializing Q tables of all secondary users in the system, wherein the newly added secondary users initialize the Q tables by using priori knowledge;

s42: selecting an action according to a formula

Making action selection, wherein

Indicating a secondary user SU_iIn the state at the time t of the instant t,

indicating a secondary user SU_iAt time t, the state is

An action of the time selection;

s43: updating state, defining state set

And

representing signal-to-interference-and-noise ratio limits for authorized and secondary users, respectively, e.g. byThe following formula updates the state:

s44: obtaining the return: using a formula

Updating the return value, wherein,

the optimal return is represented in the form of,

is composed of

The mathematical expectation of (a) is that,

s45: updating the Q value: using formulas

Updating the Q value, wherein α_tIt is indicated that the learning rate is,

represents SU_iThe Q value at the current "state-action",

indicating an action to take

Q value corresponding to the new state s', Ai representing the secondary user SU_iThe set of actions of (1);

s46: and repeating the steps S41-S45 until convergence.

6. The dynamic spectrum access method based on reinforcement learning with a priori knowledge as claimed in claim 5, wherein in step S42, formula is adopted

7. The method according to any of claims 5-6, wherein the MOS-based reward is calculated by the following formula:

in the formula (I), the compound is shown in the specification,

representing a user SU_iIn a state s_tIn the following, the actions are adopted

Reporting time;

a reward indicating a successful attempt; t is a constant less than

Represents a reward of unsuccessful attempts.