CN111654342A - Dynamic spectrum access method based on reinforcement learning with priori knowledge - Google Patents

Dynamic spectrum access method based on reinforcement learning with priori knowledge Download PDF

Info

Publication number
CN111654342A
CN111654342A CN202010495810.4A CN202010495810A CN111654342A CN 111654342 A CN111654342 A CN 111654342A CN 202010495810 A CN202010495810 A CN 202010495810A CN 111654342 A CN111654342 A CN 111654342A
Authority
CN
China
Prior art keywords
representing
learning
mos
knowledge
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010495810.4A
Other languages
Chinese (zh)
Other versions
CN111654342B (en
Inventor
张建照
柳永祥
钱璟
刘斌
吕培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010495810.4A priority Critical patent/CN111654342B/en
Publication of CN111654342A publication Critical patent/CN111654342A/en
Application granted granted Critical
Publication of CN111654342B publication Critical patent/CN111654342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/382Monitoring; Testing of propagation channels for resource allocation, admission control or handover
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/309Measuring or estimating channel quality parameters
    • H04B17/336Signal-to-interference ratio [SIR] or carrier-to-interference ratio [CIR]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/14Spectrum sharing arrangements between different networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W74/00Wireless channel access, e.g. scheduled or random access
    • H04W74/08Non-scheduled or contention based access, e.g. random access, ALOHA, CSMA [Carrier Sense Multiple Access]
    • H04W74/0833Non-scheduled or contention based access, e.g. random access, ALOHA, CSMA [Carrier Sense Multiple Access] using a random access procedure

Abstract

The invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, and belongs to the technical field of electromagnetic spectrum. Firstly, a secondary user acquires spectrum access environment information; then determining a frequency spectrum access evaluation model of the network, and adopting an MOS model as the access evaluation model; then, establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge; then, Q learning is carried out according to the priori knowledge to obtain Q table information of the secondary user; and finally, performing dynamic spectrum access according to the Q table information obtained by learning. In addition, the invention optimizes the action selection process during Q learning by using a greedy algorithm, thereby avoiding trapping in local optimization during Q learning. The invention effectively improves the learning efficiency and the dynamic spectrum access performance of the system by constructing and utilizing the prior knowledge.

Description

Dynamic spectrum access method based on reinforcement learning with priori knowledge
Technical Field
The invention belongs to the technical field of electromagnetic spectrum, and particularly relates to a dynamic spectrum access method based on prior knowledge reinforcement learning.
Background
With the continuous expansion of the wireless application environment, on one hand, the demand of the communication system for wireless resources causes the shortage of frequency spectrum resources; on the other hand, the conventional static spectrum resource management causes low spectrum resource utilization rate. In recent years, attention of scholars is drawn to a spectrum use mode of Dynamic Spectrum Access (DSA), Cognitive Radio (CR) is one of the popular research directions, and the main idea is that a Secondary User (SU) with spectrum sensing capability actively senses spectrum use condition and performs communication by accessing an idle channel "opportunistically" on the premise of not generating harmful interference to an authorized user (PU) with spectrum authorization.
Dynamic spectrum access techniques have achieved more success in academia. For example, the literature (AKBARZADEH N, MAHAJANA. dynamic spectrum access unit partial updates: A residual basic access C]Canadian Workshop on Information theory.2019:1-6.) models the dynamic spectrum access problem under multichannel transmission conditions as a partially observable Markov decision process (POMPDP) and uses a Whittle index excitation method to assist in the decision-making, and the simulation result shows that when the model is indexable, the user transmits on the channel with the smallest Whittle index as the optimal strategy. Literature (YANG H, CHEN H. energy-efficiency channel access data priority in cognitive radio sensor networks [ C)]International conference on Software, Telecommunications and Computer networks.2019:1-5.) in a Cognitive Radio Sensor Network (CRSN), a dynamic channel access scheme based on data priority and energy consumption minimization is proposed, which allocates power according to the data priority of each node and then reasonably allocates transmission time to each node to minimize energy consumption. To accommodate the signal-to-noise ratio and throughput requirements of authorized users, the literature (GURAJAPU S, RAJ S, CHOUSHAN S. Spectrum Allocation and Power management using Markov Chains and Beam-forming in Inderly coherent Radios [ C ]],Inthe method comprises the steps of using a Markov chain to model a dynamic spectrum allocation process of a cognitive radio system, and combining a Markov chain model and a beam forming technology, wherein the dynamic spectrum allocation based on the scheme is obviously improved in the aspect of authorized user throughput. Literature (PASTIRCAK J, GAZDA J, KOCUR D.A superficial on the scattering in dynamic sp-ectrum access networks[C],International SymposiumElectronics in Marine.2014:135-138.//ZHAO Q,SADLER B M.A Survey of DynamicSpectrum Access[J].IEEE Signal Process,2007,24(3):79-89.//SHARMILA A,DANANJAYAN P.Spectrum Sharing Techniques in Cognitive Radio Networks–A Survey[C]Three general dynamic spectrum access models are introduced in International Conference on System, computing, Automation and computing.2019: 1-4), respectively: an open sharing model (open sharing model), a licensed use model (shared use model), and a dynamic exclusive use model (dynamic exclusive use model). The open sharing model enables all users to use the spectrum resources equally, but easily causes interference problems; the authorized spectrum sharing model reduces the interference to authorized users, but limits the transmitting power of secondary users; the dynamic exclusive model avoids the generation of additional harmful interference, but only allows dynamic spectrum allocation among authorized users.
Most of the existing work is established on the premise that spectrum access environment knowledge and dynamic models are known, but in practical application, complete environment knowledge, such as the working condition of authorized users, frequency utilization characteristics and the like, is usually difficult to obtain. In order to solve the problem of efficient spectrum access without prior knowledge, a reinforcement learning method is paid more attention in recent years, and the basic principle is to optimize the system performance by continuously learning interactively with a spectrum environment under the condition without the prior knowledge of the spectrum environment. The method has the limitation that the convergence rate of learning is limited without prior knowledge, so that the improvement of the frequency spectrum utilization rate is restricted. In practical dynamic spectrum access applications, the secondary user is usually in the above two states, i.e. a certain environment prior knowledge can be obtained, but the prior knowledge is not sufficient. For example, before the secondary user dynamically accesses the spectrum, the spectrum database may already obtain prior knowledge of the location, interference constraint, and the like of the authorized user in the area, but cannot obtain the specific spectrum use condition of the authorized user. Under the scene closer to practical application, how to obtain sufficient environmental knowledge through reinforcement learning so as to realize efficient spectrum utilization is a problem that dynamic spectrum access is practical and needs to be solved urgently.
Disclosure of Invention
The technical problem is as follows: the invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which improves spectrum access efficiency and realizes global optimization of resource allocation by utilizing the prior knowledge reinforcement learning method under the condition that spectrum access environment knowledge and a dynamic model are partially unknown, thereby maximizing the transmission quality of a cognitive network.
The technical scheme is as follows: the invention relates to a dynamic spectrum access method based on prior knowledge reinforcement learning, which comprises the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining a frequency spectrum access evaluation model of the network, and adopting an MOS (metal oxide semiconductor) model as an access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
Further, in step S1, the signal to interference plus noise ratio constraint condition is:
Figure BDA0002522814110000031
in the formula (I), the compound is shown in the specification,
Figure BDA0002522814110000032
representing authorized users PUkThe minimum signal-to-interference-and-noise ratio of a receiving end;
Figure BDA0002522814110000033
indicating a secondary user SUiMinimum signal-to-interference-and-noise ratio of receiving end β0Represents the SINR threshold of the base station of the authorized subscriber, and β0Is constant βiRepresenting the signal-to-interference-and-noise ratio threshold of the secondary user base station; m represents the number of authorized users, N represents the number of secondary users, and k and i represent serial numbers.
Further, in step S2, the MOS model includes a MOS model of the data and a MOS model of the video, where the MOS model of the data is:
QD=g log10(bri (s)(1-pe2e))
in the formula, QDRepresenting the data stream MOS, ri (s)Indicating a secondary user SUiBit rate of pe2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;
the MOS model of the video is as follows:
Figure BDA0002522814110000034
in the formula, QVRepresenting MOS of the video, PSNR representing peak signal-to-noise ratio, c, d and f representing parameters of a logic function;
the MOS model is as follows:
Figure BDA0002522814110000035
in the formula, QμThe average MOS is represented, U represents the number of the secondary users with the service flow as data, N-U represents the number of the secondary users with the service flow as video, and m represents the serial number.
Further, in step S3, the specific method for acquiring the environmental knowledge of the existing secondary users in the system and constructing the prior knowledge by using the acquired environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user existing in the system, wherein the formula is as follows:
Figure BDA0002522814110000036
in the formula, QnewQ-table, Q, representing newly added secondary usersiQ tables representing existing secondary users in the system, K represents the number of Q tables of existing secondary users in the system, and i represents a serial number.
Further, in step S4, the step of performing Q learning using a priori knowledge includes:
s41: initializing Q tables of all sub-users SU in the system, wherein the newly added sub-users initialize the Q tables by using priori knowledge;
s42: selecting an action according to a formula
Figure BDA0002522814110000041
Making action selection, wherein
Figure BDA0002522814110000042
Indicating a secondary user SUiIn the state at the time t of the instant t,
Figure BDA0002522814110000043
indicating a secondary user SUiAt time t, the state is
Figure BDA0002522814110000044
An action of the time selection;
s43: updating state, defining state set
Figure BDA0002522814110000045
And
Figure BDA0002522814110000046
respectively representing the signal-to-interference-and-noise ratio limits of authorized users and secondary users, and updating the state by adopting the following formula:
Figure BDA0002522814110000047
Figure BDA0002522814110000048
in the formula, N represents the number of secondary users, i represents the serial number, psii、αiAre all intermediate process parameters;
s44: obtaining the return: using a formula
Figure BDA0002522814110000049
Updating the return value, wherein,
Figure BDA00025228141100000410
the optimal return is represented in the form of,
Figure BDA00025228141100000411
is composed of
Figure BDA00025228141100000412
The mathematical expectation of (a) is that,
Figure BDA00025228141100000413
for MOS-based reporting, the user SU is indicatediIn a state stIn the following, the actions are adopted
Figure BDA00025228141100000414
Reporting time; p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy;
s45: updating the Q value: using formulas
Figure BDA00025228141100000415
Updating the Q value, wherein αtIt is indicated that the learning rate is,
Figure BDA00025228141100000416
represents SUiThe Q value at the current "state-action",
Figure BDA00025228141100000417
indicating an action to take
Figure BDA00025228141100000418
Q value, A, corresponding to the new state siIndicating a secondary user SUiThe set of actions of (1); (ii) a
S46: and repeating the steps S41-S45 until convergence.
Further, in the step S42, a formula is adopted
Figure BDA00025228141100000419
Making action selection, wherein the action selection represents the exploration probability, a*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one of all actions to execute "search", otherwise execute "utilization", and select the action with high Q value.
Further, the calculation formula of the MOS-based reward is:
Figure BDA00025228141100000420
in the formula (I), the compound is shown in the specification,
Figure BDA0002522814110000051
representing a user SUiIn a state stIn the following, the actions are adopted
Figure BDA0002522814110000052
Reporting time;
Figure BDA0002522814110000053
represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);
Figure BDA0002522814110000054
reward for indicating successful attempt(ii) a T is a constant less than
Figure BDA0002522814110000055
Represents a reward of unsuccessful attempts.
Has the advantages that: compared with the prior art, the invention has the following advantages:
(1) the invention discloses a dynamic spectrum access method based on prior knowledge reinforcement learning, which can realize dynamic spectrum access under the condition that a spectrum access environment and a dynamic model are partially unknown by using the reinforcement learning method with prior knowledge, and can acquire and update environment information through Q learning by using the relation between the transmission rate of a secondary user SU and the SINR on the premise of considering the signal-to-interference-and-noise ratio (SINR) limitation of a main network PN and a secondary network SN, thereby effectively improving the access efficiency of dynamic spectrum and greatly improving the performance of a system.
(2) According to the method, the MOS is used as an evaluation model, the data stream and the video stream of the system can be comprehensively evaluated, so that the dynamic spectrum access condition of the system can be evaluated, and the accuracy is good.
(3) The Q learning action selection process is optimized based on the greedy algorithm, the problem that the dynamic spectrum access algorithm is easy to fall into local optimization is solved, iteration times required by algorithm convergence are effectively reduced, and therefore the efficiency of dynamic spectrum access and the performance of a system are improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic block diagram of a dynamic spectrum access using the method of the present invention;
FIG. 3 is a diagram illustrating the relationship between the number of sub-users and the average MOS in the system according to the embodiment of the present invention;
FIG. 4 is a diagram illustrating a relationship between a number of secondary users and a collision probability in the system according to the embodiment of the present invention;
fig. 5 is a diagram illustrating a relationship between the number of secondary users and the average iteration number in the system according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following examples and the accompanying drawings.
Description of the drawings: the meaning of the english abbreviation appearing in the present embodiment is as follows:
PN: a main network; SN: a secondary network; PSB: a master base station; SBS: a secondary base station; PU (polyurethane): an authorized user; SU: a secondary user; SINR: a signal to interference plus noise ratio; MOS: averaging subjective opinion scores; DSA: dynamic spectrum access.
Referring to fig. 1, the dynamic spectrum access method based on prior knowledge reinforcement learning of the present invention includes the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining an access evaluation model of a network, and adopting an MOS (metal oxide semiconductor) model as the access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
In general, the authorized user condition, the self-location, and the nearby secondary user location are directly available, so that the important point for step S1 is to determine the constraints of the signal to interference and noise ratio, which will be described in detail below.
Assume that there are two wireless networks in the system, one of which is the Primary Network (PN) and the other of which is the Secondary Network (SN). The PN consists of 1 main base station (PBS) and M authorized users (PU); the SN consists of 1 Secondary Base Station (SBS) and N Secondary Users (SUs), and the Secondary Users (SUs) are randomly distributed around the SBS.
Under a typical dynamic spectrum access model, a PN containing a single primary link shares a single channel with a SN at a given time. All secondary and primary links transmit using an Adaptive Modulation and Coding (AMC) scheme. Each SU can adjust the transmission parameters thereof to meet the interference requirements of the PU; the transmission parameters of SBS and PBS may not be adjusted.
The traffic flow carried on the SN link is video and data, and the transmission channel is assumed to be an additive white gaussian noise quasi-static channel. The PU employs AMC technology while assuming the transmit power to be constant, under which assumption the SU can infer Channel State Information (CSI) through active learning and then estimate the channel gain.
Because the SN only performs communication on channels that "opportunistically" access the PN when the PN is not occupying the channel, in embodiments of the present invention, only the PN access channel is observed, and the SN fails to exit the SINR of both networks in time during the channel period. Authorizing a user PU in a host networklL ∈ M authorization user PUkSINR received at k ∈ M
Figure BDA0002522814110000061
Indicating that k and l represent serial numbers, k, l ∈ M, k is more than or equal to 0 and M is less than or equal to l, when the sending station or the receiving station is PBS, k or l is 0, and the secondary user SU in the secondary networkjJ ∈ N, sub-user SUiSINR for i ∈ N reception
Figure BDA0002522814110000062
I and j denote serial numbers, i, j ∈ N,0 ≦ i, j ≦ N, and i or j is 0 when the transmitting station or the receiving station is SBS.
Although only one pair of PN network devices (PBS and PU) and one pair of SN network devices (SBS and SU) are communicating, to avoid loss of generality, all stations' transmissions are considered here, and there are:
Figure BDA0002522814110000063
Figure BDA0002522814110000064
(1) formula represents authorized user PUlL ∈ M authorization user PUkK ∈ M, where the first of the denominators is the sum
Figure BDA0002522814110000071
To representThe total power of the interference generated in the PN network, the superscript (p) indicates the PN network, the subscript h ∈ M, h ≠ l indicates the authorized user PU generating the interferenceh
Figure BDA0002522814110000072
Representing authorized users PUhTo authorized user PUkThe channel gain of (a) is determined,
Figure BDA0002522814110000073
representing authorized users PUhThe transmission power of (a); the second sum of denominators
Figure BDA0002522814110000074
Representing the total interference power generated by the SN network to the PN, the superscript(s) representing the SN network,
Figure BDA0002522814110000075
indicating a secondary user SUiI ∈ N to authorized user PUkChannel gain of (P)i (s)Indicating a secondary user SUiThe transmission power of (a); sigma in denominator2Representing the noise power. In the molecule
Figure BDA0002522814110000077
Representing authorized users PUlTo authorized user PUkChannel gain of (P)l (p)And l ∈ M denotes an authorized user PUlTransmission power of, and Pl (p)Is a constant.
(2) Formula represents secondary user SUjJ ∈ N, sub-user SUiI ∈ N, where the first of the denominators is summed
Figure BDA00025228141100000710
Indicating the total power of the interference generated in the SN network, the subscripts f ∈ N, f ≠ j indicating the interfering secondary users SUf
Figure BDA00025228141100000711
Indicating a secondary user SUfTo SUiThe channel gain of (a) is determined,
Figure BDA00025228141100000712
indicating a secondary user SUfThe transmission power of (a); the second sum of denominators
Figure BDA00025228141100000713
Represents the total interference power generated by the PN network to the SN,
Figure BDA00025228141100000714
representing authorized users PUkTo the secondary user SUiThe channel gain of (a) is determined,
Figure BDA00025228141100000715
representing authorized users PUkThe transmission power of (1). In the molecule
Figure BDA00025228141100000716
Indicating a secondary user SUjTo the secondary user SUiChannel gain of (P)j (s)Indicating a secondary user SUjThe transmission power of (1). In order to achieve the interference index under the DSA model, the invention is right
Figure BDA00025228141100000717
And
Figure BDA00025228141100000718
the following limitations are made:
Figure BDA00025228141100000719
formula (III) β0And βijRespectively representing the SINR thresholds of the licensed user base station PBS and the secondary user base station SBS, and β0Is a constant. For convenience of explanation, the present invention defines an authorized user PUkMinimum SNIR at the receiving end is
Figure BDA00025228141100000720
And a secondary user SUiMinimum SNIR at the receiving end is
Figure BDA00025228141100000721
As shown in the formulas (4) and (5):
Figure BDA00025228141100000722
Figure BDA00025228141100000723
then equation (3) can be rewritten as:
Figure BDA00025228141100000724
under the condition of SINR satisfying SN, the transmission power P allocated to each SUi (s)Is represented by the formula (7):
Figure BDA0002522814110000081
in the formula (7)
Figure BDA0002522814110000082
Representing the primary network device to the secondary user SUiAverage channel gain of, note
Figure BDA0002522814110000083
Wherein
Figure BDA0002522814110000084
Representing authorized users PUkFor secondary user SUiThe channel gain of (a);
Figure BDA0002522814110000085
indicating a secondary user SUiAverage channel gain for secondary network devices, note
Figure BDA0002522814110000086
Wherein
Figure BDA0002522814110000087
To representSub-user SUiFor secondary user SUjThe channel gain of (1). To obtain efficient power allocation, conditions need to be met
Figure BDA0002522814110000088
σ2Indicating an error. Combining the formula (7) with the formula (1) and the formula (6), the user PU is authorizedkIs received by
Figure BDA0002522814110000089
Can be rewritten as:
Figure BDA00025228141100000810
in formula (8), αi、ΨiAre intermediate process parameters, wherein:
Figure BDA00025228141100000811
since β will be replaced0It is assumed that there is a constant number,
Figure BDA00025228141100000812
representing authorized users PUkFor secondary user SU0(i ═ 0) channel gain, so it is necessary to adjust β on each SUiTo satisfy equations (7) and (8), the adjustment can be specifically done by adjusting the bit rate.
From the literature (QIU X, CHAWLA K. on the performance of adaptive modulation in cellular systems [ J ]]IEEE Transactions on Communications,1999,47(6):884-iBit rate r ofi (s)And βiThe relationship of (d) may be:
ri (s)=W log2(1+qβi) (10)
1+ q β in formula (10)iThe number of bits representing the modulated signal is typically an integer value, where q ═ 1.5/-ln (5BER) is a constant related to the maximum transmission Bit Error Rate (BER), and W represents a parameter.
For proposed dynamic spectrum accessScheme, each SUiSelects its SINR threshold value βiAnd determining the corresponding ri (s)So that all PUs and SUs satisfy the SINR limit of equation (6).
In order to evaluate the effect of spectrum access, in the invention, MOS (mean opinion score) is used as an evaluation model to evaluate the effect of dynamic spectrum access. During communication, transmission of data and video may be involved, and thus, MOS includes MOS models of data and video.
(1) MOS model of data
The MOS model of the data stream can be calculated by equation (11):
QD=g log10(b ri (s)(1-pe2e)) (11)
in formula (11), QDRepresenting a data stream MOS, pe2eAnd g and b are parameters, and the parameters g and b are obtained by calculating the data quality sensed by the terminal user. The perceptual data quality is defined as: if the transmission rate of the user is J and the receiving rate is also J, the packet loss rate is 0, and the perceived data quality of the terminal user is the maximum, namely 5; if the transmission rate of the user is J and the receiving rate is 0, the packet loss rate is 1, and the perceived data quality of the terminal user is the lowest, namely 0.
(2) MOS model of video
The MOS model of the video can be represented by equation (12):
Figure BDA0002522814110000091
q in formula (12)VRepresenting the MOS of the video, PSNR represents the peak signal-to-noise ratio, and c, d, and f are parameters of a logistic function.
In the present invention, a logistic function is selected to evaluate the quality of the video traffic. PSNR and ri (s)Can be determined by the function PSNR ═ λ logri (s)+ β, where λ and β are both constants.
In the case of considering the presence of data streams and video streams, the calculation formula for calculating the average MOS is obtained as follows:
Figure BDA0002522814110000092
in formula (13), QμDenotes average MOS, U denotes number of SU with data as service flow, N-U denotes number of SU with video as service flow, m denotes serial number, SU is adjusted βiAnd corresponding bit rate ri (s)The network quality (MOS) is maximized on the basis of satisfying the SINR limit of equation (6).
After the access evaluation model is determined, the priori knowledge is constructed, and has great influence on the access efficiency of the frequency spectrum and the transmission quality of the network. For the Q learning algorithm, Q this table stores the rewards for each action. Each SU first learns its surroundings, then proceeds to select the action associated with the largest reward, obtains the reward for the selected action by executing a Q learning algorithm, and finally updates its Q-table based on the received instantaneous reward. Thus, the Q table reflects the impact of these actions on the wireless environment. Since portions of the radio environment are correlated with interference that each SU may cause to other portions of the system, the Q-table also reflects the respective radio environment of each SU and the interrelationship of the system components. When a SU joins a learned system, it only has an impact on the wireless environment of the system, and thus it is inefficient to re-run the cognitive cycle and not look at the environmental knowledge acquired by other SUs in the system. Thus, a newly added SU can learn the existing SU environment knowledge in the system to reduce learning time and improve learning performance. This mechanism is defined as "primer-Knowledge base Cognitive Radio (PKBL-CR)". The focus of CR is observation (sensing and analysis), while the focus of PKBL-CR is learning a priori knowledge. The PKBL-CR has more "experience" nodes that can "teach" learning experiences to new nodes, thereby reducing learning time and improving learning performance.
Therefore, in step S3, when the prior knowledge is constructed, the environmental knowledge of the secondary user existing in the system is first acquired, and then the prior knowledge is constructed by using the acquired environmental knowledge.
After the priori knowledge is constructed, Q learning is performed using the priori knowledge, and in order to describe the Q learning process in detail, the Q learning process is described in detail with reference to fig. 2.
Defining the state set of the Q learning algorithm as StThe action set is A and the action a is selected in the current statetThe reward after ∈ A is Rt
Agent in Q learning perceives the current state S ∈ StAnd takes action a pi(s) ∈ a accordingly and gets the instantaneous reward R under the given strategy pit
The key of the Q learning algorithm is to consider the discount factor gamma (0)<γ<1) How to take appropriate strategies to maximize the cumulative reward V. I.e. the secondary user SUiFrom the corresponding action set
Figure BDA0002522814110000101
Selecting actions, adjusting their transmission power and other transmission parameters, sensing changes in network conditions and obtaining instantaneous reports
Figure BDA0002522814110000102
Each PU and SU selects to maximize its MOS while satisfying the SINR constraint of equation (6).
The state set is defined as
Figure BDA0002522814110000103
StFor reflecting interference generated by the network, in which
Figure BDA0002522814110000104
And
Figure BDA0002522814110000105
representing whether SINR constraints are met for PU and SU, respectively, i.e.
Figure BDA0002522814110000106
Figure BDA0002522814110000107
If SUiAfter the action is executed, better communication quality is obtained under the condition that the SINR constraint condition in the formula (6) is met, namely the MOS score is improved, and then the reward of successful attempt reward is obtained, and the reward is used
Figure BDA0002522814110000108
Represents; if the PU or SU violates the SINR constraint in equation (6) after the action is performed, then the reward at this time is "reward for failed attempts", denoted by T. Therefore, a MOS-based reward function
Figure BDA0002522814110000109
Can be expressed as (16):
Figure BDA00025228141100001010
in the formula (15), the reaction mixture is,
Figure BDA00025228141100001011
representing a user SUiIn a state stIn the following, the actions are adopted
Figure BDA00025228141100001012
Reporting time;
Figure BDA00025228141100001013
represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);
Figure BDA00025228141100001014
a reward indicating a successful attempt; t is a constant less than
Figure BDA00025228141100001015
Represents a reward of unsuccessful attempts.
In the invention, it is assumed that all SUs select 'action' according to self judgment without considering total income so as to maximize self long-term expectation accumulated return value, and for user SUiThe long-term return value is expressed as:
Figure BDA0002522814110000111
where pi is the secondary user SUiThe "policy" (action) currently taken; gamma (0)<γ<1) Is a constant, is a time discount factor, and reflects the importance of future return; s0S represents the initial state and the initial state,
Figure BDA0002522814110000112
indicating time t, sub-user SUiE () is a expectation function for mathematically expecting parameters in parentheses.
According to the Bellman optimization criterion, the maximum value of equation (17) is:
Figure BDA0002522814110000113
wherein the content of the first and second substances,
Figure BDA0002522814110000114
represents the optimal return, Ri(s, a) is
Figure BDA0002522814110000115
A mathematical expectation of (d); p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy.
The idea of Q learning is that without estimating an environment model, an optimal strategy pi can be found through simple Q value iteration under the condition that R (s, a) and p (s' | s, a) are unknown*The formula (18) is satisfied. The updating formula of the Q value iteration is as follows (19):
Figure BDA0002522814110000116
formula (III) αt(0<αt<1) It is indicated that the learning rate is,
Figure BDA0002522814110000117
represents SUiThe Q value at the current "state-action",
Figure BDA0002522814110000118
indicating an action to take
Figure BDA0002522814110000119
And the corresponding Q value after the new state s' is reached.
In order to fully explain the process of Q learning, the process of Q learning without considering prior knowledge and with considering prior knowledge is explained separately.
Without prior knowledge, each SU performs Q learning separately, with the Q tables of all SUs initialized to zero at the beginning of learning, i.e.
Figure BDA00025228141100001110
The algorithm starts by adopting an 'exploration' strategy, and after each Q table is accessed once, the 'utilization' strategy is entered according to the 'action' with the maximum Q value. The specific process of performing Q learning is shown in table 1:
TABLE 1 procedure during single user Q learning
Figure BDA00025228141100001111
In consideration of prior knowledge, Q learning is carried out, when a new SU is added, the newly added SU initializes the Q table of the new SU through the existing Q table in the system, and the specific method is as follows (20):
Figure BDA0002522814110000121
in the formula (20), QnewInitialized Q-table, Q, representing newly added SUiIndicating the presence of a secondary user SU in the systemiK represents the number of secondary users already in the system. The specific process of performing Q learning when considering a priori knowledge is shown in table 2:
TABLE 2 prior knowledge based Q learning
Figure BDA0002522814110000122
When the prior knowledge is used for dynamic spectrum access, the intelligent agent executes the maximum value of Q (s, a) in the current state during each learning, however, the strategy has a learning vulnerability, namely: if the agent finds a higher Q action through learning in the initial iterative learning, it will tie up without finding a better action that may still exist at this time and that may not be explored, and thus fall into local optimality.
Meanwhile, the precondition that Q learning can converge to a stable state is that each state space is accessed infinitely frequently, and the above strategy cannot meet the infinite frequency. In order to meet the convergence requirement of Q learning and the trade-off between 'exploration' (trying to obtain greater return without return) and 'utilization' (adopting learned higher Q value action) in the learning process, the invention optimizes the selection process of the action based on a greedy algorithm, adopts a Boltzmann experiment strategy, endows all possible actions with certain probability according to the Q value, gives high selection probability to the action with high Q value, but the probability of selecting any action is not zero, namely:
Figure BDA0002522814110000123
in the formula (21), a represents a search probability*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one (rand (a)) from all actions, execute "search", otherwise execute "utilize", select the action with high Q value. The specific process is shown in Table 3:
TABLE 3 greedy algorithm based Q learning with a priori knowledge
Figure BDA0002522814110000131
In order to illustrate the beneficial effects of the method of the invention, simulation analysis is performed on the method of the invention. Assume that a primary network is made up of a PU that accesses a single channel with a bandwidth of 10 MHz. The target SINR for the PU is set to 10dB, and the gaussian noise power and transmit power for the PU are set to 1nW and 10mW, respectively. SU is randomly distributed in a circle with a radius of 200m and PU is randomly distributed in a circle with a radius of 1000m and a SBS base station is taken as a center. The channel gain follows a logarithmic distance path loss model with a path loss exponent equal to 2.8. For a single SU, its SINR can be selected from the set { -5, -3, -1,1,3,5,7,9,11,13,15 }; for all SUs, the same learning rate α is assumed to be 0.1 and the discount factor γ is assumed to be 0.4.
The performance of four algorithms is compared in the simulation process, namely a Random Access process, a single-user Q learning (Innovial Q), a priori knowledge-based Q learning (PKBL) and a greedy algorithm-based Q learning (E-based PKBL) by using the priori knowledge. The relationship between the number of sub-users SU and the average MOS is shown in fig. 3, and it can be seen from fig. 3 that the average MOS decreases as the number of SUs increases, which causes the phenomenon: as the number of users increases, each SU tends to converge to a smaller SINR value with the interference constraint satisfied, which results in a smaller average MOS overall. The results also show that the DSA algorithm proposed by the present invention achieves higher MOS values (always higher than acceptable MOS level (MOS >3.5) even though there are 25 SUs in the network). Meanwhile, it can be seen that after the e-based PKBL algorithm is adopted, the average MOS of the system is higher than that of the PKBL algorithm, because the agent does not trap partial optimality in the initial learning, and abandons the "exploration", and the network performance index is improved by 35% at most compared with the random access.
Fig. 4 shows the relationship between the collision probability and the number of SUs in the network, and in the present invention, the collision probability is defined as the probability of violating the SINR in equation (6) after the secondary user takes "action". As the number of SUs in the network increases, the probability of collision also increases. Of the four algorithms, the e-based PKBL algorithm has better performance in reducing the probability of collision because the algorithm "explores" more possibilities and thus obtains a better solution.
Fig. 5 shows the efficiency increase that would result from an SU with "experience" (ambient environmental information) imparting "experience" to an SU that newly joins the network. It can be seen that compared to single-user learning, the number of iterations required for the PKBL algorithm to achieve convergence is reduced by 65% at most. Compared with the ' PKBL ' algorithm, the ' e-based PKBL algorithm has the advantage of adding an ' exploration ' process, so that the iteration number is between the ' Indvidual Q ' and the ' PKBL ' algorithm.
The invention provides a dynamic spectrum access method based on prior knowledge reinforcement learning, which can adaptively adjust the transmitting power and the corresponding transmitting rate of SU, and maximize the average QoE (Quality of Experience) of the network while satisfying the transmission interference constraints of SN and PN. The MOS is used as a measurement standard of subjective QoE, not only meets the quality evaluation requirement of a 5G network centering on an end user, but also provides a single universal measurement standard for different types of traffic. In addition, in order to shorten the convergence time of the Q learning algorithm, the invention enables a new SU to learn the 'knowledge' (environmental information) of the SU already existing in the system based on the thought of Q learning, thereby improving the learning process and reducing the iteration number. Simulation results show that the new method improves the network performance on the basis of ensuring the MOS.
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims (7)

1. A dynamic spectrum access method based on reinforcement learning with prior knowledge is characterized by comprising the following steps:
s1: the method comprises the steps that a secondary user obtains spectrum access environment information, wherein the spectrum access environment information comprises an authorized user and a signal-to-interference-and-noise ratio constraint condition, a self position and a position close to the secondary user;
s2: determining a frequency spectrum access evaluation model of the network, and adopting an MOS (metal oxide semiconductor) model as an access evaluation model;
s3: establishing prior knowledge, acquiring the environmental knowledge of the existing secondary users in the system, and establishing the prior knowledge by using the acquired environmental knowledge;
s4: performing Q learning according to the priori knowledge to obtain Q table information of the secondary user;
s5: and performing dynamic spectrum access according to the learned Q table information.
2. The dynamic spectrum access method based on the prior knowledge reinforcement learning of claim 1, wherein in step S1, the signal to interference plus noise ratio constraints are:
Figure FDA0002522814100000011
in the formula (I), the compound is shown in the specification,
Figure FDA0002522814100000012
representing authorized users PUkThe minimum signal-to-interference-and-noise ratio of a receiving end;
Figure FDA0002522814100000013
indicating a secondary user SUiMinimum signal-to-interference-and-noise ratio of receiving end β0Represents the SINR threshold of the base station of the authorized subscriber, and β0Is constant βiRepresenting the signal-to-interference-and-noise ratio threshold of the secondary user base station; m represents the number of authorized users, N represents the number of secondary users, and k and i represent serial numbers.
3. The method according to claim 1, wherein in step S2, the MOS models include a data MOS model and a video MOS model, and the data MOS model is:
Figure FDA0002522814100000014
in the formula, QDWhich represents the data flow MOS and which is,
Figure FDA0002522814100000015
indicating a secondary user SUiBit rate of pe2eRepresenting the end-to-end packet loss rate, g and b representing parameters, and g and b being obtained by calculating the quality of the data perceived by the terminal user;
the MOS model of the video is as follows:
Figure FDA0002522814100000016
in the formula, QVRepresenting MOS of the video, PSNR representing peak signal-to-noise ratio, c, d and f representing parameters of a logic function;
the MOS model is as follows:
Figure FDA0002522814100000017
in the formula, QμThe average MOS is represented, U represents the number of the secondary users with the service flow as data, N-U represents the number of the secondary users with the service flow as video, and m represents the serial number.
4. The dynamic spectrum access method based on reinforcement learning with priori knowledge as claimed in claim 1, wherein in step S3, the specific method for obtaining the environmental knowledge of the existing secondary users in the system and constructing the priori knowledge by using the obtained environmental knowledge includes: initializing the Q table of the newly added secondary user by using the Q table of the secondary user in the system, wherein the formula is as follows:
Figure FDA0002522814100000021
in the formula, QnewQ-table, Q, representing newly added secondary usersiQ tables representing existing secondary users in the system, K represents the number of Q tables of existing secondary users in the system, and i represents a serial number.
5. The dynamic spectrum access method based on a priori knowledge reinforcement learning of claim 1, wherein in step S4, the step of Q learning using a priori knowledge comprises:
s41: initializing Q tables of all secondary users in the system, wherein the newly added secondary users initialize the Q tables by using priori knowledge;
s42: selecting an action according to a formula
Figure FDA0002522814100000022
Making action selection, wherein
Figure FDA0002522814100000023
Indicating a secondary user SUiIn the state at the time t of the instant t,
Figure FDA0002522814100000024
indicating a secondary user SUiAt time t, the state is
Figure FDA0002522814100000025
An action of the time selection;
s43: updating state, defining state set
Figure FDA0002522814100000026
Figure FDA0002522814100000027
And
Figure FDA0002522814100000028
representing signal-to-interference-and-noise ratio limits for authorized and secondary users, respectively, e.g. byThe following formula updates the state:
Figure FDA0002522814100000029
Figure FDA00025228141000000210
in the formula, N represents the number of secondary users, i represents the serial number, psii、αiAre all intermediate process parameters;
s44: obtaining the return: using a formula
Figure FDA00025228141000000211
Updating the return value, wherein,
Figure FDA00025228141000000221
the optimal return is represented in the form of,
Figure FDA00025228141000000212
is composed of
Figure FDA00025228141000000213
The mathematical expectation of (a) is that,
Figure FDA00025228141000000216
for MOS-based reporting, the user SU is indicatediIn a state stIn the following, the actions are adopted
Figure FDA00025228141000000214
Reporting time; p (s '| s, a) is the transition probability that state s reaches state s' under the action of action a; pi*Is an optimal strategy;
s45: updating the Q value: using formulas
Figure FDA00025228141000000215
Updating the Q value, wherein αtIt is indicated that the learning rate is,
Figure FDA00025228141000000217
represents SUiThe Q value at the current "state-action",
Figure FDA00025228141000000219
Figure FDA00025228141000000220
indicating an action to take
Figure FDA00025228141000000218
Q value corresponding to the new state s', Ai representing the secondary user SUiThe set of actions of (1);
s46: and repeating the steps S41-S45 until convergence.
6. The dynamic spectrum access method based on reinforcement learning with a priori knowledge as claimed in claim 5, wherein in step S42, formula is adopted
Figure FDA0002522814100000031
Making action selection, wherein the action selection represents the exploration probability, a*Representing the selected action, rand () representing a random function, Q (s, a) representing the Q value when the state is s and the action is a, if the random number rand (0,1) is greater than the search probability, agent will directly randomly select one of all actions to execute "search", otherwise execute "utilization", and select the action with high Q value.
7. The method according to any of claims 5-6, wherein the MOS-based reward is calculated by the following formula:
Figure FDA0002522814100000032
in the formula (I), the compound is shown in the specification,
Figure FDA0002522814100000033
representing a user SUiIn a state stIn the following, the actions are adopted
Figure FDA0002522814100000035
Reporting time;
Figure FDA0002522814100000036
represents SUiIn a state stAction taken at time t, AiIndicating a secondary user SUiA state space of (a);
Figure FDA0002522814100000037
a reward indicating a successful attempt; t is a constant less than
Figure FDA0002522814100000034
Represents a reward of unsuccessful attempts.
CN202010495810.4A 2020-06-03 2020-06-03 Dynamic spectrum access method based on reinforcement learning with priori knowledge Active CN111654342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010495810.4A CN111654342B (en) 2020-06-03 2020-06-03 Dynamic spectrum access method based on reinforcement learning with priori knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010495810.4A CN111654342B (en) 2020-06-03 2020-06-03 Dynamic spectrum access method based on reinforcement learning with priori knowledge

Publications (2)

Publication Number Publication Date
CN111654342A true CN111654342A (en) 2020-09-11
CN111654342B CN111654342B (en) 2021-02-12

Family

ID=72345008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010495810.4A Active CN111654342B (en) 2020-06-03 2020-06-03 Dynamic spectrum access method based on reinforcement learning with priori knowledge

Country Status (1)

Country Link
CN (1) CN111654342B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112672359A (en) * 2020-12-18 2021-04-16 哈尔滨工业大学 Dynamic spectrum access method based on bidirectional long-and-short-term memory network
CN112672426A (en) * 2021-03-17 2021-04-16 南京航空航天大学 Anti-interference frequency point allocation method based on online learning
CN113207129A (en) * 2021-05-10 2021-08-03 重庆邮电大学 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm
CN113423110A (en) * 2021-06-22 2021-09-21 东南大学 Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
CN113747447A (en) * 2021-09-07 2021-12-03 中国人民解放军国防科技大学 Double-action reinforcement learning frequency spectrum access method and system based on priori knowledge
CN113939040A (en) * 2021-10-08 2022-01-14 中国人民解放军陆军工程大学 State updating method based on state prediction in cognitive Internet of things
CN114630333A (en) * 2022-03-16 2022-06-14 军事科学院系统工程研究院网络信息研究所 Multi-parameter statistical learning decision-making method in cognitive satellite communication
CN115086965A (en) * 2022-06-14 2022-09-20 中国人民解放军国防科技大学 Dynamic spectrum allocation method and system based on element reduction processing and joint iteration optimization
CN115412105A (en) * 2022-05-06 2022-11-29 南京邮电大学 Reinforcement learning communication interference method based on USRP RIO

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102238555A (en) * 2011-07-18 2011-11-09 南京邮电大学 Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio
CN109586820A (en) * 2018-12-28 2019-04-05 中国人民解放军陆军工程大学 The anti-interference model of dynamic spectrum and intensified learning Anti-interference algorithm in fading environment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101466111B (en) * 2009-01-13 2010-11-17 中国人民解放军理工大学通信工程学院 Dynamic spectrum access method based on policy planning constrain Q study
US10505616B1 (en) * 2018-06-01 2019-12-10 Samsung Electronics Co., Ltd. Method and apparatus for machine learning based wide beam optimization in cellular network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102238555A (en) * 2011-07-18 2011-11-09 南京邮电大学 Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio
CN109586820A (en) * 2018-12-28 2019-04-05 中国人民解放军陆军工程大学 The anti-interference model of dynamic spectrum and intensified learning Anti-interference algorithm in fading environment

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112672359A (en) * 2020-12-18 2021-04-16 哈尔滨工业大学 Dynamic spectrum access method based on bidirectional long-and-short-term memory network
CN112672359B (en) * 2020-12-18 2022-06-21 哈尔滨工业大学 Dynamic spectrum access method based on bidirectional long-and-short-term memory network
CN112672426A (en) * 2021-03-17 2021-04-16 南京航空航天大学 Anti-interference frequency point allocation method based on online learning
CN113207129B (en) * 2021-05-10 2022-05-20 重庆邮电大学 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm
CN113207129A (en) * 2021-05-10 2021-08-03 重庆邮电大学 Dynamic spectrum access method based on confidence interval upper bound algorithm and DRL algorithm
CN113423110A (en) * 2021-06-22 2021-09-21 东南大学 Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
CN113423110B (en) * 2021-06-22 2022-04-12 东南大学 Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
CN113747447A (en) * 2021-09-07 2021-12-03 中国人民解放军国防科技大学 Double-action reinforcement learning frequency spectrum access method and system based on priori knowledge
CN113939040A (en) * 2021-10-08 2022-01-14 中国人民解放军陆军工程大学 State updating method based on state prediction in cognitive Internet of things
CN113939040B (en) * 2021-10-08 2023-04-28 中国人民解放军陆军工程大学 State updating method based on state prediction in cognitive Internet of things
CN114630333A (en) * 2022-03-16 2022-06-14 军事科学院系统工程研究院网络信息研究所 Multi-parameter statistical learning decision-making method in cognitive satellite communication
CN115412105A (en) * 2022-05-06 2022-11-29 南京邮电大学 Reinforcement learning communication interference method based on USRP RIO
CN115412105B (en) * 2022-05-06 2024-03-12 南京邮电大学 Reinforced learning communication interference method based on USRP RIO
CN115086965A (en) * 2022-06-14 2022-09-20 中国人民解放军国防科技大学 Dynamic spectrum allocation method and system based on element reduction processing and joint iteration optimization

Also Published As

Publication number Publication date
CN111654342B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN111654342B (en) Dynamic spectrum access method based on reinforcement learning with priori knowledge
Wang et al. A survey on applications of model-free strategy learning in cognitive wireless networks
CN111556572B (en) Spectrum resource and computing resource joint allocation method based on reinforcement learning
CN110492955B (en) Spectrum prediction switching method based on transfer learning strategy
Hou et al. Joint allocation of wireless resource and computing capability in MEC-enabled vehicular network
Amichi et al. Spreading factor allocation strategy for LoRa networks under imperfect orthogonality
Li et al. A delay-aware caching algorithm for wireless D2D caching networks
CN114698128B (en) Anti-interference channel selection method and system for cognitive satellite-ground network
CN113423110B (en) Multi-user multi-channel dynamic spectrum access method based on deep reinforcement learning
Mohammadi et al. QoE-driven integrated heterogeneous traffic resource allocation based on cooperative learning for 5G cognitive radio networks
Kwasinski et al. Reinforcement learning for resource allocation in cognitive radio networks
Shashi Raj et al. Interference resilient stochastic prediction based dynamic resource allocation model for cognitive MANETs
Chandra et al. Joint resource allocation and power allocation scheme for MIMO assisted NOMA system
St Jean et al. Bayesian game-theoretic modeling of transmit power determination in a self-organizing CDMA wireless network
Lv et al. A dynamic spectrum access method based on Q-learning
CN116828534A (en) Intensive network large-scale terminal access and resource allocation method based on reinforcement learning
Mehta Recursive quadratic programming for constrained nonlinear optimization of session throughput in multiple‐flow network topologies
Ren et al. Joint spectrum allocation and power control in vehicular communications based on dueling double DQN
Sen et al. Rate adaptation techniques using contextual bandit approach for mobile wireless lan users
Chuan et al. Machine learning based popularity regeneration in caching-enabled wireless networks
Dogra et al. Reinforcement Learning (RL) for optimal power allocation in 6G Network
Ekwe et al. QoE-aware Q-learning resource allocation for spectrum reuse in 5G communications network
Wang et al. Experience cooperative sharing in cross-layer cognitive radio for real-time multimedia communication
Mishra et al. DDPG with Transfer Learning and Meta Learning Framework for Resource Allocation in Underlay Cognitive Radio Network
CN114828193B (en) Uplink and downlink multi-service concurrent power distribution method for wireless network and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant