CN113438723A - Competitive depth Q network power control method with high reward punishment - Google Patents
Competitive depth Q network power control method with high reward punishment Download PDFInfo
- Publication number
- CN113438723A CN113438723A CN202110701419.XA CN202110701419A CN113438723A CN 113438723 A CN113438723 A CN 113438723A CN 202110701419 A CN202110701419 A CN 202110701419A CN 113438723 A CN113438723 A CN 113438723A
- Authority
- CN
- China
- Prior art keywords
- user
- signal
- secondary user
- noise ratio
- primary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000002860 competitive effect Effects 0.000 title claims abstract description 9
- 238000001228 spectrum Methods 0.000 claims abstract description 36
- 230000009471 action Effects 0.000 claims abstract description 31
- 230000006870 function Effects 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 11
- 230000000694 effects Effects 0.000 claims abstract description 7
- 230000005540 biological transmission Effects 0.000 claims description 21
- 238000004891 communication Methods 0.000 claims description 17
- 101000912561 Bos taurus Fibrinogen gamma-B chain Proteins 0.000 claims description 6
- 238000011217 control strategy Methods 0.000 claims description 6
- 230000008054 signal transmission Effects 0.000 claims description 3
- 230000002787 reinforcement Effects 0.000 abstract description 13
- 230000001149 cognitive effect Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 229910001385 heavy metal Inorganic materials 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W52/00—Power management, e.g. TPC [Transmission Power Control], power saving or power classes
- H04W52/04—TPC
- H04W52/06—TPC algorithms
- H04W52/14—Separate analysis of uplink or downlink
- H04W52/146—Uplink power control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W52/00—Power management, e.g. TPC [Transmission Power Control], power saving or power classes
- H04W52/04—TPC
- H04W52/18—TPC being performed according to specific parameters
- H04W52/24—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
- H04W52/241—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account channel quality metrics, e.g. SIR, SNR, CIR, Eb/lo
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W52/00—Power management, e.g. TPC [Transmission Power Control], power saving or power classes
- H04W52/04—TPC
- H04W52/18—TPC being performed according to specific parameters
- H04W52/24—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
- H04W52/242—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account path loss
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W52/00—Power management, e.g. TPC [Transmission Power Control], power saving or power classes
- H04W52/04—TPC
- H04W52/18—TPC being performed according to specific parameters
- H04W52/24—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
- H04W52/245—TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account received signal strength
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W52/00—Power management, e.g. TPC [Transmission Power Control], power saving or power classes
- H04W52/04—TPC
- H04W52/30—TPC using constraints in the total amount of available transmission power
- H04W52/34—TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
- H04W52/346—TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading distributing total power among users or channels
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention provides a competition depth Q network power control method with high reward punishment, which improves a reward function in a depth reinforcement learning process, carries out grade division according to the spectrum access condition of a secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.
Description
Technical Field
The invention relates to the field of cognitive radio control methods, in particular to a competition depth Q network power control method with high reward punishment.
Background
With the rapid development and wide use of wireless communication technology, the demand of spectrum resources is increasing, and in contrast, the demand is a severe reality that the wireless spectrum resources are gradually exhausted, which becomes a big problem to be solved in further development of wireless communication technology. However, most of the current spectrum resources are allocated by using a conventional and fixed allocation method, that is, a specific frequency band is assigned to a specific user, and other users need to be authorized to use the spectrum resources. A great deal of research is carried out in the academic world and the industrial world, which shows that on one hand, a great deal of spectrum resources are not really used by authorized users, a great deal of authorized frequency bands are in an idle state, the idle frequency band utilization rate of the authorized users is low, and on the other hand, the spectrum resources in the public frequency band are robbed and blocked excessively. Therefore, how to solve these contradictions in the spectrum resource allocation process and improve the spectrum utilization rate are very important.
The concept of Cognitive Radio (CR) technology is intended to alleviate the problems of shortage of spectrum resources and low spectrum utilization. The cognitive process of cognitive radio is divided into six steps, positioning (origin), observation (observer), learning (study), decision (Decide), planning (Plan) and action (Act), respectively. The cognitive radio intelligently adjusts the decision and the positioning of the cognitive radio through the observation and the learning of the external environment, realizes the corresponding plan and action, and performs the self-adaptive adjustment process on the external environment. For spectrum sharing, the core idea of cognitive radio is: on the premise that no interference is generated to authorized users (PU) obtaining spectrum use rights, Secondary Users (SU) sense the surrounding radio environment and perform spectrum access opportunistically to improve the spectrum utilization rate, and the technology realizes access of a plurality of frequency bands through a dynamic spectrum allocation technology and can make full use of idle spectrum.
On the basis of Reinforcement Learning (RL), a deep reinforcement learning algorithm developed by combining deep learning obtains a level equivalent to that of human beings in a plurality of artificial intelligence fields, such as go, Dota, Startcraft II and the like. Specifically, a deep Q-network (DQN) combines an RL process with a class of neural networks (deep neural networks) to approximate a Q-action function, and the neural networks can make up for the limitations of Q learning in generalization and function approximation capabilities. And the competition deep Q network (dulling DQN) is improved by an algorithm on the basis of the common DQN, and the value of the state and the action evaluation value in the state are summed to be used as the Q value for reevaluation.
In the latest research, some researchers apply the DQN algorithm to spectrum allocation, and simulation results show that the algorithm has a faster convergence speed and a lower packet loss rate. In order to overcome the challenge of unknown dynamic industrial internet of things environment, a scholars provides an improved deep Q learning network applied to the spectrum resource management of the industrial internet of things. Researchers also apply the competitive deep reinforcement learning algorithm to the prediction of the heavy metal content of the soil, and can obtain a good effect. However, none of these deep reinforcement learning methods simultaneously considers the value of a state and the operation value in the state, or often does not classify a reward function according to the success of spectrum access when designing the reward function.
Disclosure of Invention
The invention provides a competition depth Q network power control method with high reward punishment, which considers the values of states and actions at the same time, sums the values and reevaluates the values, and can effectively improve the system stability.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a competition depth Q network power control method with high reward punishment comprises the following steps:
s1: the auxiliary base station collects communication information of a primary user and a secondary user and transmits the obtained information to the secondary user;
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
s3: constructing a graded reward function with high reward punishment;
s4: and constructing a power control strategy.
Further, the specific process of step S1 is:
the primary user and the secondary user are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and the auxiliary base station is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:
S(t)=[s1(t),s2(t),...,sk(t),...,sx(t)]
the signal strength received by the kth assisting base station is defined as:
in the formula Iik(t)、ljk(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) represents the reference distance, τ represents the path loss exponent, σ (t) represents the average noise power of the system; at time t, the secondary user k is in state sK(t) selecting an action, this time the user will enter sK(t) next state.
Further, in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so H is sharednA selectable action space, the action space defined as:
A(t)=[P1(t),P2(t),…,Pn(t)]。
further, in step S3, four indexes are designed to evaluate the success level of the secondary user spectrum access, where the indexes are defined as follows:
wherein,andrespectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;
in step S3, it is defined as the most prerequisite condition for determining whether power control succeeds or not that the signal-to-noise ratio of any primary user is greater than a preset threshold, and if the signal-to-noise ratio of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:
according to the above grading conditions, the reward function is defined as:
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
Further, in step S4, the primary user is defined to transmit power according to the following strategy, where the power control strategy is as follows:
under the strategy, the master user controls the sending power by adopting a gradual updating mode at each time point t;
signal-to-noise ratio gamma of primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
the secondary user accesses to the channel of the primary user through a underlay type, and in order to not influence the normal communication of the primary user, the secondary user often has strict requirements when transmitting power; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:
defining the signal-to-noise ratio of the jth secondary user as:
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hkj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively; the channel gain and the transmitting power are dynamically changed, and according to the shannon theorem, the relationship between the throughput and the signal-to-noise ratio of the jth user is defined as follows:
Tj(t)=Wlog2(1+γj(t))
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention improves the reward function in the deep reinforcement learning process, carries out grade division according to the spectrum access condition of the secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.
Drawings
FIG. 1 is a diagram of a model of an application system in which the method of the present invention is implemented;
fig. 2 is a diagram of a general DQN network architecture;
FIG. 3 is a diagram of the Dueling DQN network architecture;
FIG. 4 is a graph comparing loss functions of three different depth reinforcement learning algorithms;
FIG. 5 is a graph of the cumulative rewards of 40000 training sessions for three different depth reinforcement learning algorithms;
FIG. 6 is a graph of the total throughput of a secondary user training 40000 times with three different depth reinforcement learning algorithms;
FIG. 7 is a graph of the average transmit power for the next user for three different depth reinforcement learning algorithms.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, in a certain area centered on a master base station (PBS), it is assumed that there are M master users (PU) and N Secondary Users (SU) (N > M) in a cognitive wireless network, and 1 master base station and a plurality of Auxiliary Base Stations (ABS), and the master users, the secondary users, and the auxiliary base stations are randomly distributed in a network environment. The main base station can ensure the normal operation of the communication of the main user, and the auxiliary base station can collect the received signal strength information of the main user and the received signal strength information of the secondary user and can send the collected data information to the secondary user again.
In the model, the secondary user accesses to the channel of the primary user through the underlay mode, and in order to not affect the normal communication of the primary user, the secondary user often has strict requirements when transmitting power. To avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the secondary base station and then complete the communication transmission task with proper transmission power.
The signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:
defining the signal-to-noise ratio of the jth secondary user as:
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hKj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) represents the ambient noise received by the ith primary user and the jth secondary user respectively.
The channel gain, the transmitting power and the like of the model are dynamically changed, and according to the Shannon theorem, the relation between the throughput and the signal-to-noise ratio of the jth user is defined as follows:
Tj(t)=Wlog2(1+γj(t)) (3)
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
The invention aims to adopt the Dueling DQN and improve the reward function thereof to carry out dynamic power control of frequency spectrum, and secondary users can self-adaptively adjust the own transmitting power according to the information obtained from the auxiliary base station, thereby completing the dynamic power control of the cognitive wireless network.
Like the ordinary DQN algorithm, the Dueling DQN algorithm has the same network structure as the ordinary DQN, namely, an environment, a playback memory unit, two neural networks with the same structure but different parameters and an error function. The method based on deep reinforcement learning is used for processing the power control problem of the frequency spectrum and is essentially a Markov decision process. Ordinary DQN proposes to approximate the optimal control strategy using an action value function Q (s, a):
the value of the state and the action advantage value in the state are summed to be used as a Q value for reevaluation, and the core content of the competitive deep Q network different from the common deep Q network is represented as follows:
Q(s,a;θ,α,β)=V(s;θ,β)+A(s,a;θ,a) (5)
comparing the network structures of DQN and dulling DQN as shown in fig. 2 and fig. 3, it can be known that dulling DQN has two data streams before the output layer, one data stream outputs the Q value of the state, and the other data stream outputs the advantaged value of the action.
1) Status of state
The primary user and the secondary user of the system model are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, and the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user. In the signal transmission process, the secondary base station plays an important role, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:
S(t)=[s1(t),s2(t),…,sK(t),…,sX(t)] (6)
the signal strength received by the kth assisting base station is defined as:
in the formula IiK(t)、ljK(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system.
At time t, the secondary user k is in state sK(t) selecting an action, this time the user will enter sK(t) next state.
2) Movement of
The transmitting power selected by the secondary user in each time slot is set as an action value, the transmitting power of each secondary user is a discretization value, and each secondary user can select H different transmitting values, so that the system model has H in commonnAn action space is selectable. The action space is defined as: graded reward function for high reward penalties:
A(t)=[P1(t),P2(t),…,Pn(t)] (8)
3) graded reward function for high reward penalties
One of the key issues for the secondary users to adaptively select the appropriate transmit power to achieve spectrum sharing is to design an efficient reward function. From the perspective of close reality, four indexes are designed to judge the success level of the spectrum access of the secondary user. The index is defined as follows:
wherein,andrespectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;
whether the signal-to-noise ratios of any primary user are all larger than a preset threshold value is defined as the most prerequisite condition for judging whether the power control is successful, and if the signal-to-noise ratios of any primary user are not all larger than the preset threshold value, the Complete Failure of spectrum access (CF) can be directly judged. If the snr of any primary user is greater than the preset threshold but the snr of no secondary user is greater than the preset threshold, the situation is called Secondary Failure (SF). If the signal-to-noise ratio of any primary user is greater than the preset threshold, the signal-to-noise ratio of any secondary user is also greater than the preset threshold, and the primary user transmission power of all access channels is greater than the sum of the secondary user transmission power, the access mode is called Complete Success (CS). If the signal-to-noise ratio of only a part of secondary users in the CS condition is higher than the preset threshold and the other conditions are not changed, the access method is called secondary successful access (SS). The specific formula is expressed as follows:
according to the above grading conditions, the reward function is defined as:
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
4) Policy
Defining a master user to transmit power according to the following strategies, wherein the power control strategies are as follows:
under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. Signal-to-noise ratio gamma of primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, the current transmission power is kept unchanged. The signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
the competition depth Q network power control method based on the high reward punishment is provided, experimental simulation is carried out on a Python platform, and the method is a delay DQN algorithm for improving a reward function, so the method is referred to as the delay DQN algorithm hereinafter and in experiments. And comparing the performance of the native DQN algorithm, the double DQN algorithm and the blanking DQN algorithm under the same simulation environment. Each algorithm iterates 40000 times, and the performance results of each index are displayed once every 1000 times. FIG. 4 is a graph comparing the loss functions of three different depth reinforcement learning algorithms, and it can be seen that all three eventually converge. However, the native DQN algorithm and the double DQN algorithm are unstable, the loss fluctuation is large, and the convergence speed is slow. The blanking DQN algorithm proposed herein can converge at a relatively fast speed and the loss values are kept in a very small range.
As with fig. 5 and 6, the images show the cumulative reward and total throughput of the secondary user 40000 times for three different depth reinforcement learning algorithm trainings. Comparing the three algorithms can find that: compared with a natural DQN algorithm and a double DQN algorithm, the blanking DQN algorithm provided by the invention can explore the action of successful access of a secondary user from the 5 th round, obtain positive rewards, and continuously increase the accumulated rewards, so that the algorithm can quickly learn correct actions and has obvious advantages. In addition, the total throughput of the algorithm is the largest and the performance is the best in the index of the total throughput of the secondary users.
Fig. 7 shows the average transmit power of the next user for the three algorithms. In summary, the average transmit power of the native DQN algorithm is the highest, and the average transmit power of the double DQN algorithm is almost all above 2.0 mW. While the average transmit power of the blanking DQN algorithm is lowest, mostly at 1.5mW and 2.0mW, and a few higher than 2.0 mW. Simulation results show that, in combination with the indexes, the blanking DQN algorithm provided by the invention can ensure that the total throughput of the secondary user is maximum when performing dynamic power control, and the average transmission power is minimum under the condition of ensuring that the frequency spectrum of the secondary user is successfully accessed, so that the power loss can be effectively reduced, and energy is saved.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A competition depth Q network power control method with high reward punishment is characterized by comprising the following steps:
s1: the auxiliary base station collects communication information of a primary user and a secondary user and transmits the obtained information to the secondary user;
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
s3: constructing a graded reward function with high reward punishment;
s4: and constructing a power control strategy.
2. The competitive deep Q-network power control method for high reward penalty according to claim 1, wherein the specific process of step S1 is:
the primary user and the secondary user are in a non-cooperative relationship, the secondary user is in a pad type access to a primary user channel, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user, and if X auxiliary base stations exist in the environment, the state value is as follows:
S(t)=[s1(t),s2(t),…,sk(t),…,sx(t)]
the signal strength received by the kth assisting base station is defined as:
in the formula Iik(t)、ljk(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system.
3. The competitive deep Q-network power control method with high reward penalty according to claim 2, wherein in step S1, at time t, the secondary user k is in state Sk(t) selecting an action, this time the user will enter sk(t) next state.
4. The method according to claim 3, wherein in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so that H is totalnA selectable action space, the action space defined as:
A(t)=[P1(t),P2(t),…,Pn(t)]。
5. the method according to claim 4, wherein in step S3, four indexes are designed to judge the success level of spectrum access of the secondary users, and the indexes are defined as follows:
wherein,andrespectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel.
6. The method according to claim 5, wherein in step S3, it is defined that whether the snr of any primary user is greater than a preset threshold is the most prerequisite to determine whether the power control is successful, and if the snr of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:
according to the above grading conditions, the reward function is defined as:
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
7. The method as claimed in claim 6, wherein in step S4, the primary user is defined to transmit power according to the following strategy, and the power control strategy is as follows:
under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t.
8. The competitive deep Q-network power control method with high reward penalty according to claim 7, characterized in that the SNR γ of the primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
9. the method as claimed in claim 8, wherein the secondary user accesses to the channel of the primary user via underlay mode, and in order not to affect the normal communication of the primary user, the secondary user often has strict requirement in power transmission; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality, and the signal-to-noise ratio of the ith main user is defined as:
defining the signal-to-noise ratio of the jth secondary user as:
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hkj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) represents the ambient noise received by the ith primary user and the jth secondary user respectively.
10. The method according to claim 9, wherein channel gain and transmission power are dynamically changed, and according to shannon's theorem, the relation between throughput and signal-to-noise ratio of the ith sub-user is defined as:
Tj(t)=W log2(1+γj(t))
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110701419.XA CN113438723B (en) | 2021-06-23 | 2021-06-23 | Competition depth Q network power control method with high rewarding punishment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110701419.XA CN113438723B (en) | 2021-06-23 | 2021-06-23 | Competition depth Q network power control method with high rewarding punishment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113438723A true CN113438723A (en) | 2021-09-24 |
CN113438723B CN113438723B (en) | 2023-04-28 |
Family
ID=77753705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110701419.XA Active CN113438723B (en) | 2021-06-23 | 2021-06-23 | Competition depth Q network power control method with high rewarding punishment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113438723B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116321390A (en) * | 2023-05-23 | 2023-06-23 | 北京星河亮点技术股份有限公司 | Power control method, device and equipment |
CN117545094A (en) * | 2024-01-09 | 2024-02-09 | 大连海事大学 | Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013000169A1 (en) * | 2011-06-29 | 2013-01-03 | 中国人民解放军理工大学 | Resource allocation method for maximizing throughput in cooperative cognitive simo network |
WO2013000167A1 (en) * | 2011-06-29 | 2013-01-03 | 中国人民解放军理工大学 | Cognitive single-input multi-output network access method base on cooperative relay |
CN110267338A (en) * | 2019-07-08 | 2019-09-20 | 西安电子科技大学 | Federated resource distribution and Poewr control method in a kind of D2D communication |
CN111262638A (en) * | 2020-01-17 | 2020-06-09 | 合肥工业大学 | Dynamic spectrum access method based on efficient sample learning |
WO2020134507A1 (en) * | 2018-12-28 | 2020-07-02 | 北京邮电大学 | Routing construction method for unmanned aerial vehicle network, unmanned aerial vehicle, and storage medium |
CN111726811A (en) * | 2020-05-26 | 2020-09-29 | 国网浙江省电力有限公司嘉兴供电公司 | Slice resource allocation method and system for cognitive wireless network |
WO2020244906A1 (en) * | 2019-06-03 | 2020-12-10 | Nokia Solutions And Networks Oy | Uplink power control using deep q-learning |
CN112367132A (en) * | 2020-10-27 | 2021-02-12 | 西北工业大学 | Power distribution algorithm in cognitive radio based on reinforcement learning solution |
-
2021
- 2021-06-23 CN CN202110701419.XA patent/CN113438723B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013000169A1 (en) * | 2011-06-29 | 2013-01-03 | 中国人民解放军理工大学 | Resource allocation method for maximizing throughput in cooperative cognitive simo network |
WO2013000167A1 (en) * | 2011-06-29 | 2013-01-03 | 中国人民解放军理工大学 | Cognitive single-input multi-output network access method base on cooperative relay |
WO2020134507A1 (en) * | 2018-12-28 | 2020-07-02 | 北京邮电大学 | Routing construction method for unmanned aerial vehicle network, unmanned aerial vehicle, and storage medium |
WO2020244906A1 (en) * | 2019-06-03 | 2020-12-10 | Nokia Solutions And Networks Oy | Uplink power control using deep q-learning |
CN110267338A (en) * | 2019-07-08 | 2019-09-20 | 西安电子科技大学 | Federated resource distribution and Poewr control method in a kind of D2D communication |
CN111262638A (en) * | 2020-01-17 | 2020-06-09 | 合肥工业大学 | Dynamic spectrum access method based on efficient sample learning |
CN111726811A (en) * | 2020-05-26 | 2020-09-29 | 国网浙江省电力有限公司嘉兴供电公司 | Slice resource allocation method and system for cognitive wireless network |
CN112367132A (en) * | 2020-10-27 | 2021-02-12 | 西北工业大学 | Power distribution algorithm in cognitive radio based on reinforcement learning solution |
Non-Patent Citations (2)
Title |
---|
ZIFENG YE,YONGHUA WANG,PIN WAN: "Joint Channel Allocation and Power Control Based on Long", 《COMPLEXITY》 * |
蒋涛涛;朱江;: "CNR中基于多用户Q学习的联合信道选择和功率控制" * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116321390A (en) * | 2023-05-23 | 2023-06-23 | 北京星河亮点技术股份有限公司 | Power control method, device and equipment |
CN117545094A (en) * | 2024-01-09 | 2024-02-09 | 大连海事大学 | Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network |
CN117545094B (en) * | 2024-01-09 | 2024-03-26 | 大连海事大学 | Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network |
Also Published As
Publication number | Publication date |
---|---|
CN113438723B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Joint interference alignment and power control for dense networks via deep reinforcement learning | |
Liao et al. | A model-driven deep reinforcement learning heuristic algorithm for resource allocation in ultra-dense cellular networks | |
CN113438723A (en) | Competitive depth Q network power control method with high reward punishment | |
Alpcan et al. | Power control for multicell CDMA wireless networks: A team optimization approach | |
CN110267274B (en) | Spectrum sharing method for selecting sensing users according to social credibility among users | |
Ren et al. | DDPG based computation offloading and resource allocation for MEC systems with energy harvesting | |
CN113795050B (en) | Sum Tree sampling-based deep double-Q network dynamic power control method | |
Ye et al. | Learning-based computing task offloading for autonomous driving: A load balancing perspective | |
Trrad et al. | Application of fuzzy logic to cognitive wireless communications | |
Ma et al. | On-demand resource management for 6G wireless networks using knowledge-assisted dynamic neural networks | |
Sanusi et al. | Development of handover decision algorithms in hybrid Li-Fi and Wi-Fi networks | |
Liu et al. | Deep reinforcement learning-based MEC offloading and resource allocation in uplink NOMA heterogeneous network | |
Joshi et al. | Optimized fuzzy power control over fading channels in spectrum sharing cognitive radio using ANFIS | |
Yan et al. | QoE-based semantic-aware resource allocation for multi-task networks | |
Tashman et al. | Performance optimization of energy-harvesting underlay cognitive radio networks using reinforcement learning | |
CN114219074A (en) | Wireless communication network resource allocation algorithm dynamically adjusted according to requirements | |
Mendoza et al. | Deep reinforcement learning for dynamic access point activation in cell-free MIMO networks | |
CN113115355A (en) | Power distribution method based on deep reinforcement learning in D2D system | |
Alajmi et al. | An efficient actor critic drl framework for resource allocation in multi-cell downlink noma | |
CN110149608B (en) | DAI-based resource allocation method for optical wireless sensor network | |
CN116470598A (en) | Wireless textile body area network energy neutral operation method based on deep reinforcement learning | |
CN113890653A (en) | Multi-agent reinforcement learning power distribution method for multi-user benefits | |
CN115633402A (en) | Resource scheduling method for mixed service throughput optimization | |
Chang et al. | Fuzzy/neural congestion control for integrated voice and data DS-CDMA/FRMA cellular networks | |
Sabitha et al. | Design and analysis of fuzzy logic and neural network based transmission power control techniques for energy efficient wireless sensor networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231212 Address after: 518021 A807, Jihao Building, No. 1086 Shennan East Road, Fenghuang Community, Huangbei Street, Luohu District, Shenzhen City, Guangdong Province Patentee after: Shenzhen Tuo Ai Wei Information Technology Co.,Ltd. Address before: 510090 Dongfeng East Road 729, Yuexiu District, Guangzhou City, Guangdong Province Patentee before: GUANGDONG University OF TECHNOLOGY |
|
TR01 | Transfer of patent right |