CN113438723A - Competitive depth Q network power control method with high reward punishment - Google Patents

Competitive depth Q network power control method with high reward punishment Download PDF

Info

Publication number
CN113438723A
CN113438723A CN202110701419.XA CN202110701419A CN113438723A CN 113438723 A CN113438723 A CN 113438723A CN 202110701419 A CN202110701419 A CN 202110701419A CN 113438723 A CN113438723 A CN 113438723A
Authority
CN
China
Prior art keywords
user
signal
secondary user
noise ratio
primary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110701419.XA
Other languages
Chinese (zh)
Other versions
CN113438723B (en
Inventor
刘骏
刘德荣
王永华
林得有
王宇慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tuo Ai Wei Information Technology Co ltd
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110701419.XA priority Critical patent/CN113438723B/en
Publication of CN113438723A publication Critical patent/CN113438723A/en
Application granted granted Critical
Publication of CN113438723B publication Critical patent/CN113438723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/06TPC algorithms
    • H04W52/14Separate analysis of uplink or downlink
    • H04W52/146Uplink power control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/241TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account channel quality metrics, e.g. SIR, SNR, CIR, Eb/lo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/242TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account path loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/245TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account received signal strength
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power
    • H04W52/34TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
    • H04W52/346TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading distributing total power among users or channels
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides a competition depth Q network power control method with high reward punishment, which improves a reward function in a depth reinforcement learning process, carries out grade division according to the spectrum access condition of a secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.

Description

Competitive depth Q network power control method with high reward punishment
Technical Field
The invention relates to the field of cognitive radio control methods, in particular to a competition depth Q network power control method with high reward punishment.
Background
With the rapid development and wide use of wireless communication technology, the demand of spectrum resources is increasing, and in contrast, the demand is a severe reality that the wireless spectrum resources are gradually exhausted, which becomes a big problem to be solved in further development of wireless communication technology. However, most of the current spectrum resources are allocated by using a conventional and fixed allocation method, that is, a specific frequency band is assigned to a specific user, and other users need to be authorized to use the spectrum resources. A great deal of research is carried out in the academic world and the industrial world, which shows that on one hand, a great deal of spectrum resources are not really used by authorized users, a great deal of authorized frequency bands are in an idle state, the idle frequency band utilization rate of the authorized users is low, and on the other hand, the spectrum resources in the public frequency band are robbed and blocked excessively. Therefore, how to solve these contradictions in the spectrum resource allocation process and improve the spectrum utilization rate are very important.
The concept of Cognitive Radio (CR) technology is intended to alleviate the problems of shortage of spectrum resources and low spectrum utilization. The cognitive process of cognitive radio is divided into six steps, positioning (origin), observation (observer), learning (study), decision (Decide), planning (Plan) and action (Act), respectively. The cognitive radio intelligently adjusts the decision and the positioning of the cognitive radio through the observation and the learning of the external environment, realizes the corresponding plan and action, and performs the self-adaptive adjustment process on the external environment. For spectrum sharing, the core idea of cognitive radio is: on the premise that no interference is generated to authorized users (PU) obtaining spectrum use rights, Secondary Users (SU) sense the surrounding radio environment and perform spectrum access opportunistically to improve the spectrum utilization rate, and the technology realizes access of a plurality of frequency bands through a dynamic spectrum allocation technology and can make full use of idle spectrum.
On the basis of Reinforcement Learning (RL), a deep reinforcement learning algorithm developed by combining deep learning obtains a level equivalent to that of human beings in a plurality of artificial intelligence fields, such as go, Dota, Startcraft II and the like. Specifically, a deep Q-network (DQN) combines an RL process with a class of neural networks (deep neural networks) to approximate a Q-action function, and the neural networks can make up for the limitations of Q learning in generalization and function approximation capabilities. And the competition deep Q network (dulling DQN) is improved by an algorithm on the basis of the common DQN, and the value of the state and the action evaluation value in the state are summed to be used as the Q value for reevaluation.
In the latest research, some researchers apply the DQN algorithm to spectrum allocation, and simulation results show that the algorithm has a faster convergence speed and a lower packet loss rate. In order to overcome the challenge of unknown dynamic industrial internet of things environment, a scholars provides an improved deep Q learning network applied to the spectrum resource management of the industrial internet of things. Researchers also apply the competitive deep reinforcement learning algorithm to the prediction of the heavy metal content of the soil, and can obtain a good effect. However, none of these deep reinforcement learning methods simultaneously considers the value of a state and the operation value in the state, or often does not classify a reward function according to the success of spectrum access when designing the reward function.
Disclosure of Invention
The invention provides a competition depth Q network power control method with high reward punishment, which considers the values of states and actions at the same time, sums the values and reevaluates the values, and can effectively improve the system stability.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a competition depth Q network power control method with high reward punishment comprises the following steps:
s1: the auxiliary base station collects communication information of a primary user and a secondary user and transmits the obtained information to the secondary user;
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
s3: constructing a graded reward function with high reward punishment;
s4: and constructing a power control strategy.
Further, the specific process of step S1 is:
the primary user and the secondary user are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and the auxiliary base station is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:
S(t)=[s1(t),s2(t),...,sk(t),...,sx(t)]
the signal strength received by the kth assisting base station is defined as:
Figure BDA0003129877580000021
in the formula Iik(t)、ljk(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) represents the reference distance, τ represents the path loss exponent, σ (t) represents the average noise power of the system; at time t, the secondary user k is in state sK(t) selecting an action, this time the user will enter sK(t) next state.
Further, in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so H is sharednA selectable action space, the action space defined as:
A(t)=[P1(t),P2(t),…,Pn(t)]。
further, in step S3, four indexes are designed to evaluate the success level of the secondary user spectrum access, where the indexes are defined as follows:
Figure BDA0003129877580000031
wherein,
Figure BDA0003129877580000032
and
Figure BDA0003129877580000033
respectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,
Figure BDA0003129877580000034
sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;
in step S3, it is defined as the most prerequisite condition for determining whether power control succeeds or not that the signal-to-noise ratio of any primary user is greater than a preset threshold, and if the signal-to-noise ratio of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:
Figure BDA0003129877580000035
according to the above grading conditions, the reward function is defined as:
Figure BDA0003129877580000041
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
Further, in step S4, the primary user is defined to transmit power according to the following strategy, where the power control strategy is as follows:
Figure BDA0003129877580000042
Figure BDA0003129877580000043
under the strategy, the master user controls the sending power by adopting a gradual updating mode at each time point t;
signal-to-noise ratio gamma of primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
Figure BDA0003129877580000044
the secondary user accesses to the channel of the primary user through a underlay type, and in order to not influence the normal communication of the primary user, the secondary user often has strict requirements when transmitting power; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:
Figure BDA0003129877580000045
i=1,2,...,M
defining the signal-to-noise ratio of the jth secondary user as:
Figure BDA0003129877580000046
i=1,2,...,N
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hkj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively; the channel gain and the transmitting power are dynamically changed, and according to the shannon theorem, the relationship between the throughput and the signal-to-noise ratio of the jth user is defined as follows:
Tj(t)=Wlog2(1+γj(t))
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention improves the reward function in the deep reinforcement learning process, carries out grade division according to the spectrum access condition of the secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.
Drawings
FIG. 1 is a diagram of a model of an application system in which the method of the present invention is implemented;
fig. 2 is a diagram of a general DQN network architecture;
FIG. 3 is a diagram of the Dueling DQN network architecture;
FIG. 4 is a graph comparing loss functions of three different depth reinforcement learning algorithms;
FIG. 5 is a graph of the cumulative rewards of 40000 training sessions for three different depth reinforcement learning algorithms;
FIG. 6 is a graph of the total throughput of a secondary user training 40000 times with three different depth reinforcement learning algorithms;
FIG. 7 is a graph of the average transmit power for the next user for three different depth reinforcement learning algorithms.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, in a certain area centered on a master base station (PBS), it is assumed that there are M master users (PU) and N Secondary Users (SU) (N > M) in a cognitive wireless network, and 1 master base station and a plurality of Auxiliary Base Stations (ABS), and the master users, the secondary users, and the auxiliary base stations are randomly distributed in a network environment. The main base station can ensure the normal operation of the communication of the main user, and the auxiliary base station can collect the received signal strength information of the main user and the received signal strength information of the secondary user and can send the collected data information to the secondary user again.
In the model, the secondary user accesses to the channel of the primary user through the underlay mode, and in order to not affect the normal communication of the primary user, the secondary user often has strict requirements when transmitting power. To avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the secondary base station and then complete the communication transmission task with proper transmission power.
The signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:
Figure BDA0003129877580000061
defining the signal-to-noise ratio of the jth secondary user as:
Figure BDA0003129877580000062
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hKj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) represents the ambient noise received by the ith primary user and the jth secondary user respectively.
The channel gain, the transmitting power and the like of the model are dynamically changed, and according to the Shannon theorem, the relation between the throughput and the signal-to-noise ratio of the jth user is defined as follows:
Tj(t)=Wlog2(1+γj(t)) (3)
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
The invention aims to adopt the Dueling DQN and improve the reward function thereof to carry out dynamic power control of frequency spectrum, and secondary users can self-adaptively adjust the own transmitting power according to the information obtained from the auxiliary base station, thereby completing the dynamic power control of the cognitive wireless network.
Like the ordinary DQN algorithm, the Dueling DQN algorithm has the same network structure as the ordinary DQN, namely, an environment, a playback memory unit, two neural networks with the same structure but different parameters and an error function. The method based on deep reinforcement learning is used for processing the power control problem of the frequency spectrum and is essentially a Markov decision process. Ordinary DQN proposes to approximate the optimal control strategy using an action value function Q (s, a):
Figure BDA0003129877580000071
the value of the state and the action advantage value in the state are summed to be used as a Q value for reevaluation, and the core content of the competitive deep Q network different from the common deep Q network is represented as follows:
Q(s,a;θ,α,β)=V(s;θ,β)+A(s,a;θ,a) (5)
comparing the network structures of DQN and dulling DQN as shown in fig. 2 and fig. 3, it can be known that dulling DQN has two data streams before the output layer, one data stream outputs the Q value of the state, and the other data stream outputs the advantaged value of the action.
1) Status of state
The primary user and the secondary user of the system model are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, and the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user. In the signal transmission process, the secondary base station plays an important role, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:
S(t)=[s1(t),s2(t),…,sK(t),…,sX(t)] (6)
the signal strength received by the kth assisting base station is defined as:
Figure BDA0003129877580000072
in the formula IiK(t)、ljK(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system.
At time t, the secondary user k is in state sK(t) selecting an action, this time the user will enter sK(t) next state.
2) Movement of
The transmitting power selected by the secondary user in each time slot is set as an action value, the transmitting power of each secondary user is a discretization value, and each secondary user can select H different transmitting values, so that the system model has H in commonnAn action space is selectable. The action space is defined as: graded reward function for high reward penalties:
A(t)=[P1(t),P2(t),…,Pn(t)] (8)
3) graded reward function for high reward penalties
One of the key issues for the secondary users to adaptively select the appropriate transmit power to achieve spectrum sharing is to design an efficient reward function. From the perspective of close reality, four indexes are designed to judge the success level of the spectrum access of the secondary user. The index is defined as follows:
Figure BDA0003129877580000081
wherein,
Figure BDA0003129877580000082
and
Figure BDA0003129877580000083
respectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,
Figure BDA0003129877580000084
sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;
whether the signal-to-noise ratios of any primary user are all larger than a preset threshold value is defined as the most prerequisite condition for judging whether the power control is successful, and if the signal-to-noise ratios of any primary user are not all larger than the preset threshold value, the Complete Failure of spectrum access (CF) can be directly judged. If the snr of any primary user is greater than the preset threshold but the snr of no secondary user is greater than the preset threshold, the situation is called Secondary Failure (SF). If the signal-to-noise ratio of any primary user is greater than the preset threshold, the signal-to-noise ratio of any secondary user is also greater than the preset threshold, and the primary user transmission power of all access channels is greater than the sum of the secondary user transmission power, the access mode is called Complete Success (CS). If the signal-to-noise ratio of only a part of secondary users in the CS condition is higher than the preset threshold and the other conditions are not changed, the access method is called secondary successful access (SS). The specific formula is expressed as follows:
Figure BDA0003129877580000085
according to the above grading conditions, the reward function is defined as:
Figure BDA0003129877580000091
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
4) Policy
Defining a master user to transmit power according to the following strategies, wherein the power control strategies are as follows:
Figure BDA0003129877580000092
Figure BDA0003129877580000093
under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. Signal-to-noise ratio gamma of primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, the current transmission power is kept unchanged. The signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
Figure BDA0003129877580000094
the competition depth Q network power control method based on the high reward punishment is provided, experimental simulation is carried out on a Python platform, and the method is a delay DQN algorithm for improving a reward function, so the method is referred to as the delay DQN algorithm hereinafter and in experiments. And comparing the performance of the native DQN algorithm, the double DQN algorithm and the blanking DQN algorithm under the same simulation environment. Each algorithm iterates 40000 times, and the performance results of each index are displayed once every 1000 times. FIG. 4 is a graph comparing the loss functions of three different depth reinforcement learning algorithms, and it can be seen that all three eventually converge. However, the native DQN algorithm and the double DQN algorithm are unstable, the loss fluctuation is large, and the convergence speed is slow. The blanking DQN algorithm proposed herein can converge at a relatively fast speed and the loss values are kept in a very small range.
As with fig. 5 and 6, the images show the cumulative reward and total throughput of the secondary user 40000 times for three different depth reinforcement learning algorithm trainings. Comparing the three algorithms can find that: compared with a natural DQN algorithm and a double DQN algorithm, the blanking DQN algorithm provided by the invention can explore the action of successful access of a secondary user from the 5 th round, obtain positive rewards, and continuously increase the accumulated rewards, so that the algorithm can quickly learn correct actions and has obvious advantages. In addition, the total throughput of the algorithm is the largest and the performance is the best in the index of the total throughput of the secondary users.
Fig. 7 shows the average transmit power of the next user for the three algorithms. In summary, the average transmit power of the native DQN algorithm is the highest, and the average transmit power of the double DQN algorithm is almost all above 2.0 mW. While the average transmit power of the blanking DQN algorithm is lowest, mostly at 1.5mW and 2.0mW, and a few higher than 2.0 mW. Simulation results show that, in combination with the indexes, the blanking DQN algorithm provided by the invention can ensure that the total throughput of the secondary user is maximum when performing dynamic power control, and the average transmission power is minimum under the condition of ensuring that the frequency spectrum of the secondary user is successfully accessed, so that the power loss can be effectively reduced, and energy is saved.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A competition depth Q network power control method with high reward punishment is characterized by comprising the following steps:
s1: the auxiliary base station collects communication information of a primary user and a secondary user and transmits the obtained information to the secondary user;
s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;
s3: constructing a graded reward function with high reward punishment;
s4: and constructing a power control strategy.
2. The competitive deep Q-network power control method for high reward penalty according to claim 1, wherein the specific process of step S1 is:
the primary user and the secondary user are in a non-cooperative relationship, the secondary user is in a pad type access to a primary user channel, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user, and if X auxiliary base stations exist in the environment, the state value is as follows:
S(t)=[s1(t),s2(t),…,sk(t),…,sx(t)]
the signal strength received by the kth assisting base station is defined as:
Figure FDA0003129877570000011
in the formula Iik(t)、ljk(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l0(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system.
3. The competitive deep Q-network power control method with high reward penalty according to claim 2, wherein in step S1, at time t, the secondary user k is in state Sk(t) selecting an action, this time the user will enter sk(t) next state.
4. The method according to claim 3, wherein in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so that H is totalnA selectable action space, the action space defined as:
A(t)=[P1(t),P2(t),…,Pn(t)]。
5. the method according to claim 4, wherein in step S3, four indexes are designed to judge the success level of spectrum access of the secondary users, and the indexes are defined as follows:
Figure FDA0003129877570000021
wherein,
Figure FDA0003129877570000022
and
Figure FDA0003129877570000023
respectively representing the signal-to-noise ratio, mu, of any primary user and any secondary useriAnd mujRespectively representing preset thresholds of a primary user and a secondary user,
Figure FDA0003129877570000024
sum sigma PjRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel.
6. The method according to claim 5, wherein in step S3, it is defined that whether the snr of any primary user is greater than a preset threshold is the most prerequisite to determine whether the power control is successful, and if the snr of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:
Figure FDA0003129877570000025
according to the above grading conditions, the reward function is defined as:
Figure FDA0003129877570000026
in the above formula, a1>10a2,a3>10a4And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.
7. The method as claimed in claim 6, wherein in step S4, the primary user is defined to transmit power according to the following strategy, and the power control strategy is as follows:
Figure FDA0003129877570000031
Figure FDA0003129877570000032
under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t.
8. The competitive deep Q-network power control method with high reward penalty according to claim 7, characterized in that the SNR γ of the primary user i at time ti(t)≤μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time ti(t)≥μiAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'i(t)≥μiWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:
Figure FDA0003129877570000033
9. the method as claimed in claim 8, wherein the secondary user accesses to the channel of the primary user via underlay mode, and in order not to affect the normal communication of the primary user, the secondary user often has strict requirement in power transmission; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality, and the signal-to-noise ratio of the ith main user is defined as:
Figure FDA0003129877570000034
defining the signal-to-noise ratio of the jth secondary user as:
Figure FDA0003129877570000035
wherein h isiiAnd hjjRespectively representing the channel gain, P, of the ith primary user and the jth secondary useri(t) and Pj(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, hij(t)、hji(t)、hkj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, Ni(t) and Nj(t) represents the ambient noise received by the ith primary user and the jth secondary user respectively.
10. The method according to claim 9, wherein channel gain and transmission power are dynamically changed, and according to shannon's theorem, the relation between throughput and signal-to-noise ratio of the ith sub-user is defined as:
Tj(t)=W log2(1+γj(t))
wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.
CN202110701419.XA 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment Active CN113438723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110701419.XA CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110701419.XA CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Publications (2)

Publication Number Publication Date
CN113438723A true CN113438723A (en) 2021-09-24
CN113438723B CN113438723B (en) 2023-04-28

Family

ID=77753705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110701419.XA Active CN113438723B (en) 2021-06-23 2021-06-23 Competition depth Q network power control method with high rewarding punishment

Country Status (1)

Country Link
CN (1) CN113438723B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116321390A (en) * 2023-05-23 2023-06-23 北京星河亮点技术股份有限公司 Power control method, device and equipment
CN117545094A (en) * 2024-01-09 2024-02-09 大连海事大学 Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013000169A1 (en) * 2011-06-29 2013-01-03 中国人民解放军理工大学 Resource allocation method for maximizing throughput in cooperative cognitive simo network
WO2013000167A1 (en) * 2011-06-29 2013-01-03 中国人民解放军理工大学 Cognitive single-input multi-output network access method base on cooperative relay
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN111262638A (en) * 2020-01-17 2020-06-09 合肥工业大学 Dynamic spectrum access method based on efficient sample learning
WO2020134507A1 (en) * 2018-12-28 2020-07-02 北京邮电大学 Routing construction method for unmanned aerial vehicle network, unmanned aerial vehicle, and storage medium
CN111726811A (en) * 2020-05-26 2020-09-29 国网浙江省电力有限公司嘉兴供电公司 Slice resource allocation method and system for cognitive wireless network
WO2020244906A1 (en) * 2019-06-03 2020-12-10 Nokia Solutions And Networks Oy Uplink power control using deep q-learning
CN112367132A (en) * 2020-10-27 2021-02-12 西北工业大学 Power distribution algorithm in cognitive radio based on reinforcement learning solution

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013000169A1 (en) * 2011-06-29 2013-01-03 中国人民解放军理工大学 Resource allocation method for maximizing throughput in cooperative cognitive simo network
WO2013000167A1 (en) * 2011-06-29 2013-01-03 中国人民解放军理工大学 Cognitive single-input multi-output network access method base on cooperative relay
WO2020134507A1 (en) * 2018-12-28 2020-07-02 北京邮电大学 Routing construction method for unmanned aerial vehicle network, unmanned aerial vehicle, and storage medium
WO2020244906A1 (en) * 2019-06-03 2020-12-10 Nokia Solutions And Networks Oy Uplink power control using deep q-learning
CN110267338A (en) * 2019-07-08 2019-09-20 西安电子科技大学 Federated resource distribution and Poewr control method in a kind of D2D communication
CN111262638A (en) * 2020-01-17 2020-06-09 合肥工业大学 Dynamic spectrum access method based on efficient sample learning
CN111726811A (en) * 2020-05-26 2020-09-29 国网浙江省电力有限公司嘉兴供电公司 Slice resource allocation method and system for cognitive wireless network
CN112367132A (en) * 2020-10-27 2021-02-12 西北工业大学 Power distribution algorithm in cognitive radio based on reinforcement learning solution

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZIFENG YE,YONGHUA WANG,PIN WAN: "Joint Channel Allocation and Power Control Based on Long", 《COMPLEXITY》 *
蒋涛涛;朱江;: "CNR中基于多用户Q学习的联合信道选择和功率控制" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116321390A (en) * 2023-05-23 2023-06-23 北京星河亮点技术股份有限公司 Power control method, device and equipment
CN117545094A (en) * 2024-01-09 2024-02-09 大连海事大学 Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network
CN117545094B (en) * 2024-01-09 2024-03-26 大连海事大学 Dynamic spectrum resource allocation method for hierarchical heterogeneous cognitive wireless sensor network

Also Published As

Publication number Publication date
CN113438723B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
Wang et al. Joint interference alignment and power control for dense networks via deep reinforcement learning
Liao et al. A model-driven deep reinforcement learning heuristic algorithm for resource allocation in ultra-dense cellular networks
CN113438723A (en) Competitive depth Q network power control method with high reward punishment
Alpcan et al. Power control for multicell CDMA wireless networks: A team optimization approach
CN110267274B (en) Spectrum sharing method for selecting sensing users according to social credibility among users
Ren et al. DDPG based computation offloading and resource allocation for MEC systems with energy harvesting
CN113795050B (en) Sum Tree sampling-based deep double-Q network dynamic power control method
Ye et al. Learning-based computing task offloading for autonomous driving: A load balancing perspective
Trrad et al. Application of fuzzy logic to cognitive wireless communications
Ma et al. On-demand resource management for 6G wireless networks using knowledge-assisted dynamic neural networks
Sanusi et al. Development of handover decision algorithms in hybrid Li-Fi and Wi-Fi networks
Liu et al. Deep reinforcement learning-based MEC offloading and resource allocation in uplink NOMA heterogeneous network
Joshi et al. Optimized fuzzy power control over fading channels in spectrum sharing cognitive radio using ANFIS
Yan et al. QoE-based semantic-aware resource allocation for multi-task networks
Tashman et al. Performance optimization of energy-harvesting underlay cognitive radio networks using reinforcement learning
CN114219074A (en) Wireless communication network resource allocation algorithm dynamically adjusted according to requirements
Mendoza et al. Deep reinforcement learning for dynamic access point activation in cell-free MIMO networks
CN113115355A (en) Power distribution method based on deep reinforcement learning in D2D system
Alajmi et al. An efficient actor critic drl framework for resource allocation in multi-cell downlink noma
CN110149608B (en) DAI-based resource allocation method for optical wireless sensor network
CN116470598A (en) Wireless textile body area network energy neutral operation method based on deep reinforcement learning
CN113890653A (en) Multi-agent reinforcement learning power distribution method for multi-user benefits
CN115633402A (en) Resource scheduling method for mixed service throughput optimization
Chang et al. Fuzzy/neural congestion control for integrated voice and data DS-CDMA/FRMA cellular networks
Sabitha et al. Design and analysis of fuzzy logic and neural network based transmission power control techniques for energy efficient wireless sensor networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231212

Address after: 518021 A807, Jihao Building, No. 1086 Shennan East Road, Fenghuang Community, Huangbei Street, Luohu District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Tuo Ai Wei Information Technology Co.,Ltd.

Address before: 510090 Dongfeng East Road 729, Yuexiu District, Guangzhou City, Guangdong Province

Patentee before: GUANGDONG University OF TECHNOLOGY

TR01 Transfer of patent right