CN113438723A

CN113438723A - Competitive depth Q network power control method with high reward punishment

Info

Publication number: CN113438723A
Application number: CN202110701419.XA
Authority: CN
Inventors: 刘骏; 刘德荣; 王永华; 林得有; 王宇慧
Original assignee: Guangdong University of Technology
Current assignee: Shenzhen Tuo Ai Wei Information Technology Co ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-24
Anticipated expiration: 2041-06-23
Also published as: CN113438723B

Abstract

The invention provides a competition depth Q network power control method with high reward punishment, which improves a reward function in a depth reinforcement learning process, carries out grade division according to the spectrum access condition of a secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.

Description

Competitive depth Q network power control method with high reward punishment

Technical Field

The invention relates to the field of cognitive radio control methods, in particular to a competition depth Q network power control method with high reward punishment.

Background

With the rapid development and wide use of wireless communication technology, the demand of spectrum resources is increasing, and in contrast, the demand is a severe reality that the wireless spectrum resources are gradually exhausted, which becomes a big problem to be solved in further development of wireless communication technology. However, most of the current spectrum resources are allocated by using a conventional and fixed allocation method, that is, a specific frequency band is assigned to a specific user, and other users need to be authorized to use the spectrum resources. A great deal of research is carried out in the academic world and the industrial world, which shows that on one hand, a great deal of spectrum resources are not really used by authorized users, a great deal of authorized frequency bands are in an idle state, the idle frequency band utilization rate of the authorized users is low, and on the other hand, the spectrum resources in the public frequency band are robbed and blocked excessively. Therefore, how to solve these contradictions in the spectrum resource allocation process and improve the spectrum utilization rate are very important.

The concept of Cognitive Radio (CR) technology is intended to alleviate the problems of shortage of spectrum resources and low spectrum utilization. The cognitive process of cognitive radio is divided into six steps, positioning (origin), observation (observer), learning (study), decision (Decide), planning (Plan) and action (Act), respectively. The cognitive radio intelligently adjusts the decision and the positioning of the cognitive radio through the observation and the learning of the external environment, realizes the corresponding plan and action, and performs the self-adaptive adjustment process on the external environment. For spectrum sharing, the core idea of cognitive radio is: on the premise that no interference is generated to authorized users (PU) obtaining spectrum use rights, Secondary Users (SU) sense the surrounding radio environment and perform spectrum access opportunistically to improve the spectrum utilization rate, and the technology realizes access of a plurality of frequency bands through a dynamic spectrum allocation technology and can make full use of idle spectrum.

On the basis of Reinforcement Learning (RL), a deep reinforcement learning algorithm developed by combining deep learning obtains a level equivalent to that of human beings in a plurality of artificial intelligence fields, such as go, Dota, Startcraft II and the like. Specifically, a deep Q-network (DQN) combines an RL process with a class of neural networks (deep neural networks) to approximate a Q-action function, and the neural networks can make up for the limitations of Q learning in generalization and function approximation capabilities. And the competition deep Q network (dulling DQN) is improved by an algorithm on the basis of the common DQN, and the value of the state and the action evaluation value in the state are summed to be used as the Q value for reevaluation.

In the latest research, some researchers apply the DQN algorithm to spectrum allocation, and simulation results show that the algorithm has a faster convergence speed and a lower packet loss rate. In order to overcome the challenge of unknown dynamic industrial internet of things environment, a scholars provides an improved deep Q learning network applied to the spectrum resource management of the industrial internet of things. Researchers also apply the competitive deep reinforcement learning algorithm to the prediction of the heavy metal content of the soil, and can obtain a good effect. However, none of these deep reinforcement learning methods simultaneously considers the value of a state and the operation value in the state, or often does not classify a reward function according to the success of spectrum access when designing the reward function.

Disclosure of Invention

The invention provides a competition depth Q network power control method with high reward punishment, which considers the values of states and actions at the same time, sums the values and reevaluates the values, and can effectively improve the system stability.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a competition depth Q network power control method with high reward punishment comprises the following steps:

s1: the auxiliary base station collects communication information of a primary user and a secondary user and transmits the obtained information to the secondary user;

s2: setting the transmitting power selected by the secondary user in each time slot as an action value, and constructing an action space;

s3: constructing a graded reward function with high reward punishment;

s4: and constructing a power control strategy.

Further, the specific process of step S1 is:

the primary user and the secondary user are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and the auxiliary base station is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:

S(t)＝[s₁(t)，s₂(t)，...，s_k(t)，...，s_x(t)]

the signal strength received by the kth assisting base station is defined as:

in the formula I_ik(t)、l_jk(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l₀(t) represents the reference distance, τ represents the path loss exponent, σ (t) represents the average noise power of the system; at time t, the secondary user k is in state s_K(t) selecting an action, this time the user will enter s_K(t) next state.

Further, in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so H is sharedⁿA selectable action space, the action space defined as:

A(t)＝[P₁(t),P₂(t),…,P_n(t)]。

further, in step S3, four indexes are designed to evaluate the success level of the secondary user spectrum access, where the indexes are defined as follows:

wherein,

and

respectively representing the signal-to-noise ratio, mu, of any primary user and any secondary user_iAnd mu_jRespectively representing preset thresholds of a primary user and a secondary user,

sum sigma P_jRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel;

in step S3, it is defined as the most prerequisite condition for determining whether power control succeeds or not that the signal-to-noise ratio of any primary user is greater than a preset threshold, and if the signal-to-noise ratio of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:

according to the above grading conditions, the reward function is defined as:

in the above formula, a₁＞10a₂，a₃＞10a₄And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.

Further, in step S4, the primary user is defined to transmit power according to the following strategy, where the power control strategy is as follows:

under the strategy, the master user controls the sending power by adopting a gradual updating mode at each time point t;

signal-to-noise ratio gamma of primary user i at time t_i(t)≤μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time t_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:

the secondary user accesses to the channel of the primary user through a underlay type, and in order to not influence the normal communication of the primary user, the secondary user often has strict requirements when transmitting power; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:

i＝1，2，...，M

defining the signal-to-noise ratio of the jth secondary user as:

i＝1，2，...，N

wherein h is_iiAnd h_jjRespectively representing the channel gain, P, of the ith primary user and the jth secondary user_i(t) and P_j(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, h_ij(t)、h_ji(t)、h_kj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, N_i(t) and N_j(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively; the channel gain and the transmitting power are dynamically changed, and according to the shannon theorem, the relationship between the throughput and the signal-to-noise ratio of the jth user is defined as follows:

T_j(t)＝Wlog₂(1+γ_j(t))

wherein, W represents the signal bandwidth, in the dynamically changing system, the best power distribution effect of the system is ensured, the signal-to-noise ratio of the primary user is higher than the preset threshold value, and the secondary user is ensured to adjust the self transmitting power through continuous learning, so that the total throughput of the secondary user in the whole system is maximized.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention improves the reward function in the deep reinforcement learning process, carries out grade division according to the spectrum access condition of the secondary user, and gives different actions with different reward values. Giving high reward to the most successful action of the most correct access and giving high punishment to the most wrong action of the most failed access, thus leading the system to more quickly explore the strategy of successful access; the competitive depth Q network is combined with the graded reward function with high reward punishment, and the method is applied to the dynamic power control of the frequency spectrum, so that the stability of the system can be effectively improved, the total throughput of secondary users can be improved, the power loss is reduced, and the effect of saving energy is achieved.

Drawings

FIG. 1 is a diagram of a model of an application system in which the method of the present invention is implemented;

fig. 2 is a diagram of a general DQN network architecture;

FIG. 3 is a diagram of the Dueling DQN network architecture;

FIG. 4 is a graph comparing loss functions of three different depth reinforcement learning algorithms;

FIG. 5 is a graph of the cumulative rewards of 40000 training sessions for three different depth reinforcement learning algorithms;

FIG. 6 is a graph of the total throughput of a secondary user training 40000 times with three different depth reinforcement learning algorithms;

FIG. 7 is a graph of the average transmit power for the next user for three different depth reinforcement learning algorithms.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, in a certain area centered on a master base station (PBS), it is assumed that there are M master users (PU) and N Secondary Users (SU) (N > M) in a cognitive wireless network, and 1 master base station and a plurality of Auxiliary Base Stations (ABS), and the master users, the secondary users, and the auxiliary base stations are randomly distributed in a network environment. The main base station can ensure the normal operation of the communication of the main user, and the auxiliary base station can collect the received signal strength information of the main user and the received signal strength information of the secondary user and can send the collected data information to the secondary user again.

In the model, the secondary user accesses to the channel of the primary user through the underlay mode, and in order to not affect the normal communication of the primary user, the secondary user often has strict requirements when transmitting power. To avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the secondary base station and then complete the communication transmission task with proper transmission power.

The signal-to-noise ratio is an important index for measuring the link quality. Defining the signal-to-noise ratio of the ith primary user as follows:

defining the signal-to-noise ratio of the jth secondary user as:

wherein h is_iiAnd h_jjRespectively representing the channel gain, P, of the ith primary user and the jth secondary user_i(t) and P_j(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, h_ij(t)、h_ji(t)、h_Kj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, N_i(t) and N_j(t) represents the ambient noise received by the ith primary user and the jth secondary user respectively.

The channel gain, the transmitting power and the like of the model are dynamically changed, and according to the Shannon theorem, the relation between the throughput and the signal-to-noise ratio of the jth user is defined as follows:

T_j(t)＝Wlog₂(1+γ_j(t)) (3)

The invention aims to adopt the Dueling DQN and improve the reward function thereof to carry out dynamic power control of frequency spectrum, and secondary users can self-adaptively adjust the own transmitting power according to the information obtained from the auxiliary base station, thereby completing the dynamic power control of the cognitive wireless network.

Like the ordinary DQN algorithm, the Dueling DQN algorithm has the same network structure as the ordinary DQN, namely, an environment, a playback memory unit, two neural networks with the same structure but different parameters and an error function. The method based on deep reinforcement learning is used for processing the power control problem of the frequency spectrum and is essentially a Markov decision process. Ordinary DQN proposes to approximate the optimal control strategy using an action value function Q (s, a):

the value of the state and the action advantage value in the state are summed to be used as a Q value for reevaluation, and the core content of the competitive deep Q network different from the common deep Q network is represented as follows:

Q(s,a；θ,α,β)＝V(s；θ,β)+A(s,a；θ,a) (5)

comparing the network structures of DQN and dulling DQN as shown in fig. 2 and fig. 3, it can be known that dulling DQN has two data streams before the output layer, one data stream outputs the Q value of the state, and the other data stream outputs the advantaged value of the action.

1) Status of state

The primary user and the secondary user of the system model are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, and the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user. In the signal transmission process, the secondary base station plays an important role, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. Assuming that there are X assisting bss in the environment, the state values are:

S(t)＝[s₁(t),s₂(t),…,s_K(t),…,s_X(t)] (6)

the signal strength received by the kth assisting base station is defined as:

in the formula I_iK(t)、l_jK(t) represents the distance between the secondary base station and the primary and secondary users at time t, respectively, and l₀(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system.

At time t, the secondary user k is in state s_K(t) selecting an action, this time the user will enter s_K(t) next state.

2) Movement of

The transmitting power selected by the secondary user in each time slot is set as an action value, the transmitting power of each secondary user is a discretization value, and each secondary user can select H different transmitting values, so that the system model has H in commonⁿAn action space is selectable. The action space is defined as: graded reward function for high reward penalties:

A(t)＝[P₁(t),P₂(t),…,P_n(t)] (8)

3) graded reward function for high reward penalties

One of the key issues for the secondary users to adaptively select the appropriate transmit power to achieve spectrum sharing is to design an efficient reward function. From the perspective of close reality, four indexes are designed to judge the success level of the spectrum access of the secondary user. The index is defined as follows:

wherein,

and

whether the signal-to-noise ratios of any primary user are all larger than a preset threshold value is defined as the most prerequisite condition for judging whether the power control is successful, and if the signal-to-noise ratios of any primary user are not all larger than the preset threshold value, the Complete Failure of spectrum access (CF) can be directly judged. If the snr of any primary user is greater than the preset threshold but the snr of no secondary user is greater than the preset threshold, the situation is called Secondary Failure (SF). If the signal-to-noise ratio of any primary user is greater than the preset threshold, the signal-to-noise ratio of any secondary user is also greater than the preset threshold, and the primary user transmission power of all access channels is greater than the sum of the secondary user transmission power, the access mode is called Complete Success (CS). If the signal-to-noise ratio of only a part of secondary users in the CS condition is higher than the preset threshold and the other conditions are not changed, the access method is called secondary successful access (SS). The specific formula is expressed as follows:

according to the above grading conditions, the reward function is defined as:

in the above formula, a₁>10a₂，a₃>10a₄And the reward function is graded according to the successful spectrum access condition, the secondary user is successfully accessed to give a high reward, and the secondary user is completely failed to be accessed to give a high punishment, so that the system can explore a successful access strategy more quickly.

4) Policy

Defining a master user to transmit power according to the following strategies, wherein the power control strategies are as follows:

under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. Signal-to-noise ratio gamma of primary user i at time t_i(t)≤μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time t_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, the current transmission power is kept unchanged. The signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:

the competition depth Q network power control method based on the high reward punishment is provided, experimental simulation is carried out on a Python platform, and the method is a delay DQN algorithm for improving a reward function, so the method is referred to as the delay DQN algorithm hereinafter and in experiments. And comparing the performance of the native DQN algorithm, the double DQN algorithm and the blanking DQN algorithm under the same simulation environment. Each algorithm iterates 40000 times, and the performance results of each index are displayed once every 1000 times. FIG. 4 is a graph comparing the loss functions of three different depth reinforcement learning algorithms, and it can be seen that all three eventually converge. However, the native DQN algorithm and the double DQN algorithm are unstable, the loss fluctuation is large, and the convergence speed is slow. The blanking DQN algorithm proposed herein can converge at a relatively fast speed and the loss values are kept in a very small range.

As with fig. 5 and 6, the images show the cumulative reward and total throughput of the secondary user 40000 times for three different depth reinforcement learning algorithm trainings. Comparing the three algorithms can find that: compared with a natural DQN algorithm and a double DQN algorithm, the blanking DQN algorithm provided by the invention can explore the action of successful access of a secondary user from the 5 th round, obtain positive rewards, and continuously increase the accumulated rewards, so that the algorithm can quickly learn correct actions and has obvious advantages. In addition, the total throughput of the algorithm is the largest and the performance is the best in the index of the total throughput of the secondary users.

Fig. 7 shows the average transmit power of the next user for the three algorithms. In summary, the average transmit power of the native DQN algorithm is the highest, and the average transmit power of the double DQN algorithm is almost all above 2.0 mW. While the average transmit power of the blanking DQN algorithm is lowest, mostly at 1.5mW and 2.0mW, and a few higher than 2.0 mW. Simulation results show that, in combination with the indexes, the blanking DQN algorithm provided by the invention can ensure that the total throughput of the secondary user is maximum when performing dynamic power control, and the average transmission power is minimum under the condition of ensuring that the frequency spectrum of the secondary user is successfully accessed, so that the power loss can be effectively reduced, and energy is saved.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A competition depth Q network power control method with high reward punishment is characterized by comprising the following steps:

s3: constructing a graded reward function with high reward punishment;

s4: and constructing a power control strategy.

2. The competitive deep Q-network power control method for high reward penalty according to claim 1, wherein the specific process of step S1 is:

the primary user and the secondary user are in a non-cooperative relationship, the secondary user is in a pad type access to a primary user channel, the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user, the auxiliary base station plays an important role in the signal transmission process, and is responsible for collecting communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user, and if X auxiliary base stations exist in the environment, the state value is as follows:

S(t)＝[s₁(t),s₂(t),…,s_k(t),…,s_x(t)]

the signal strength received by the kth assisting base station is defined as:

3. The competitive deep Q-network power control method with high reward penalty according to claim 2, wherein in step S1, at time t, the secondary user k is in state S_k(t) selecting an action, this time the user will enter s_k(t) next state.

4. The method according to claim 3, wherein in step S2, the transmission power selected by the sub-users in each time slot is set as the action value, the transmission power of each sub-user is discretized, and each sub-user selects H different transmission values, so that H is totalⁿA selectable action space, the action space defined as:

A(t)＝[P₁(t),P₂(t),…,P_n(t)]。

5. the method according to claim 4, wherein in step S3, four indexes are designed to judge the success level of spectrum access of the secondary users, and the indexes are defined as follows:

wherein,

and

sum sigma P_jRespectively representing the sum of the primary user power and the secondary user transmitting power of any access channel.

6. The method according to claim 5, wherein in step S3, it is defined that whether the snr of any primary user is greater than a preset threshold is the most prerequisite to determine whether the power control is successful, and if the snr of any primary user is not greater than the preset threshold, it can directly determine that the spectrum access completely fails CF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, but the signal-to-noise ratio of no secondary user is greater than the preset threshold value, the condition is called secondary access failure SF; if the signal-to-noise ratio of any primary user is greater than a preset threshold value, the signal-to-noise ratio of any secondary user is also greater than the preset threshold value, and the transmitting power of the primary users of all access channels is greater than the sum of the transmitting power of the secondary users, the access mode is called as a complete access successful CS; in the condition of complete access success, if the signal-to-noise ratio of only part of the secondary users is higher than a preset threshold value and the other conditions are not changed, the access mode is called as a secondary access success SS, and the specific formula is expressed as follows:

according to the above grading conditions, the reward function is defined as:

7. The method as claimed in claim 6, wherein in step S4, the primary user is defined to transmit power according to the following strategy, and the power control strategy is as follows:

under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t.

8. The competitive deep Q-network power control method with high reward penalty according to claim 7, characterized in that the SNR γ of the primary user i at time t_i(t)≤μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time t_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the primary user i at the moment of predicting t +1 is as follows:

9. the method as claimed in claim 8, wherein the secondary user accesses to the channel of the primary user via underlay mode, and in order not to affect the normal communication of the primary user, the secondary user often has strict requirement in power transmission; in order to avoid the influence on the normal communication of the primary user, the secondary user needs to continuously learn the data information collected from the auxiliary base station and then complete the communication transmission task with proper transmitting power; the signal-to-noise ratio is an important index for measuring the link quality, and the signal-to-noise ratio of the ith main user is defined as:

defining the signal-to-noise ratio of the jth secondary user as:

10. The method according to claim 9, wherein channel gain and transmission power are dynamically changed, and according to shannon's theorem, the relation between throughput and signal-to-noise ratio of the ith sub-user is defined as:

T_j(t)＝W log₂(1+γ_j(t))