CN111491358B

CN111491358B - Adaptive modulation and power control system based on energy acquisition and optimization method

Info

Publication number: CN111491358B
Application number: CN202010325108.3A
Authority: CN
Inventors: 杨佳雨; 胡杰; 杨鲲; 冷甦鹏
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2021-10-26
Anticipated expiration: 2040-04-23
Also published as: CN111491358A

Abstract

The invention discloses an adaptive modulation and power control system and an optimization method based on energy acquisition, which are applied to the technical field of wireless communication networks, and the system comprises the following components: the device comprises a transmitter, a receiver, a Rayleigh fading channel and a channel estimation module; the transmitter adaptively adjusts the transmitting power and the modulation mode of the transmitter under the constraints of average power limit, average bit error rate limit and average energy harvesting limit according to the feedback information of the channel estimation module; the receiver adaptively adjusts the power division factor; the receiver comprises a rechargeable battery, a part of received energy is stored in the battery by the receiver in a power division mode, and the rest energy is used for transmitting data to the transmitter through a Rayleigh fading channel; the problem of the energy supply of the low-power-consumption receiver of the future Internet of things is effectively solved, and the vision of a green network is realized.

Description

Adaptive modulation and power control system based on energy acquisition and optimization method

Technical Field

The invention belongs to the technical field of wireless communication networks, and particularly relates to a self-adaptive link technology applied to an SWIPT system.

Background

In recent years, Simultaneous Wireless Information and Power Transfer (SWIPT) has received considerable attention to extend the life of energy-limited nodes. In a SWIPT application scenario, a transmitter sends information and energy to a receiver over a wireless channel. In a conventional transmission scheme, a modulation mode and transmission power are fixed, which is called a non-adaptive scheme. This scheme does not take full advantage of the time-varying fading channel. And it is a non-adaptive system designed according to the worst case of the channel state in order to ensure reliable transmission in various states of the time-varying channel. The design principle of the system may result in an inefficient use of the channel capacity. In order to obtain maximum throughput under different channel conditions, it is necessary to introduce adaptive link techniques (including adaptive modulation, adaptive power control, adaptive energy transfer control) into the SWIPT system.

In addition, the artificial intelligence technique is good in hand. The system is applied to various fields at present by virtue of the characteristic that machine equipment and the like can sense more intelligently like human beings and make certain feedback with the environment. In the field of communications, artificial intelligence techniques are also applied to the various communication layers. For example, the physical layer may perform intelligent modulation and coding by deep learning, the MAC layer may perform certain resource allocation according to reinforcement learning, and the network layer may intelligently help each device to find an optimal route. The combination of communication and machine learning is making networks more intelligent.

Unlike conventional adaptive link techniques, in a SWIPT system, since a receiver operates using only energy collected from an energy signal received from a wireless channel, there is a tradeoff between an amount of information transfer and an amount of energy transfer, and thus the adaptive link control scheme must be designed to simultaneously optimize the collected energy in addition to considering the optimization of throughput, thereby ensuring the performance and stability of the system. In the conventional optimization method, although a time-varying channel is considered, the assumption that the channel transition probability is known in the system in the related research is not reasonable because it is difficult to accurately estimate the channel transition probability in the real world.

Disclosure of Invention

In order to solve the technical problems, the invention provides an adaptive modulation and power control scheme and an optimization method based on energy collection and deep reinforcement learning.

The technical scheme adopted by the invention is as follows: an adaptive modulation link control system comprising: the device comprises a transmitter, a receiver, a Rayleigh fading channel and a channel estimation module;

the transmitter adaptively adjusts the transmitting power and the modulation mode of the transmitter under the constraints of average power limit, average bit error rate limit and average energy harvesting limit according to the feedback information of the channel estimation module;

the receiver comprises a rechargeable battery, a part of received energy is stored in the battery by the receiver in a power division mode, and the rest energy is used for transmitting data to the transmitter through a Rayleigh fading channel.

The transmitter maintains two deep neural networks, which are respectively noted as: a target network and an evaluation network, wherein the target network is used for selecting an action strategy and outputting an expected return value corresponding to the selected action strategy

Representing a reward function, the evaluation network serving to evaluate the value of the function Q(s) at the current moment_t,a_t) And (6) estimating. The action strategy refers to a modulation mode.

The second technical scheme adopted by the invention is as follows: a deep neural network optimization method based on deep reinforcement learning comprises the following steps:

b1, randomly initializing and evaluating the weight parameter theta of the network and the weight theta of the target network^-；

B2, the transmitter obtains the maximum Q(s) according to the current evaluation network_t,a_t) Act a of_tThe target network performs action a with a probability of 1-epsilon_tRandomly selecting actions from the action candidate set according to the probability of epsilon for exploration;

b3, each action corresponds to a reward function, and the state of the transmitter is changed from s_tIs transferred to s_t+1；

B4, controlling the samples(s) stored in the experience pool by adopting a sliding window_t,a_t,r_t,s_t+1)，s_tRepresenting the state of the transmitter at time t, s_t+1Representing the state of the transmitter at time t +1, r_tExpressing a return function value at the t moment;

b5, taking samples from the experience pool by the evaluation network and the target network, and carrying out a back propagation algorithm based on gradient descent to update network parameters;

b6, assigning the weight parameter of the evaluation network to the target network so that theta^-＝θ。

The action candidate set in step B2 specifically includes: the transmitter obtains a maximum function Q(s) from the evaluation network_t,a_t) Corresponding action policy a_tAnd selecting action strategies with the same or adjacent orders with the action strategies to form an action candidate set.

The reward function of step B3 is set according to constraints including average power limit, average bit error rate limit, and average energy harvesting limit.

When the constraint is satisfied, the return function takes the value of the frequency spectrum utilization rate corresponding to the action;

when the constraint is not satisfied, the reward function takes a negative value equal to the degree to which the constraint is not satisfied.

Step B4 further includes initially setting the sliding window to 2.

The invention has the beneficial effects that: according to the invention, an energy acquisition technology and a wireless communication technology are combined, so that the problem of energy supply of a low-power receiver of the Internet of things in the future is effectively solved, and the vision of a green network is realized. Meanwhile, based on a deep reinforcement learning technology, intelligent decision is made for intelligent nodes in the network, and a Priority Experience Generation (PEG) technology is used to improve the convergence of the deep reinforcement learning algorithm, so that the reinforcement learning algorithm can converge and learn a strategy with higher performance in a digital integrated transmission scene. The strategy is applied to a data-energy integrated cooperative transmission scene, so that the wireless network is more intelligent.

Drawings

Fig. 1 is a flowchart of an adaptive link control design and optimization method based on energy collection and deep reinforcement learning according to the present invention.

Fig. 2 is a system diagram of adaptive modulation, adaptive power control and adaptive energy control according to the present invention.

FIG. 3 illustrates the location of PEG neutralization implementation in the reinforcement learning algorithm of the present invention, in comparison to conventional Per experience Playback (PER);

wherein, fig. 3(a) is that the priority experience playback mechanism can take effect when the good experience and bad experience are balanced; fig. 3(b) shows a case where the PER fails, and when all the experience pools are bad experience, effective learning cannot be achieved even though the PER is used; FIG. 3(c) is a diagram illustrating the principle of the Preferred Empirical Generation (PEG) proposed by the present invention.

Fig. 4 is a deep reinforcement learning DQN algorithm framework of an embodiment of the invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the following further explains the technical contents of the present invention with reference to fig. 1 to 4.

For the understanding of the present invention, the following terms will be explained first:

WPT: wireless energy transfer.

WIT: and (4) wireless information transmission.

h: the channel gain.

(s_t,a_t,r_t,s_t+1): transitions format of reinforcement learning algorithm.

The epsilon-greedy strategy: exploration strategy of original DQN.

γ: instantaneous signal-to-noise ratio.

: average error rate performance.

: average energy harvesting performance.

: the average transmit power.

ρ_PS: a power division factor.

Mod: and (4) modulation mode.

P₀: a constrained target value for the average harvest energy.

P_t0: a target value for the average transmit power constraint.

BER₀: average bit error rateIs determined.

In the case where the channel transition probability is unknown, a reinforcement learning based approach may be very effective. In the reinforcement learning method, the optimal control strategy is learned by repeatedly interacting with the environment (i.e., the channel) without assuming prior information of the channel transition probability.

The invention provides a self-adaptive link control design and optimization method based on energy collection and deep reinforcement learning, which comprises the following steps as shown in figure 1:

s1, constructing a self-adaptive link control system (including self-adaptive power control, self-adaptive modulation and self-adaptive energy control) based on an energy acquisition technology;

in this embodiment, a point-to-point SWIPT system consisting of one intelligent transmitter and one receiver is considered. They all have only one antenna. The receiver is assumed to have a rechargeable battery. The receiver uses power splitting to store a portion of the received energy in a battery, and the remaining energy is used to transmit data to the transmitter. The transmitter adaptively adjusts the transmitting power and the modulation mode of the transmitter under the constraints of average power limit, average bit error rate limit and average energy harvesting limit according to the feedback information of channel estimation, and the receiver adaptively adjusts the modulation rate division factor. For example, the transmitter makes adjustments to the transmit power based on various performance indicators of the current system operation, selects an appropriate modulation scheme based on a policy, and stabilizes the average performance indicator within the performance constraints by dividing the power, some for energy transmission, some for data transmission, and adaptively adjusting the ratio during division (adaptive power division factor control).

We consider the transmitter to have complete Channel State Information (CSI), and assuming the wireless channel is quasi-static fading and rayleigh flat fading, the downlink channel gain from the transmitter to the receiver can be expressed as

g＝|h|²α

Where α denotes the components of large scale fading, including path fading and lognormal shadow fading, which will remain unchanged over multiple time slots. Based on a first-order Gauss-Markov process, the invention considers a correlation time-varying fading channel of small-scale Rayleigh fading component h variation.

h^t＝ρh^t-1+e^t

Wherein h-CN (0,1) is a cyclic symmetric complex Gaussian function (CSCG) of unit variance, and the channel updating process e¹,e²… … consists of CSCG random variables with distributed independent identity distributions, satisfying CN (0, 1-rho)²) Correlation coefficient ρ ═ J₀(2πf_dT), here J)₀(. is a zero-order Bessel function of the first kind, f_dThe maximum doppler frequency.

The correlation profile of the channel (i.e., the probability distribution of channel transitions) is assumed to be unknown to the system. The distance between the transmitter and the receiver is d and the path loss coefficient is λ. Suppose the transmitter has a transmit power of P_tThe average received power of energy reception is P_rThe power division factor is rho_PSNoise power of σ²The received signal-to-noise ratio (SINR) may be expressed as

For energy harvesting, we use a general linear model, assuming EH circuit conversion efficiency as a constant η. For ease of analysis, the symbol period is set to 1. Considering the EH power threshold, the EH circuit output power P_EHIs determined by the following formula. Wherein P is_thIndicating a received power threshold, (a)⁺Denotes max (a, 0).

P_EH＝η(P_r-P_th)⁺

S2, designing a reinforcement learning improvement technology PEG suitable for the adaptive modulation scene according to the defects in the traditional deep reinforcement learning. The prior experience generation comprises two parts, namely, the intelligent transmitter is enabled to conduct efficient exploration during training. Second, controlling the state s of the next experience_t+1。

Let intelligent transmitter efficiently explore when training: when performing an action, only selectThose actions that may become the optimal strategy are selected (these actions do not cause the average performance to fluctuate too much), and those actions that are significantly worse (actions that cause the average performance metric to drift significantly) are not considered. So for those bad action strategies, no computing power needs to be wasted to try and error the learning. So compared with the original DQN algorithm, we do not adopt the epsilon-greedy strategy for exploration any more. Combining with the rule of adaptive modulation scenario, after obtaining an action strategy that is relatively in line with the current performance constraint, if an action strategy with higher performance is to be obtained by exploration, only a modulation mode with the same or adjacent order as the current action strategy may be selected. According to the characteristics of the adaptive modulation scene, a new exploration action strategy is designed, and the current optimal strategy alpha is obtained each time_tThe search is performed with a probability epsilon (epsilon may take a larger value to enhance the search, e.g., 0.4, at the beginning of the search; then epsilon gradually decreases with training and finally decays to 0.05). The action candidate set is updated according to the above rule (i.e. the modulation mode with the same or adjacent order as the current action strategy is selected). And finally, randomly selecting an action strategy from the reduced action candidate set for training. After the algorithm learns a better decision, the search strategy can reduce the action space to be searched and accelerate the process of searching the optimal decision scheme by the algorithm. The motion candidate set here is a subset of the motion space.

The selection of modulation modes comprises BPSK, 4QPSK, 8QAM, 16QAM, 64QAM, 256QAM and other code types, if the current alpha is_tIf the corresponding modulation mode is 64QAM, the action candidate set in this embodiment is composed of 16QAM, 64QAM, and 256 QAM; 4QAM is not selected because the order difference is too large.

Controlling the state s of the next experience_t+1And reducing the transition which generates too much deviation from the optimal strategy: in the scenario of this embodiment, it is desirable that the state sequences experienced in the training process all satisfy the performance constraint. The experience near the performance index is learned, so that a useful strategy is easier to learn, rapid convergence is realized, and better performance is obtained. However, during algorithm trainingSince the smart transmitter has not completely learned the appropriate strategy, the strategy obtained by the "trial and error" exploration will produce a transition out of the constraint, in which case the experience generated will have no learning value for the subsequent exploration. To this end, we introduce a "forgetting mechanism" (implemented with sliding windows). After the intelligent transmitter makes a mistake, we forget the influence of the mistake for a short time. For example, after a single error action is performed, the average bit error rate performance deviates from the constraint range, and after several state transitions, the action is removed in the sliding window (in the initial stage of exploration, because the algorithm does not learn a good strategy, a large amount of bad experience is generated, the size of the sliding window can take a smaller value, for example, only 2 steps of information are stored, and then as the training progresses, the sliding window is gradually increased to allow the algorithm to perform longer performance consideration), and the average bit error rate performance returns to a state which is more in accordance with the constraint. Therefore, the method can be realized, the algorithm can automatically jump out of the bad state in a short time, and more state transitions meeting performance constraints appear in the experience pool.

The influence of different experiences on the algorithm is different, the experience according with the current constraint can enable the intelligent transmitter to better learn a high-performance strategy, and the experience deviating from the constraint has little help effect on the learning of the intelligent transmitter and even has an adverse effect. Experience that meets the constraints is hereinafter referred to as good experience, and experience that deviates from the constraints is hereinafter referred to as bad experience. As shown in fig. 3(a), when the ratio of good experience to bad experience is balanced, the prior experience Playback (PER) mechanism can be effective, that is, the learning effect is optimized by sampling good experience with higher frequency; fig. 3(b) shows a case where the PER fails, and when all the experience pools are bad experience, effective learning cannot be achieved even though the PER is used; fig. 3(c) is the action principle of the Preferred Experience Generation (PEG) proposed by the present invention, and uses a simple Experience Replay (ER), and modifies the exploration of the environment by the agent ((r) in fig. 3 (c)) and the process of generating samples from the environment and putting them into the experience pool ((r) in fig. 3 (c)) to make more experiences that meet the current constraints be put into the experience pool.

And S3, carrying out optimization decision based on deep reinforcement learning aiming at the intelligent transmitter in the system. The method comprises the following steps:

s31, determining the error rate performance and the energy harvesting performance of the receiver;

s32, determining the state value and the state space of the deep reinforcement learning of the transmitter;

as can be seen from the optimization problem of this embodiment, the optimization target (i.e. the average channel capacity) is closely related to the current signal-to-noise ratio, the signal-to-noise ratio γ and the current channel quality h, and the power division factor ρ_PSIt is related. In addition, due to constraints such as average power and average bit error rate, it is necessary to reflect the change information of these environments and states in State when designing State. The state is as follows,

s33, determining the action value and the action space of the deep reinforcement learning of the transmitter;

in time slot t, the transmitter determines the modulation mode Mod of the signal at time t^(t)Magnitude of transmitted power P_t ^(t)And power division factor

S34, determining a return function of deep reinforcement learning of the transmitter;

when the constraint is satisfied, the reward is that the spectrum utilization rate R (s, a) ═ C corresponding to the action is executed.

When the constraint is not satisfied, the reward is equal to the negative of the degree to which the constraint is not satisfied, and the specific formula is as follows.

R(s,a)＝R_PEH+R_BER+R_PT

Wherein (·)⁺Representing taking the absolute value.

And S35, performing deep reinforcement learning and decision making by the adaptive link control transmitter based on energy collection.

The intelligent transmitter maintains two deep neural networks, namely a target network and an evaluation network, wherein the evaluation network is responsible for estimating system return, and the target network is responsible for selecting a certain action value. At the beginning of time t, the intelligent transmitter firstly inputs the current state s of the intelligent transmitter_tTo the action network, the target network then outputs the expected return value of each action, and the intelligent transmitter selects the action a with the maximum expected return value_t. Then the intelligent transmitter calculates the current average bit error rate, average transmitting power and average energy collecting power to obtain the next state value s_t+1. Then intelligently transmitting the state of the current moment-action-report-state group(s) of the next moment_t,a_t,r_t,s_t+1) Storing the data into a memory cache, wherein the size of the memory cache can be selected to be 1000, namely, storing 1000-step state transition samples. Then, data of a certain mini-batch, for example, 64 samples, are selected from the memory buffer, and the weight parameter theta of the neural network is updated through back propagation by using a small batch gradient descent. The neural network is a fully connected neural network with 3 hidden layers, and the activating function uses a double-cut sine function tanh. The deep reinforcement learning process of the intelligent transmitter is shown in fig. 4.

The deep reinforcement learning process is specifically as follows. Firstly, randomly initializing and evaluating a network weight parameter theta and a weight theta of a target network^-. Then, a cyclic process is carried out: the agent obtains the maximum Q(s) according to the current evaluation network_t,a_t) Act a of_tPerforming action a with a probability of 1-epsilon_tAnd randomly selecting actions from the action candidate set according to the probability of epsilon for exploration. Each action corresponds to a prize value r_tAnd causing the state of the smart transmitter to be changed from s_tIs transferred to s_t+1In the formation state s_t+1In the process of (3), the PEG technique (i.e., control when updating the above-mentioned average performance parameters using a sliding window) is used. When obtaining the whole(s)_t,a_t,r_t,s_t+1) And then storing the data in an experience pool. And taking samples from the experience pool by the evaluation network and the target network, and carrying out a back propagation algorithm based on gradient descent to update network parameters. For example, after every 100 exploration steps, the parameters of the evaluation network are assigned to the target network so that θ^-＝θ。

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. An adaptive modulation link control system, comprising: the device comprises a transmitter, a receiver, a Rayleigh fading channel and a channel estimation module;

the receiver comprises a rechargeable battery, a part of received energy is stored in the battery by the receiver in a power division mode, and the rest energy is used for transmitting data to the transmitter through a Rayleigh fading channel;

the process of the adaptive modulation link control based on the adaptive modulation link control system is as follows:

a1, determining the error rate performance and the energy harvesting performance of a receiver;

a2, determining the state value and the state space of deep reinforcement learning of the transmitter; the state value is recorded as s_t，

Wherein h is_t-1Indicates the channel quality, h, corresponding to time slot t-1_tIndicating the channel quality for the time slot t,

indicating the corresponding transmit power level, Mod, of the time slot t-1^(t-1)Indicates the modulation mode, gamma, corresponding to the time slot t-1^(t-1)Indicating the signal-to-noise ratio corresponding to time slot t-1,

representing the average energy harvesting performance for time slot t-1,

indicating the average bit error rate performance for time slot t-1,

indicates the level corresponding to time slot t-1The average transmit power;

a3, determining the action value and the action space of the deep reinforcement learning of the transmitter;

The motion space of the time slot t transmitter is:

a4, determining the reward value r of deep reinforcement learning of the transmitter_t；

When the constraint is satisfied, the prize value r_tThe corresponding spectrum utilization rate R(s) for executing the action_t,a_t)＝C_t；

When the constraint is not satisfied, the reward value r_tA negative value equal to the degree to which the constraint is not satisfied;

a5, carrying out deep reinforcement learning and decision making based on an improved prior experience generation method; the transmitter maintains two deep neural networks, which are respectively recorded as: a target network and an evaluation network, wherein the target network is used for selecting the action strategy and outputting an expected reward value r corresponding to the selected action strategy_t+αmaxQ(s_t+1,a_t+1) The evaluation network is used to evaluate the current time value function Q(s)_t,a_t) Carrying out estimation;

step a5 specifically includes the following steps:

B2, the transmitter obtains the maximum Q(s) according to the current evaluation network_t,a_t) Act a of_tThe target network performs action a with a probability of 1-epsilon_tRandomly selecting actions from the action candidate set according to the probability of epsilon for exploration; step B2The candidate set specifically comprises: the transmitter obtains a maximum function Q(s) from the evaluation network_t,a_t) Corresponding action policy a_tSelecting action strategies with the same or adjacent orders with the action strategies to form an action candidate set;

b3, each action corresponds to a reward value r_tAnd causing the state of the smart transmitter to be changed from s_tIs transferred to s_t+1；

B4, controlling the samples(s) stored in the experience pool by adopting a sliding window_t,a_t,r_t,s_t+1)，s_tRepresenting the state of the transmitter at time t, s_t+1Represents the state of the transmitter at time t + 1;

b6, assigning the parameters of the evaluation network to the target network so that theta^-＝θ。

2. The adaptive modulation link control system of claim 1 wherein the reward value r of step B3 is_tThe settings are based on constraints including average power limit, average bit error rate limit, average energy harvesting limit.

3. The adaptive modulation link control system of claim 2, wherein the step B4 further comprises initially setting the sliding window to 2.