CN115242271B

CN115242271B - Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method

Info

Publication number: CN115242271B
Application number: CN202210892708.7A
Authority: CN
Inventors: 马青竹; 杨书宁; 周浩然; 张顺外
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2023-06-16
Anticipated expiration: 2042-07-27
Also published as: CN115242271A

Abstract

A reinforcement Learning-assisted large-scale MIMO speed-BP detection method adopts a Q-Learning algorithm in reinforcement Learning to find the optimal damping factor in a damping belief propagation speed-BP algorithm, so as to improve the performance of the speed-BP detection algorithm. And setting the damping factor in the speed-BP algorithm as a state in the Q-Learning algorithm, and setting the action to increase or decrease the damping factor to form a Q-Table. Determining whether to feed back the system in a positive direction or a negative direction according to the bit error rate obtained by the BP algorithm each time, and feeding back the system in a positive direction if the bit error rate is small; and if the error rate is large, giving a negative return. Therefore, the state with the largest return under a certain action is obtained by reasonably setting the Learning rate, the discount factor and the training times in the Q-Learning algorithm, and the damping factor corresponding to the state is the optimal damping factor, so that the optimal damping factor searching is completed, the performance of the weighted-BP detection algorithm is improved, the detection performance of large-scale MIMO is further improved, and the high-reliability low-delay requirement of actual communication can be better met.

Description

Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method

Technical Field

The invention belongs to the technical field of signal detection, and particularly relates to a reinforcement learning-assisted large-scale MIMO (multiple input multiple output) -BP detection method.

Background

Among the many key technologies of 5G communication, massive multiple-input multiple-output (MIMO) systems (MultipleInputMultiple Output) have been the focus of widespread attention in the industry and academia due to outstanding advantages in spectral efficiency, energy efficiency, link reliability, and in MIMO systems, signal detection is a very critical part. When the phenomenon of 'channel hardening' occurs in a large-scale MIMO system, the performance of the signal detection algorithm can be close to the optimal detection along with the increase of the number of antennas, and meanwhile, the complexity is low. The most typical algorithm with this property is the belief propagation (BeliefPropagation, BP) algorithm.

The Damped belief propagation (sampled-BP) detection algorithm is a BP algorithm that introduces the idea of dynamic excess relaxation correction. In contrast to the MIMO-BP detection algorithm, at each iteration of the algorithm, the update to the message is considered as a weighted average between the old and new estimates. In the t-th iteration, the message is damped by taking a weighted average of the messages calculated in the t-th iteration and in the (t-1) th iteration. Wherein the damping factor delta epsilon [0, 1).

In the BP detection algorithm, as the message is updated and passed back and forth, the edge probability of the correct symbol will gradually increase, while the edge probabilities of other symbols will become smaller and smaller. The memory introduced by the sampled-BP detection affects the convergence of the algorithm to some extent. Too large or too small a damping factor can seriously affect the performance of the sampled-BP detection algorithm, so it is important to find the optimal damping factor delta.

Disclosure of Invention

Aiming at the defects existing in the prior art, the method for detecting the speed-BP of the large-scale MIMO assisted by reinforcement Learning is provided, and the speed-BP algorithm is combined with the Q-Learning algorithm in reinforcement Learning, so that the searching of an optimal damping factor is realized, the performance of the speed-BP algorithm is improved, and the detection performance of the large-scale MIMO is improved.

A reinforcement learning-assisted large-scale MIMO speed-BP detection method comprises the following steps:

step A, for a large-scale MIMO system, initializing a Q value: establishing a Q-table, wherein the vertical axis is action value action, and the horizontal axis is state value;

step B, selecting an action, namely an action value action: selecting an action a in a state s based on the current Q value estimation;

step C, evaluating: implementing action a, observing the result state s' and the prize r;

step D, lifelong learning: and (5) repeating the step B, C until the maximum value of the learning times epicode is reached, or manually stopping training, obtaining the optimal damping factor after the training is finished, and substituting the optimal damping factor into a speed-BP algorithm to obtain a detection result.

Further, in step a, the action value action is set to increase or decrease the damping factor δ, and the damping factor is set to be at state and the initial value is 0.

Further, in step B, if each Q value is equal to 0, an epsilon greedy strategy is adopted.

Further, in the step B, the following sub-steps are included:

1) Designating an exploration rate epsilon, and setting from 1 as a step length randomly adopted; initially, the rate is at a maximum, and more exploration is made by randomly selecting actions;

2) Generating a random number; if the number is greater than the epsilon value, starting to utilize the expllocation to select the best action of each step by utilizing the known information; otherwise, exploration is started.

Further, in step C, specifically: updating the function Q (s, a); implementing the selected action a to obtain a new state s' and a reward r; determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; if the error rate is large, a negative forward is given; q (s, a) is an action value function, is an expected value of the obtained reward after taking action a in the state s, and is used for measuring how effective the action a is taken in the current state s, and is defined as follows:

wherein i is the number of steps, r _i Representing the rewards of step i, s ₀ Representing an initial state, a ₀ Representing the initial action, E [ # ]]The representation takes the expected value;

then, Q (s, a) is updated using the bellman equation.

NewsQ(s,a)＝Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)]

Wherein alpha is learning rate, determining the degree of learning in the error; gamma is the attenuation value of future reorder; the closer γ is to 1, the more sensitive to future reward; a 'represents the action taken in the new state s'.

Further, R (s, a) is a reward function defined as:

R(s,a)＝E[r _i |s _i ＝s,a _i ＝a]。

where i is the number of steps, s _i Indicating the status of step i, a _i Represent the firstAnd (3) performing the action of the step i.

Compared with the prior art, the invention has the beneficial effects that: according to the method, the Q-Learning algorithm in reinforcement Learning is adopted to obtain the optimal damping factor corresponding to the weighted-BP, so that the convergence speed of the weighted-BP detection algorithm can be remarkably increased, the detection performance of large-scale MIMO is improved, and the requirements of high reliability and low time delay of actual communication can be better met.

Drawings

FIG. 1 is a method model diagram in an embodiment of the invention.

FIG. 2 is a graph showing the simulated performance of the optimal damping factor trained by the method according to the embodiment of the invention compared with other damping factors.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the attached drawings.

In the message passing algorithm, the passing of damping messages is a known solution for increasing the convergence speed of iterative algorithms. The weighted-BP algorithm is a known improved BP algorithm for improving BP convergence performance, which includes the following:

in the real number domain model, the symbol x is transmitted _i e.OMEGA.and OMEGA= { e ₁ ,e ₂ ,...,e _K In which Ω is the in-phase or quadrature component e in the original complex constellation _k K is determined by the modulation order. Information is updated back and forth between the observation node and the variable node in each iteration of the algorithm.

A. Updating rule of observation node message

The observation node calculates posterior information according to the channel state information and prior information of the adjacent nodes, and then transmits the posterior information to the observation node. The posterior Log-Likelihood Ratio (LLR) of an observation node is defined as:

wherein the method comprises the steps of

β _j ( _i l)(e _k ) Representing that in the first iteration, the jth observation node transmits the ith symbol node's information about e _k Is a posterior information of (1). P is p ^(l) (x _i ＝e _k |y _i H) represents that the j-th observation node receives information y in the first iteration in the case where the channel matrix is H _j When the transmitting end actually transmits the symbol x _i E is _k Is a posterior probability of (c).

B. Update rules for variable node messages

The variable nodes calculate a priori messages from posterior messages from other variable nodes and pass them to all observation nodes. The a priori LLR of a variable node is defined as:

wherein the method comprises the steps of

In the whole iterative process, the prior message is calculated as follows:

since the sum of all prior probabilities in each iteration is 1, the prior probability for each symbol can be expressed as:

where k=1, 2,..k, and K

In this iteration, the updated information is not immediately transferred to the next iteration, but is weighted-averaged with the old information of the previous iteration, so as to be transferred to the updated information of the next iteration. Introducing memorability to the prior probability in BP iteration, taking a weighted average value to the new prior probability and the old prior probability, and then the updating criterion of the prior probability can be expressed as a linear combination of the prior probability corresponding to the last iteration and the current prior probability, as follows:

wherein the damping factor delta epsilon [0,1].

In the BP detection algorithm, as the message is updated and passed back and forth, the edge probability of the correct symbol will gradually increase, while the edge probabilities of other symbols will become smaller and smaller. From equation (7), it can be seen that the memory introduced by the sampled-BP detection affects the convergence of the algorithm to some extent. The faster the sampled-BP algorithm converges, the fewer the number of iterations needed to achieve the expected detection performance, and the smaller the corresponding complexity and system delay, so the design of the damping factor should be based on improving the convergence speed of the BP algorithm as much as possible.

Too large or too small a damping factor will seriously affect the performance of the sampled-BP detection algorithm, so it is important to find the optimal damping factor delta, while the Q-Learning algorithm can just achieve this, so it is possible to combine the Q-Learning algorithm with the sampled-BP algorithm to find the optimal damping factor. The magnitude of the damping factor is used as a state in a Q-Learning algorithm, the action is set to increase or decrease the damping factor, Q-Table is formed, whether the system is fed back positively or negatively is determined according to the bit error rate obtained by the BP algorithm each time, and if the bit error rate is small, the system is fed back positively; and if the error rate is large, giving a negative return. In this way, by reasonably setting the Learning rate, the discount factor and the training times in the Q-Learning algorithm, the state with the largest return under a certain action can be obtained, and the damping factor corresponding to the state is the optimal damping factor, so that the searching of the optimal damping factor is realized, and the method comprises the following steps:

A. initializing a Q value: a Q-table is established, and the behavior state (state value) is listed as action (action value). The action is set to increase or decrease the damping factor delta, and the magnitude of the damping factor is taken as the state. The initial value is 0, as shown in table 1.

TABLE 1Q-Table

B. Select an action: an action a is selected in the estimated state s based on the current Q value, i.e. the Q (s, a) value. If each Q value is equal to 0, an epsilon greedy strategy is initially adopted.

1) The exploration rate epsilon is specified, starting from a 1 setting, as a step taken randomly. Initially, this rate should be at a maximum because none of the values in the Q-table are known. This means that more exploration is made by randomly selecting actions.

2) A random number is generated. If the number is greater than the epsilon value, then the utilization (expllocation) is started, which means that the best action for each step is selected using known information. Otherwise, exploration (execution) is started.

C. Evaluation (evaluation): action a is performed, the result state s' and the prize r are observed. Now, the function Q (s, a) is updated. The selected action a is performed, resulting in a new state s' and prize r. Determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; and if the error rate is large, a negative forward is given. Q (s, a) is an action value function (action value function), also called Q function (Q-function), defined as follows:

wherein i is the number of steps, r _i Represents the firsti step of rewarding, s ₀ Representing an initial state, a ₀ Representing the initial action, E [ # ]]Representing taking the expected value.

Then, Q (s, a) is updated using the bellman equation.

NewsQ(s,a)＝Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)] (9)

Where α is the learning rate, determining how much of this error is to be learned. Gamma is the attenuation value of future reorder, i.e. the aforementioned discount factor. The closer γ is to 1, the more sensitive the machine is to future reward; a 'represents the action taken in the new state s'. R (s, a) is a reward function defined as:

R(s,a)＝E[r _i |s _i ＝s,a _i ＝a] (10)

where i is the number of steps, s _i Indicating the status of step i, a _i Indicating the action of step i.

D. Life-long learning: step B, C is repeated until the maximum number of learning times epicode is reached (specified by the user), or training is stopped manually.

The convergence speed of the BP algorithm can be accelerated by selecting a proper damping factor, the detection performance of the BP algorithm is improved, and the optimal damping factor can be found through training according to the proposed reinforcement learning-assisted large-scale MIMO BP detection method. In the learning process, the learning rate α=0.6, the discount factor γ=0.7, and the search rate epsilon=0.3. The parameters of the BP algorithm are set as follows: BPSK modulation, number of transmit antennas N _t And the number of receiving antennas N _r Are all 16, i.e. N _t ＝N _r =16, the amount of transmitted data is 16000 bits, the signal-to-noise ratio snr=2, and the number of iterations is 5.

The training process is divided into two steps:

step S1: training with an accuracy of 0.1 was performed. The damping factor is set as a state in a Q-Learning algorithm, the initial state is selected to be 0.3, the action is increased or decreased by 0.1, the training frequency is 100, the Q-Table is an 11 multiplied by 2 matrix, and the initial optimal damping factor with the precision of 0.1 is obtained after 100 times of training.

Step S2: training with an accuracy of 0.01 was performed. Setting the initial optimal damping factor obtained in the step S1 as an initial state, setting a state space as the initial optimal damping factor to float up and down by 0.1 precision, increasing or decreasing by 0.01, training 100 times, and obtaining the optimal damping factor after 100 times of training, wherein Q-Table is a matrix of 21×2.

The best damping factor is obtained through reinforcement learning, and comparison with simulation performance of randomly selected damping factors is shown in fig. 2, it can be seen that BP detection performance is best in the case of the damping factor δ=0.23 obtained through reinforcement learning, which verifies that the method is effective.

The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.

Claims

1. A reinforcement learning-assisted large-scale MIMO (multiple input multiple output) -BP detection method is characterized by comprising the following steps of: the method comprises the following steps:

in the step A, the action value action is set to increase or decrease the damping factor delta, the size of the damping factor is used as state, and the initial value is 0;

the step B comprises the following sub-steps:

2) Generating a random number; if the number is greater than the epsilon value, starting to utilize the expllocation to select the best action of each step by utilizing the known information; otherwise, beginning to explore the expression;

in the step C, the specific steps are as follows: updating the function Q (s, a); implementing the selected action a to obtain a new state s' and a reward r; determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; if the error rate is large, a negative forward is given; q (s, a) is an action value function, is an expected value of the obtained reward after taking action a in the state s, and is used for measuring the effect of taking action a in the current state s, and is defined as follows:

then, Q (s, a) is updated using the bellman equation;

NewsQ(s,a)＝Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)]

wherein alpha is learning rate, determining the degree of learning in the error; gamma is the attenuation value of future reorder; the closer γ is to 1, the more sensitive to future reward; a 'represents an action taken in the new state s';

step D, lifelong learning: repeating the step B, C until the maximum value of the learning times epicode is reached, or manually stopping training, obtaining the optimal damping factor after the training is finished, and substituting the optimal damping factor into a speed-BP algorithm to obtain a detection result;

r (s, a) is a reward function defined as:

R(s,a)＝E[r _i |s _i ＝s,a _i ＝a]

2. The reinforcement learning assisted massive MIMO sampled-BP detection method of claim 1, wherein: in step B, if each Q value is equal to 0, an epsilon greedy strategy is adopted.