CN115242271B - Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method - Google Patents

Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method Download PDF

Info

Publication number
CN115242271B
CN115242271B CN202210892708.7A CN202210892708A CN115242271B CN 115242271 B CN115242271 B CN 115242271B CN 202210892708 A CN202210892708 A CN 202210892708A CN 115242271 B CN115242271 B CN 115242271B
Authority
CN
China
Prior art keywords
action
value
algorithm
state
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210892708.7A
Other languages
Chinese (zh)
Other versions
CN115242271A (en
Inventor
马青竹
杨书宁
周浩然
张顺外
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210892708.7A priority Critical patent/CN115242271B/en
Publication of CN115242271A publication Critical patent/CN115242271A/en
Application granted granted Critical
Publication of CN115242271B publication Critical patent/CN115242271B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/02Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
    • H04B7/04Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
    • H04B7/0413MIMO systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Electromagnetism (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Radio Transmission System (AREA)

Abstract

A reinforcement Learning-assisted large-scale MIMO speed-BP detection method adopts a Q-Learning algorithm in reinforcement Learning to find the optimal damping factor in a damping belief propagation speed-BP algorithm, so as to improve the performance of the speed-BP detection algorithm. And setting the damping factor in the speed-BP algorithm as a state in the Q-Learning algorithm, and setting the action to increase or decrease the damping factor to form a Q-Table. Determining whether to feed back the system in a positive direction or a negative direction according to the bit error rate obtained by the BP algorithm each time, and feeding back the system in a positive direction if the bit error rate is small; and if the error rate is large, giving a negative return. Therefore, the state with the largest return under a certain action is obtained by reasonably setting the Learning rate, the discount factor and the training times in the Q-Learning algorithm, and the damping factor corresponding to the state is the optimal damping factor, so that the optimal damping factor searching is completed, the performance of the weighted-BP detection algorithm is improved, the detection performance of large-scale MIMO is further improved, and the high-reliability low-delay requirement of actual communication can be better met.

Description

Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method
Technical Field
The invention belongs to the technical field of signal detection, and particularly relates to a reinforcement learning-assisted large-scale MIMO (multiple input multiple output) -BP detection method.
Background
Among the many key technologies of 5G communication, massive multiple-input multiple-output (MIMO) systems (MultipleInputMultiple Output) have been the focus of widespread attention in the industry and academia due to outstanding advantages in spectral efficiency, energy efficiency, link reliability, and in MIMO systems, signal detection is a very critical part. When the phenomenon of 'channel hardening' occurs in a large-scale MIMO system, the performance of the signal detection algorithm can be close to the optimal detection along with the increase of the number of antennas, and meanwhile, the complexity is low. The most typical algorithm with this property is the belief propagation (BeliefPropagation, BP) algorithm.
The Damped belief propagation (sampled-BP) detection algorithm is a BP algorithm that introduces the idea of dynamic excess relaxation correction. In contrast to the MIMO-BP detection algorithm, at each iteration of the algorithm, the update to the message is considered as a weighted average between the old and new estimates. In the t-th iteration, the message is damped by taking a weighted average of the messages calculated in the t-th iteration and in the (t-1) th iteration. Wherein the damping factor delta epsilon [0, 1).
In the BP detection algorithm, as the message is updated and passed back and forth, the edge probability of the correct symbol will gradually increase, while the edge probabilities of other symbols will become smaller and smaller. The memory introduced by the sampled-BP detection affects the convergence of the algorithm to some extent. Too large or too small a damping factor can seriously affect the performance of the sampled-BP detection algorithm, so it is important to find the optimal damping factor delta.
Disclosure of Invention
Aiming at the defects existing in the prior art, the method for detecting the speed-BP of the large-scale MIMO assisted by reinforcement Learning is provided, and the speed-BP algorithm is combined with the Q-Learning algorithm in reinforcement Learning, so that the searching of an optimal damping factor is realized, the performance of the speed-BP algorithm is improved, and the detection performance of the large-scale MIMO is improved.
A reinforcement learning-assisted large-scale MIMO speed-BP detection method comprises the following steps:
step A, for a large-scale MIMO system, initializing a Q value: establishing a Q-table, wherein the vertical axis is action value action, and the horizontal axis is state value;
step B, selecting an action, namely an action value action: selecting an action a in a state s based on the current Q value estimation;
step C, evaluating: implementing action a, observing the result state s' and the prize r;
step D, lifelong learning: and (5) repeating the step B, C until the maximum value of the learning times epicode is reached, or manually stopping training, obtaining the optimal damping factor after the training is finished, and substituting the optimal damping factor into a speed-BP algorithm to obtain a detection result.
Further, in step a, the action value action is set to increase or decrease the damping factor δ, and the damping factor is set to be at state and the initial value is 0.
Further, in step B, if each Q value is equal to 0, an epsilon greedy strategy is adopted.
Further, in the step B, the following sub-steps are included:
1) Designating an exploration rate epsilon, and setting from 1 as a step length randomly adopted; initially, the rate is at a maximum, and more exploration is made by randomly selecting actions;
2) Generating a random number; if the number is greater than the epsilon value, starting to utilize the expllocation to select the best action of each step by utilizing the known information; otherwise, exploration is started.
Further, in step C, specifically: updating the function Q (s, a); implementing the selected action a to obtain a new state s' and a reward r; determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; if the error rate is large, a negative forward is given; q (s, a) is an action value function, is an expected value of the obtained reward after taking action a in the state s, and is used for measuring how effective the action a is taken in the current state s, and is defined as follows:
Figure BDA0003768227440000031
wherein i is the number of steps, r i Representing the rewards of step i, s 0 Representing an initial state, a 0 Representing the initial action, E [ # ]]The representation takes the expected value;
then, Q (s, a) is updated using the bellman equation.
NewsQ(s,a)=Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)]
Wherein alpha is learning rate, determining the degree of learning in the error; gamma is the attenuation value of future reorder; the closer γ is to 1, the more sensitive to future reward; a 'represents the action taken in the new state s'.
Further, R (s, a) is a reward function defined as:
R(s,a)=E[r i |s i =s,a i =a]。
where i is the number of steps, s i Indicating the status of step i, a i Represent the firstAnd (3) performing the action of the step i.
Compared with the prior art, the invention has the beneficial effects that: according to the method, the Q-Learning algorithm in reinforcement Learning is adopted to obtain the optimal damping factor corresponding to the weighted-BP, so that the convergence speed of the weighted-BP detection algorithm can be remarkably increased, the detection performance of large-scale MIMO is improved, and the requirements of high reliability and low time delay of actual communication can be better met.
Drawings
FIG. 1 is a method model diagram in an embodiment of the invention.
FIG. 2 is a graph showing the simulated performance of the optimal damping factor trained by the method according to the embodiment of the invention compared with other damping factors.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the attached drawings.
In the message passing algorithm, the passing of damping messages is a known solution for increasing the convergence speed of iterative algorithms. The weighted-BP algorithm is a known improved BP algorithm for improving BP convergence performance, which includes the following:
in the real number domain model, the symbol x is transmitted i e.OMEGA.and OMEGA= { e 1 ,e 2 ,...,e K In which Ω is the in-phase or quadrature component e in the original complex constellation k K is determined by the modulation order. Information is updated back and forth between the observation node and the variable node in each iteration of the algorithm.
A. Updating rule of observation node message
The observation node calculates posterior information according to the channel state information and prior information of the adjacent nodes, and then transmits the posterior information to the observation node. The posterior Log-Likelihood Ratio (LLR) of an observation node is defined as:
Figure BDA0003768227440000041
wherein the method comprises the steps of
Figure BDA0003768227440000042
β j ( i l)(e k ) Representing that in the first iteration, the jth observation node transmits the ith symbol node's information about e k Is a posterior information of (1). P is p (l) (x i =e k |y i H) represents that the j-th observation node receives information y in the first iteration in the case where the channel matrix is H j When the transmitting end actually transmits the symbol x i E is k Is a posterior probability of (c).
B. Update rules for variable node messages
The variable nodes calculate a priori messages from posterior messages from other variable nodes and pass them to all observation nodes. The a priori LLR of a variable node is defined as:
Figure BDA0003768227440000051
wherein the method comprises the steps of
Figure BDA0003768227440000052
In the whole iterative process, the prior message is calculated as follows:
Figure BDA0003768227440000053
since the sum of all prior probabilities in each iteration is 1, the prior probability for each symbol can be expressed as:
Figure BDA0003768227440000054
where k=1, 2,..k, and K
Figure BDA0003768227440000055
In this iteration, the updated information is not immediately transferred to the next iteration, but is weighted-averaged with the old information of the previous iteration, so as to be transferred to the updated information of the next iteration. Introducing memorability to the prior probability in BP iteration, taking a weighted average value to the new prior probability and the old prior probability, and then the updating criterion of the prior probability can be expressed as a linear combination of the prior probability corresponding to the last iteration and the current prior probability, as follows:
Figure BDA0003768227440000061
wherein the damping factor delta epsilon [0,1].
In the BP detection algorithm, as the message is updated and passed back and forth, the edge probability of the correct symbol will gradually increase, while the edge probabilities of other symbols will become smaller and smaller. From equation (7), it can be seen that the memory introduced by the sampled-BP detection affects the convergence of the algorithm to some extent. The faster the sampled-BP algorithm converges, the fewer the number of iterations needed to achieve the expected detection performance, and the smaller the corresponding complexity and system delay, so the design of the damping factor should be based on improving the convergence speed of the BP algorithm as much as possible.
Too large or too small a damping factor will seriously affect the performance of the sampled-BP detection algorithm, so it is important to find the optimal damping factor delta, while the Q-Learning algorithm can just achieve this, so it is possible to combine the Q-Learning algorithm with the sampled-BP algorithm to find the optimal damping factor. The magnitude of the damping factor is used as a state in a Q-Learning algorithm, the action is set to increase or decrease the damping factor, Q-Table is formed, whether the system is fed back positively or negatively is determined according to the bit error rate obtained by the BP algorithm each time, and if the bit error rate is small, the system is fed back positively; and if the error rate is large, giving a negative return. In this way, by reasonably setting the Learning rate, the discount factor and the training times in the Q-Learning algorithm, the state with the largest return under a certain action can be obtained, and the damping factor corresponding to the state is the optimal damping factor, so that the searching of the optimal damping factor is realized, and the method comprises the following steps:
A. initializing a Q value: a Q-table is established, and the behavior state (state value) is listed as action (action value). The action is set to increase or decrease the damping factor delta, and the magnitude of the damping factor is taken as the state. The initial value is 0, as shown in table 1.
TABLE 1Q-Table
Figure BDA0003768227440000071
B. Select an action: an action a is selected in the estimated state s based on the current Q value, i.e. the Q (s, a) value. If each Q value is equal to 0, an epsilon greedy strategy is initially adopted.
1) The exploration rate epsilon is specified, starting from a 1 setting, as a step taken randomly. Initially, this rate should be at a maximum because none of the values in the Q-table are known. This means that more exploration is made by randomly selecting actions.
2) A random number is generated. If the number is greater than the epsilon value, then the utilization (expllocation) is started, which means that the best action for each step is selected using known information. Otherwise, exploration (execution) is started.
C. Evaluation (evaluation): action a is performed, the result state s' and the prize r are observed. Now, the function Q (s, a) is updated. The selected action a is performed, resulting in a new state s' and prize r. Determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; and if the error rate is large, a negative forward is given. Q (s, a) is an action value function (action value function), also called Q function (Q-function), defined as follows:
Figure BDA0003768227440000072
wherein i is the number of steps, r i Represents the firsti step of rewarding, s 0 Representing an initial state, a 0 Representing the initial action, E [ # ]]Representing taking the expected value.
Then, Q (s, a) is updated using the bellman equation.
NewsQ(s,a)=Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)] (9)
Where α is the learning rate, determining how much of this error is to be learned. Gamma is the attenuation value of future reorder, i.e. the aforementioned discount factor. The closer γ is to 1, the more sensitive the machine is to future reward; a 'represents the action taken in the new state s'. R (s, a) is a reward function defined as:
R(s,a)=E[r i |s i =s,a i =a] (10)
where i is the number of steps, s i Indicating the status of step i, a i Indicating the action of step i.
D. Life-long learning: step B, C is repeated until the maximum number of learning times epicode is reached (specified by the user), or training is stopped manually.
The convergence speed of the BP algorithm can be accelerated by selecting a proper damping factor, the detection performance of the BP algorithm is improved, and the optimal damping factor can be found through training according to the proposed reinforcement learning-assisted large-scale MIMO BP detection method. In the learning process, the learning rate α=0.6, the discount factor γ=0.7, and the search rate epsilon=0.3. The parameters of the BP algorithm are set as follows: BPSK modulation, number of transmit antennas N t And the number of receiving antennas N r Are all 16, i.e. N t =N r =16, the amount of transmitted data is 16000 bits, the signal-to-noise ratio snr=2, and the number of iterations is 5.
The training process is divided into two steps:
step S1: training with an accuracy of 0.1 was performed. The damping factor is set as a state in a Q-Learning algorithm, the initial state is selected to be 0.3, the action is increased or decreased by 0.1, the training frequency is 100, the Q-Table is an 11 multiplied by 2 matrix, and the initial optimal damping factor with the precision of 0.1 is obtained after 100 times of training.
Step S2: training with an accuracy of 0.01 was performed. Setting the initial optimal damping factor obtained in the step S1 as an initial state, setting a state space as the initial optimal damping factor to float up and down by 0.1 precision, increasing or decreasing by 0.01, training 100 times, and obtaining the optimal damping factor after 100 times of training, wherein Q-Table is a matrix of 21×2.
The best damping factor is obtained through reinforcement learning, and comparison with simulation performance of randomly selected damping factors is shown in fig. 2, it can be seen that BP detection performance is best in the case of the damping factor δ=0.23 obtained through reinforcement learning, which verifies that the method is effective.
The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.

Claims (2)

1. A reinforcement learning-assisted large-scale MIMO (multiple input multiple output) -BP detection method is characterized by comprising the following steps of: the method comprises the following steps:
step A, for a large-scale MIMO system, initializing a Q value: establishing a Q-table, wherein the vertical axis is action value action, and the horizontal axis is state value;
in the step A, the action value action is set to increase or decrease the damping factor delta, the size of the damping factor is used as state, and the initial value is 0;
step B, selecting an action, namely an action value action: selecting an action a in a state s based on the current Q value estimation;
the step B comprises the following sub-steps:
1) Designating an exploration rate epsilon, and setting from 1 as a step length randomly adopted; initially, the rate is at a maximum, and more exploration is made by randomly selecting actions;
2) Generating a random number; if the number is greater than the epsilon value, starting to utilize the expllocation to select the best action of each step by utilizing the known information; otherwise, beginning to explore the expression;
step C, evaluating: implementing action a, observing the result state s' and the prize r;
in the step C, the specific steps are as follows: updating the function Q (s, a); implementing the selected action a to obtain a new state s' and a reward r; determining whether to provide a forward or a negative reward for the system according to the bit error rate obtained by the BP algorithm each time, and providing a forward reward if the bit error rate is small; if the error rate is large, a negative forward is given; q (s, a) is an action value function, is an expected value of the obtained reward after taking action a in the state s, and is used for measuring the effect of taking action a in the current state s, and is defined as follows:
Figure FDA0004196768470000021
wherein i is the number of steps, r i Representing the rewards of step i, s 0 Representing an initial state, a 0 Representing the initial action, E [ # ]]The representation takes the expected value;
then, Q (s, a) is updated using the bellman equation;
NewsQ(s,a)=Q(s,a)+α[R(s,a)+γmaxQ'(s',a')-Q(s,a)]
wherein alpha is learning rate, determining the degree of learning in the error; gamma is the attenuation value of future reorder; the closer γ is to 1, the more sensitive to future reward; a 'represents an action taken in the new state s';
step D, lifelong learning: repeating the step B, C until the maximum value of the learning times epicode is reached, or manually stopping training, obtaining the optimal damping factor after the training is finished, and substituting the optimal damping factor into a speed-BP algorithm to obtain a detection result;
r (s, a) is a reward function defined as:
R(s,a)=E[r i |s i =s,a i =a]
where i is the number of steps, s i Indicating the status of step i, a i Indicating the action of step i.
2. The reinforcement learning assisted massive MIMO sampled-BP detection method of claim 1, wherein: in step B, if each Q value is equal to 0, an epsilon greedy strategy is adopted.
CN202210892708.7A 2022-07-27 2022-07-27 Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method Active CN115242271B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210892708.7A CN115242271B (en) 2022-07-27 2022-07-27 Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210892708.7A CN115242271B (en) 2022-07-27 2022-07-27 Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method

Publications (2)

Publication Number Publication Date
CN115242271A CN115242271A (en) 2022-10-25
CN115242271B true CN115242271B (en) 2023-06-16

Family

ID=83677910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210892708.7A Active CN115242271B (en) 2022-07-27 2022-07-27 Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method

Country Status (1)

Country Link
CN (1) CN115242271B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911299A (en) * 2017-10-24 2018-04-13 浙江工商大学 A kind of route planning method based on depth Q study
CN108390705A (en) * 2018-03-29 2018-08-10 东南大学 The extensive mimo system detection method of deep neural network based on BP algorithm structure
WO2021203243A1 (en) * 2020-04-07 2021-10-14 东莞理工学院 Artificial intelligence-based mimo multi-antenna signal transmission and detection technique
CN114721397A (en) * 2022-04-19 2022-07-08 北方工业大学 Maze robot path planning method based on reinforcement learning and curiosity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911299A (en) * 2017-10-24 2018-04-13 浙江工商大学 A kind of route planning method based on depth Q study
CN108390705A (en) * 2018-03-29 2018-08-10 东南大学 The extensive mimo system detection method of deep neural network based on BP algorithm structure
WO2021203243A1 (en) * 2020-04-07 2021-10-14 东莞理工学院 Artificial intelligence-based mimo multi-antenna signal transmission and detection technique
CN114721397A (en) * 2022-04-19 2022-07-08 北方工业大学 Maze robot path planning method based on reinforcement learning and curiosity

Also Published As

Publication number Publication date
CN115242271A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN111464465B (en) Channel estimation method based on integrated neural network model
US11424963B2 (en) Channel prediction method and related device
CN108390705A (en) The extensive mimo system detection method of deep neural network based on BP algorithm structure
CN111985523A (en) Knowledge distillation training-based 2-exponential power deep neural network quantification method
CN108494412A (en) A kind of multiple-factor amendment LDPC code interpretation method and device based on parameter Estimation
Tan et al. Improving massive MIMO belief propagation detector with deep neural network
CN107124379B (en) Orthogonal wavelet normal-modulus blind equalization method based on improved wolf pack optimization
KR20030095144A (en) Apparatus and method for correcting of forward error in high data transmission system
CN110535475B (en) Hierarchical adaptive normalized minimum sum decoding algorithm
CN113541747B (en) Large-scale MIMO detection method, device and storage medium
CN112564712B (en) Intelligent network coding method and equipment based on deep reinforcement learning
US20230254187A1 (en) Method for designing complex-valued channel equalizer
CN112686383B (en) Method, system and device for reducing distributed random gradient of communication parallelism
CN114499601A (en) Large-scale MIMO signal detection method based on deep learning
CN111917474A (en) Implicit triple neural network and optical fiber nonlinear damage balancing method
Ronca et al. Efficient PAC reinforcement learning in regular decision processes
CN115242271B (en) Reinforced learning-assisted large-scale MIMO (multiple input multiple output) -BP (back propagation) detection method
Kim et al. Semi-data-aided channel estimation for MIMO systems via reinforcement learning
CN114070331A (en) Self-adaptive serial offset list flip decoding method and system
CN109150237A (en) A kind of robust multi-user detector design method
CN113067666A (en) User activity and multi-user joint detection method of NOMA system
CN116938732A (en) Communication topology optimization method based on reinforcement learning algorithm
CN107018104B (en) Wavelet weighted multi-mode blind equalization method based on mixed monkey swarm optimization
CN114338300B (en) Pilot frequency optimization method and system based on compressed sensing
CN112488309B (en) Training method and system of deep neural network based on critical damping momentum

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant