CN114884547A

CN114884547A - Active monitoring method based on deep reinforcement learning

Info

Publication number: CN114884547A
Application number: CN202210312148.3A
Authority: CN
Inventors: 唐岚; 陈家乐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-08-09

Abstract

The invention discloses an active monitoring method based on deep reinforcement learning, and belongs to the field of communication. In massive MIMO-OFDM systems, conventional passive and active listening schemes become inefficient or even ineffective when the listener E and suspect receiver D are not within the coverage area of the same communication beam. In order to realize legal monitoring of a large-scale MIMO-OFDM system, a monitor is used as a pseudo relay to realize beam induction and data monitoring. When the transmitter S performs beam scanning, the listener E induces the transmitter to select a beam that is favorable for listening by optimizing the relay precoding matrix. In the data listening phase, the listener E increases the listening rate by optimizing the relay power allocation factor and the power gain factor. Because the channel state information of the suspicious communication link is unknown, an optimal precoding matrix and a power distribution factor are searched through a deep reinforcement learning algorithm-MADDPG. Computer simulation verifies the validity of the proposed design.

Description

Active monitoring method based on deep reinforcement learning

技术领域technical field

本发明属于通信领域，具体涉及一种基于深度强化学习的主动监听方法，更具体涉及一种基于深度强化学习的大规模MIMO-OFDM(Multiple Input Multiple Output-Orthogonal Frequency Division Multiplexing，多输入多输出-正交频分复用)系统中的主动监听方法。The invention belongs to the field of communications, in particular to an active monitoring method based on deep reinforcement learning, and more particularly to a massive MIMO-OFDM (Multiple Input Multiple Output-Orthogonal Frequency Division Multiplexing, Multiple Input Multiple Output- Active listening method in orthogonal frequency division multiplexing) system.

背景技术Background technique

MIMO-OFDM技术被认为是第五代(5G)移动网络的一项关键技术。然而，当在5G基站中采用先进的波束赋形技术时，定向的窄波束使得传统的监听方法效率变低，甚至无效。因此，为了实现对可疑链路的合法监听，研究窄波束场景中的监听方案至关重要。MIMO-OFDM technology is considered a key technology for fifth generation (5G) mobile networks. However, when advanced beamforming technology is adopted in 5G base stations, the directional narrow beam makes traditional listening methods inefficient or even ineffective. Therefore, in order to achieve legal monitoring of suspicious links, it is crucial to study monitoring schemes in narrow beam scenarios.

现有的关于监听的文献可分为被动监听、干扰式主动监听和欺骗中继式主动监听三类。在被动监听中，监听者在监听时保持沉默，即只监听发射机发送的数据。这种方法只有在监听信道优于可疑信道时才有效。为了克服这一缺点，引入了干扰式主动监听的方法，即监听者将干扰信号发向可疑接收机，迫使发射机降低速率，从而信息可以被监听者解码。为了灵活地实现主动监听，提出了一种称为欺骗中继的监听方法。当监听信道优于可疑信道时，该方法可以通过将监听器伪装为中继，使监听速率最大化。然而，当发射机用定向波束向可疑用户发送信息时，使用上述的任何监听方案都无法使得在波束覆盖范围外的监听器成功地监听。The existing literature on monitoring can be divided into three categories: passive monitoring, jamming active monitoring and deception relay active monitoring. In passive listening, the listener remains silent while listening, ie only listens to the data sent by the transmitter. This method only works if the listening channel is better than the suspect channel. In order to overcome this shortcoming, the method of jamming active monitoring is introduced, that is, the listener sends the jamming signal to the suspicious receiver, forcing the transmitter to reduce the rate, so that the information can be decoded by the listener. In order to realize active monitoring flexibly, a monitoring method called spoofing relay is proposed. When the listening channel is better than the suspicious channel, this method can maximize the listening rate by disguising the listener as a relay. However, when a transmitter uses a directional beam to send information to a suspicious user, none of the listening schemes described above will allow listeners outside the beam coverage to successfully listen.

发明内容SUMMARY OF THE INVENTION

发明目的：针对上述现有技术的缺陷，本发明研究在MIMO-OFDM系统中的波束覆盖范围外的监听器能够成功监听可疑通信链路的方案，提出一种基于深度强化学习的大规模MIMO-OFDM系统中的主动监听方法，以保证即使发射机采用窄波束和可疑接收机通信，依然能够成功的监听通信数据。Purpose of the invention: In view of the above-mentioned defects of the prior art, the present invention studies the scheme that the listener outside the beam coverage in the MIMO-OFDM system can successfully monitor the suspicious communication link, and proposes a massive MIMO- The active monitoring method in the OFDM system ensures that even if the transmitter uses a narrow beam to communicate with a suspicious receiver, it can still successfully monitor the communication data.

技术方案：一种基于深度强化学习的主动监听方法，包含以下步骤：Technical solution: an active monitoring method based on deep reinforcement learning, including the following steps:

(1)发射机S按照波束预码本以时分的方式执行模拟波束扫描；(1) The transmitter S performs analog beam scanning in a time-division manner according to the beam precodebook;

(2)在发射机S执行波束扫描阶段，监听器E根据自身的波束质量报告和接收机D反馈给发射机S的波束报告确定对自身有利的最佳波束索引j^*；(2) In the stage of beam scanning performed by the transmitter S, the listener E determines the optimal beam index j ^* that is beneficial to itself according to its own beam quality report and the beam report fed back to the transmitter S by the receiver D;

(3)监听器E通过优化转发预编码矩阵来诱导发射机S选择最佳波束索引j^*；(3) The listener E induces the transmitter S to select the best beam index j ^* by optimizing the forwarding precoding matrix;

(4)在波束j^*确定后的通信阶段，监听器E充当数据转发的伪中继，维护通信波束，提高数据监听率。(4) In the communication stage after the beam j ^* is determined, the listener E acts as a pseudo relay for data forwarding, maintains the communication beam, and improves the data monitoring rate.

进一步的，所述步骤(2)中包括如下步骤：Further, the step (2) includes the following steps:

1)接收机D和监听器E分别接收发射机S发出的波束质量测量参考信号，并根据接收信号计算波束质量，所述接收机D形成波束质量报告并反馈给发射机S做波束选择参考；1) The receiver D and the listener E respectively receive the beam quality measurement reference signal sent by the transmitter S, and calculate the beam quality according to the received signal, and the receiver D forms a beam quality report and feeds it back to the transmitter S for beam selection reference;

2)监听器E根据自身的波束质量报告和通过监听接收机D反馈给发射机S的波束质量报告，同时考虑功率消耗的因素，最终根据波束诱导成功率和功耗折衷公式确定最佳波束索引j^*。2) Listener E determines the optimal beam index according to its own beam quality report and the beam quality report fed back to transmitter S through monitoring receiver D, while considering the factors of power consumption, and finally according to the compromise formula of beam induction success rate and power consumption j ^* .

所述步骤(3)中包括如下步骤：Described step (3) comprises the following steps:

㈠监听器形成优化问题：在波束诱导成功的约束下最小化监听器的总发射功率，根据优化问题推导出最优预编码矩阵的形式，得出最优预编码矩阵与发射机S和接收机D的信道状态信息有关；(1) Listener formation optimization problem: The total transmit power of the listener is minimized under the constraint of successful beam induction, and the form of the optimal precoding matrix is derived according to the optimization problem, and the optimal precoding matrix and transmitter S and receiver are obtained. D's channel state information;

㈡监听器E使用MADDPG(Multi-Agent Deep Deterministic Policy Gradient，多代理深度确定性策略梯度)算法训练第一拟合网络来确定第一转发矩阵的传输参数，之后利用所述传输参数确定的第一转发矩阵向接收机D转发波束质量测量参考信号，诱导接收机D发送有误的波束测量报告，从而使得发射机S选择对监听器E有利的波束。(II) The listener E uses the MADDPG (Multi-Agent Deep Deterministic Policy Gradient, Multi-Agent Deep Deterministic Policy Gradient) algorithm to train the first fitting network to determine the transmission parameters of the first forwarding matrix, and then utilizes the first transmission parameters determined by the transmission parameters. The forwarding matrix forwards the beam quality measurement reference signal to the receiver D, and induces the receiver D to send an erroneous beam measurement report, so that the transmitter S selects a favorable beam for the listener E.

所述步骤(4)中包括如下步骤：Described step (4) comprises the following steps:

i监听器E接收发射机S发出的传输数据，并形成优化问题：在成功监听和发送功率小于转发功率上限的条件下，最大化数据监听率；The i listener E receives the transmission data sent by the transmitter S, and forms an optimization problem: maximize the data monitoring rate under the condition that the successful monitoring and the transmission power is less than the upper limit of the forwarding power;

ii监听器使用MADDPG算法训练第二拟合网络来确定功率分配因子和功率增益因子，让一部分功率用来解码，一部分功率用来转发信号，之后利用第二转发矩阵向接收机D转发通信数据，以维护通信波束，提高数据监听率。ii The listener uses the MADDPG algorithm to train the second fitting network to determine the power distribution factor and the power gain factor, so that a part of the power is used for decoding and a part of the power is used for forwarding the signal, and then the second forwarding matrix is used to forward the communication data to the receiver D, In order to maintain the communication beam and improve the data monitoring rate.

进一步的，所述步骤㈡中包括如下步骤：Further, the step (ii) includes the following steps:

①将波束诱导问题建模成第一多智能体协同MDP(Markov Decision Process，马尔科夫决策过程)问题；①Model the beam steering problem as the first multi-agent collaborative MDP (Markov Decision Process) problem;

②根据最优预编码矩阵的形式，将寻找最优预编码矩阵问题转化为寻找一对常数问题，从而加快训练过程；在某一特定时刻，单子载波上的动作为预编码矩阵的角度及幅度，因此，所有子载波的动作为单载波上动作的集合；② According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants, thereby speeding up the training process; at a certain moment, the action on a single subcarrier is the angle and amplitude of the precoding matrix , therefore, the actions of all subcarriers are a set of actions on a single carrier;

③某一特定时刻，单子载波上的状态为通过监听和分析反馈信道上的波束报告信息加上已知信道信息，全局状态是所有子载波状态的非重叠信息的并集；③ At a certain moment, the state on a single subcarrier is the addition of known channel information by monitoring and analyzing the beam report information on the feedback channel, and the global state is the union of the non-overlapping information of all subcarrier states;

④某一特定时刻的奖励函数设计鼓励成功的波束诱导，同时惩罚消耗过多能量的行为。④ The reward function design at a particular moment encourages successful beam induction while punishing behaviors that consume too much energy.

进一步的，所述步骤ii中包括如下步骤：Further, the step ii includes the following steps:

I将数据监听问题建模成第二多智能体协同MDP问题；I model the data monitoring problem as a second multi-agent cooperative MDP problem;

II根据最优预编码矩阵的形式，将寻找最优预编码矩阵问题转化为寻找一对常数问题，从而加快训练过程；在某一特定时刻，单个子载波的动作为功率增益因子和功率分配因子，因此，所有子载波的动作为单载波上动作的集合；II According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants, thereby speeding up the training process; at a certain moment, the actions of a single subcarrier are the power gain factor and power allocation factor , therefore, the actions of all subcarriers are a set of actions on a single carrier;

III某一特定时刻，单个子载波上的状态为通过监听和反馈信道信息获得的信干噪比加上已知信道信息，全局状态是所有子载波状态的非重叠信息的并集；III At a particular moment, the state on a single subcarrier is the signal-to-interference-to-noise ratio obtained by monitoring and feeding back channel information plus known channel information, and the global state is the union of the non-overlapping information of all subcarrier states;

IV某一特定时刻的奖励设计鼓励子载波在监听成功和功率限制的约束下最大化监听率。IV The reward design at a particular moment encourages subcarriers to maximize the listening rate under the constraints of listening success and power constraints.

有益效果：本发明适用于窄波束大规模MIMO-OFDM系统中监听可疑通信链路。在发射机的波束扫描确定波束过程，监听器通过优化预编码矩阵来实现波束的诱导。在发射机传输数据的过程中，通过优化功率分配因子和功率增益因子来最大化监听率。考虑到监听器很难获得可疑节点之间的信道信息，本发明提出了基于MADDPG的学习方案，以帮助其进行波束诱导和数据监听。本发明提出的基于深度强化学习的大规模MIMO-OFDM系统中的主动监听方法不仅可以有效诱导发射机S选择对监听器E有利的波束，为接下来的数据监听过程打下基础，而且能使得监听器E重新调整功率分割因子和功率增益因子，有效维护通信链路，提高数据监听率。Beneficial effects: the present invention is suitable for monitoring suspicious communication links in a narrow beam massive MIMO-OFDM system. During the beam scanning process of the transmitter to determine the beam, the listener realizes the beam induction by optimizing the precoding matrix. In the process of transmitting data by the transmitter, the listening rate is maximized by optimizing the power distribution factor and the power gain factor. Considering that it is difficult for a listener to obtain channel information between suspicious nodes, the present invention proposes a MADDPG-based learning scheme to help it conduct beam induction and data monitoring. The active monitoring method in the massive MIMO-OFDM system based on deep reinforcement learning proposed by the present invention can not only effectively induce the transmitter S to select a beam that is beneficial to the listener E, and lay the foundation for the subsequent data monitoring process, but also enable the monitoring The device E readjusts the power division factor and power gain factor to effectively maintain the communication link and improve the data monitoring rate.

附图说明Description of drawings

图1是本发明的大规模MIMO-OFDM系统中的主动监听模型图；Fig. 1 is the active monitoring model diagram in the massive MIMO-OFDM system of the present invention;

图2是本发明的监听器E在发射机S的不同传输阶段的作用图(BS和DT是波束扫描和数据传输阶段的缩写)；Fig. 2 is the action diagram of the listener E of the present invention in different transmission stages of the transmitter S (BS and DT are the abbreviations of beam scanning and data transmission stages);

图3是本发明的监听器E的收发器结构图；Fig. 3 is the transceiver structure diagram of the listener E of the present invention;

图4是本发明的不同N_te配置下的波束诱导成功率与传输功率关系图；FIG. 4 is a graph showing the relationship between beam induction success rate and transmission power under different N _te configurations of the present invention;

图5是本发明的不同N_te配置下的监听器E的发送功率与发射机S的传输功率关系图；5 is a diagram showing the relationship between the transmission power of the listener E and the transmission power of the transmitter S under different N _te configurations of the present invention;

图6是本发明的不同的P_S和N_ts条件下的平均监听率图；Fig. 6 is the average listening rate graph under different P _S and N _ts conditions of the present invention;

图7是本发明的各种监听方法的平均监听率图。FIG. 7 is a graph of the average interception rate of various interception methods of the present invention.

具体实施方式Detailed ways

本发明是在传统的伪中继监听的基础上提出了一种在大规模MIMO-OFDM系统中的监听方法，其中采用合法的全双工中继来实现波束诱导和数据监听。本发明假设在可疑的发射机上采用了模拟波束赋形，并利用波束扫描来选择最优的波束矢量。波束诱导是在波束扫描阶段完成的。波束诱导的目的是诱导可疑的接收机选择有利于监听器的波束。为了实现这一目的，监听器充当一个中继，将期望波束的测量参考信号放大并转发到可疑的接收机。在这一阶段，本发明的目标是通过优化监听器的预编码矩阵，在成功波束诱导的约束下，最小化监听器的总发送功率。经过数学推导，计算出了最优预编码矩阵的闭合表达形式，该形式与可疑通信对的CSI(Channel State Information，信道状态信息)有关。当监听器不知道它们之间的CSI时，本发明使用DRL(Deep Reinforcement Learning，深度强化学习)算法-MADDPG(Multi-Agent Deep Deterministic Policy Gradient，多代理深度确定性策略梯度)来确定所有子载波的传输参数。一旦实现波束诱导，监听器就可以实施数据监听，并通过继续扮演伪中继来提高监听率。在这一阶段，对功率分割因子和功率增益因子进行了优化，以使监听率最大化。同样，由于监听器不知道可疑通信对的CSI，本发明仍然用MADDPG来优化监听器的中继参数。The present invention proposes a monitoring method in a massive MIMO-OFDM system on the basis of traditional pseudo-relay monitoring, wherein a legal full-duplex relay is used to realize beam induction and data monitoring. The present invention assumes that analog beamforming is employed on the suspect transmitter and utilizes beam scanning to select the optimal beam vector. Beam steering is done in the beam scanning phase. The purpose of beam steering is to induce suspicious receivers to choose beams that favor the listener. To achieve this, the listener acts as a relay, amplifying and forwarding the measurement reference signal of the desired beam to the suspect receiver. At this stage, the goal of the present invention is to minimize the total transmit power of the listener under the constraints of successful beam induction by optimizing the precoding matrix of the listener. After mathematical derivation, the closed expression form of the optimal precoding matrix is calculated, which is related to the CSI (Channel State Information, channel state information) of the suspicious communication pair. When the listeners do not know the CSI between them, the present invention uses the DRL (Deep Reinforcement Learning) algorithm-MADDPG (Multi-Agent Deep Deterministic Policy Gradient, Multi-Agent Deep Deterministic Policy Gradient) to determine all sub-carriers transmission parameters. Once beam steering is achieved, the listener can implement data listening and increase the listening rate by continuing to act as a pseudo-relay. At this stage, the power split factor and power gain factor are optimized to maximize the listening rate. Also, since the listener does not know the CSI of the suspicious communication pair, the present invention still uses MADDPG to optimize the relay parameters of the listener.

下面结合附图，详细描述本发明的实施方式：Embodiments of the present invention are described in detail below in conjunction with the accompanying drawings:

本发明的应用场景如图1所示：本发明考虑一个合法的监听系统，它由一对可疑通信节点(发射机S和接收机D)和一个合法的监听器E组成。发射机S和接收机D分别配备了N_ts根发射天线和N_rd根接收天线。本发明假设发射机S和接收机D都采用了模拟波束赋形的大规模MIMO-OFDM阵列来传输和接收信息。模拟波束是从预定义的离散码本中选择的，本发明把发射机S的码本表示为

监听器E作为一个全双工伪中继，通过N_re天线接收来自发射机S的信号，同时通过N_te天线将信号转发给接收机D。为了提高监听质量，监听器E在每个子载波上都采用了数字波束赋形技术。本发明假设系统中的所有信道在每个RB(Resource Block，资源块)中保持不变，但可能不同RB之间可能根据马尔可夫模型变化。The application scenario of the present invention is shown in FIG. 1 : the present invention considers a legal monitoring system, which consists of a pair of suspicious communication nodes (transmitter S and receiver D) and a legal listener E. Transmitter S and receiver D are equipped with _{Nts transmit antennas and Nrd} _receive antennas, respectively. The present invention assumes that both the transmitter S and the receiver D employ an analog beamforming massive MIMO-OFDM array to transmit and receive information. The analog beam is selected from a predefined discrete codebook, and the present invention expresses the codebook of the transmitter S as

As a full-duplex pseudo-relay, the listener E receives the signal from the transmitter S through the N _re antenna, and at the same time forwards the signal to the receiver D through the N _te antenna. In order to improve the monitoring quality, Listener E adopts digital beamforming technology on each subcarrier. The present invention assumes that all channels in the system remain unchanged in each RB (Resource Block, resource block), but may vary between different RBs according to the Markov model.

在本发明的方案中，如图2所示，对于发射机S，每个传输块的整个过程分为两个阶段：BS(Beam Sweeping，波束扫描)阶段和DT(Data Transmission，数据传输)阶段。而对于监听器E，监听过程包括三个阶段：波束选择、波束诱导性和欺骗性数据转发。在波束选择阶段，监听器通过监听反馈信道来获取波束质量信息。具体地说，当发射机S用波束赋形矢量

传输时，在第k个子载波上由接收机D和监听器E接收到的信号可以表示为In the solution of the present invention, as shown in FIG. 2 , for the transmitter S, the entire process of each transmission block is divided into two stages: a BS (Beam Sweeping, beam scanning) stage and a DT (Data Transmission, data transmission) stage . For listener E, the listening process includes three stages: beam selection, beam induction, and deceptive data forwarding. In the beam selection stage, the listener obtains beam quality information by listening to the feedback channel. Specifically, when transmitter S uses a beamforming vector

When transmitting, the signal received by receiver D and listener E on the kth subcarrier can be expressed as

和and

其中s_k是发射机的发送信号且

表示取期望，f_j是波束赋形向量且|f_j(n)|＝1,n＝1,...,N_ts，j是波束索引，

和

是第k子载波上的发送功率，发射机S和接收机D之间的信道矩阵，发射机S和监听器E之间的信道矩阵，

表示矩阵维度。

和

是零均值加性高斯白噪声且协方差矩阵为σ²I。在接收机D的接收器上，使用

的模拟波束赋形器处理接收信号且有‖v_D‖²＝N_rd,其中‖·‖表示对向量取模或取矩阵的F范数。在监听器E处，第k子载波上的接收信号使用数字波束赋形器

处理信号。在发射机S、接收机D和监听器E的BS阶段，接收机D和监听器E的第k个子载波处的SNR(Signal to Noise Ratio，信噪比)为where _sk is the transmitted signal of the transmitter and

represents the expectation, f _j is the beamforming vector and |f _j (n)|=1,n=1,...,N _ts , j is the beam index,

and

is the transmit power on the kth subcarrier, the channel matrix between transmitter S and receiver D, the channel matrix between transmitter S and listener E,

represents the matrix dimension.

and

is zero mean additive white Gaussian noise and the covariance matrix is σ ² I. On receiver D's receiver, use

The analog beamformer of A handles the received signal and has ‖v _D ‖ ² =N _rd , where ‖·‖ represents the F-norm of a vector modulo or a matrix. At listener E, the received signal on the kth subcarrier uses a digital beamformer

Process the signal. In the BS stage of transmitter S, receiver D and listener E, the SNR (Signal to Noise Ratio, signal-to-noise ratio) at the kth subcarrier of receiver D and listener E is:

和and

接收机D计算所有子载波的平均信噪比

其中K是子载波的数量，并从

选择J个

值大的作为候选波束，然后将

和相应的索引反馈给S，其中J是最大反馈的波束数目。本发明假设监听器E可以通过监听发射机S和接收机D之间的反馈信道来获得这些反馈信息。当发射机S选择的波束导致低

监听器E很难监听发射机S传输的通信信息，因此，监听器E作为伪中继诱导发射机S的波束选择。对于监听器E，理想的波束应该为监听器E和接收机D提供较高的信噪比，因为低

的波束将消耗更多的监听器E的转发功率，因此，监听器E确定所需的最佳波束索引将根据Receiver D calculates the average SNR of all subcarriers

where K is the number of subcarriers and is derived from

choose J

The larger value is used as the candidate beam, and then the

And the corresponding index is fed back to S, where J is the maximum number of beams fed back. The present invention assumes that the listener E can obtain these feedback information by listening to the feedback channel between the transmitter S and the receiver D. When the beam selected by transmitter S results in a low

It is difficult for the listener E to monitor the communication information transmitted by the transmitter S. Therefore, the listener E acts as a pseudo-relay to induce the beam selection of the transmitter S. For listener E, the ideal beam should provide a high signal-to-noise ratio for both listener E and receiver D, because the low

The beams of will consume more of the forwarding power of the listener E, therefore, the optimal beam index required by the listener E will be determined according to

其中，δ是一个用于平衡监听器的监听成功率和功耗的折衷因子。where δ is a trade-off factor for balancing the monitoring success rate and power consumption of the listener.

在确定了所需的最佳波束索引j^*之后，监听器E将诱导发射机S在接下来的BS期间选择最佳波束索引j^*。在DT阶段，发射机S将通信数据传输给接收机D，监听器E充当一个AF(Amplify and Forward，放大转发)欺骗中继，在转发数据时进行监听。After determining the desired optimal beam index j ^* , the listener E will induce the transmitter S to select the optimal beam index j ^* during the next BS. In the DT stage, the transmitter S transmits the communication data to the receiver D, and the listener E acts as an AF (Amplify and Forward, Amplify and Forward) spoofing relay, and monitors when the data is forwarded.

如图3所示，

α_k和g_k分别表示监听器E在子载波k处的接收波束赋形向量、发送波束赋形向量、功率分配因子和功率增益因子。在波束诱导阶段，接收到的信号被放大并通过α_k＝1传输，即不需要解码信息。在数据转发阶段，将接收到的信号功率分成用于解码和转发两部分。本发明将分析如何优化

以实现最大的监听速率。As shown in Figure 3,

α _k and g _k represent the receive beamforming vector, transmit beamforming vector, power allocation factor and power gain factor of the listener E at subcarrier k, respectively. During the beam induction phase, the received signal is amplified and transmitted with α _k =1, ie no decoding of the information is required. In the data forwarding stage, the received signal power is divided into two parts for decoding and forwarding. This invention will analyze how to optimize

for maximum listening rate.

在发射机S的波束扫描期间，监听器E将放大和转发接收到发射机S发出的用来测量发射机S和接收机D之间的波束质量的导频信号。本发明假设监听器使用的AF中继的延迟远小于符号持续时间，因此可以忽略。由于监听器E的全双工性质，监听器E在子载波k处的接收信号为During the beam scan of the transmitter S, the listener E will amplify and retransmit the pilot signal received by the transmitter S to measure the beam quality between the transmitter S and the receiver D. The present invention assumes that the delay of the AF relay used by the listener is much smaller than the symbol duration and therefore can be ignored. Due to the full-duplex nature of listener E, the received signal of listener E at subcarrier k is

其中

为自干扰信道，

为监听器E的预编码矩阵，

是上一时刻监听器接收的信号。从(6)中可以看到，如果W_k在

的零空间，则

那么(6)中

设

为

的零奇异值所对应于的右奇异矩阵，预编码矩阵可以写成in

is the self-interfering channel,

is the precoding matrix of listener E,

is the signal received by the listener at the last moment. It can be seen from (6) that if W _k is in

the null space, then

Then in (6)

Assume

for

The right singular matrix corresponding to the zero singular value of , the precoding matrix can be written as

其中，

是新的待优化的矩阵。为了保证r₀＞0，本发明有N_te＞N_re，即监听器E需要比接收天线更多的发射天线来抑制自干扰。消除自干扰后，传输信号

可以表示为

监听器E的传输功率

为

经过接收波束赋形后接收机D接收到的信号可以表示in,

is the new matrix to be optimized. In order to ensure that r ₀ >0, the present invention has N _te >N _re , that is, the listener E needs more transmit antennas than receive antennas to suppress self-interference. After eliminating self-interference, transmit the signal

It can be expressed as

Transmit power of listener E

for

The signal received by receiver D after receive beamforming can be expressed as

其中，

是第k个子载波上监听器E和接收机D之间的信道矩阵，以及

和

为新构建的等效信道。之后，第k个子载波上接收机D的接收信噪比可以表示为in,

is the channel matrix between listener E and receiver D on the kth subcarrier, and

and

for the newly constructed equivalent channel. Afterwards, the received signal-to-noise ratio of receiver D on the kth subcarrier can be expressed as

为了用最小传输功率完成诱导发射机S选择最佳波束索引j^*，这个问题可以表述为In order to induce transmitter S to choose the best beam index j ^* with minimum transmission power, this problem can be formulated as

其中，

和

为了得到

的闭式解，本发明将问题(10)分解为K独立子问题，令

其中

因此，这K个子问题可以表述为in,

and

in order to get

The closed-form solution of , the present invention decomposes problem (10) into K independent subproblems, let

in

Therefore, the K subproblems can be formulated as

为了求解(11)，本发明首先证明了一个引理：问题(11)的最优解可以表示为In order to solve (11), the present invention first proves a lemma: the optimal solution of problem (11) can be expressed as

其中

引理的证明如下：in

The proof of the lemma is as follows:

为了简洁起见，在以下引理证明中将省略子载波k的标识。为了证明引理，本发明假设(11)的可行解为

其中

w^′为可行解的幅值参数，则可行解对应的功率消耗P(W′)为

然后本发明构建矩阵

其中

其中的不等式遵循对任意矩阵(向量)A和B，有||AB||≤||A||||B||。下面将证明新矩阵W″不仅对问题(11)是可行的，而且得到一个小于P(W′)的目标值。令

通过将W″代入(9)中β_k的分子和分母，因此有了For the sake of brevity, the identification of subcarrier k will be omitted in the following lemma proofs. To prove the lemma, the present invention assumes that the feasible solution of (11) is

in

w ^′ is the amplitude parameter of the feasible solution, then the power consumption P(W′) corresponding to the feasible solution is

Then the present invention constructs the matrix

in

where the inequality follows for arbitrary matrices (vectors) A and B, ||AB||≤||A||||B||. The following will prove that the new matrix W" is not only feasible for problem (11), but also obtains a target value smaller than P(W'). Let

By substituting W" into the numerator and denominator of β _k in (9), we have

其中，(13)和(14)遵循三角形不等式。基于(13)和(14)，本发明推断出β(W″)≥β(W′)≥β_D。以上结果表明，W″对问题(11)是可行的。通过将W″代入到(11)的目标函数中，本发明得到了where (13) and (14) follow the triangle inequality. Based on (13) and (14), the present invention deduces β(W″)≥β(W′)≥β _D . The above results show that W″ is feasible for problem (11). By substituting W" into the objective function of (11), the present invention obtains

其中(15)遵循柯西-施瓦兹不等式

综上所述，对于问题(11)的任何解W′，本发明总是可以构造另一个

得到更小的目标值，这就证明了这一引理。where (15) follows the Cauchy-Schwartz inequality

To sum up, for any solution W' of problem (11), the present invention can always construct another

A smaller target value is obtained, which proves the lemma.

将(12)代入(11)中的目标函数，可以看到

是w_k的递增函数。因此，在(12)中，从一个小值逐渐增加w_k，直到满足(11)中的约束，可以找到唯一的未知变量w_k。由于引理适用于任何给定的

因此(10)的最优解与(12)具有相同的形式。在理论上，本发明可以通过考虑所有可能的组合

来得到(10)的最优解。对于给定的

(11)的解提供了(10)的解的上界。只有当监听器E能够知道所有信道

时，监听器E才能采用(12)中的预编码矩阵。本发明假设监听器E可以通过监听导频信号得到等效的信道向量

然而，由于发射机S和监听器E之间的非合作关系，很难获得

因此，本发明根据DRL的反馈β和β_D调整

来最小化P_E。通过采用MADDPG的学习框架，实时确定

诱导发射机S选择监听器E所需的波束。最终，(7)中的W_k可以表示为列向量

和行向量

的乘积，如图3所示。Substituting (12) into the objective function in (11), we can see

is an increasing function of _wk . Therefore, in (12), gradually increasing _wk from a small value until the constraint in (11) is satisfied, the unique unknown variable _wk can be found. Since the lemma holds for any given

Therefore the optimal solution of (10) has the same form as (12). In theory, the present invention can be realized by considering all possible combinations

to get the optimal solution of (10). for a given

The solution of (11) provides an upper bound on the solution of (10). Only if listener E can know all channels

, the listener E can use the precoding matrix in (12). The present invention assumes that the listener E can obtain the equivalent channel vector by monitoring the pilot signal

However, due to the non-cooperative relationship between transmitter S and listener E, it is difficult to obtain

Therefore, the present invention adjusts according to the feedback β and β _D of DRL

to minimize P _E . By adopting the learning framework of MADDPG, real-time determination

The transmitter S is induced to select the desired beam of the listener E. Finally, W _k in (7) can be expressed as a column vector

and row vector

, as shown in Figure 3.

成功的波束诱导并不意味着监听可以成功地进行。在数据传输阶段，如果监听器E不将数据转发到接收机D，则接收机D的误码率可能高于阈值，并触发波束恢复过程，从而切换波束。因此，为了在AF中继操作下实现监听器E的数据中继和监听，将接收信号

分为两部分，一部分用于转发信息以增加接收机D的信噪比，另一部分用于信息解码以监听发射机S发送的消息。由于α_k的引入，

中的功率增益因子w_k需要重新优化。定义

为标准化波束赋形向量，则监听器E的发射信号

表示为Successful beam steering does not imply that listening can be successfully performed. During the data transmission phase, if the listener E does not forward the data to the receiver D, the bit error rate of the receiver D may be higher than the threshold and trigger the beam recovery process, thereby switching the beam. Therefore, in order to realize the data relay and monitoring of the listener E under the AF relay operation, the received signal will be

It is divided into two parts, one part is used to forward the information to increase the signal-to-noise ratio of the receiver D, and the other part is used to decode the information to listen to the message sent by the transmitter S. Due to the introduction of α _k ,

The power gain factor w _k in needs to be re-optimized. definition

is the normalized beamforming vector, then the transmitted signal of the listener E

Expressed as

其中g_k为功率增益因子，用于控制数据监听阶段的传输功率，α_k为功率分配因子。需要注意的是，

和

与波束诱导相一致，因为在这两个阶段，本发明都旨在提高接收机D的信噪比。与(8)类似，数据传输阶段接收机D的接收信号可以写为Among them, g _k is the power gain factor, which is used to control the transmission power in the data monitoring stage, and α _k is the power distribution factor. have to be aware of is,

and

Consistent with beam steering, since the present invention aims to improve the signal-to-noise ratio of receiver D in both stages. Similar to (8), the received signal of receiver D in the data transmission stage can be written as

对于给定的

和

接收到的接收机D和监听器E的信噪比可以计算为

和

然后，监听器E的目标是优化

从而使监听率在传输功率的约束下达到最大值。因此，优化问题可以表示为for a given

and

The received signal-to-noise ratio of receiver D and listener E can be calculated as

and

Then, the goal of listener E is to optimize

Therefore, the monitoring rate can reach the maximum value under the constraint of transmission power. Therefore, the optimization problem can be expressed as

其中，

和P_M分别为监听器E的总发射功率和功率约束。本发明假设监听器E只有在R_E≥R_D时才能实现监听，相应的监听速率为R_D。如果监听器E知道全局CSI，则可以用拉格朗日乘子法推导出(18)的解。然而，当

未知时，本发明无法得到最优的

比知道

更合理的假设是，可以通过监听发射机S和接收机D之间的上行控制信道来获得

因此，采用DRL以

作为观测状态并与系统交互来确定

通过使用MADDPG训练神经网络，实时给出了

从而提高了在可控传输功率下监听器E的监听率。in,

and P _M are the total transmit power and power constraint of listener E, respectively. The present invention assumes that the monitor _E can monitor only when RE ≥ _RD , and the corresponding monitor rate is _RD . If the listener E knows the global CSI, the solution to (18) can be derived using the Lagrange multiplier method. However, when

When unknown, the present invention cannot obtain the optimal

than know

A more reasonable assumption is that it can be obtained by listening to the uplink control channel between transmitter S and receiver D

Therefore, using DRL to

Determined as observed state and interaction with the system

By training a neural network with MADDPG, in real-time given

Thus, the listening rate of the listener E under the controllable transmission power is improved.

基于上述的分析，当发射机S和接收机D之间的CSI未知时，将波束诱导和数据监听问题表述为MDP(Markov Decision Process，马尔可夫决策过程)问题。将所有子载波视为一个代理，并通过单一DDPG的Actor-Critic网络获得策略是第一直觉的深度学习解决方案。然而，在实际实现中，训练一个具有大动作空间的策略通常比训练多个具有小动作空间的策略更困难。因此，在这两个阶段中，本发明将每个子载波视为一个单独的代理，它们合作实现一个共同的目标。因此，本发明采用了MADDPG的学习架构，其中包括K个Actor(策略)和一个集中的Critic(价值函数)。在训练阶段，Actor和Critic使用全局数据进行更新，包括全局状态、共享奖励和所有动作，这些数据将在稍后定义。Based on the above analysis, when the CSI between the transmitter S and the receiver D is unknown, the beam steering and data monitoring problems are formulated as MDP (Markov Decision Process, Markov Decision Process) problems. Treating all sub-carriers as a proxy and obtaining the policy through the Actor-Critic network of a single DDPG is a first-intuitive deep learning solution. However, in practical implementations, training one policy with a large action space is usually more difficult than training multiple policies with small action spaces. Therefore, in these two phases, the present invention treats each sub-carrier as a separate agent that cooperate to achieve a common goal. Therefore, the present invention adopts the learning architecture of MADDPG, which includes K Actors (policies) and a centralized Critic (value function). During the training phase, Actors and Critic are updated with global data, including global state, shared rewards, and all actions, which will be defined later.

将波束诱导问题建模成第一多智能体协同MDP问题，根据最优预编码矩阵的形式，将寻找最优预编码矩阵问题转化为寻找一对常数(w_k,θ_k)问题，其中θ_k为MADDPG算法对

的估计，从而加快训练过程。在t时刻，第k个子载波的动作用

表示。因此，所有子载波的动作为

t时刻，每个子载波k上的状态为

β和β_D是通过监听和分析反馈信道上的波束报告而获得的。全局状态s_t是所有子载波状态

的非重叠信息的并集，即

t时刻的奖励r_t定义为r_t＝-a₁P_E-a₂(β-β_D-B)²+a₃I(β,β_D)，其中

为正的用于平衡监听器的诱导成功率和功耗的系数，B是用来增加选择最佳波束索引j^*概率的常数，I(x,y)是一个布尔函数，其中当x≥y时I(x,y)＝1，否则I(x,y)＝0。奖励函数鼓励成功的波束诱导，同时惩罚消耗过多能量的行为。The beam steering problem is modeled as the first multi-agent cooperative MDP problem. According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants (w _k , θ _k ), where θ _k is the MADDPG algorithm pair

, thereby speeding up the training process. At time t, the action of the kth subcarrier

express. Therefore, the actions of all subcarriers are

At time t, the state on each subcarrier k is

β and β _D are obtained by listening and analyzing beam reports on the feedback channel. The global state s _t is the state of all subcarriers

The union of non-overlapping information of , i.e.

The reward rt at time _t is defined as r _t =-a ₁ P _E -a ₂ (β-β _D -B) ² +a ₃ I(β,β _D ), where

is a positive coefficient used to balance the induction success rate and power consumption of the listener, B is a constant used to increase the probability of choosing the best beam index j ^* , and I(x,y) is a Boolean function where x ≥ y When I(x,y)=1, otherwise I(x,y)=0. The reward function encourages successful beam induction while penalizing behaviors that consume too much energy.

将数据监听问题建模成第二多智能体协同MDP问题，根据最优预编码矩阵的形式，将寻找最优预编码矩阵问题转化为寻找一对常数(g_k,α_k)问题，其中g_k和α_k分别表示监听器在子载波k上的功率增益因子和功率分配比，从而加快训练过程。在t时刻，第k个子载波的动作为

因此，所有子载波的动作为

t时刻，每个子载波k上的状态为

其中

是通过监听和反馈信道信息获得的。全局状态s_t是所有子载波状态

的非重叠信息的并集，即

t时刻的奖励r_t定义为

其中

为正的用于平衡中继的监听率和功耗的系数，C是用来提升监听率的常数。奖励函数鼓励子载波在R_E＞R_D和P_E≤P_M的约束下最大化R_D。The data monitoring problem is modeled as the second multi-agent cooperative MDP problem. According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants (g _k , α _k ), where g _k and α _k represent the power gain factor and power distribution ratio of the listener on subcarrier k, respectively, thereby speeding up the training process. At time t, the action of the kth subcarrier is

Therefore, the actions of all subcarriers are

At time t, the state on each subcarrier k is

in

It is obtained by monitoring and feeding back channel information. The global state s _t is the state of all subcarriers

The union of non-overlapping information of , i.e.

The reward r _t at time t is defined as

in

A positive factor used to balance the repeater's listening rate and power consumption, C is a constant used to increase the listening rate. The reward function encourages sub-carriers to maximize _RD under the constraints of _RE > _RD and _PE ≤ _PM .

如图4所示，在波束诱导阶段，不同N_te下波束诱导成功率与发射功率P_S的关系图。本发明在每个子载波上分配相同的传输功率，即

诱导率是通过计算10⁵次蒙特卡洛仿真中统计β≥β_D的数量得到的。在被动方法中，当发射机S执行波束扫描时，监听器E保持沉默。在这种情况下，当监听器E和接收机D相距很远时，接收机D将以低概率选择最佳波束索引j^*。结果表明，本发明提出的基于MADDPG的方法的成功率接近100％。这些结果验证了该方法在不同系统配置下的有效性。As shown in Figure 4, in the beam induction stage, the relationship between the beam induction success rate and the transmit power P _S under different N _te . The present invention allocates the same transmission power on each sub-carrier, namely

The induction rate was obtained by counting the number of statistical β ≥ β _D in 10 ⁵ Monte Carlo simulations. In the passive method, the listener E remains silent while the transmitter S performs beam scanning. In this case, when listener E and receiver D are far apart, receiver D will choose the best beam index j ^* with low probability. The results show that the success rate of the MADDPG-based method proposed in the present invention is close to 100%. These results verify the effectiveness of the method under different system configurations.

如图5所示，在波束诱导阶段，不同N_te配置下的P_E和P_S之间的关系图。在最优方案中，P_E是已知(10)中

计算的最优目标值。在基于MADDPG的方案中，P_E是用MADDPG学习到的参数来计算的。可以看出，虽然P_E会随着P_S的增加而增加，但配备更多的N_te可以有效地降低P_E。结合图3和图4，可以看出，即使

未知，本发明仍然可以利用MADDPG学习到的波束诱导策略实现波束诱导，且发射功率略高于理论最小功率。Figure 5 shows the relationship between _PE and PS under different _N _te configurations during the beam induction stage. In the optimal solution, _PE is known in (10)

Calculated optimal target value. In _MADDPG -based schemes, PE is computed using the parameters learned by MADDPG. It can be seen that although PE increases with the increase of _PS , _equipping more _Nte can effectively decrease _PE . Combining Figures 3 and 4, it can be seen that even if

Unknown, the present invention can still utilize the beam induction strategy learned by MADDPG to achieve beam induction, and the transmit power is slightly higher than the theoretical minimum power.

如图6所示，在数据监听的阶段，通过求解(18)，得到了最优解。具有SBM(Successful Beam Misleading，成功的波束诱导)的被动方法意味着代理在BS阶段实现波束诱导，但在DT阶段保持沉默。如图6所示，成功诱导后，监听率会随着P_S或N_ts的增加而增加，本发明可以通过调整传输参数来保证R_E≥R_D。同时，本发明提出的方法监听率接近于最优解，并且明显优于有SBM下的被动监听方法。As shown in Fig. 6, in the stage of data monitoring, the optimal solution is obtained by solving (18). A passive approach with SBM (Successful Beam Misleading) means that the agent achieves beam steering during the BS phase, but remains silent during the DT phase. As shown in FIG. 6 , after successful induction, the listening rate will increase with the increase of P _S or N _ts , and the present invention can ensure that RE ≥ _{R D} _by adjusting the transmission parameters. At the same time, the monitoring rate of the method proposed by the present invention is close to the optimal solution, and is obviously better than the passive monitoring method with SBM.

如图7所示，图7对比了不同功率约束P_M下的多种监听方案，并绘制了传统的主动干扰方案作为比较。结果表明，本发明提出的MADDPG方案所获得的监听率接近于最优解，且随着P_M的增加而增加。当P_M＞55dBm，

时，监听率接近最大R_E。无SBM被动监听方案的监听性能与监听器E的发射功率无关，有SBM的被动监听的监听性能优于无SBM的方法。干扰方案的平均窃听率受到监听器E的功率约束的限制，因为它在功率限制值P_M相对较低时不能保证R_E≥R_D。As shown in Fig. 7, Fig. 7 compares various monitoring _schemes under different power constraints PM, and draws the traditional active jamming scheme for comparison. The results show that the interception rate obtained by the _MADDPG scheme proposed by the present invention is close to the optimal solution, and increases with the increase of PM. When P _M > 55dBm,

, the monitoring rate is close to the maximum _RE . The monitoring performance of the passive monitoring scheme without SBM has nothing to do with the transmit power of the listener E, and the monitoring performance of passive monitoring with SBM is better than that of the method without SBM. The average eavesdropping rate of the jamming scheme is limited by the power constraint of the listener _E , since it cannot guarantee RE ≥ _RD when the power limit value _PM is relatively low.

仿真证明，本发明提出的基于深度强化学习的大规模MIMO-OFDM系统中的主动监听方法不仅可以有效诱导发射机S选择对监听器E有利的波束，为接下来的数据监听过程打下基础，而且能使得监听器E重新调整功率分配因子和功率增益因子，有效维护通信链路，提高数据监听率。两个阶段结合，实现了监听窄波束通信的大规模MIMO-OFDM系统。The simulation proves that the active monitoring method in the massive MIMO-OFDM system based on deep reinforcement learning proposed by the present invention can not only effectively induce the transmitter S to select a beam that is beneficial to the listener E, and lay the foundation for the subsequent data monitoring process, but also The monitor E can be made to readjust the power distribution factor and the power gain factor, effectively maintain the communication link, and improve the data monitoring rate. The combination of the two stages realizes a massive MIMO-OFDM system for monitoring narrow-beam communications.

Claims

1. An active monitoring method based on deep reinforcement learning, comprising the following steps:

(1) The transmitter S performs analog beam scanning in a time-division manner according to the beam precodebook;

(2) In the stage of beam scanning performed by the transmitter S, the listener E determines the optimal beam index j ^* that is beneficial to itself according to its own beam quality report and the beam report fed back to the transmitter S by the receiver D;

(3) The listener E induces the transmitter S to select the best beam index j ^* by optimizing the forwarding precoding matrix;

(4) In the communication stage after the optimal beam index j ^* is determined, the listener E acts as a pseudo relay for data forwarding, maintains the communication beam, and improves the data monitoring rate.

2. the active monitoring method based on deep reinforcement learning according to claim 1, is characterized in that, comprises the following steps in described step (2):

1) The receiver D and the listener E respectively receive the beam quality measurement reference signal sent by the transmitter S, and calculate the beam quality according to the received signal, and the receiver D forms a beam quality report and feeds it back to the transmitter S for beam selection reference;

2) Listener E determines the optimal beam index according to its own beam quality report and the beam quality report fed back to transmitter S through monitoring receiver D, while considering the factors of power consumption, and finally according to the compromise formula of beam induction success rate and power consumption j ^* .

3. the active monitoring method based on deep reinforcement learning according to claim 1, is characterized in that, comprises the following steps in described step (3):

(1) Listener formation optimization problem: The total transmit power of the listener is minimized under the constraint of successful beam induction, and the form of the optimal precoding matrix is derived according to the optimization problem, and the optimal precoding matrix and transmitter S and receiver are obtained. D's channel state information;

(2) Listener E uses MADDPG algorithm to train the first fitting network to determine the transmission parameter of the first forwarding matrix, and then utilizes the first forwarding matrix determined by the described transmission parameter to forward the beam quality measurement reference signal to receiver D, inducing receiver D An erroneous beam measurement report is sent so that transmitter S selects a beam that is beneficial to listener E.

4. the active monitoring method based on deep reinforcement learning according to claim 1, is characterized in that, comprises the following steps in described step (4):

The i listener E receives the transmission data sent by the transmitter S, and forms an optimization problem: maximize the data monitoring rate under the condition that the successful monitoring and the transmission power is less than the upper limit of the forwarding power;

ii The listener uses the MADDPG algorithm to train the second fitting network to determine the power distribution factor and the power gain factor, so that a part of the power is used for decoding and a part of the power is used for forwarding the signal, and then the second forwarding matrix is used to forward the communication data to the receiver D, In order to maintain the communication beam and improve the data monitoring rate.

5. the active monitoring method based on deep reinforcement learning according to claim 3, is characterized in that, comprises the following steps in described step (ii):

①Model the beam steering problem as the first multi-agent cooperative MDP problem;

② According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants, thereby speeding up the training process; at a certain moment, the action on a single subcarrier is the angle and amplitude of the precoding matrix , therefore, the actions of all subcarriers are a set of actions on a single carrier;

③ At a certain moment, the state on a single subcarrier is the addition of known channel information by monitoring and analyzing the beam report information on the feedback channel, and the global state is the union of the non-overlapping information of all subcarrier states;

④ The reward function design at a particular moment encourages successful beam induction while punishing behaviors that consume too much energy.

6. the active monitoring method based on deep reinforcement learning according to claim 4, is characterized in that, comprises the following steps in described step ii:

I model the data monitoring problem as a second multi-agent cooperative MDP problem;

II According to the form of the optimal precoding matrix, the problem of finding the optimal precoding matrix is transformed into the problem of finding a pair of constants, thereby speeding up the training process; at a certain moment, the actions of a single subcarrier are the power gain factor and power allocation factor , therefore, the actions of all subcarriers are a set of actions on a single carrier;

III At a particular moment, the state on a single subcarrier is the signal-to-interference-to-noise ratio obtained by monitoring and feeding back channel information plus known channel information, and the global state is the union of the non-overlapping information of all subcarrier states;

IV The reward design at a particular moment encourages subcarriers to maximize the listening rate under the constraints of listening success and power constraints.