CN116755046B

CN116755046B - Multifunctional radar interference decision-making method based on imperfect expert strategy

Info

Publication number: CN116755046B
Application number: CN202311029543.1A
Authority: CN
Inventors: 周峰; 田甜; 李建鑫; 刘磊; 樊伟伟
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-14
Anticipated expiration: 2043-08-16
Also published as: CN116755046A

Abstract

The invention relates to a multifunctional radar interference decision-making method based on imperfect expert strategy, which includes the steps of: obtaining radar status; when judging that the radar status is inconsistent with the radar target status, using an expert intervention discriminant function module to judge whether the radar status belongs to the expert decision-making error status set ; When it is judged that the radar status belongs to the expert decision-making error state set, the main decision-making network in the interference decision-making network is used to select the interference pattern; when the radar status is judged not to belong to the expert decision-making error state set, the expert interference exploration function module is used to judge based on the radar status. Whether the expert strategy participates in the interference decision-making; when it is judged that the expert strategy participates in the interference decision-making, the expert decision-making network module is used to select the interference style; when it is judged that the expert strategy does not participate in the interference decision-making, the main decision-making network in the interference decision-making network is used to select the interference style. This method effectively improves the learning efficiency and decision-making accuracy of the interference decision-making algorithm, and reduces the trial-and-error cost of adversarial games.

Description

A multifunctional radar interference decision-making method based on imperfect expert strategy

技术领域Technical field

本发明属于雷达技术领域，具体涉及一种不完美专家策略的多功能雷达干扰决策方法。The invention belongs to the field of radar technology, and specifically relates to a multifunctional radar interference decision-making method based on imperfect expert strategies.

背景技术Background technique

随着数字化技术和人工智能技术在雷达领域的广泛应用，具备多任务、多波形的多功能雷达已成为现代微波探测的重要手段。然而，对于被探测的一方而言，雷达智能化高、灵活性强的优势给被探测的目标和区域则带来更加严重的威胁和挑战。干扰机难以灵活快速的采取有效的干扰措施，从而处于不利地位。干扰决策作为雷达博弈对抗中的重要一环，是其能否实施有效干扰的关键。因此，研究具备自适应特性的、智能化的干扰决策方法来应对具备认知能力的雷达，对于提升干扰效能具有重大意义。目前，结合强化学习进行智能干扰决策是雷达领域的研究热点之一，也被认为是解决多功能雷达干扰难题的有效手段之一。近年来，随着深度学习领域技术的不断发展，结合深度学习和强化学习优势的深度强化学习在智能控制决策领域表现出色，广泛应用于机器人控制、自动驾驶、游戏AI、自然语言处理等领域。因此，许多学者将强化学习引入雷达干扰决策领域，提出了各种基于强化学习的干扰决策方法。With the widespread application of digital technology and artificial intelligence technology in the field of radar, multi-functional radar with multi-tasks and multi-waveforms has become an important means of modern microwave detection. However, for the party being detected, the advantages of radar's high intelligence and flexibility bring more serious threats and challenges to the targets and areas being detected. It is difficult for jammers to take effective jamming measures flexibly and quickly, so they are at a disadvantage. As an important part of radar game confrontation, interference decision-making is the key to whether it can implement effective interference. Therefore, studying adaptive and intelligent interference decision-making methods to deal with radars with cognitive capabilities is of great significance for improving interference effectiveness. At present, intelligent interference decision-making combined with reinforcement learning is one of the research hotspots in the radar field, and is also considered to be one of the effective means to solve the problem of multi-functional radar interference. In recent years, with the continuous development of technology in the field of deep learning, deep reinforcement learning, which combines the advantages of deep learning and reinforcement learning, has performed well in the field of intelligent control decision-making and is widely used in fields such as robot control, autonomous driving, game AI, and natural language processing. Therefore, many scholars have introduced reinforcement learning into the field of radar interference decision-making and proposed various interference decision-making methods based on reinforcement learning.

李云杰等人首次将强化学习与雷达对抗过程相结合，为雷达干扰决策方法提供了新思路。张伯开等人针对雷达状态增多导致基于Q-learning的干扰决策效率降低的问题，提出基于DQN(Deep Q Network，深度Q网络)的干扰决策方法，实现了多种雷达状态的干扰决策。李慧琴等人利用模拟退火算法和带热重启的随机梯度下降思想改进Q-learning算法，提高了干扰策略的探索和利用率，获得了更快的收敛效果。邹玮琦等人引入A3C(Asynchronous Advantage Actor-Critic，异步优势动作评价算法)到干扰决策领域，提高了决策时间效率。朱霸坤等人针对Q-learning算法收敛速度过慢的问题，从引入先验知识和结合Dyna架构两个角度改进，提升了算法的收敛速度。刘洪迪等人提出了两级干扰决策算法，实现了干扰样式与干扰参数的策略优化问题。李永锋通过分析样本抽样方法，提出基于监督抽样的DDQN(Double DQN，双深度Q网络)干扰决策方法，提高了决策的效率与稳定。刘松涛等人进一步分析DQN算法，提出基于D3QN(Dueling Double Deep Q Network，对决双深度Q网络)的雷达干扰决策算法，改善了干扰决策的效率和精度。Li Yunjie and others combined reinforcement learning with the radar countermeasures process for the first time, providing new ideas for radar interference decision-making methods. Zhang Bokai and others proposed an interference decision-making method based on DQN (Deep Q Network) to solve the problem of reduced efficiency of interference decision-making based on Q-learning due to the increase in radar states, which realizes interference decision-making for multiple radar states. Li Huiqin and others used the simulated annealing algorithm and the idea of stochastic gradient descent with hot restart to improve the Q-learning algorithm, improve the exploration and utilization of interference strategies, and achieve faster convergence effects. Zou Weiqi and others introduced A3C (Asynchronous Advantage Actor-Critic, asynchronous advantage action evaluation algorithm) into the field of interference decision-making, improving decision-making time efficiency. Zhu Bakun and others aimed at the problem of the slow convergence speed of the Q-learning algorithm and improved it from the two perspectives of introducing prior knowledge and combining the Dyna architecture to improve the convergence speed of the algorithm. Liu Hongdi and others proposed a two-level interference decision-making algorithm to realize the strategy optimization problem of interference patterns and interference parameters. By analyzing the sample sampling method, Li Yongfeng proposed the DDQN (Double DQN, Double Deep Q Network) interference decision-making method based on supervised sampling, which improved the efficiency and stability of decision-making. Liu Songtao and others further analyzed the DQN algorithm and proposed a radar interference decision-making algorithm based on D3QN (Dueling Double Deep Q Network), which improved the efficiency and accuracy of interference decision-making.

然而，随着雷达状态和干扰动作空间的增大，导致传统Q-learning算法的计算与存储都存在巨大压力，算法复杂度呈指数型增长，这将严重影响干扰机进行决策的时效性，导致干扰机决策的效率较低；另外，现有的基于强化学习的智能干扰机决策学习方法，仍然存在学习效率低、对抗博弈试错成本高的问题。However, as the radar status and jamming action space increase, there is huge pressure on the calculation and storage of the traditional Q-learning algorithm, and the algorithm complexity increases exponentially. This will seriously affect the timeliness of the jammer's decision-making, resulting in The efficiency of jammer decision-making is low; in addition, the existing intelligent jammer decision-making learning method based on reinforcement learning still has the problems of low learning efficiency and high trial-and-error cost in adversarial games.

发明内容Contents of the invention

为了解决现有技术中存在的上述问题，本发明提供了一种不完美专家策略的多功能雷达干扰决策方法。本发明要解决的技术问题通过以下技术方案实现：In order to solve the above-mentioned problems existing in the prior art, the present invention provides a multifunctional radar interference decision-making method based on imperfect expert strategy. The technical problems to be solved by the present invention are achieved through the following technical solutions:

本发明实施例提供了一种不完美专家策略的多功能雷达干扰决策方法，所述方法基于多功能雷达干扰决策模型，利用训练好的决策网络根据雷达状态选择干扰样式以对雷达实施干扰，使雷达产生新的雷达状态；所述训练好的决策网络包括专家决策网络模块、干扰决策网络、专家干预判别函数模块和专家干扰探索函数模块；所述方法包括步骤：The embodiment of the present invention provides a multifunctional radar interference decision-making method based on imperfect expert strategy. The method is based on a multifunctional radar interference decision-making model and utilizes a trained decision-making network to select an interference pattern according to the radar status to interfere with the radar. The radar generates a new radar state; the trained decision-making network includes an expert decision-making network module, an interference decision-making network, an expert intervention discriminant function module and an expert interference exploration function module; the method includes the steps:

获取雷达状态；Get radar status;

当判断所述雷达状态与雷达目标状态不一致，利用所述专家干预判别函数模块判断所述雷达状态是否属于专家决策失误状态集；When it is determined that the radar state is inconsistent with the radar target state, the expert intervention discriminant function module is used to determine whether the radar state belongs to the expert decision-making error state set;

当判断所述雷达状态属于专家决策失误状态集，则利用所述干扰决策网络中的主决策网络选择干扰样式；当判断雷达状态不属于专家决策失误状态集，则根据所述雷达状态，利用所述专家干扰探索函数模块判断专家策略是否参与干扰决策；When it is judged that the radar state belongs to the expert decision-making error state set, then the main decision-making network in the interference decision-making network is used to select the interference pattern; when it is judged that the radar state does not belong to the expert decision-making error state set, then according to the radar state, the interference pattern is selected The expert interference exploration function module determines whether the expert strategy participates in interference decision-making;

当判断专家策略参与干扰决策，则利用所述专家决策网络模块选择干扰样式；当判断专家策略不参与干扰决策，则利用所述干扰决策网络中的主决策网络选择干扰样式。When it is judged that the expert strategy participates in the interference decision-making, the expert decision-making network module is used to select the interference pattern; when it is judged that the expert strategy does not participate in the interference decision-making, the main decision-making network in the interference decision-making network is used to select the interference pattern.

在本发明的一个实施例中，所述多功能雷达干扰决策模型为：In one embodiment of the present invention, the multifunctional radar interference decision model is:

<S,J,P,R,T>；<S,J,P,R,T>;

其中，S为雷达状态空间，S＝{s₁,s₂,…,s_N}，s_n,n＝1,2，...，N为S中的雷达状态，N为雷达状态个数；J为干扰样式空间，J＝{j₁,j₂,…,j_M}，j_m,m＝1,2，...，M为J中干扰机的干扰样式，M为干扰样式个数；P:S×J×S→[0,1]为状态转移概率，T为结束信号，R为奖励函数；Among them, S is the radar state space, S={s ₁ , s ₂ ,..., s _N }, s _n , n=1, 2,..., N is the radar state in S, and N is the number of radar states. ; J is the interference pattern space, J={j ₁ ,j ₂ ,…,j _M }, j _m ,m=1,2,…, M is the interference pattern of the jammer in J, M is the interference pattern individual number; P:S×J×S→[0,1] is the state transition probability, T is the end signal, and R is the reward function;

奖励函数R定义为：The reward function R is defined as:

其中，r_t为t时刻的干扰收益，t为实施干扰前的时刻，t+1为实施干扰后再次侦收雷达信号的时刻，为干扰前雷达威胁等级，/>为干扰后雷达威胁等级，s_target为雷达目标状态，s_t为实施干扰时雷达的状态，s_t+1为实施干扰后雷达的状态。Among them, r _t is the interference gain at time t, t is the time before interference is implemented, t+1 is the time when radar signals are detected again after interference is implemented, is the radar threat level before jamming,/> is the radar threat level after interference, s _target is the radar target state, s _t is the state of the radar when interference is implemented, and s _t+1 is the state of the radar after interference is implemented.

在本发明的一个实施例中，所述专家决策网络模块的构建方法包括步骤：In one embodiment of the present invention, the method for constructing the expert decision-making network module includes the steps:

将专家数据作为训练样本集；其中，通过行为克隆方法，以所述专家数据中当前多功能雷达状态作为初始专家决策网络的输入，以当前多功能雷达状态下采取的干扰样式的概率分布作为初始专家决策网络的输出，并以所述专家数据中多功能雷达状态对应的干扰样式为标签；Expert data is used as a training sample set; wherein, through the behavioral cloning method, the current multi-functional radar state in the expert data is used as the input of the initial expert decision-making network, and the probability distribution of the interference pattern adopted in the current multi-functional radar state is used as the initial The output of the expert decision-making network is labeled with the interference pattern corresponding to the multi-function radar state in the expert data;

构建由深度神经网络表示的初始专家决策网络；Construct an initial expert decision-making network represented by a deep neural network;

利用所述训练样本集对所述初始专家决策网络进行训练，并将交叉熵损失函数反向传播以更新所述初始专家决策网络的参数，得到所述专家决策网络模块。The initial expert decision-making network is trained using the training sample set, and the cross-entropy loss function is back-propagated to update the parameters of the initial expert decision-making network to obtain the expert decision-making network module.

在本发明的一个实施例中，所述专家数据以轨迹形式表示为：In one embodiment of the present invention, the expert data is expressed in the form of a trajectory:

Γ＝{(s_i,j_i)|i＝1,2,...,N_e}；Γ={(s _i ,j _i )|i=1,2,...,N _e };

其中，s_i为多功能雷达状态，j_i为干扰机的干扰样式，j_i∈J，J为干扰样式空间，(s_i,j_i)为第i个专家知识，N_e表示专家知识个数；Among them, s _i is the multifunctional radar state, j _i is the jamming pattern of the jammer, j _i ∈J, J is the jamming pattern space, (s _i , j _i ) is the i-th expert knowledge, and N _e represents the individual expert knowledge. number;

所述交叉熵损失函数为：The cross-entropy loss function is:

其中，M为干扰样式个数，j_k为第k个干扰样式，为当前多功能雷达状态下采取的干扰样式的概率分布，p_e(·|s)为专家数据中多功能雷达状态对应的干扰样式的概率分布。Among them, M is the number of interference patterns, j _k is the kth interference pattern, is the probability distribution of the interference pattern adopted in the current multifunctional radar state, and p _e (·|s) is the probability distribution of the interference pattern corresponding to the multifunctional radar state in the expert data.

在本发明的一个实施例中，所述干扰决策网络包括主决策网络和目标决策网络，所述主决策网络和所述目标决策网络的结构相同，均包括：第三全连接层、第二激活层、第四全连接层、第三激活层、第五全连接层、第六全连接层和相加模块，其中，In one embodiment of the present invention, the interference decision-making network includes a main decision-making network and a target decision-making network. The main decision-making network and the target decision-making network have the same structure, and both include: a third fully connected layer, a second activation layer layer, the fourth fully connected layer, the third activation layer, the fifth fully connected layer, the sixth fully connected layer and the addition module, where,

所述第三全连接层、第二激活层、第四全连接层、第三激活层依次连接；The third fully connected layer, the second activation layer, the fourth fully connected layer, and the third activation layer are connected in sequence;

所述第五全连接层的输入端、所述第六全连接层的输入端均连接所述第三全连接层的输出端；The input terminal of the fifth fully connected layer and the input terminal of the sixth fully connected layer are both connected to the output terminal of the third fully connected layer;

所述第五全连接层的输出端、所述第六全连接层的输出端均连接所述相加模块的输入端；The output terminal of the fifth fully connected layer and the output terminal of the sixth fully connected layer are both connected to the input terminal of the adding module;

所述相加模块的输出端作为所述主决策网络或所述目标决策网络的输出端。The output terminal of the adding module serves as the output terminal of the main decision-making network or the target decision-making network.

在本发明的一个实施例中，所述训练好的决策网络为对决策网络进行训练得到，所述决策网络的训练方法包括步骤：In one embodiment of the present invention, the trained decision-making network is obtained by training a decision-making network. The training method of the decision-making network includes the steps:

S2041、设置训练博弈总次数、单轮最大博弈次数和探索因子，初始化所述干扰决策网络的网络参数、学习率、经验池和采样批次大小；S2041. Set the total number of training games, the maximum number of games in a single round and the exploration factor, and initialize the network parameters, learning rate, experience pool and sampling batch size of the interference decision-making network;

S2042、获取当前雷达状态，当判断所述当前雷达状态与所述雷达目标状态不一致，则将所述当前雷达状态输入所述专家干预判别函数模块；当判断所述当前雷达状态与雷达目标状态一致，开始新一轮训练，直至训练次数达到所述训练博弈总次数；S2042. Obtain the current radar state. When it is determined that the current radar state is inconsistent with the radar target state, input the current radar state into the expert intervention discriminant function module; when it is determined that the current radar state is consistent with the radar target state , start a new round of training until the number of training times reaches the total number of training games;

S2043、当专家干预判别函数模块判断所述当前雷达状态属于专家决策失误状态集，则利用所述干扰决策网络中的主决策网络选择当前干扰样式以实施干扰；当专家干预判别函数模块判断所述当前雷达状态不属于专家决策失误状态集，则根据所述当前雷达状态，利用所述专家干扰探索函数模块判断专家策略进行干扰决策的参与度；当判断专家策略参与干扰决策，则利用所述专家决策网络模块选择当前干扰样式以实施干扰；当判断专家策略不参与干扰决策，则利用所述干扰决策网络中的主决策网络选择当前干扰样式以实施干扰；S2043. When the expert intervention discriminant function module determines that the current radar state belongs to the expert decision error state set, the main decision network in the interference decision network is used to select the current interference pattern to implement interference; when the expert intervention discriminant function module determines that the If the current radar state does not belong to the expert decision-making error state set, then according to the current radar state, the expert interference exploration function module is used to determine the participation degree of the expert strategy in interference decision-making; when it is judged that the expert strategy participates in the interference decision-making, the expert is used The decision-making network module selects the current interference pattern to implement interference; when it is judged that the expert strategy does not participate in interference decision-making, the main decision-making network in the interference decision-making network is used to select the current interference pattern to implement interference;

S2044、获取干扰后雷达状态，并结合所述当前雷达状态评估当前干扰收益，将所述当前雷达状态、所述当前干扰样式、所述当前干扰收益、所述干扰后雷达状态和当前结束信号作为组合存储到经验池；S2044. Obtain the post-interference radar state, and evaluate the current interference benefit in combination with the current radar state. Use the current radar state, the current interference pattern, the current interference benefit, the post-interference radar state and the current end signal as The combination is stored in the experience pool;

S2045、采用优先经验回放的抽样方式，根据所述采样批次大小从所述经验池采样经验样本以训练更新所述主决策网络，并利用目标损失函数反向传播以更新主决策网络参数；S2045. Adopt the sampling method of priority experience replay, sample experience samples from the experience pool according to the sampling batch size to train and update the main decision-making network, and use the target loss function to back propagate to update the main decision-making network parameters;

S2046、结合超参数和所述主决策网络参数更新目标决策网络参数；S2046. Update the target decision network parameters by combining the hyperparameters and the main decision network parameters;

S2047、将专家策略失误信息存储至所述专家决策失误状态集；S2047. Store expert strategy error information into the expert decision-making error state set;

S2048、当判断单轮博弈次数小于单轮最大博弈次数，返回步骤S2042；当判断单轮博弈次数大于或等于单轮最大博弈次数，返回步骤S2041直至训练次数达到训练博弈总次数，得到训练好的决策网络。S2048. When it is judged that the number of games in a single round is less than the maximum number of games in a single round, return to step S2042; when it is judged that the number of games in a single round is greater than or equal to the maximum number of games in a single round, return to step S2041 until the number of training times reaches the total number of training games, and the trained decision network.

在本发明的一个实施例中，所述目标损失函数为：In one embodiment of the present invention, the target loss function is:

其中，D3QN为D3QN算法，θ为主决策网络参数，Q(s_t,j_t；θ)为主决策网络输出的在s_t下采取j_t的干扰价值，s_t为当前雷达状态，j_t为当前干扰样式，r_t为当前干扰收益，Q_T为目标决策网络，s_t+1为干扰后雷达状态，J为干扰样式空间，j为干扰机的干扰样式，Q(s_t+1,j；θ)为干扰后主决策网络输出的在s_t+1下采取j的干扰价值，θ^-为目标决策网络参数；Among them, D3QN is the D3QN algorithm, θ is the main decision-making network parameter, Q (s _t ,j _t ; θ) is the interference value of j _t output by the main decision-making network under s _t , s _t is the current radar state, j _t is the current interference pattern, r _t is the current interference profit, Q _T is the target decision-making network, s _t+1 is the radar state after interference, J is the interference pattern space, j is the interference pattern of the jammer, Q(s _t+1 , j; θ) is the interference value output by the main decision-making network after interference, taking j under s _t+1 , θ ^- is the target decision-making network parameter;

所述主决策网络参数的更新公式为：The update formula of the main decision network parameters is:

其中，θ_new为更新后的主决策网络参数，θ_old为更新前的主决策网络参数，l为学习率，为目标损失函数关于θ_old的梯度；Among them, θ _new is the main decision-making network parameter after the update, θ _old is the main decision-making network parameter before the update, l is the learning rate, is the gradient of the target loss function with respect to θ _old ;

所述目标决策网络参数的更新公式为：The update formula of the target decision network parameters is:

其中，为更新后的目标决策网络参数，/>为更新前的目标决策网络参数，τ为超参数。in, Determine network parameters for the updated target,/> are the target decision network parameters before updating, and τ is the hyperparameter.

在本发明的一个实施例中，所述训练好的决策网络的输出策略为：In one embodiment of the present invention, the output strategy of the trained decision-making network is:

其中，为专家干预判别函数模块，/>为专家干扰探索函数模块，π_e为专家决策网络模块中的专家策略，π_q为干扰决策网络中的干扰策略，s为雷达状态。in, For the expert intervention discriminant function module,/> is the expert interference exploration function module, π _e is the expert strategy in the expert decision-making network module, π _q is the interference strategy in the interference decision-making network, and s is the radar state.

在本发明的一个实施例中，所述专家干预判别函数模块定义为：In one embodiment of the present invention, the expert intervention discriminant function module is defined as:

其中，s为雷达状态，为专家决策失误状态集，/>表示利用专家策略π_e进行决策，/>表示利用干扰策略π_q进行决策。Among them, s is the radar state, is the expert decision-making error state set,/> Indicates the use of expert strategy π _e to make decisions,/> Indicates the use of interference strategy π _q to make decisions.

在本发明的一个实施例中，所述专家干扰探索函数模块定义为：In one embodiment of the present invention, the expert interference exploration function module is defined as:

其中，s为雷达状态，为专家决策失误状态集，ξ为0～1的随机数，ε_e为探索因子，表示专家策略π_e参与决策，/>表示利用干扰策略π_q进行雷达状态s下的干扰样式探索。Among them, s is the radar state, is the expert decision-making error state set, ξ is a random number from 0 to 1, ε _e is the exploration factor, Indicates that expert strategy π _e participates in decision-making,/> Indicates the use of interference strategy π _q to explore interference patterns in radar state s.

与现有技术相比，本发明的有益效果：Compared with the existing technology, the beneficial effects of the present invention are:

1、本发明提出了不完美专家策略的多功能雷达干扰决策方法，在传统深度强化学习干扰决策算法的基础上，将雷达对抗领域积累的专家知识融入干扰决策方法中，结合专家决策网络模块和干扰决策网络共同进行干扰决策，有效提升了干扰决策算法学习效率与决策精度，降低了对抗博弈试错成本。1. The present invention proposes a multifunctional radar interference decision-making method with imperfect expert strategy. Based on the traditional deep reinforcement learning interference decision-making algorithm, the expert knowledge accumulated in the field of radar countermeasures is integrated into the interference decision-making method, and combined with the expert decision-making network module and The interference decision-making network jointly performs interference decisions, effectively improves the learning efficiency and decision-making accuracy of the interference decision-making algorithm, and reduces the trial and error cost of adversarial games.

2、本发明的多功能雷达干扰决策方法考虑了专家策略最优条件的苛刻性，在保证专家策略为决策网络提供安全保障和探索指导的前提下，引入专家知识判别处理模块和专家干扰探索函数模块，对不同质量的专家知识，利用决策网络进行学习修正，提高了决策准确度，并且提高决策的时效性。2. The multifunctional radar interference decision-making method of the present invention takes into account the harshness of the optimal conditions of the expert strategy. On the premise of ensuring that the expert strategy provides security and exploration guidance for the decision-making network, the expert knowledge discrimination processing module and the expert interference exploration function are introduced. The module uses the decision-making network to learn and modify expert knowledge of different qualities, which improves the accuracy of decision-making and improves the timeliness of decision-making.

附图说明Description of the drawings

图1为本发明实施例提供的一种不完美专家策略的多功能雷达干扰决策方法的流程示意图；Figure 1 is a schematic flowchart of a multifunctional radar interference decision-making method based on imperfect expert strategies provided by an embodiment of the present invention;

图2为本发明实施例提供的一种训练好的决策网络的获取方法示意图；Figure 2 is a schematic diagram of a method for obtaining a trained decision-making network provided by an embodiment of the present invention;

图3为本发明实施例提供的一种初始专家决策网络的结构示意图；Figure 3 is a schematic structural diagram of an initial expert decision-making network provided by an embodiment of the present invention;

图4为本发明实施例提供的干扰决策网络的结构示意图；Figure 4 is a schematic structural diagram of an interference decision-making network provided by an embodiment of the present invention;

图5为本发明实施例提供的决策网络的训练实现框架示意图；Figure 5 is a schematic diagram of the training implementation framework of the decision-making network provided by the embodiment of the present invention;

图6为本发明实施例提供的雷达状态转换关系图；Figure 6 is a radar state transition relationship diagram provided by an embodiment of the present invention;

图7为本发明实施例提供的一种决策精度曲线图；Figure 7 is a decision accuracy curve chart provided by an embodiment of the present invention;

图8为本发明实施例提供的另一种决策精度曲线图。Figure 8 is another decision-making accuracy curve provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明做进一步详细的描述，但本发明的实施方式不限于此。The present invention will be described in further detail below with reference to specific examples, but the implementation of the present invention is not limited thereto.

实施例一Embodiment 1

请参见图1，图1为本发明实施例提供的一种不完美专家策略的多功能雷达干扰决策方法的流程示意图。该方法基于构建的多功能雷达干扰决策模型，利用训练好的决策网络根据雷达状态选择干扰样式以对雷达实施干扰，使雷达产生新的雷达状态。训练好的决策网络包括专家决策网络模块、干扰决策网络、专家干预判别函数模块和专家干扰探索函数模块。该方法包括步骤：Please refer to Figure 1. Figure 1 is a schematic flow chart of a multifunctional radar interference decision-making method based on imperfect expert strategies provided by an embodiment of the present invention. This method is based on the constructed multifunctional radar interference decision-making model, and uses the trained decision-making network to select the interference pattern according to the radar state to interfere with the radar, so that the radar generates a new radar state. The trained decision-making network includes expert decision-making network module, interference decision-making network, expert intervention discriminant function module and expert interference exploration function module. The method includes steps:

S101、获取雷达状态。S101. Obtain radar status.

S102、当判断雷达状态与雷达目标状态不一致，利用专家干预判别函数模块判断雷达状态是否属于专家决策失误状态集。当判断雷达状态与雷达目标状态一致，此时不需要进行干扰决策。S102. When it is determined that the radar state is inconsistent with the radar target state, use the expert intervention discriminant function module to determine whether the radar state belongs to the expert decision-making error state set. When the radar status is judged to be consistent with the radar target status, no interference decision is required at this time.

S103、当判断雷达状态属于专家决策失误状态集，则利用干扰决策网络中的主决策网络选择干扰样式；当判断雷达状态不属于专家决策失误状态集，则根据雷达状态，利用专家干扰探索函数模块判断专家策略进行干扰决策的参与度，即判断专家策略是否参与决策。S103. When it is judged that the radar state belongs to the expert decision-making error state set, use the main decision-making network in the interference decision-making network to select the interference pattern; when it is judged that the radar state does not belong to the expert decision-making error state set, use the expert interference exploration function module according to the radar state Determine the degree of participation of expert strategies in interfering with decision-making, that is, determine whether expert strategies participate in decision-making.

S104、当判断专家策略参与干扰决策，则利用专家决策网络模块选择干扰样式；当判断专家策略不参与干扰决策，则利用干扰决策网络中的主决策网络选择干扰样式。S104. When it is judged that the expert strategy participates in the interference decision-making, the expert decision-making network module is used to select the interference pattern; when it is judged that the expert strategy does not participate in the interference decision-making, the main decision-making network in the interference decision-making network is used to select the interference pattern.

请参见图2，图2为本发明实施例提供的一种训练好的决策网络的获取方法示意图。该方法包括步骤：Please refer to Figure 2. Figure 2 is a schematic diagram of a method for obtaining a trained decision-making network provided by an embodiment of the present invention. The method includes steps:

S201、构建多功能雷达干扰决策模型。S201. Construct a multifunctional radar interference decision-making model.

具体的，在复杂电磁环境下，非合作的雷达对抗双方的博弈过程具有不确定性、动态性，雷达状态序列满足马尔可夫性，即对抗系统只依赖于当前雷达状态和干扰机的干扰策略。因此，将多功能雷达干扰决策问题构建为马尔可夫决策过程(Markov DecisionProcess，MDPs)。MDPs是用于具有马尔可夫性的随机动态系统的一种形式化序列决策问题的框架，用一个五元组<S,J,P,R,T>进行描述，其中，S为状态集合，可用于表示雷达状态空间，J为动作集合，可用于表示干扰样式空间，P:S×J×S→[0,1]为状态转移概率，即智能体在当前状态s_t采取动作j_t后，环境状态转移到s_t+1的概率p(s_t+1|s_t,a_t)，R为奖励函数，T为结束信号。在MDPs框架下，其核心目标是寻找最优策略，即状态空间到动作空间的最佳映射π:S→A，获得最大期望奖励：Specifically, in a complex electromagnetic environment, the game process between non-cooperative radar countermeasures is uncertain and dynamic. The radar state sequence satisfies Markov property, that is, the countermeasures system only relies on the current radar status and the interference strategy of the jammer. . Therefore, the multifunctional radar interference decision-making problem is constructed as Markov Decision Process (Markov DecisionProcess, MDPs). MDPs is a framework for formal sequence decision-making problems for stochastic dynamic systems with Markov properties. It is described by a five-tuple <S, J, P, R, T>, where S is the state set, It can be used to represent the radar state space. J is the action set, which can be used to represent the interference pattern space. P:S×J×S→[0,1] is the state transition probability, that is, after the agent takes action j _t in the current state s _t , the probability p(s _t ₊₁ |s _t ,a _t ) of the environment state transitioning to s t+1, R is the reward function, and T is the end signal. Under the framework of MDPs, its core goal is to find the optimal strategy, that is, the best mapping π:S→A from state space to action space, to obtain the maximum expected reward:

其中，π^*表示最优策略，π表示策略，γ为折扣因子，t表示时间步，T_end表示一轮交互结束时的步长，π_t表示t时刻采取的策略。Among them, π ^* represents the optimal strategy, π represents the strategy, γ is the discount factor, t represents the time step, T _end represents the step size at the end of a round of interaction, and π _t represents the strategy adopted at time t.

在多功能雷达干扰决策模型中，雷达状态空间定义为S＝{s₁,s₂,…,s_N}，表征雷达威胁信息即雷达状态；雷达目标状态记为s_target，表示达到干扰目的；N表示雷达状态个数。干扰样式空间定义为J＝{j₁,j₂,…,j_M}，表示干扰机可采取的干扰样式集合，如噪声压制干扰、密集假目标干扰、灵巧噪声干扰、梳状谱干扰等，其中M表示干扰样式个数。奖励函数R表示采取特定干扰后的干扰效能评估，R根据干扰前后雷达威胁等级变化定义：In the multifunctional radar interference decision-making model, the radar state space is defined as S = {s ₁ , s ₂ ,..., s _N }, which represents the radar threat information, that is, the radar state; the radar target state is recorded as s _target , indicating that the interference purpose has been achieved; N represents the number of radar states. The interference pattern space is defined as J={j ₁ ,j ₂ ,…,j _M }, which represents the set of interference patterns that the jammer can adopt, such as noise suppression interference, dense false target interference, smart noise interference, comb spectrum interference, etc. Among them, M represents the number of interference patterns. The reward function R represents the interference effectiveness evaluation after specific interference is adopted. R is defined based on the change in radar threat level before and after interference:

(1)威胁等级降至最低，即达到目标雷达状态s_target，奖励设置为r＝100；(1) The threat level is reduced to the minimum, that is, the target radar state s _target is reached, and the reward is set to r=100;

(2)威胁等级降低，但未降至最低，奖励设置为r＝-1；(2) The threat level is reduced, but not to the minimum, and the reward is set to r=-1;

(3)威胁等级不变或升高，奖励设置为r＝1。(3) The threat level remains unchanged or increases, and the reward is set to r=1.

基于上述定义，基于马尔可夫决策过程的多功能雷达干扰决策模型为：Based on the above definition, the multifunctional radar interference decision model based on Markov decision process is:

<S,J,P,R,T>；<S,J,P,R,T>;

其中，S为雷达状态空间，S＝{s₁,s₂,…,s_N}，s_n,n＝1,2，...，N为S中的雷达状态，N为雷达状态个数；J为干扰样式空间，J＝{j₁,j₂,…,j_M}，j_m,m＝1,2，...，M为J中干扰机的干扰样式，M为干扰样式个数；P:S×J×S→[0,1]为状态转移概率，T为结束信号，R为奖励函数；奖励函数R定义为：Among them, S is the radar state space, S={s ₁ , s ₂ ,..., s _N }, s _n , n=1, 2,..., N is the radar state in S, and N is the number of radar states. ; J is the interference pattern space, J={j ₁ ,j ₂ ,…,j _M }, j _m ,m=1,2,…, M is the interference pattern of the jammer in J, M is the interference pattern individual number; P:S×J×S→[0,1] is the state transition probability, T is the end signal, and R is the reward function; the reward function R is defined as:

基于上述多功能雷达干扰决策模型，整个雷达对抗过程可描述为：干扰机通过侦收信号确定雷达状态信息，然后根据干扰策略采取某种干扰样式，并分析雷达威胁等级变化获得奖励，受干扰后雷达采取抗干扰手段，雷达状态转移到新的状态。在连续对抗过程中，干扰机通过奖励收益不断更新干扰策略以获得最大干扰收益，实现对多功能雷达的自适应干扰。Based on the above-mentioned multifunctional radar jamming decision-making model, the entire radar countermeasures process can be described as follows: the jammer determines the radar status information by detecting signals, then adopts a certain jamming style according to the jamming strategy, and analyzes the changes in the radar threat level to obtain rewards. After being jammed, The radar takes anti-interference measures and the radar state is transferred to a new state. During the continuous confrontation process, the jammer continuously updates the jamming strategy through reward benefits to obtain the maximum jamming benefit and achieve adaptive jamming of multi-functional radars.

S202、将专家知识信息参数化为专家决策网络模型。S202. Parameterize the expert knowledge information into an expert decision-making network model.

通过行为克隆将雷达对抗领域专家知识信息参数化，将参数化的雷达对抗领域专家知识信息转化为由深度神经网络表示的专家决策网络模型。具体包括步骤：The expert knowledge information in the field of radar countermeasures is parameterized through behavioral cloning, and the parameterized expert knowledge information in the field of radar countermeasures is transformed into an expert decision-making network model represented by a deep neural network. Specific steps include:

S2021、将专家数据作为训练样本集；其中，通过行为克隆方法，以专家数据中当前多功能雷达状态作为初始专家决策网络的输入，以当前多功能雷达状态下采取的干扰样式的概率分布作为初始专家决策网络的输出，并以专家数据中多功能雷达状态对应的干扰样式为标签。S2021. Use expert data as a training sample set; among which, through the behavioral cloning method, the current multi-function radar state in the expert data is used as the input of the initial expert decision-making network, and the probability distribution of the interference pattern adopted in the current multi-function radar state is used as the initial The output of the expert decision-making network is labeled with the interference pattern corresponding to the multifunctional radar state in the expert data.

具体的，行为克隆是三种主要模仿学习方法之一，其学习任务为：从专家数据中模仿解决问题的策略，即专家策略，其本质为监督学习。假设雷达对抗领域专家知识信息(即专家数据)为Γ，将专家数据以轨迹形式表示为：Specifically, behavioral cloning is one of the three main imitation learning methods. Its learning task is to imitate problem-solving strategies from expert data, that is, expert strategies, and its essence is supervised learning. Assuming that expert knowledge information (ie expert data) in the field of radar countermeasures is Γ, the expert data is expressed in the form of a trajectory as:

Γ＝{(s_i,j_i)|i＝1,2,...,N_e}；Γ={(s _i ,j _i )|i=1,2,...,N _e };

其中，s_i为多功能雷达状态，j_i为干扰机的干扰样式，j_i∈J，J为干扰样式空间，(s_i,j_i)为第i个专家知识，即在状态s_i下采取干扰行动j_i，N_e表示专家知识个数。Among them, s _i is the multifunctional radar state, j _i is the interference pattern of the jammer, j _i ∈J, J is the interference pattern space, (s _i , j _i ) is the i-th expert knowledge, that is, under the state s _i Take interference action j _i , N _e represents the number of expert knowledge.

行为克隆以专家数据Γ作为训练样本集，以专家数据Γ中当前多功能雷达状态s作为初始专家决策网络模型的输入，以该状态s下采取的干扰样式的概率分布为输出，以专家数据Γ中多功能雷达状态对应的干扰样式为标签，其中标签为One-hot编码形式，也可记为一种概率分布p_e(·|s)。Behavioral cloning uses expert data Γ as a training sample set, the current multifunctional radar state s in expert data Γ as the input of the initial expert decision-making network model, and the probability distribution of the interference pattern adopted in this state s is the output, and the interference pattern corresponding to the multifunctional radar state in the expert data Γ is used as the label, where the label is in the form of One-hot encoding, which can also be recorded as a probability distribution p _e (·|s).

S2022、构建由深度神经网络表示的初始专家决策网络。S2022. Construct an initial expert decision-making network represented by a deep neural network.

初始专家决策网络由深度神经网络表示，包括全连接层、激活层、归一化层，全连接层、激活层、归一化层的层数和层级结构可以根据实际情况进行设置。The initial expert decision-making network is represented by a deep neural network, including a fully connected layer, an activation layer, and a normalization layer. The number and hierarchical structure of the fully connected layer, activation layer, and normalization layer can be set according to the actual situation.

请参见图3，图3为本发明实施例提供的一种初始专家决策网络的结构示意图，图3中初始专家决策网络包括依次连接的第一全连接层、第一激活层、第二全连接层和归一化层。具体的，第一激活层包括ReLU激活层，归一化层包括softmax层。Please refer to Figure 3. Figure 3 is a schematic structural diagram of an initial expert decision-making network provided by an embodiment of the present invention. The initial expert decision-making network in Figure 3 includes a first fully connected layer, a first activation layer, and a second fully connected layer connected in sequence. layer and normalization layer. Specifically, the first activation layer includes a ReLU activation layer, and the normalization layer includes a softmax layer.

S2023、利用训练样本集对初始专家决策网络进行训练，并将交叉熵损失函数反向传播以更新初始专家决策网络的参数，得到专家决策网络模块。具体训练步骤包括：S2023. Use the training sample set to train the initial expert decision-making network, and back-propagate the cross-entropy loss function to update the parameters of the initial expert decision-making network to obtain the expert decision-making network module. Specific training steps include:

2023a)随机抽取专家数据样本送入初始专家决策网络，得到多功能雷达状态s下采取的干扰样式的概率分布 2023a) Randomly select expert data samples and send them to the initial expert decision-making network to obtain the probability distribution of the interference pattern adopted by the multifunctional radar in state s

2023b)计算交叉熵损失函数：2023b) Calculate the cross-entropy loss function:

2023c)将交叉熵损失函数Loss反向传播以更新初始专家决策网络的参数，若损失函数收敛，则得到专家决策网络模块。2023c) Back-propagate the cross-entropy loss function Loss to update the parameters of the initial expert decision-making network. If the loss function converges, the expert decision-making network module is obtained.

S203、构建基于D3QN的干扰决策网络。S203. Construct an interference decision-making network based on D3QN.

干扰决策网络基于D3QN算法构建，包括主决策网络Q(s,j；θ)和目标决策网络Q_T(s,j；θ^-)，主决策网络Q(s,j；θ)用于干扰决策，目标决策网络Q_T(s,j；θ^-)用于更新主决策网络参数。主决策网络Q(s,j；θ)和目标决策网络Q_T(s,j；θ^-)网络结构相同，均包含全连接层和激活层，全连接层、激活层的层数和层级结构可以根据实际情况进行设置。The interference decision-making network is constructed based on the D3QN algorithm, including the main decision-making network Q(s,j;θ) and the target decision-making network _QT (s,j; ^θ- ). The main decision-making network Q(s,j;θ) is used for interference decision-making. , the target decision network Q _T (s, j; θ ^- ) is used to update the main decision network parameters. The main decision-making network Q(s,j;θ) and the target decision-making network _QT (s,j;θ- ⁾ have the same network structure, both including fully connected layers and activation layers, the number and hierarchical structure of fully connected layers and activation layers It can be set according to the actual situation.

请参见图4，图4为本发明实施例提供的干扰决策网络的结构示意图。如图4所示，主决策网络和目标决策网络均包括：第三全连接层、第二激活层、第四全连接层、第三激活层、第五全连接层、第六全连接层和相加模块。其中，第三全连接层、第二激活层、第四全连接层、第三激活层依次连接；第五全连接层的输入端、第六全连接层的输入端均连接第三全连接层的输出端；第五全连接层的输出端、第六全连接层的输出端均连接相加模块的输入端，利用加法运算对第五全连接层的输出和第六全连接层的输出进行相加；相加模块的输出端作为主决策网络或目标决策网络的输出端。具体的，第二激活层、第三激活层均包括ReLU激活层；第五全连接层具有1个节点，第六全连接层具有M个节点，M表示干扰样式个数。Please refer to Figure 4, which is a schematic structural diagram of an interference decision-making network provided by an embodiment of the present invention. As shown in Figure 4, both the main decision-making network and the target decision-making network include: the third fully connected layer, the second activation layer, the fourth fully connected layer, the third activation layer, the fifth fully connected layer, the sixth fully connected layer and Addition module. Among them, the third fully connected layer, the second activation layer, the fourth fully connected layer, and the third activation layer are connected in sequence; the input end of the fifth fully connected layer and the input end of the sixth fully connected layer are both connected to the third fully connected layer The output terminal; the output terminal of the fifth fully connected layer and the output terminal of the sixth fully connected layer are both connected to the input terminal of the addition module, and the output of the fifth fully connected layer and the output of the sixth fully connected layer are performed using addition operations. Addition; the output terminal of the addition module serves as the output terminal of the main decision-making network or the target decision-making network. Specifically, the second activation layer and the third activation layer both include ReLU activation layers; the fifth fully connected layer has 1 node, and the sixth fully connected layer has M nodes, where M represents the number of interference patterns.

S204、构建决策网络，并对决策网络进行训练，得到训练好的决策网络。S204. Construct a decision-making network and train the decision-making network to obtain a trained decision-making network.

本实施例构建的决策网络包括：(1)步骤S202构建的专家决策网络模型，(2)步骤S203构建的干扰决策网络，(3)专家干预判别函数模块(4)专家干扰探索函数模块其中，专家决策网络模块表征专家策略π_e；干扰决策网络表征在线学习的干扰策略π_q；专家干预判别函数模块/>依据当前干扰机侦收到雷达状态s决定是否由专家决策网络模块决策此次干扰行动；专家干扰探索函数模块/>依据当前干扰机侦收到雷达状态s控制专家策略的参与度。The decision-making network constructed in this embodiment includes: (1) the expert decision-making network model constructed in step S202, (2) the interference decision-making network constructed in step S203, (3) expert intervention discriminant function module (4) Expert interference exploration function module Among them, the expert decision-making network module represents the expert strategy π _e ; the interference decision-making network represents the interference strategy π _q of online learning; the expert intervention discriminant function module/> Based on the radar status s detected by the current jammer, it is decided whether the expert decision-making network module decides on the jamming action; the expert jamming exploration function module/> The participation degree of the expert strategy is controlled based on the current radar status s detected by the jammer.

本实施例中，专家决策网络模块和干扰决策网络共同决策形成一种混合的干扰决策机制，其目的有二：一是对错误的专家知识进行重新探索与学习，提高干扰决策的效率；二是参与专家决策，依概率探索专家策略以外的可能性，避免专家策略为次优解。决策网络的最终输出策略π(即训练好的决策网络的输出策略)表示为：In this embodiment, the expert decision-making network module and the interference decision-making network jointly make decisions to form a hybrid interference decision-making mechanism, which has two purposes: first, to re-explore and learn erroneous expert knowledge to improve the efficiency of interference decision-making; second, to Participate in expert decision-making, explore possibilities other than expert strategies based on probability, and avoid expert strategies as suboptimal solutions. The final output strategy π of the decision-making network (that is, the output strategy of the trained decision-making network) is expressed as:

其中，π_e为专家决策网络模块中的专家策略，π_q为干扰决策网络中的干扰策略，s为雷达状态；为专家干预判别函数模块，/>利用双方博弈对抗的过程中建立的知识库，记录专家策略的失误信息，将专家决策失误状态集表示为/>则专家干预判别函数模块定义为/> 表示利用专家策略π_e进行决策，相反表示利用干扰策略π_q进行决策；/>为专家干扰探索函数模块，/>定义为/>其中ξ表示0～1的随机数，ε_e表示探索因子，表示专家策略π_e参与决策，相反/>表示利用干扰策略π_q进行状态s下的干扰样式探索。Among them, π _e is the expert strategy in the expert decision-making network module, π _q is the interference strategy in the interference decision-making network, and s is the radar state; For the expert intervention discriminant function module,/> The knowledge base established during the game confrontation between the two parties is used to record the error information of the expert strategy, and the expert decision-making error state set is expressed as/> Then the expert intervention discriminant function module is defined as/> Represents the use of expert strategy π _e to make decisions, on the contrary Indicates the use of interference strategy π _q to make decisions;/> Exploring function modules for expert interference, /> Defined as/> Where ξ represents a random number from 0 to 1, ε _e represents the exploration factor, Indicates that expert strategy π _e participates in decision-making, on the contrary/> Indicates the use of interference strategy π _q to explore interference patterns in state s.

需要说明的是，在混合的干扰决策机制中，干扰决策网络是用最终策略π交互收集的数据进行训练，而其中在某些状态下有一部分是由专家策略π_e产生的，在这些状态下专家策略π_e与干扰策略π_q存在策略差异这种策略差异影响混合策略与干扰策略下的状态分布D_π(s)和/>的差异，当该差距过大时，将直接导致决策网络训练的不稳定，决策性能下降。而混合策略与干扰策略间的状态分布差异受其策略差异的约束，继而可得到最终输出策略与干扰策略状态分布的差异情况如下：It should be noted that in the hybrid interference decision-making mechanism, the interference decision-making network is trained using the data collected interactively by the final policy π, and part of it is generated by the expert policy π _e in some states. In these states There are strategic differences between the expert strategy π _e and the interference strategy π _q This strategy difference affects the state distribution D _π (s) and/> under the mixed strategy and interference strategy When the difference is too large, it will directly lead to the instability of decision-making network training and the decline of decision-making performance. The difference in state distribution between the mixed strategy and the interference strategy is constrained by the difference in their strategies. Then the difference in state distribution between the final output strategy and the interference strategy can be obtained as follows:

其中，为期望，s～π为s服从策略π分布，γ为折扣因子，称为干预因子。in, is the expectation, s～π means that s obeys the strategy π distribution, γ is the discount factor, called intervening factors.

由上式可知，状态分布差异的上限受策略差异和干预因子所约束，策略差异是不可控的，但可通过减小专家策略的参与度，从而降低状态分布差异，因此本实施例将专家策略参与决策比重逐步降低，提升决策网络训练的稳定性，其具体体现在探索因子ε_e逐步减小。It can be seen from the above formula that the upper limit of the state distribution difference is constrained by the strategy difference and intervention factors. The strategy difference is uncontrollable, but the state distribution difference can be reduced by reducing the participation of the expert strategy. Therefore, in this embodiment, the expert strategy The proportion of participation in decision-making is gradually reduced, which improves the stability of decision-making network training, which is specifically reflected in the gradual reduction of the exploration factor ε _e .

请参见图5，图5为本发明实施例提供的决策网络的训练实现框架示意图，决策网络的训练方法包括步骤：Please refer to Figure 5. Figure 5 is a schematic diagram of the training implementation framework of the decision-making network provided by an embodiment of the present invention. The training method of the decision-making network includes the steps:

S2041、设置训练博弈总次数N_round、单轮最大博弈次数N_max和探索因子ε_e，初始化干扰决策网络的网络参数、学习率l、经验池H和采样批次大小N_b。S2041. Set the total number of training games N _round , the maximum number of games in a single round N _max and the exploration factor ε _e , and initialize the network parameters of the interference decision-making network, the learning rate l, the experience pool H and the sampling batch size N _b .

S2042、获取当前雷达状态s_t，当判断当前雷达状态s_t与雷达目标状态s_target不一致，即s_t≠s_target，则将当前雷达状态输入专家干预判别函数模块，转至步骤S2043；当判断当前雷达状态s_t与雷达目标状态s_target一致，开始新一轮训练，直至训练次数达到训练博弈总次数N_round。S2042. Obtain the current radar state s _t . When it is determined that the current radar state s _t is inconsistent with the radar target state s _target , that is, s _t ≠s _target , then input the current radar state into the expert intervention discriminant function module and go to step S2043; when it is determined The current radar state s _t is consistent with the radar target state s _target , and a new round of training starts until the number of training times reaches the total number of training games N _round .

S2043、若则通过干扰决策网络选择当前干扰样式j_t并实施干扰；否则，由干扰探索函数控制专家策略参与决策。S2043, if Then the current interference pattern j _t is selected through the interference decision-making network and interference is implemented; otherwise, the interference exploration function controls the expert strategy to participate in decision-making.

具体的，当专家干预判别函数模块判断当前雷达状态s_t属于专家决策失误状态集则利用干扰决策网络中的主决策网络选择当前干扰样式j_t以实施干扰；当专家干预判别函数模块判断当前雷达状态s_t不属于专家决策失误状态集/>则根据当前雷达状态s_t，利用专家干扰探索函数模块判断专家策略进行干扰决策的参与度；当判断专家策略参与干扰决策，则利用专家决策网络模块选择当前干扰样式j_t以实施干扰；当判断专家策略不参与干扰决策，则利用干扰决策网络中的主决策网络选择当前干扰样式j_t以实施干扰。Specifically, when the expert intervention discriminant function module determines that the current radar state s _t belongs to the expert decision-making error state set Then use the main decision-making network in the interference decision-making network to select the current interference pattern j _t to implement interference; when the expert intervention discriminant function module determines that the current radar state s _t does not belong to the expert decision-making error state set/> Then according to the current radar state s _t , the expert interference exploration function module is used to judge the participation degree of the expert strategy in interference decision-making; when it is judged that the expert strategy is involved in the interference decision-making, the expert decision-making network module is used to select the current interference style j _t to implement interference; when it is judged If the expert strategy does not participate in interference decision-making, the main decision-making network in the interference decision-making network is used to select the current interference pattern j _t to implement interference.

S2044、获取干扰后雷达状态s_t+1，并结合当前雷达状态评估当前干扰收益r_t，将当前雷达状态s_t、当前干扰样式j_t、当前干扰收益r_t、干扰后雷达状态s_t+1和当前结束信号done作为组合即<s_t,j_t,r_t,s_t+1,done>存储到经验池H。S2044. Obtain the post-interference radar state s _t+1 and evaluate the current interference gain r _t based on the current radar state. Combine the current radar state s _t , current interference pattern j _t , current interference gain r _t , and post-interference radar state s _{t+ 1} and the current end signal done are stored in the experience pool H as a combination, that is, <s _t ,j _t ,r _t ,s _t+1 ,done>.

S2045、采用优先经验回放的抽样方式，根据采样批次大小从经验池H采样经验样本，经验样本的数量与采样批次大小N_b相同，以训练更新主决策网络Q(s,j；θ)，并利用目标损失函数反向传播以更新主决策网络参数θ。其中，目标损失函数的计算公式为：S2045. Adopt a sampling method that prioritizes experience playback, and sample experience samples from the experience pool H according to the sampling batch size. The number of experience samples is the same as the sampling batch size N _b to train and update the main decision-making network Q (s, j; θ) , and use the target loss function to back propagate to update the main decision network parameters θ. Among them, the calculation formula of the target loss function is:

其中，D3QN为D3QN算法，θ为主决策网络参数，Q(s_t,j_t；θ)为主决策网络输出的在s_t下采取j_t的干扰价值，s_t为当前雷达状态，j_t为当前干扰样式，r_t为当前干扰收益，Q_T为目标决策网络，s_t+1为干扰后雷达状态，J为干扰样式空间，j为干扰机的干扰样式，Q(s_t+1,j；θ)为主决策网络输出的在s_t+1下采取j的干扰价值，θ^-为目标决策网络参数。Among them, D3QN is the D3QN algorithm, θ is the main decision-making network parameter, Q (s _t ,j _t ; θ) is the interference value of j _t output by the main decision-making network under s _t , s _t is the current radar state, j _t is the current interference pattern, r _t is the current interference profit, Q _T is the target decision-making network, s _t+1 is the radar state after interference, J is the interference pattern space, j is the interference pattern of the jammer, Q(s _t+1 , j; θ) is the interference value of j taken under s _t+1 output by the main decision-making network, and θ ^- is the target decision-making network parameter.

主决策网络参数的更新公式为：The update formula of the main decision network parameters is:

其中，θ_new为更新后的主决策网络参数，θ_old为更新前的主决策网络参数，l为学习率，为目标损失函数关于θ_old的梯度。Among them, θ _new is the main decision-making network parameter after the update, θ _old is the main decision-making network parameter before the update, l is the learning rate, is the gradient of the target loss function with respect to θ _old .

S2046、结合超参数和主决策网络参数更新目标决策网络参数。S2046. Update the target decision network parameters by combining the hyperparameters and the main decision network parameters.

例如，设置超参数τ，将超参数τ作为权值与主决策网络参数θ进行加权平均以更新目标决策网络参数θ^-，更新公式如下：For example, set the hyperparameter τ, use the hyperparameter τ as a weight and perform a weighted average with the main decision network parameter θ to update the target decision network parameter θ ^- . The update formula is as follows:

S2047、记录专家策略失误信息，将专家策略失误信息存储至专家决策失误状态集 S2047. Record the expert strategy error information and store the expert strategy error information into the expert decision-making error status set.

S2048、当判断单轮博弈次数小于单轮最大博弈次数N_max，返回步骤S2042；当判断单轮博弈次数大于单轮最大博弈次数，返回步骤S2041直至训练次数达到训练博弈总次数N_round，得到训练好的决策网络。其中，当返回步骤S2041开始新一轮训练时，探索因子ε_e小于上一轮训练时的探索因子ε_e，即探索因子ε_e随着训练次数的增加而逐步减小。S2048. When it is judged that the number of games in a single round is less than the maximum number of games in a single round N _max , return to step S2042 ; when it is judged that the number of games in a single round is greater than the maximum number of games in a single round , return to step S2041 until the number of training times reaches the total number of training games N _round , and training is obtained Good decision-making network. Among them, when returning to step S2041 to start a new round of training, the exploration factor ε _e is smaller than the exploration factor ε _e in the previous round of training, that is, the exploration factor ε _e gradually decreases as the number of training times increases.

进一步的，利用训练好的决策网络按照图1所示的方法进行多功能雷达干扰决策。Further, the trained decision-making network is used to make multi-functional radar interference decisions according to the method shown in Figure 1.

本实施例提出了不完美专家策略的多功能雷达干扰决策方法，在传统深度强化学习干扰决策算法的基础上，将雷达对抗领域积累的专家知识融入干扰决策方法中，结合专家决策网络模块和干扰决策网络共同进行干扰决策，有效提升了干扰决策算法学习效率与决策精度，降低了对抗博弈试错成本。This embodiment proposes a multifunctional radar interference decision-making method based on imperfect expert strategies. Based on the traditional deep reinforcement learning interference decision-making algorithm, the expert knowledge accumulated in the field of radar countermeasures is integrated into the interference decision-making method, combining the expert decision-making network module and interference decision-making method. The decision-making network jointly performs interference decisions, effectively improving the learning efficiency and decision-making accuracy of the interference decision-making algorithm, and reducing the trial-and-error cost of adversarial games.

本实施例的多功能雷达干扰决策方法考虑了专家策略最优条件的苛刻性，在保证专家策略为决策网络提供安全保障和探索指导的前提下，引入专家知识判别处理模块和专家干扰探索函数模块，对不同质量的专家知识，利用决策网络进行学习修正，提高了决策准确度，并且提高决策的时效性。The multifunctional radar interference decision-making method of this embodiment takes into account the harshness of the optimal conditions of the expert strategy. On the premise of ensuring that the expert strategy provides security and exploration guidance for the decision-making network, an expert knowledge discrimination processing module and an expert interference exploration function module are introduced. , using the decision-making network for learning and correction of expert knowledge of different qualities, which improves the accuracy of decision-making and improves the timeliness of decision-making.

本实施例的效果可以通过以下仿真实验做进一步的说明。The effect of this embodiment can be further explained through the following simulation experiments.

为了验证复杂情况下本实施例所提方法的性能，假设干扰机可发射10种不同的干扰样式J＝{j₁,j₂,…,j₁₀}，多功能雷达状态空间包含16种状态S＝{s₁,s₂,…,s₁₆}，对应16个威胁等级，其中s₁威胁等级最高，s₁₆为目标雷达状态s_target，其威胁等级最低，各雷达状态转换关系参考图6，图6为本发明实施例提供的雷达状态转换关系图。干扰机的干扰任务是使多功能雷达的任意状态转移至雷达目标状态s_target，奖励函数表示为：In order to verify the performance of the method proposed in this embodiment under complex conditions, it is assumed that the jammer can emit 10 different interference patterns J={j ₁ , j ₂ ,..., j ₁₀ }, and the multifunctional radar state space contains 16 states S = {s ₁ , s ₂ ,…, s ₁₆ }, corresponding to 16 threat levels, of which s ₁ has the highest threat level, s ₁₆ is the target radar state s _target , and has the lowest threat level. Please refer to Figure 6 for the conversion relationship of each radar state. Figure 6 is a radar state transition relationship diagram provided by an embodiment of the present invention. The jamming task of the jammer is to transfer any state of the multi-function radar to the radar target state s _target . The reward function is expressed as:

其中，和/>表示干扰前后雷达威胁等级。in, and/> Indicates the radar threat level before and after interference.

为了对比本实施例所提方法(简称ED3QN_PER)相对于目前基于深度强化学习的干扰决策算法的性能提升，分别从单局博弈次数和决策准确度两个维度与基于深度Q学习(DQN)、优先经验回放的深度Q学习(DQN_PER)、优先经验回放的深度双Q学习(DDQN_PER)和优先经验回放的深度对决双Q学习(D3QN_PER)的决策算法进行比较分析，决策结果参考图7，图7为本发明实施例提供的一种决策精度曲线图。从决策准确度曲线来看，本实施例所提方法由于引入专家策略对干扰决策进行指导，在整个训练期间，其干扰决策准度基本上均优于其他算法，对博弈过程中的安全性有所改善。In order to compare the performance improvement of the method proposed in this embodiment (abbreviated as ED3QN_PER) relative to the current interference decision-making algorithm based on deep reinforcement learning, from the two dimensions of the number of single games and decision accuracy, it was compared with deep Q learning (DQN) and priority based on deep Q learning (DQN). The decision-making algorithms of deep Q learning with experience replay (DQN_PER), deep double Q learning with priority experience replay (DDQN_PER), and deep duel double Q learning with priority experience replay (D3QN_PER) are compared and analyzed. The decision results are shown in Figure 7. A decision-making accuracy curve provided by an embodiment of the present invention. From the perspective of the decision accuracy curve, the method proposed in this embodiment introduces expert strategies to guide the interference decision-making. During the entire training period, its interference decision-making accuracy is basically better than other algorithms, which is beneficial to the security of the game process. improved.

为了分析本实施例所提方法在不同质量的专家策略下的性能，以高、中、低、差四个等级的专家策略为参考，分析不同质量的专家策略对算法性能的影响，决策结果请参考图8，图8为本发明实施例提供的另一种决策精度曲线图。从决策准确度曲线来看，在不同质量专家策略的影响下，其决策的收敛速度和准确度均表现优异，且随着专家策略质量的提高，决策收敛速度以及精度也随之提高。In order to analyze the performance of the method proposed in this embodiment under expert strategies of different qualities, the four levels of expert strategies of high, medium, low, and poor are used as a reference to analyze the impact of expert strategies of different qualities on algorithm performance. Please refer to the decision-making results. Refer to Figure 8, which is another decision-making accuracy curve provided by an embodiment of the present invention. Judging from the decision-making accuracy curve, under the influence of expert strategies of different qualities, the convergence speed and accuracy of decision-making are excellent, and as the quality of expert strategies improves, the convergence speed and accuracy of decision-making also increase.

综上，本实施例针对现有基于强化学习的干扰决策算法存在训练周期长、试错成本高的问题，引入专家知识，实现了干扰决策算法学习效率的提升，降低了试错风险，并针对复杂电磁环境下雷达与干扰机非合作博弈对抗中最优专家干扰策略获取困难的问题，在不完美专家知识辅助决策条件下，实现了干扰决策效率与精度的提升。In summary, this embodiment aims at the problems of long training cycle and high trial and error cost of the existing interference decision-making algorithm based on reinforcement learning. It introduces expert knowledge to improve the learning efficiency of the interference decision-making algorithm, reduces the risk of trial and error, and targets In the non-cooperative game confrontation between radar and jammers in a complex electromagnetic environment, it is difficult to obtain the optimal expert jamming strategy. Under the conditions of imperfect expert knowledge-assisted decision-making, the efficiency and accuracy of jamming decision-making are improved.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be concluded that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, and all of them should be regarded as belonging to the protection scope of the present invention.

Claims

1. A multifunctional radar interference decision method of imperfect expert strategy is characterized in that the method is based on a multifunctional radar interference decision model, and a trained decision network is utilized to select an interference pattern according to radar states so as to implement interference on the radar, so that the radar generates new radar states; the trained decision network comprises an expert decision network module, an interference decision network, an expert interference discrimination function module and an expert interference exploration function module; the method comprises the steps of:

acquiring a radar state;

when the radar state is inconsistent with the radar target state, judging whether the radar state belongs to an expert decision error state set or not by utilizing the expert intervention discriminant function module;

when the radar state is judged to belong to an expert decision error state set, selecting an interference pattern by using a main decision network in the interference decision network; judging whether an expert strategy participates in interference decision or not by utilizing the expert interference exploration function module according to the radar state when the radar state is judged not to belong to the expert decision error state set;

when judging that the expert strategy participates in the interference decision, selecting an interference pattern by using the expert decision network module; and when judging that the expert strategy does not participate in the interference decision, selecting an interference pattern by using a main decision network in the interference decision network.

2. The imperfect expert policy multi-functional radar interference decision method according to claim 1, wherein the multi-functional radar interference decision model is:

<S,J,P,R,T>；

wherein S is a radar state space, s= { S ₁ ,s ₂ ,…,s _N }，s _n N=1, 2,..n is the radar status in S, N is the number of radar states; j is the interference pattern space, j= { J ₁ ,j ₂ ,…,j _M }，j _m M=1, 2,., M is the interference pattern of the interfering machine in J, M is the number of interference patterns; p is SxJxS → [0,1 ]]The state transition probability is represented by T, the ending signal is represented by T, and the reward function is represented by R;

the bonus function R is defined as:

wherein r is _t For the interference gain at time t, t is the time before interference is implemented, t+1 is the time after interference is implemented and radar signals are again detected,for pre-interference radar threat level, +.>For post-interference radar threat level s _target Is radar target state s _t To implement the radar state at interference s _t+1 To implement the radar state after the disturbance.

3. The method for multi-functional radar interference decision making with imperfect expert strategy according to claim 1, wherein the method for constructing the expert decision network module comprises the steps of:

taking expert data as a training sample set; the method comprises the steps of using a current multifunctional radar state in expert data as input of an initial expert decision network, using probability distribution of interference patterns adopted in the current multifunctional radar state as output of the initial expert decision network, and using interference patterns corresponding to the multifunctional radar state in the expert data as labels through a behavior cloning method;

constructing an initial expert decision network represented by a deep neural network;

and training the initial expert decision network by using the training sample set, and back-propagating a cross entropy loss function to update parameters of the initial expert decision network to obtain the expert decision network module.

4. A multifunctional radar interference decision method according to claim 3, characterized in that the expert data is represented in trajectory form as:

Γ＝{(s _i ,j _i )|i＝1,2,...,N _e }；

wherein s is _i Is the state of multifunctional radar, j _i For jammer pattern j _i E J, J is the interference pattern space,(s) _i ,j _i ) For the ith expert knowledge, N _e Representing the number of expert knowledge;

the cross entropy loss function is:

wherein M is the number of interference patterns, j _k For the kth interference pattern,probability distribution, p, of interference patterns assumed in the current multifunctional radar state _e (·|s) is the probability distribution of the interference pattern corresponding to the multi-functional radar state in the expert data.

5. The imperfect expert policy multifunction radar interference decision making method according to claim 1, wherein the interference decision making network comprises a main decision making network and a target decision making network, the main decision making network and the target decision making network are identical in structure, each comprising: a third fully-connected layer, a second active layer, a fourth fully-connected layer, a third active layer, a fifth fully-connected layer, a sixth fully-connected layer, and an addition module, wherein,

the third full-connection layer, the second activation layer, the fourth full-connection layer and the third activation layer are sequentially connected;

the input end of the fifth full-connection layer and the input end of the sixth full-connection layer are connected with the output end of the third full-connection layer;

the output end of the fifth full-connection layer and the output end of the sixth full-connection layer are both connected with the input end of the adding module;

the output end of the adding module is used as the output end of the main decision network or the target decision network.

6. The method for multi-functional radar interference decision making with imperfect expert strategy according to claim 5, wherein the trained decision network is obtained by training the decision network, and the training method of the decision network comprises the steps of:

s2041, setting the total number of training games, the maximum number of single-round games and an exploration factor, and initializing network parameters, learning rate, experience pool and sampling batch size of the interference decision network;

s2042, acquiring a current radar state, and inputting the current radar state into the expert intervention discriminant function module when the current radar state is inconsistent with the radar target state; when the current radar state is judged to be consistent with the radar target state, starting a new training round until the training times reach the total training game times;

s2043, when the expert intervention discriminant function module judges that the current radar state belongs to an expert decision error state set, selecting a current interference pattern by using a main decision network in the interference decision network to implement interference; when the expert interference judging function module judges that the current radar state does not belong to an expert decision error state set, judging the participation degree of an expert strategy for interference decision by utilizing the expert interference exploring function module according to the current radar state; when judging that the expert strategy participates in interference decision, selecting a current interference pattern by utilizing the expert decision network module to implement interference; when judging that the expert strategy does not participate in the interference decision, selecting a current interference pattern by using a main decision network in the interference decision network to implement interference;

s2044, acquiring a radar state after interference, evaluating current interference benefits by combining the current radar state, and storing the current radar state, the current interference pattern, the current interference benefits, the radar state after interference and a current ending signal as a combination into an experience pool;

s2045, sampling experience samples from the experience pool according to the size of the sampling batch by adopting a sampling mode of preferential experience playback so as to train and update the main decision network, and back-propagating by utilizing a target loss function so as to update main decision network parameters;

s2046, updating target decision network parameters by combining the super parameters and the main decision network parameters;

s2047, storing expert strategy error information into the expert decision error state set;

s2048, when the number of single-round games is smaller than the maximum number of single-round games, returning to the step S2042; and when the number of single-round games is equal to the maximum number of single-round games, returning to the step S2041 until the training number reaches the total number of training games, and obtaining the trained decision network.

7. The imperfect expert policy multi-functional radar interference decision making method according to claim 6, wherein the objective loss function is:

wherein D3QN is D3QN algorithm, θ is main decision network parameter, Q(s) _t ,j _t The method comprises the steps of carrying out a first treatment on the surface of the θ) is the primary decision network output at s _t Take j below _t Interference value s of (c) _t J is the current radar state _t R for the current interference pattern _t For current interference gain, Q _T For the target decision network s _t+1 For post-interference radar conditions, J is the interference pattern space, J is the interference pattern of the jammer, Q (s _t+1 J; θ) is the primary decision network output at s _t+1 The interference value of j, theta is adopted ^- Deciding network parameters for a target;

the update formula of the main decision network parameter is as follows:

wherein θ _new For updated primary decision network parameters, θ _old For the main decision network parameters before updating, l is the learning rate,regarding θ for the target loss function _old Is a gradient of (2);

the updating formula of the target decision network parameter is as follows:

wherein,decision network parameters for updated targets, +.>For the target decision network parameters before updating, τ is the superparameter.

8. The imperfect expert policy multi-functional radar interference decision making method according to claim 1, wherein the trained decision network output policy is:

wherein,for expert intervention discriminant function module->Exploring a function module for expert interference, pi _e For expert policy in expert decision network module, pi _q S is radar state, which is an interference strategy in the interference decision network.

9. The method of claim 1, wherein the expert intervention discriminant function module defines:

wherein s is the radar state,for expert decision error state set, < >>Representing the utilization of expert policy pi _e Decision making->Representing the utilization of interference policy pi _q And making a decision.

10. The method of claim 1, wherein the expert interference exploration function module defines:

wherein s is the radar state,for expert decision error state set, xi is random number of 0-1, epsilon _e In order to explore the factors,representing expert policy pi _e Participation in decision making, including->Representing the utilization of interference policy pi _q And (5) searching an interference pattern in the radar state s.