CN105388461A

CN105388461A - Radar adaptive behavior Q learning method

Info

Publication number: CN105388461A
Application number: CN201510729398.7A
Authority: CN
Inventors: 彭晓燕; 杨金金; 袁晓垒; 张花国
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-10-31
Filing date: 2015-10-31
Publication date: 2016-03-09
Anticipated expiration: 2035-10-31
Also published as: CN105388461B

Abstract

The invention belongs to the field of radar signal processing, and particularly relates to a Q learning method updated based on a Bayesian table to learn and recognize radar adaptive behaviors. The invention provides a radar adaptive behavior Q learning method. An improved Q learning algorithm is used for learning in view of a time domain waveform selection behavior (the minimum mutual information criterion), a big forward step is made on the basis of carrying out interference only according to direct information obtained by a receiving end traditionally, a suggested machine learning algorithm is used for recognizing the radar time domain adaptive behavior, and a certain learning result is given. The method of the invention applies the Q learning algorithm updated based on the Bayesian table to the radar behavior learning and recognition problem for the first time, and in comparison with the prior art, learning effects under time domain waveform selection (the minimum mutual information criterion) are better.

Description

A Radar Adaptive Behavior Q Learning Method

技术领域technical field

本发明属于雷达信号处理领域，尤其涉及基于贝叶斯表更新的Q学习方法对雷达自适应行为的学习、辨识问题。The invention belongs to the field of radar signal processing, and in particular relates to the problem of learning and identifying radar adaptive behavior by a Q learning method based on Bayesian table updating.

背景技术Background technique

随着上个世纪60年代自适应系统和自适应信号处理的问世，诞生了自适应雷达系统，其自适应能力日益发展，并逐步由雷达接收机自适应处理发展到接收机—发射机同步处理。目前雷达自适应行为主要表现在时/频/空域工作参数、信号处理及工作模式方面的行为特征，如时域波形选择自适应行为。雷达波形选择是雷达波形自适应的一种重要手段，目标雷达会建立一个波形库，按照一定准则在波形库内选取发射波形以提高雷达性能。波形选择准则与雷达所处的工作模式(或雷达任务)紧密相关，根据现有的文献资料，雷达任务为目标识别时波形选择准则包括最大互信息量准则(目标为下次发送信号和目标最佳匹配)；最小互信息量(目标为下次发送的信号能获取更多的新信息量，共同的信息量最小)；最大Kullback-Leibler信息准则等，本发明将以最小互信息量为对象。With the advent of the adaptive system and adaptive signal processing in the 1960s, the adaptive radar system was born, and its adaptive capability is developing day by day, and gradually developed from the radar receiver adaptive processing to the receiver-transmitter synchronous processing . At present, the adaptive behavior of radar is mainly manifested in the behavioral characteristics of time/frequency/space domain working parameters, signal processing and working mode, such as adaptive behavior of time domain waveform selection. Radar waveform selection is an important means of radar waveform self-adaptation. The target radar will build a waveform library, and select transmit waveforms in the waveform library according to certain criteria to improve radar performance. The waveform selection criterion is closely related to the working mode (or radar task) of the radar. According to the existing literature, the waveform selection criterion includes the maximum mutual information criterion when the radar task is target identification (the target is the next transmission signal and the target maximum best matching); the minimum amount of mutual information (the target is that the signal sent next time can obtain more new information amount, and the common amount of information is minimum); the maximum Kullback-Leibler information criterion, etc., the present invention will take the minimum amount of mutual information as the object .

目前的雷达信号处理领域，作为干扰的一方，一般是针对固定的雷达目标进行识别，然而智能化是未来发展的一个趋势，双方都将逐渐往具有认知能力的方向发展，针对上具有自适应行为的目标，需要更智能的算法对自适应行为进行学习，之后才能利用学习的结果进行高效、实时地攻击。In the current field of radar signal processing, as the interference party, it generally identifies fixed radar targets. However, intelligence is a trend in future development. Both sides will gradually develop in the direction of cognitive ability, and have adaptive The goal of behavior requires more intelligent algorithms to learn adaptive behavior, and then use the learning results to carry out efficient and real-time attacks.

Q学习算法是强化学习算法的一种，由C.Watkins于1989年在其博士学位论文“Learningfromdelayedrewards”中首次提出，该算法是动态规划的有关理论及动物学习心理学的有力相互结合，以求解具有延迟回报的序贯化决策问题为目标。在Q学习算法中根据时间差分对Markov决策过程的行为值函数进行迭代计算，其迭代计算公式为： $Q (s_{t}, a_{t}) &LeftArrow; Q (s_{t}, a_{t}) + α [r_{t + 1} + γ \underset{a &Element; A_{s}}{m a x} Q (s_{t + 1}, a_{t}) - Q (s_{t}, a_{t})],$ 其中，参数α称为学习率(或学习步长)，γ为折扣率。Q(s_t,a_t)是状态—动作对的值函数，表示在状态s_t下，执行动作a_t，以后再按策略π映射动作所得的报酬，Q学习的目标是它的每一步都是贪心的。Q-learning algorithm is a kind of reinforcement learning algorithm, which was first proposed by C.Watkins in his doctoral dissertation "Learning from delayed rewards" in 1989. This algorithm is a powerful combination of dynamic programming theory and animal learning psychology to solve Sequential decision problems with delayed rewards are the goal. In the Q learning algorithm, the behavior value function of the Markov decision process is iteratively calculated according to the time difference, and the iterative calculation formula is: $Q ({the s}_{t}, a_{t}) &LeftArrow; Q ({the s}_{t}, a_{t}) + α [r_{t + 1} + γ \underset{a &Element; A_{the s}}{m a x} Q ({the s}_{t + 1}, a_{t}) - Q ({the s}_{t}, a_{t})],$ Among them, the parameter α is called the learning rate (or learning step size), and γ is the discount rate. Q(s _t , a _t ) is the value function of the state-action pair, which means that in the state s _t , execute the action a _t , and then map the action according to the strategy π. The reward obtained, the goal of Q learning is Every step it takes is greedy.

贝叶斯网络通过提供图形化的方法来表示知识，是一个有向无环图，其中结点代表论域中的变量，有向弧代表变量的关系，条件概率表示变量之间影响的程度，通过贝叶斯网络可以清楚地反映实际应用中变量之间的依赖关系。贝叶斯网络又称为信度网，是一种图形化的模型，表示一组变量间的联合概率分布函数。一个贝叶斯网络包括一个结构模型和与之相关的一组条件概率分布函数。Bayesian network represents knowledge by providing a graphical method. It is a directed acyclic graph, in which nodes represent variables in the domain of discourse, directed arcs represent the relationship between variables, and conditional probability represents the degree of influence between variables. Bayesian networks can clearly reflect the dependencies between variables in practical applications. Bayesian network, also known as belief network, is a graphical model that represents the joint probability distribution function between a set of variables. A Bayesian network consists of a structural model and a set of conditional probability distribution functions associated with it.

当贝叶斯网络中的数据特征个数为K时，那么这K个变量的联合分布p(x₁,...,x_K)则可写为下面的形式，并通过贝叶斯网络的条件独立特征进行简化： $\begin{matrix} p (x_{1}, ..., x_{K}) = p (x_{K} | x_{1}, ..., x_{K - 1}) ... p (x_{2} | x_{1}) p (x_{1}) \\ = Π_{k = 1}^{K} p (x_{k} | {pa}_{k}) \end{matrix},$ 其中，pa_k是指节点x_k的父节点集合。可以得知，当贝叶斯网络较为稀疏时，联合概率密度形式将大大简化。When the number of data features in the Bayesian network is K, then the joint distribution p(x ₁ ,...,x _K ) of these K variables can be written as the following form, and through the Bayesian network Simplify with conditionally independent features: $\begin{matrix} p (x_{1}, ..., x_{K}) = p (x_{K} | x_{1}, ..., x_{K - 1}) ... p (x_{2} | x_{1}) p (x_{1}) \\ = Π_{k = 1}^{K} p (x_{k} | {pa}_{k}) \end{matrix},$ Among them, pa _k refers to the set of parent nodes of node x _k . It can be seen that when the Bayesian network is relatively sparse, the form of the joint probability density will be greatly simplified.

Q学习是无监督的机器学习算法，通过学习可以使得学习方逐渐适应要学习的环境，在这里指适应目标雷达的自适应行为，而贝叶斯学习从概率(置信度)的角度将未知的信息作为随机变量，具有很好的适应性和可扩展性，将贝叶斯学习理论应用到Q学习中，加入启发式策略，针对目标雷达的自适应行为的学习具有更好的效果。Q-learning is an unsupervised machine learning algorithm. Through learning, the learner can gradually adapt to the environment to be learned. Here, it refers to the adaptive behavior of adapting to the target radar, while Bayesian learning converts the unknown from the perspective of probability (confidence). As a random variable, information has good adaptability and scalability. Applying Bayesian learning theory to Q learning and adding heuristic strategies can achieve better results in the learning of adaptive behavior of target radar.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足，提供一种雷达自适应行为Q学习方法，用改进的Q学习算法针对时域波形选择行为(最小互信息量准则)进行学习，在传统只根据接收端得到的直接信息进行干扰的基础上跃进一大步，利用所提出的机器学习算法对雷达时域自适应行为进行辩识，并给出一定的学习结果。The purpose of the present invention is to aim at the deficiencies in the prior art, provide a kind of radar self-adaptive behavior Q learning method, use the improved Q learning algorithm to learn for the time-domain waveform selection behavior (minimum mutual information criterion), in the tradition only according to receiving Based on the interference of the direct information obtained from the terminal, a big leap forward is made, and the proposed machine learning algorithm is used to identify the adaptive behavior of the radar in the time domain, and a certain learning result is given.

本发明的技术方案是：在Q学习算法的基础上，以时域最小互信息量准则下的波形选择为学习对象，首先，在对目标雷达波形自适应行为学习之前还需要获得其波形库信息；其次，对自适应行为对象进行建模，并利用建模的对象与学习方进行交互，得到不同干扰下的波形转变情况即实验室训练数据；然后，利用训练数据进行贝叶斯网络参数学习，得到贝叶斯后验概率表和贝叶斯记录表；最后，以贝叶斯后验概率表为先验知识，利用所提出的算法进行迭代学习，并给出学习结果The technical solution of the present invention is: on the basis of the Q-learning algorithm, the waveform selection under the time-domain minimum mutual information criterion is used as the learning object. First, the waveform database information of the target radar waveform needs to be obtained before the adaptive behavior learning of the target radar waveform ;Secondly, model the adaptive behavior object, and use the modeled object to interact with the learning party to obtain the waveform transition under different disturbances, that is, the laboratory training data; then, use the training data to learn the parameters of the Bayesian network , to obtain the Bayesian posterior probability table and the Bayesian record table; finally, using the Bayesian posterior probability table as prior knowledge, use the proposed algorithm for iterative learning, and give the learning results

一种雷达自适应行为Q学习方法，具体步骤如下：A radar adaptive behavior Q learning method, the specific steps are as follows:

S1、学习方通过不断发射探测干扰信号，迫使目标雷达改变发射信号，学习方接收端获得目标雷达下次的发射信号，用于健全学习方的动态波形库，所述健全学习方的动态波形库的具体方法为：将学习方接收端得到的波形信息和已知波形进行对比，若动态波形库中没有该波形，则存入动态波形库，然后继续发射探测干扰信号，直到m次交互中得到的目标雷达发射波形均可在动态波形库中找到为止，其中，m为经验值；S1. The learning party forces the target radar to change the transmission signal by continuously transmitting the detection interference signal, and the receiving end of the learning party obtains the next transmission signal of the target radar, which is used to improve the dynamic waveform library of the learning party. The dynamic waveform library of the sound learning party The specific method is: compare the waveform information obtained by the receiving end of the learner with the known waveform, if there is no such waveform in the dynamic waveform library, store it in the dynamic waveform library, and then continue to transmit the detection interference signal until m times of interaction get The target radar emission waveforms can be found in the dynamic waveform library, where m is the empirical value;

S2、以时域最小互信息量准则下的波形选择为学习对象，对其进行建模，并利用建模的对象与学习方进行交互，得到不同干扰下的波形转变情况即实验室训练数据；S2. Select the waveform under the minimum mutual information criterion in the time domain as the learning object, model it, and use the modeled object to interact with the learning party to obtain the waveform transition under different interferences, that is, the laboratory training data;

S3、利用S2所述训练数据进行贝叶斯网络参数学习，利用Matlab环境下的贝叶斯工具箱，加入狄利克雷先验分布，得到新的雷达波形的最大后验概率表即贝叶斯记录表，其中，贝叶斯记录表是指现波形、现干扰信号下具有最大后验概率的新波形编号；S3, use the training data described in S2 to learn Bayesian network parameters, use the Bayesian toolbox under the Matlab environment, add Dirichlet prior distribution, and obtain the maximum posterior probability table of the new radar waveform, that is, Bayesian A recording table, wherein the Bayesian recording table refers to the new waveform number with the maximum posterior probability under the existing waveform and the existing interference signal;

S4、在原来Q学习算法的基础上，以S3所述贝叶斯记录表为先验知识，根据贝叶斯表更新算法进行迭代学习，并给出学习结果。S4. On the basis of the original Q-learning algorithm, take the Bayesian record table described in S3 as prior knowledge, perform iterative learning according to the Bayesian table updating algorithm, and give the learning result.

进一步地，S2所述建模的具体方法为：Further, the specific method of modeling described in S2 is:

S21、当目标雷达波形选择准则为最小互信息量准则，雷达回波信号建模为b＝a+w＝Sα+w，其中，S为包含波形参数的波形卷积矩阵，α为散射系数向量，w为接收机噪声向量；S21. When the target radar waveform selection criterion is the minimum mutual information criterion, the radar echo signal is modeled as b=a+w=Sα+w, wherein, S is a waveform convolution matrix containing waveform parameters, and α is a scattering coefficient vector , w is the receiver noise vector;

S22、进行波形选择，具体为：保证下一次发送的波形能获得更多的新信息量，即前后两次雷达回波信号的互信息量最小，即 $\begin{matrix} M I = {M I (b_{1}, b_{i})}_{s_{i} (m)}^{\min} \\ M I (b_{1}, b_{i}) = H (b_{1}) - H (b_{1} | b_{i}) \\ = H (b_{i}) - H (b_{i} | b_{1}) \end{matrix},$ 在w高斯白噪声分布的假设下，波形1和波形i之间的互信息量为，其中，{d_k|k＝1,2,...,K}是互相关矩阵R_xz的奇异值，矩阵R_xz定义为奇异值满足：1≥d₁≥d₂...≥d_K≥0，互相关矩阵R₁₁、R_i1，R_ii定义为 $\begin{matrix} E (b_{1} b_{1}^{H}) = R_{11} = S_{1} R_{α α} S_{1}^{H} + R_{w w} \\ E (b_{i} b_{1}^{H}) = R_{i 1} = S_{i} R_{α α} S_{1}^{H} \\ E (b_{i} b_{i}^{H}) = R_{i i} = S_{i} R_{α α} S_{i}^{H} + R_{w w} \end{matrix},$ 得到不同波形间的互信息量；S22. Perform waveform selection, specifically: ensure that the waveform to be sent next time can obtain more new information, that is, the mutual information of the two radar echo signals before and after is the smallest, that is $\begin{matrix} m I = {m I (b_{1}, b_{i})}_{{the s}_{i} (m)}^{\min} \\ m I (b_{1}, b_{i}) = h (b_{1}) - h (b_{1} | b_{i}) \\ = h (b_{i}) - h (b_{i} | b_{1}) \end{matrix},$ Under the assumption of w Gaussian white noise distribution, the amount of mutual information between waveform 1 and waveform i is, Among them, {d _k |k=1,2,...,K} is the singular value of the cross-correlation matrix R _xz , and the matrix R _xz is defined as The singular value satisfies: 1≥d ₁ ≥d ₂ ...≥d _K ≥0, the cross-correlation matrix R ₁₁ , R _i1 , and R _ii are defined as $\begin{matrix} E. (b_{1} b_{1}^{h}) = R_{11} = S_{1} R_{α α} S_{1}^{h} + R_{w w} \\ E. (b_{i} b_{1}^{h}) = R_{i 1} = S_{i} R_{α α} S_{1}^{h} \\ E. (b_{i} b_{i}^{h}) = R_{i i} = S_{i} R_{α α} S_{i}^{h} + R_{w w} \end{matrix},$ Get the amount of mutual information between different waveforms;

S23、对S22所述的波形选择对象进行建模，以信号波形雷达、带宽等参数来表征不同的雷达波形状态，选择波形库中与上一发射波形互信息量最小的波形作为新的雷达波形状态；S23, modeling the waveform selection object described in S22, characterizing different radar waveform states with parameters such as signal waveform radar and bandwidth, and selecting the waveform with the minimum mutual information with the previous transmitted waveform in the waveform library as a new radar waveform state;

S24、设置不同的干扰信号，影响目标雷达的波形选择，以此不断进行交互，则得到不同干扰下的波形转变情况即实验室训练数据。S24. Set different interference signals to affect the waveform selection of the target radar, and continuously interact with each other to obtain the waveform transitions under different interferences, that is, laboratory training data.

进一步地，S3所述贝叶斯网络参数学习具体为：Further, the Bayesian network parameter learning described in S3 is specifically:

S31、利用S2所述训练数据得到贝叶斯网络中的条件概率和贝叶斯定理；S31. Using the training data described in S2 to obtain the conditional probability and Bayesian theorem in the Bayesian network;

S32、根据S31所述条件概率和贝叶斯定理得到输出节点即根节点的后验概率其中，s_k指雷达k时刻的状态，r_k指学习方在k时刻采取的攻击，s_k+1指雷达k+1时刻的新状态，公式左边表示在k时刻雷达处于状态s_k、学习方采取攻击r_k时，雷达在k+1时刻转变为新状态s_k+1的概率，为雷达新状态的后验概率估计。公式右边分母中，P(s_k+1|s_k)表示雷达的状态转移概率，也是k+1时刻状态的先验概率，P(r_k|s_k+1,s_k)是雷达状态的条件概率，表示雷达在k时刻是状态s_k，k+1时刻是状态s_k+1的条件下，学习方采取动作r_k的概率，也即是在状态s_k时，设置一期望状态s_k+1，学习方为使得雷达从状态s_k转到状态s_k+1选择各个攻击的概率，分母P(r_k|s_k)是分子对新状态s_k+1的积分或求和，仍以当前状态s_k为条件，学习方选择攻击r_k的概率。S32. Obtain the posterior probability of the output node, namely the root node, according to the conditional probability and Bayesian theorem described in S31 Among them, s _k refers to the state of the radar at time k, r _k refers to the attack taken by the learning party at time k, s _k+1 refers to the new state of the radar at time k+1, and the left side of the formula indicates that the radar is in state s _{k at time k} , learning When the party attacks r _k , the probability that the radar changes to the new state sk+1 at time _k+1 is the posterior probability estimate of the new state of the radar. In the denominator on the right side of the formula, P(s _k+1 |s _k ) represents the state transition probability of the radar, which is also the prior probability of the state at time k+1, and P(r _k |s _k+1 ,s _k ) is the radar state Conditional probability, which means that the radar is in state s _k at time k, and the state s k+1 at time _k+1 , the probability that the learner takes an action r _k , that is, in state s _k , set a desired state s _k+1 , the learning party chooses the probability of each attack to make the radar change from state s _k to state s _k+1 , and the denominator P(r _k |s _k ) is the integral or summation of the numerator to the new state s _k+1 , Still taking the current state s _k as the condition, the learning party chooses the probability of attacking r _k .

进一步地，S4所述贝叶斯表更新算法，具体如下：Further, the Bayesian table update algorithm described in S4 is as follows:

S41、进行最小互信息量下的波形选择对象建模、雷达波形库波形参数设置、干扰信号参数设置以后，通过向目标雷达发射不同的干扰信号得到波形转换情况，即得到了实验室训练数据，具体实现过程为：从波形1开始进行波形选择，攻击从4个干扰编号中随机选择，得到新波形，进行更新、循环，得到100个训练数据；S41. After performing the waveform selection object modeling under the minimum mutual information amount, radar waveform library waveform parameter setting, and interference signal parameter setting, the waveform conversion situation is obtained by transmitting different interference signals to the target radar, that is, the laboratory training data is obtained. The specific implementation process is as follows: waveform selection starts from waveform 1, and the attack randomly selects from 4 interference numbers to obtain a new waveform, update and loop, and obtain 100 training data;

S42、构造贝叶斯网络，加入先验分布，求解最大后验解，利用Matlab中的贝叶斯工具箱对贝叶斯网络中的条件概率进行求解，最后得到根节点的后验概率，其中，先验概率设为Dirichlet分布，概率均等，贝叶斯记录表是在已求解的最大后验概率解的基础上统计的不同干扰下的波形转移情况，针对现波形、某一干扰下，选取具有最大后验概率的新波形作为输出，记录在表内，即 $S_{t + 1}^{\max} = \underset{S_{t + 1}}{\arg \max} {P (S_{t}, r_{t}, S_{t + 1})};$ S42. Construct a Bayesian network, add the prior distribution, solve the maximum posterior solution, and use the Bayesian toolbox in Matlab to analyze the conditional probability in the Bayesian network Solve, and finally get the posterior probability of the root node, where the prior probability is set to Dirichlet distribution, the probability is equal, and the Bayesian record table is the waveform under different disturbances that is counted on the basis of the solved maximum posterior probability solution For the transfer situation, for the current waveform and a certain interference, select the new waveform with the largest posterior probability as the output, and record it in the table, that is, $S_{t + 1}^{\max} = \underset{S_{t + 1}}{\arg \max} {P (S_{t}, r_{t}, S_{t + 1})};$

S43、得到了贝叶斯后验概率表之后，根据贝叶斯表更新算法流程图，与最小互信息量准则下的波形选择对象之间进行交互，然后算法迭代、学习。S43. After obtaining the Bayesian posterior probability table, update the algorithm flow chart according to the Bayesian table, interact with the waveform selection object under the minimum mutual information criterion, and then iterate and learn the algorithm.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明的方法首次将基于贝叶斯表更新的Q学习算法应用到雷达行为学习与辨识问题中，相对于现有技术在时域波形选择(最小互信息量准则)下的学习效果更优。The method of the invention applies the Q-learning algorithm based on Bayesian table update to the radar behavior learning and identification problem for the first time, and the learning effect under the time-domain waveform selection (minimum mutual information criterion) is better than that of the prior art.

附图说明Description of drawings

图1是获取目标雷达波形库方法示意图。Figure 1 is a schematic diagram of the method for obtaining the target radar waveform library.

图2是波形自适应行为建模示意图。Fig. 2 is a schematic diagram of waveform adaptive behavior modeling.

图3是波形选择对象下的贝叶斯网络结构。Figure 3 is the Bayesian network structure under the waveform selection object.

图4是Q学习算法流程图。Figure 4 is a flowchart of the Q-learning algorithm.

图5是贝叶斯表更新算法流程图。Fig. 5 is a flow chart of Bayesian table update algorithm.

图6是Q学习算法收敛性曲线。Fig. 6 is the convergence curve of the Q learning algorithm.

图7是贝叶斯表更新算法收敛性曲线。Fig. 7 is the convergence curve of the Bayesian table update algorithm.

图8是学习后干扰策略验证下的状态转移图。Fig. 8 is a state transition diagram under the interference policy verification after learning.

图9是初始波形不同时算法学习效果验证。Figure 9 is the verification of the learning effect of the algorithm when the initial waveforms are different.

具体实施方式detailed description

下面结合实施例和附图，详细说明本发明的技术方案。The technical solution of the present invention will be described in detail below in combination with the embodiments and the accompanying drawings.

S1、学习方通过不断的发射探测干扰信号，迫使目标雷达改变发射信号，学习方接收端获得目标雷达下次的发射信号，不断健全学习方的动态波形库。首先学习方接收端得到的波形信息和已知波形比对，若动态波形库中没有该波形，则存入该动态波形库，然后继续发射探测干扰，直到m次交互中得到的目标雷达发射波形均可在我方波形库内找到为止，m的值可以调节。S1. The learning party forces the target radar to change the transmission signal by continuously transmitting and detecting interference signals, and the receiving end of the learning party obtains the next transmission signal of the target radar, and continuously improves the dynamic waveform library of the learning party. First, compare the waveform information obtained by the receiving end of the learning party with the known waveform. If the waveform does not exist in the dynamic waveform library, store it in the dynamic waveform library, and then continue to transmit and detect interference until the target radar transmission waveform obtained in m interactions All can be found in our waveform library, and the value of m can be adjusted.

如图1所示，以波形选择下的最小互信息量准则为对象，不断发射干扰以遍历目标雷达波形库，得到学习方波形库，以后的学习及仿真均在完全并正确遍历了目标雷达波形库的条件下开展的。As shown in Figure 1, taking the minimum mutual information criterion under waveform selection as the object, the target radar waveform library is continuously transmitted to traverse the target radar waveform library, and the learning square waveform library is obtained. The subsequent learning and simulation have completely and correctly traversed the target radar waveform. carried out under the conditions of the library.

S2、以时域最小互信息量准则下的波形选择为学习对象，对其进行建模，并利用建模的对象与学习方进行交互，得到不同干扰下的波形转变情况即实验室训练数据，具体如下：S2. Select the waveform under the minimum mutual information criterion in the time domain as the learning object, model it, and use the modeled object to interact with the learning party to obtain the waveform transition under different interferences, that is, the laboratory training data. details as follows:

S21、对波形自适应行为进行特征分析及建模。S21. Perform characteristic analysis and modeling on the waveform adaptive behavior.

当目标雷达波形选择准则为最小互信息量准则，雷达回波信号建模为：b＝a+w＝Sα+w，其中，S为包含波形参数的波形卷积矩阵，α为散射系数向量，w为接收机噪声向量，一般假定为高斯白噪声。When the target radar waveform selection criterion is the minimum mutual information criterion, the radar echo signal is modeled as: b=a+w=Sα+w, where S is the waveform convolution matrix containing waveform parameters, and α is the scattering coefficient vector, w is the receiver noise vector, which is generally assumed to be Gaussian white noise.

雷达为精确描述感兴趣的区域，采用更有效的方法收集信息。因此，波形选择准则为保证下一次发送的波形能获得更多的新信息量，即前后两次雷达回波信号的互信息量最小，表达式为： $\begin{matrix} M I = {M I (b_{1}, b_{i})}_{s_{i} (m)}^{\min} \\ M I (b_{1}, b_{i}) = H (b_{1}) - H (b_{1} | b_{i}) \\ = H (b_{i}) - H (b_{i} | b_{1}) \end{matrix} .$ Radar uses more efficient methods of gathering information to accurately describe areas of interest. Therefore, the waveform selection criterion is to ensure that the waveform sent next time can obtain more new information, that is, the mutual information of the two radar echo signals before and after is the smallest, and the expression is: $\begin{matrix} m I = {m I (b_{1}, b_{i})}_{{the s}_{i} (m)}^{\min} \\ m I (b_{1}, b_{i}) = h (b_{1}) - h (b_{1} | b_{i}) \\ = h (b_{i}) - h (b_{i} | b_{1}) \end{matrix} .$

在w高斯白噪声分布的假设下，波形1和波形i之间的互信息量为：其中，{d_k|k＝1,2,...,K}是互相关矩阵R_xz的奇异值，矩阵R_xz定义如下：奇异值满足：1≥d₁≥d₂...≥d_K≥0，互相关矩阵R₁₁、R_i1，R_ii定义为： $\begin{matrix} E (b_{1} b_{1}^{H}) = R_{11} = S_{1} R_{α α} S_{1}^{H} + R_{w w} \\ E (b_{i} b_{1}^{H}) = R_{i 1} = S_{i} R_{α α} S_{1}^{H} \\ E (b_{i} b_{i}^{H}) = R_{i i} = S_{i} R_{α α} S_{i}^{H} + R_{w w} \end{matrix},$ 从上式则可得到不同波形间的互信息量。Under the assumption of w Gaussian white noise distribution, the mutual information between waveform 1 and waveform i is: Among them, {d _k |k=1,2,...,K} is the singular value of the cross-correlation matrix R _xz , and the matrix R _xz is defined as follows: The singular value satisfies: 1≥d ₁ ≥d ₂ ...≥d _K ≥0, the cross-correlation matrix R ₁₁ , R _i1 , and R _ii are defined as: $\begin{matrix} E. (b_{1} b_{1}^{h}) = R_{11} = S_{1} R_{α α} S_{1}^{h} + R_{w w} \\ E. (b_{i} b_{1}^{h}) = R_{i 1} = S_{i} R_{α α} S_{1}^{h} \\ E. (b_{i} b_{i}^{h}) = R_{i i} = S_{i} R_{α α} S_{i}^{h} + R_{w w} \end{matrix},$ From the above formula, the mutual information between different waveforms can be obtained.

然后，对最小互信息量准则下的波形选择对象进行建模，以信号波形雷达、带宽等参数来表征不同的雷达波形状态，根据此准则选择波形库中与上一发射波形互信息量最小的波形作为新的雷达波形状态，因此，在一定准则下的状态的转换即是雷达波形自适应行为。Then, the waveform selection object under the minimum mutual information criterion is modeled, and different radar waveform states are represented by parameters such as signal waveform radar and bandwidth, and the waveform library with the minimum mutual information with the previous transmitted waveform is selected according to this criterion. The waveform is used as a new radar waveform state, therefore, the transition of the state under a certain criterion is the radar waveform adaptive behavior.

具体建模如图2，以信号波形类型、信号带宽、信号脉宽等参数表征一个雷达波形状态。本发明仿真中目标雷达波形参数设置如下：波形库内设置32个波形，8类波形，分别为线性调频正斜率、线性调频负斜率、二次调频上凹正斜率、二次调频上凹负斜率、二次调频下凸正斜率、二次调频下凸负斜率、对数调频正斜率、对数调频负斜率；每类波形设置4种带宽，为10MHz、15MHz、20MHz、25MHz。The specific modeling is shown in Figure 2. A radar waveform state is characterized by parameters such as signal waveform type, signal bandwidth, and signal pulse width. In the simulation of the present invention, the target radar waveform parameters are set as follows: 32 waveforms and 8 types of waveforms are set in the waveform library, which are respectively linear frequency modulation positive slope, linear frequency modulation negative slope, secondary frequency modulation concave positive slope, and secondary frequency modulation concave negative slope , Secondary FM downward convex positive slope, secondary FM downward convex negative slope, logarithmic FM positive slope, logarithmic FM negative slope; each type of waveform can be set with 4 bandwidths, 10MHz, 15MHz, 20MHz, 25MHz.

在对最小互信息量准则下的波形选择对象建模以后，还需要设置学习方干扰信号参数，干扰信号为4种，分别是干信比为30dB和50dB的单频信号，干信比为30dB和55dB、带宽为30MHz的线性调频信号。After modeling the waveform selection object under the minimum mutual information criterion, it is also necessary to set the interference signal parameters of the learning side. There are 4 types of interference signals, which are single-frequency signals with an interference-to-signal ratio of 30dB and 50dB, and the interference-to-signal ratio is 30dB and a 55dB chirp signal with a bandwidth of 30MHz.

在有了学习对象以及干扰信号后，则需要在仿真过程中求解不同干扰下的波形间的互信息量以便得到新选择的波形。其中，在w高斯白噪声分布的假设下，波形1和波形i之间的互信息量为：其中，{d_k|k＝1,2,...,K}是互相关矩阵R_xz的奇异值，矩阵R_xz定义如下：奇异值满足：1≥d₁≥d₂...≥d_K≥0，互相关矩阵R₁₁、R_i1，R_ii定义为： $\begin{matrix} E (b_{1} b_{1}^{H}) = R_{11} = S_{1} R_{α α} S_{1}^{H} + R_{w w} \\ E (b_{i} b_{1}^{H}) = R_{i 1} = S_{i} R_{α α} S_{1}^{H} \\ E (b_{i} b_{i}^{H}) = R_{i i} = S_{i} R_{α α} S_{i}^{H} + R_{w w} \end{matrix},$ 从上式则可得到不同波形间的互信息量。After having the learning object and the interference signal, it is necessary to solve the mutual information between the waveforms under different interferences in the simulation process in order to obtain the newly selected waveform. Among them, under the assumption of w Gaussian white noise distribution, the mutual information between waveform 1 and waveform i is: Among them, {d _k |k=1,2,...,K} is the singular value of the cross-correlation matrix R _xz , and the matrix R _xz is defined as follows: The singular value satisfies: 1≥d ₁ ≥d ₂ ...≥d _K ≥0, the cross-correlation matrix R ₁₁ , R _i1 , and R _ii are defined as: $\begin{matrix} E. (b_{1} b_{1}^{h}) = R_{11} = S_{1} R_{α α} S_{1}^{h} + R_{w w} \\ E. (b_{i} b_{1}^{h}) = R_{i 1} = S_{i} R_{α α} S_{1}^{h} \\ E. (b_{i} b_{i}^{h}) = R_{i i} = S_{i} R_{α α} S_{i}^{h} + R_{w w} \end{matrix},$ From the above formula, the mutual information between different waveforms can be obtained.

S22、在对最小互信息量波形选择对象进行建模之后，设置不同的干扰信号，影响目标雷达的波形选择，以此不断进行交互，则得到不同干扰下的波形转变情况即实验室训练数据，交互过程中的雷达波形参数以及干扰信号参数在下面有详细说明。S22. After modeling the minimum mutual information waveform selection object, set different interference signals to affect the waveform selection of the target radar, so as to continuously interact, and then obtain the waveform transformation under different interferences, that is, the laboratory training data. The radar waveform parameters and interference signal parameters during the interaction process are described in detail below.

S3、利用训练数据进行贝叶斯网络参数学习，利用Matlab环境下的贝叶斯工具箱，加入狄利克雷先验分布，得到新的雷达波形的最大后验概率表，而贝叶斯记录表则是指现波形、现干扰信号下具有最大后验概率的新波形编号。S3. Use the training data to learn the parameters of the Bayesian network, use the Bayesian toolbox under the Matlab environment, add the Dirichlet prior distribution, and obtain the maximum posterior probability table of the new radar waveform, and the Bayesian record table It refers to the number of the new waveform with the maximum posterior probability under the current waveform and the current interference signal.

贝叶斯网络的结构如图3所示，利用训练数据得到贝叶斯网络中的各种条件概率以及贝叶斯定理，进而可得到输出节点即根节点的后验概率其中，s_k指雷达k时刻的状态，r_k指学习方在k时刻采取的攻击，s_k+1指雷达k+1时刻的新状态，公式左边表示在k时刻雷达处于状态s_k、学习方采取攻击r_k时，雷达在k+1时刻转变为新状态s_k+1的概率，为雷达新状态的后验概率估计。公式右边分母中，P(s_k+1|s_k)表示雷达的状态转移概率，也是k+1时刻状态的先验概率，P(r_k|s_k+1,s_k)是雷达状态的条件概率，表示雷达在k时刻是状态s_k，k+1时刻是状态s_k+1的条件下，学习方采取动作r_k的概率，也即是在状态s_k时，设置一期望状态s_k+1，学习方为使得雷达从状态s_k转到状态s_k+1选择各个攻击的概率，分母P(r_k|s_k)是分子对新状态s_k+1的积分或求和，仍以当前状态s_k为条件，学习方选择攻击r_k的概率。The structure of the Bayesian network is shown in Figure 3. The training data is used to obtain various conditional probabilities and Bayesian theorem in the Bayesian network, and then the posterior probability of the output node, that is, the root node, can be obtained. Among them, s _k refers to the state of the radar at time k, r _k refers to the attack taken by the learning party at time k, s _k+1 refers to the new state of the radar at time k+1, and the left side of the formula indicates that the radar is in state s _{k at time k} , learning When the party attacks r _k , the probability that the radar changes to the new state sk+1 at time _k+1 is the posterior probability estimate of the new state of the radar. In the denominator on the right side of the formula, P(s _k+1 |s _k ) represents the state transition probability of the radar, which is also the prior probability of the state at time k+1, and P(r _k |s _k+1 ,s _k ) is the radar state Conditional probability, which means that the radar is in state s _k at time k, and the state s k+1 at time _k+1 , the probability that the learner takes an action r _k , that is, in state s _k , set a desired state s _k+1 , the learning party chooses the probability of each attack to make the radar change from state s _k to state s _k+1 , and the denominator P(r _k |s _k ) is the integral or summation of the numerator to the new state s _k+1 , Still taking the current state s _k as the condition, the learning party chooses the probability of attacking r _k .

S4、在原来Q学习算法的基础上，以贝叶斯后验概率表为先验知识，根据图5的贝叶斯表更新算法流程图进行迭代学习，并给出学习结果。S4. On the basis of the original Q-learning algorithm, using the Bayesian posterior probability table as prior knowledge, perform iterative learning according to the flow chart of the Bayesian table updating algorithm in Figure 5, and give the learning results.

图4和图5分别是Q学习算法和基于贝叶斯表更新的Q学习算法的流程图，两种算法的主要区别是贝叶斯表更新算法利用了实验室训练数据得到贝叶斯后验概率表，并以此表为先验知识以及对目标波形的引导知识，然后才在与对象的交互过程中学习、迭代。Figure 4 and Figure 5 are the flow charts of the Q-learning algorithm and the Q-learning algorithm based on Bayesian table updating, respectively. The main difference between the two algorithms is that the Bayesian table updating algorithm uses the laboratory training data to obtain the Bayesian posterior Probability table, and use this table as prior knowledge and guided knowledge of the target waveform, and then learn and iterate during the interaction with the object.

Q学习算法具体实现过程包括以下步骤：The specific implementation process of the Q-learning algorithm includes the following steps:

步骤1、在上述对最小互信息量下的波形选择对象建模、雷达波形库波形参数设置、干扰信号参数设置以后，则可以使得Q学习算法与目标对象之间利用干扰信号进行交互。Step 1. After the above-mentioned modeling of the waveform selection object under the minimum amount of mutual information, radar waveform library waveform parameter setting, and interference signal parameter setting, the Q learning algorithm can be used to interact with the target object by using the interference signal.

步骤2、根据图4中的Q学习算法流程图，在与目标对象之间不同交互的过程中，Q学习算法进行迭代、学习，图6则是Q学习算法的收敛性曲线。其中，横坐标的幕次表示达到目标状态的次数，而纵坐标的每幕次的迭代次数表示每次达到目标状态所需要的攻击次数，也即学习方与目标雷达进行交互时，牵引目标雷达到达目标状态时所需要的交互次数。从图中可以看出，在仿真幕次的开始阶段，所需要的迭代次数很多，甚至会达到迭代次数上限，随着仿真幕次的加深，在之前幕次迭代过程中获得的知识的基础上，算法达到目标波形所需要的迭代次数逐渐减少，最后达到稳定。Step 2. According to the flow chart of the Q-learning algorithm in FIG. 4, the Q-learning algorithm iterates and learns during different interactions with the target object. FIG. 6 shows the convergence curve of the Q-learning algorithm. Among them, the number of episodes on the abscissa indicates the number of times the target state is reached, and the number of iterations per episode on the ordinate indicates the number of attacks required to reach the target state each time, that is, when the learner interacts with the target radar, the target radar The number of interactions required to reach the goal state. It can be seen from the figure that at the beginning of the simulation sequence, the number of iterations required is very large, and even the upper limit of the number of iterations will be reached. As the simulation sequence deepens, based on the knowledge obtained during the previous iterations , the number of iterations required by the algorithm to reach the target waveform gradually decreases, and finally reaches stability.

贝叶斯表更新算法具体实现过程包括以下步骤：The specific implementation process of the Bayesian table update algorithm includes the following steps:

步骤一、在上述对最小互信息量下的波形选择对象建模、雷达波形库波形参数设置、干扰信号参数设置以后，则可以通过向目标雷达发射不同的干扰信号得到波形转换情况，也即得到了实验室训练数据，具体实现过程为：从波形1开始进行波形选择，攻击从4个干扰编号中随机选择，得到新波形，进行更新、循环，得到100个训练数据。Step 1. After the above-mentioned modeling of the waveform selection object under the minimum mutual information, radar waveform library waveform parameter setting, and interference signal parameter setting, the waveform conversion situation can be obtained by transmitting different interference signals to the target radar, that is, The laboratory training data is obtained, and the specific implementation process is as follows: waveform selection starts from waveform 1, and the attack randomly selects from 4 interference numbers to obtain a new waveform, update and loop, and obtain 100 training data.

步骤二、构造贝叶斯网络，如图3所示，加入先验分布，求解最大后验解。先验概率设为Dirichlet分布，概率均等。利用Matlab中的贝叶斯工具箱对贝叶斯网络中的条件概率根据下式进行求解，最后得到根节点的后验概率。Step 2: Construct a Bayesian network, as shown in Figure 3, add the prior distribution, and find the maximum a posteriori solution. The prior probability is set to Dirichlet distribution with equal probability. Use the Bayesian toolbox in Matlab to solve the conditional probability in the Bayesian network according to the following formula, and finally get the posterior probability of the root node.

而贝叶斯记录表则是在已求解的最大后验概率解的基础上统计的不同干扰下的波形转移情况，其中，针对现波形、某一干扰下，选取具有最大后验概率的新波形作为输出，也即记录在表内，即 $S_{t + 1}^{\max} = \underset{S_{t + 1}}{\arg \max} {P (S_{t}, r_{t}, S_{t + 1})} .$ The Bayesian recording table is the waveform transition under different disturbances calculated on the basis of the maximum a posteriori probability solution that has been solved. Among them, for the current waveform and a certain disturbance, select a new waveform with the maximum a posteriori probability As output, i.e. recorded in the table, i.e. $S_{t + 1}^{\max} = \underset{S_{t + 1}}{\arg \max} {P (S_{t}, r_{t}, S_{t + 1})} .$

步骤三、得到了贝叶斯后验概率表之后，则根据图5中的贝叶斯表更新算法流程图，与最小互信息量准则下的波形选择对象之间进行交互，然后算法迭代、学习，而图7是算法收敛性曲线、图8、图9则是贝叶斯表更新算法的学习结果。Step 3: After obtaining the Bayesian posterior probability table, update the algorithm flowchart according to the Bayesian table in Figure 5, and interact with the waveform selection object under the minimum mutual information criterion, and then the algorithm iterates and learns , while Figure 7 is the algorithm convergence curve, and Figure 8 and Figure 9 are the learning results of the Bayesian table update algorithm.

Q学习算法的收敛性与贝叶斯表更新算法的收敛性可见图6与图7，可以看出，本发明所提出的基于贝叶斯表更新的Q学习算法收敛性更好，也即对雷达波形自适应行为的学习效果更好；经过贝叶斯表更新算法学习后，从图7迭代次数统计曲线上可以看出迭代次数逐渐减少直至稳定，达到稳定时也即算法对目标对象学习完成，之后则是战场验证阶段，利用实验室学习阶段迭代得到的贝叶斯后验概率表以及算法的干扰选择策略，在相同的初始状态和目标状态下，战场上的干扰样式选择以及状态转移情况如图8所示；当初始值不是波形1时，利用贝叶斯表更新算法学习到的后验概率表，则可以得到不同的初始波形下达到目标波形时所需要的平均迭代次数，如图9，横坐标表示每次迭代初始化的初始波形编号，纵坐标为对应初始波形下达到目标波形时算法迭代的次数。可以看出，经过迭代学习后，各初始波形下想要达到目标波形所需的攻击次数大大减少，均在10次以内。The convergence of the Q learning algorithm and the convergence of the Bayesian table update algorithm can be seen in Figure 6 and Figure 7, as can be seen, the Q learning algorithm based on the Bayesian table update proposed by the present invention has better convergence, that is, for The learning effect of radar waveform adaptive behavior is better; after Bayesian table update algorithm learning, it can be seen from the statistical curve of the number of iterations in Figure 7 that the number of iterations gradually decreases until it stabilizes, and when it reaches stability, the algorithm has completed the learning of the target object , followed by the battlefield verification stage, using the Bayesian posterior probability table obtained iteratively in the laboratory learning stage and the interference selection strategy of the algorithm, under the same initial state and target state, the interference pattern selection and state transition on the battlefield As shown in Figure 8; when the initial value is not waveform 1, the average number of iterations required to reach the target waveform under different initial waveforms can be obtained by using the posterior probability table learned by the Bayesian table update algorithm, as shown in the figure 9. The abscissa indicates the initial waveform number initialized for each iteration, and the ordinate indicates the number of algorithm iterations when the target waveform is reached under the corresponding initial waveform. It can be seen that after iterative learning, the number of attacks required to reach the target waveform under each initial waveform is greatly reduced, all within 10 times.

Claims

1. a radar adaptive behavior Q learning method, is characterized in that, comprises the steps:

S1. The learning party forces the target radar to change the transmission signal by continuously transmitting the detection interference signal, and the receiving end of the learning party obtains the next transmission signal of the target radar, which is used to improve the dynamic waveform library of the learning party. The dynamic waveform library of the sound learning party The specific method is: compare the waveform information obtained by the receiving end of the learner with the known waveform, if there is no such waveform in the dynamic waveform library, store it in the dynamic waveform library, and then continue to transmit the detection interference signal until m times of interaction get The target radar emission waveforms can be found in the dynamic waveform library, where m is the empirical value;

S2. Select the waveform under the minimum mutual information criterion in the time domain as the learning object, model it, and use the modeled object to interact with the learning party to obtain the waveform transition under different interferences, that is, the laboratory training data;

S3, use the training data described in S2 to learn Bayesian network parameters, use the Bayesian toolbox under the Matlab environment, add Dirichlet prior distribution, and obtain the maximum posterior probability table of the new radar waveform, that is, Bayesian A recording table, wherein the Bayesian recording table refers to the new waveform number with the maximum posterior probability under the existing waveform and the existing interference signal;

S4. On the basis of the original Q-learning algorithm, take the Bayesian record table described in S3 as prior knowledge, perform iterative learning according to the Bayesian table updating algorithm, and give the learning result.

2. A kind of radar adaptive behavior Q learning method according to claim 1, is characterized in that: the specific method of modeling described in S2 is:

S21. When the target radar waveform selection criterion is the minimum mutual information criterion, the radar echo signal is modeled as b=a+w=Sα+w, wherein, S is a waveform convolution matrix containing waveform parameters, and α is a scattering coefficient vector , w is the receiver noise vector;

S22. Perform waveform selection, specifically: ensure that the waveform to be sent next time can obtain more new information, that is, the mutual information of the two radar echo signals before and after is the smallest, that is

\begin{matrix} m I = \min_{{the s}_{i} (no)} {m I (b_{1}, b_{i})} \\ m I (b_{1}, b_{i}) = h (b_{1}) - h (b_{1} | b_{i}) \\ = h (b_{i}) - h (b_{i} | b_{1}) \end{matrix},

Under the assumption of w Gaussian white noise distribution, the amount of mutual information between waveform 1 and waveform i is, Among them, {d _k |k=1,2,...,K} is the singular value of the cross-correlation matrix R _xz , and the matrix R _xz is defined as The singular value satisfies: 1≥d ₁ ≥d ₂ ...≥d _K ≥0, the cross-correlation matrix R ₁₁ , R _i1 , and R _ii are defined as

\begin{matrix} E. [b_{1} b_{1}^{h}] = R_{11} = S_{1} R_{α α} S_{1}^{h} + R_{w w} \\ E. [b_{i} b_{1}^{h}] = R_{i 1} = S_{i} R_{α α} S_{1}^{h} \\ E. [b_{i} b_{i}^{h}] = R_{i i} = S_{i} R_{α α} S_{i}^{h} + R_{w w} \end{matrix},

Get the amount of mutual information between different waveforms;

S23, modeling the waveform selection object described in S22, characterizing different radar waveform states with parameters such as signal waveform radar and bandwidth, and selecting the waveform with the minimum mutual information with the previous transmitted waveform in the waveform library as a new radar waveform state;

S24. Set different interference signals to affect the waveform selection of the target radar, and continuously interact with each other to obtain the waveform transitions under different interferences, that is, laboratory training data.

3. a kind of radar adaptive behavior Q learning method according to claim 1, is characterized in that: the Bayesian network parameter learning described in S3 is specifically:

S31. Using the training data described in S2 to obtain the conditional probability and Bayesian theorem in the Bayesian network;

S32. Obtain the posterior probability of the output node, namely the root node, according to the conditional probability and Bayesian theorem described in S31 Among them, s _k refers to the state of the radar at time k, r _k refers to the attack taken by the learning party at time k, s _k+1 refers to the new state of the radar at time k+1, and the left side of the formula indicates that the radar is in state s _{k at time k} , learning When the party attacks r _k , the probability that the radar changes to the new state sk+1 at time _k+1 is the posterior probability estimate of the new state of the radar. In the denominator on the right side of the formula, P(s _k+1 |s _k ) represents the state transition probability of the radar, which is also the prior probability of the state at time k+1, and P(r _k |s _k+1 ,s _k ) is the radar state Conditional probability, which means that the radar is in state s _k at time k, and the state s k+1 at time _k+1 , the probability that the learner takes an action r _k , that is, in state s _k , set a desired state s _k+1 , the learning party chooses the probability of each attack to make the radar change from state s _k to state s _k+1 , and the denominator P(r _k |s _k ) is the integral or summation of the numerator to the new state s _k+1 , Still taking the current state s _k as the condition, the learning party chooses the probability of attacking r _k .

4. a kind of radar adaptive behavior Q learning method according to claim 1, is characterized in that: Bayesian table update algorithm described in S4, specifically as follows:

S41. After performing the waveform selection object modeling under the minimum mutual information amount, radar waveform library waveform parameter setting, and interference signal parameter setting, the waveform conversion situation is obtained by transmitting different interference signals to the target radar, that is, the laboratory training data is obtained. The specific implementation process is as follows: waveform selection starts from waveform 1, and the attack randomly selects from 4 interference numbers to obtain a new waveform, update and loop, and obtain 100 training data;

S42. Construct a Bayesian network, add the prior distribution, solve the maximum posterior solution, and use the Bayesian toolbox in Matlab to analyze the conditional probability in the Bayesian network Solve, and finally get the posterior probability of the root node, where the prior probability is set to Dirichlet distribution, the probability is equal, and the Bayesian record table is the waveform under different disturbances that is counted on the basis of the solved maximum posterior probability solution For the transfer situation, for the current waveform and a certain interference, select the new waveform with the largest posterior probability as the output, and record it in the table, that is,

S_{t + 1}^{\max} = \underset{S_{t + 1}}{\arg \max} {P (S_{t}, r_{t}, S_{t + 1})};

S43. After obtaining the Bayesian posterior probability table, update the algorithm flow chart according to the Bayesian table, interact with the waveform selection object under the minimum mutual information criterion, and then iterate and learn the algorithm.