CN115454141A

CN115454141A - A multi-agent and multi-domain anti-jamming method for UAV swarms based on partially observable information

Info

Publication number: CN115454141A
Application number: CN202211261459.8A
Authority: CN
Inventors: 刘梦泽; 单雯; 卢其然; 林艳; 张一晋; 邹骏; 吴志娟
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2022-12-09
Anticipated expiration: 2042-10-14
Also published as: CN115454141B

Abstract

The invention discloses a UAV cluster multi-agent multi-domain anti-interference method based on partially observable information. The method utilizes part of the observed environment information of each agent, retains historical experience data through a long-term and short-term memory network, and inputs each agent The agent’s deep recurrent Q network performs action value function fitting, and uses the ε‑greedy algorithm to select the channel and power corresponding to the maximum output Q value, and then continuously and independently trains the deep recurrent Q network of each agent to update the Q value distribution, and finally learns To the optimal decision-making of UAV channel and transmission power that can adapt to unknown interference scenarios to minimize the energy consumption of communication transmission. In the present invention, the UAV cluster network is in two scenarios of frequency sweep interference and Markov interference, and uses the historical experience data of part of the observable information to realize effective multi-agent anti-interference communication from the spectrum domain and the power domain; Compared with the comparison scheme based on multi-agent deep Q learning, the proposed scheme can more efficiently reduce the long-term communication transmission energy consumption of the UAV cluster network when the environmental information is partially observable.

Description

A multi-agent and multi-domain anti-jamming for UAV swarms based on partially observable information method

技术领域technical field

本发明属于无线通信技术领域，特别是一种基于部分可观测信息的无人机集群多智能体多域抗干扰方法。The invention belongs to the technical field of wireless communication, in particular to a multi-agent and multi-domain anti-jamming method of an unmanned aerial vehicle cluster based on partially observable information.

背景技术Background technique

近年来，随着无线电技术的飞速发展，无人机通信系统中诸多优势的不断凸显，无人机被广泛应用于应急网络，缓解通信系统中的终端需求。无人机集群网络抗干扰技术是保障无人机通信免于干扰威胁的重要技术。其中，跳频抗干扰是最常见的抗干扰技术之一。由于传统跳频抗干扰技术无法应对未知高动态复杂干扰环境等问题，基于强化学习的跳频抗干扰技术已成为近年来无人机通信网络跳频抗干扰技术的研究热点。In recent years, with the rapid development of radio technology, many advantages of UAV communication systems have been highlighted, and UAVs are widely used in emergency networks to alleviate the terminal demand in communication systems. The anti-jamming technology of UAV swarm network is an important technology to protect UAV communication from interference threats. Among them, frequency hopping anti-jamming is one of the most common anti-jamming technologies. Since the traditional frequency hopping anti-jamming technology cannot cope with the unknown high dynamic and complex interference environment and other problems, the frequency hopping anti-jamming technology based on reinforcement learning has become a research hotspot in the frequency hopping anti-jamming technology of UAV communication network in recent years.

以往大多数研究采用Q学习(Q-Learning,QL)算法，但只适用于低维离散的动作空间。当动作空间较大时，将面临维数灾难的问题。针对上述问题，Shangxing Wang等人提出了基于深度Q网络(Deep Q-Network,DQN)在线学习的信道选择算法，这有效的改善了无人机通信网络在复杂环境下的抗干扰性能。Fuqiang Yao和Luliang Jia借助马尔可夫博弈框架 (Markov Game Framework)，对无人机集群通信系统建立多智能体马尔科夫决策模型(Markov Decision Process,MDP)模型，降低了应用于实际通信环境时的通信开销。然而，上述抗干扰通信技术并未考虑到通信环境部分可观测问题。Most previous studies use Q-Learning (QL) algorithm, but it is only suitable for low-dimensional discrete action space. When the action space is large, it will face the problem of the curse of dimensionality. In response to the above problems, Shangxing Wang et al. proposed a channel selection algorithm based on Deep Q-Network (Deep Q-Network, DQN) online learning, which effectively improves the anti-jamming performance of the UAV communication network in complex environments. Fuqiang Yao and Luliang Jia used the Markov Game Framework to establish a multi-agent Markov Decision Process (MDP) model for the UAV swarm communication system, reducing the time required for application in the actual communication environment. communication overhead. However, the above-mentioned anti-jamming communication technology does not take into account some observable problems of the communication environment.

发明内容Contents of the invention

本发明旨在提供一种基于部分可观测信息的无人机集群多智能体多域抗干扰方法，利用深度循环Q学习(Deep Recurrent Q-Network,DRQN)算法，簇头无人机在建立 Dec-POMDP模型的基础上，通过采用长短期记忆网络保留历史信息数据训练DRQN，实现向真实环境模型的趋近。The present invention aims to provide a UAV cluster multi-agent multi-domain anti-jamming method based on partly observable information. Using the deep recurrent Q-learning (Deep Recurrent Q-Network, DRQN) algorithm, the cluster head UAV is establishing a Dec -On the basis of the POMDP model, the DRQN is trained by using the long-term short-term memory network to retain historical information data, so as to achieve the approach to the real environment model.

实现本发明目的的技术解决方案为：基于部分可观测信息的无人机集群多智能体多域抗干扰方法，具体步骤为：The technical solution to realize the purpose of the present invention is: a UAV cluster multi-agent multi-domain anti-jamming method based on partially observable information, the specific steps are:

步骤1：初始化算法参数；Step 1: Initialize algorithm parameters;

步骤2：各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率；Step 2: Each cluster-head UAV obtains the channel and transmission power selected by its member UAVs in the last time slot by interacting with the environment;

步骤3：各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率；Step 3: Each cluster head UAV uses the ε-greedy algorithm to select the channel and transmit power of the current time slot for its members in the cluster;

步骤4：各簇头无人机计算与其簇内成员通信过程所需的能量开销总和，并获得对应环境奖励值；Step 4: Each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value;

步骤5：将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中；Step 5: Store the observations, actions, rewards, and observations of the next time slot of each cluster-head UAV into their respective experience pools;

步骤6：当经验池样本数据足够时，各簇头无人机从各自的经验池中进行随机采样，得到若干批历史信息数据组成时间序列，将时间序列输入各簇头无人机的价值网络，采用梯度下降法更新价值网络参数；Step 6: When the sample data of the experience pool is sufficient, each cluster-head UAV conducts random sampling from its own experience pool, obtains several batches of historical information data to form a time series, and inputs the time series into the value network of each cluster-head UAV , using the gradient descent method to update the value network parameters;

步骤7：每隔一定时隙数，复制价值网络的参数形成新的目标网络；Step 7: Every certain number of time slots, copy the parameters of the value network to form a new target network;

步骤8：重复步骤2至步骤7，直至完成100次数据传输；Step 8: Repeat steps 2 to 7 until 100 data transfers are completed;

步骤9：重复步骤8，直至无人机集群网络的总奖励值收敛，完成本地训练。Step 9: Repeat step 8 until the total reward value of the UAV cluster network converges to complete the local training.

本发明与现有技术相比，其显著优点为：(1)提出一种可适用于部分可观测环境的多智能体多域抗干扰框架，通过以实现无人机集群网络的长期通信传输能耗最小化为目标，将多域抗干扰决策过程建模为多智能体部分可观测马尔科夫过程，并利用簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值作为历史经验，辅助每个无人机集群智能体完成各自信道选择和发射功率分配；(2)提出一种基于多智能体深度循环Q网络的多域抗干扰算法，通过采用长短期记忆网络保留历史信息数据，再输入到各智能体的深度循环Q网络进行动作值函数拟合，并更新各深度循环Q网络参数，最终获得可适应未知干扰场景下实现通信传输能耗最小化的无人机信道和发射功率最优决策。Compared with the prior art, the present invention has the following significant advantages: (1) A multi-agent and multi-domain anti-jamming framework applicable to partially observable environments is proposed, through which the long-term communication transmission capability of the UAV cluster network can be realized The goal is to minimize the consumption, model the multi-domain anti-jamming decision-making process as a multi-agent partly observable Markov process, and use the observation value, action, reward of the current time slot of the cluster head UAV and the observation of the next time slot Value as historical experience, assisting each UAV cluster agent to complete their channel selection and transmission power allocation; (2) Propose a multi-domain anti-jamming algorithm based on multi-agent deep cycle Q network, by using long short-term memory network Retain the historical information data, and then input it to the deep cycle Q network of each agent to perform action value function fitting, and update the parameters of each deep cycle Q network, and finally obtain the unmanned robot that can adapt to the unknown interference scene and realize the minimization of communication transmission energy consumption. The optimal decision of machine channel and transmit power.

附图说明Description of drawings

图1为本发明基于部分可观测信息的无人机集群多智能体多域抗干扰方法的流程图。Fig. 1 is a flow chart of the UAV swarm multi-agent multi-domain anti-jamming method based on partially observable information in the present invention.

图2为扫频干扰模式下不同算法的学习收敛效果示意图。Fig. 2 is a schematic diagram of learning convergence effects of different algorithms in the frequency sweep jamming mode.

图3为马尔科夫干扰模式下不同算法的学习收敛效果示意图。Fig. 3 is a schematic diagram of the learning convergence effect of different algorithms under the Markov interference mode.

图4为扫频干扰模式下不同算法环境奖励的收敛值与信道数目的关系图。Fig. 4 is a graph showing the relationship between the convergence value of different algorithm environment rewards and the number of channels in the frequency sweep jamming mode.

图5为马尔科夫干扰模式下不同算法环境奖励的收敛值与信道数目的关系图。Fig. 5 is a graph showing the relationship between the convergence value of environment rewards of different algorithms and the number of channels under the Markov interference mode.

图6为扫频干扰模式下不同算法环境奖励的收敛值与干扰机数目的关系图。Fig. 6 is a graph showing the relationship between the convergence value of different algorithm environment rewards and the number of jammers in the frequency sweep jamming mode.

图7为马尔科夫干扰模式下不同算法环境奖励的收敛值与干扰机数目的关系图。Fig. 7 is a graph showing the relationship between the convergence value of environment rewards of different algorithms and the number of jammers in the Markov jamming mode.

具体实施方式detailed description

本发明基于部分可观测信息的无人机集群多智能体多域抗干扰方法，具体步骤为：The present invention is based on partially observable information of the UAV cluster multi-agent multi-domain anti-interference method, the specific steps are:

步骤1：初始化算法参数；Step 1: Initialize algorithm parameters;

进一步地，步骤1中算法参数包括学习率δ、贪心因子ε、折扣因子γ、经验池大小μ、衰减因子θ、价值网络参数w和目标网络参数w'。Further, the algorithm parameters in step 1 include learning rate δ, greedy factor ε, discount factor γ, experience pool size μ, decay factor θ, value network parameter w and target network parameter w'.

进一步地，步骤2中各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率，具体如下：Further, in step 2, each cluster head UAV obtains the channel and transmission power selected by the member UAVs in the cluster by interacting with the environment, as follows:

本发明中的通信环境尽可能仿照真实环境，而大部分真实环境下，由于噪声与干扰的影响，智能体无法观测到全部状态信息。因此，将无人机抗干扰决策问题建模为分散式部分可观测的马尔科夫决策过程(Decentralized Partially Observable MarkovDecision Process, Dec-POMDP)。The communication environment in the present invention imitates the real environment as much as possible, but in most real environments, due to the influence of noise and interference, the agent cannot observe all the state information. Therefore, the UAV anti-jamming decision-making problem is modeled as a decentralized partially observable Markov decision process (Decentralized Partially Observable Markov Decision Process, Dec-POMDP).

系统模型建模为Dec-POMDP<D,S,A,O,R>，其中D为多个智能体集合，S为联合状态集合，A为联合动作集合，O为联合观测集合，R为奖励函数；定义D＝{1,…,N}为N 个智能体的集合；定义时隙t+1簇头无人机n的当前观测为：

联合观测集合

其中

是时隙t+1簇头无人机n的簇成员i的信道，

是时隙t+1 簇头无人机n为其簇成员i选择的发射功率；定义时隙t簇头无人机n的动作为

联合观测集合

其中

是时隙t簇头无人机n的簇成员i跳频到信道

是时隙t簇头无人机n为其簇成员i选择的发射功率

定义联合状态集合S为全部环境状态信息，联合观测集合O为N个智能体能够观测到的部分信息，因此可以将联合观测集合O看作联合状态集合S的子集；定义

是时隙t簇头无人机n的奖励值。The system model is modeled as Dec-POMDP<D,S,A,O,R>, where D is a set of multiple agents, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward Function; define D={1,...,N} as the set of N agents; define the current observation of time slot t+1 cluster head UAV n as:

joint observation set

in

is the channel of cluster member i of cluster head drone n at time slot t+1,

is the transmit power selected by cluster-head UAV n for its cluster member i in time slot t+1; define the action of cluster-head UAV n in time slot t as

joint observation set

in

is time slot t for cluster member i of cluster head UAV n to hop to channel

is the transmit power selected by the cluster-head UAV n for its cluster member i at time slot t

Define the joint state set S as all environmental state information, and the joint observation set O as part of the information that N agents can observe, so the joint observation set O can be regarded as a subset of the joint state set S; define

is the reward value of cluster-head UAV n at time slot t.

进一步地，步骤3中各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率，具体如下：Further, in step 3, each cluster head UAV uses the ε-greedy algorithm to select the channel and transmission power of the current time slot for its members in the cluster, as follows:

步骤3-1：各簇头无人机的观测值作为其价值网络的输入，每一个动作对应的Q值作为其价值网络的输出，其中，时隙t簇头无人机n在观测

下执行动作

的Q值为时隙t+1开始的累计未来奖励值的期望，如下：Step 3-1: The observation value of each cluster head UAV is used as the input of its value network, and the Q value corresponding to each action is used as the output of its value network, where the time slot t cluster head UAV n is observing

next action

The Q value of is the expectation of the cumulative future reward value starting from time slot t+1, as follows:

其中，s^t为时隙t环境状态信息，

为时隙t簇头无人机n采取动作

环境状态s^t由转移到s^t+1的概率。Among them, s ^t is the environmental state information of time slot t,

Take action for clusterhead drone n for time slot t

The probability that the environment state s ^t transitions to s ^t+1 .

步骤3-2：根据ε-greedy算法来选择动作，具体方式如下：Step 3-2: Select an action according to the ε-greedy algorithm, the specific method is as follows:

其中，p为0～1之间的随机数，ε(0＜ε＜1)为探索概率，

为时隙t簇头无人机n 神经网络的隐藏层状态，w为价值网络参数。在此网络中，输出不仅与输入有关，还与时隙 t隐藏层状态

有关，

用于存储簇头无人机n过去的网络状态，包含历史信息。隐藏层状态

在回合开始时为0，即不包含任何历史信息。随着回合进行，

将进行迭代更新，时隙 t网络产生的

将作为时隙t+1的隐藏层状态，从而影响时隙t+1价值网络的输出，逐步迭代。Among them, p is a random number between 0 and 1, ε (0<ε<1) is the exploration probability,

is the hidden layer state of the time slot t cluster head UAV n neural network, and w is the value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state

related,

It is used to store the past network status of the cluster head UAV n, including historical information. hidden layer state

It is 0 at the beginning of the round, that is, it does not contain any historical information. As the round progresses,

will be updated iteratively, the time slot t network produces

It will be used as the hidden layer state of time slot t+1, thereby affecting the output of the value network of time slot t+1, and iterated step by step.

该策略以ε的概率在动作空间中随机选择一个动作，避免陷入局部最优。ε为探索概率，1-ε为利用(选择当前最优策略)概率。ε的值越大，利用的概率就越小。算法执行初始阶段，由于状态动作空间较大，探索概率应该取较大的值，随着迭代次数的增加，逐渐接近最优策略，利用概率应该随之增加。The strategy randomly selects an action in the action space with probability ε to avoid falling into a local optimum. ε is the exploration probability, and 1-ε is the utilization (choose the current optimal strategy) probability. The larger the value of ε, the smaller the probability of exploitation. In the initial stage of algorithm execution, due to the large state-action space, the exploration probability should take a larger value. As the number of iterations increases, it gradually approaches the optimal strategy, and the utilization probability should increase accordingly.

进一步地，步骤4中各簇头无人机计算与其簇内成员通信过程所需的能量开销总和，并获得对应环境奖励值，具体如下：Further, in step 4, each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value, as follows:

记

和

分别为时隙t簇头无人机n的簇成员i和干扰机j的发射功率，

为时隙t簇头无人机m的簇成员k的发射功率(当m＝n时，k≠i)，G_U和G_J分别为无人机和干扰机天线增益，

为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的欧几里得距离，ρ为无人机噪声系数，σ²环境噪声均方值，

为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的快衰落，B为信道带宽，T为单次通信传输所需时间，s为单次通信传输的数据大小，

为加性高斯白噪声信道中时隙t簇头无人机n与其簇成员i无差错传输的最大平均信息速率；Rician衰落信道增益用实部建模为均值为0方差为ξ²、虚部建模为均值为0方差为ξ²独立同分布的高斯随机过程，所以记信道快衰落为

a为实部，b为虚部；设置时隙t簇头无人机n的能量开销为remember

with

are the transmission powers of cluster member i and jammer j of cluster head UAV n in time slot t, respectively,

is the transmit power of cluster member k of cluster head UAV m in time slot t (when m=n, k≠i), G _U and G _J are the antenna gain of UAV and jammer respectively,

is the Euclidean distance between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j at time slot t, ρ is the noise coefficient of the UAV, and ^σ2 is the mean square value of the environmental noise ,

is the fast fading between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j in the time slot t, B is the channel bandwidth, T is the time required for a single communication transmission, and s is a single The data size of the communication transmission,

is the maximum average information rate of the error-free transmission between the cluster head UAV n and its cluster members i in the additive Gaussian white noise channel at time slot t ^; It is modeled as a Gaussian random process with a mean value of 0 and a variance of ξ ² independent and identical distribution, so record the channel fast fading as

a is the real part, b is the imaginary part; the energy cost of setting the time slot t cluster head UAV n is

其中，当簇头无人机n的簇成员i与干扰机j在同一信道时，β＝1，否则β＝0；当簇头无人机n的簇成员i与簇头无人机m的簇成员k在同一信道时，α＝1，否则α＝0。时隙t环境总奖励值为Among them, when the cluster member i of the cluster head UAV n and the jammer j are in the same channel, β=1, otherwise β=0; when the cluster member i of the cluster head UAV n and the cluster head UAV m When cluster member k is on the same channel, α=1, otherwise α=0. The total reward value of the time slot t environment is

能量开销的实际物理意义是簇头无人机n与所有簇成员无人机进行一次数据传输消耗的能量。The actual physical meaning of energy overhead is the energy consumed by a data transmission between the cluster head UAV n and all cluster member UAVs.

进一步地，步骤5中将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中，具体如下：Further, in step 5, the observations, actions, rewards, and observations of the next time slot of each cluster head UAV are stored in their respective experience pools, as follows:

当簇头无人机n在时隙t按照

选择簇成员无人机跳频信道和发射功率后，环境状态由 s^t跳转至s^t+1，通过奖励值计算公式计算在s^t下选择动作

得到的奖励

和观测

将当前时隙t产生的

历史经验数据保存至经验池中。When the cluster head UAV n is at time slot t according to

After selecting the frequency hopping channel and transmission power of the cluster member UAV, the environment state jumps from st to ^st+1 , and the action selected under s ^t ^is calculated by the reward value calculation formula

received rewards

and observation

Generated by the current time slot t

Historical experience data is saved to the experience pool.

进一步地，步骤6中当经验池样本数据足够时，各智能体从各自的经验池中进行随机采样，得到若干批历史信息数据组成时间序列，将时间序列输入各智能体的价值网络，采用梯度下降法更新价值网络参数，具体如下：Further, in step 6, when the sample data of the experience pool is sufficient, each agent randomly samples from its own experience pool to obtain several batches of historical information data to form a time series, and input the time series into the value network of each agent, using the gradient The descending method updates the value network parameters, as follows:

簇头无人机n的神经网络输入为时隙t的观测值

输出为时隙t每个动作对应的Q值。为了增强算法稳定性，本发明中采用双网络结构，记w为价值网络参数，w'为目标网络参数，在步骤7中每间隔一定回合更新一次目标网络参数w'。The neural network input of cluster head UAV n is the observation value of time slot t

The output is the Q value corresponding to each action in time slot t. In order to enhance the stability of the algorithm, a dual network structure is adopted in the present invention, and w is the value network parameter, and w' is the target network parameter. In step 7, the target network parameter w' is updated every certain round.

步骤6-1：在各智能体训练价值网络时，先从经验池中随机选取一批历史经验数据，组成若干个时间序列，每个时间序列都是一个完整的通信回合，再在每个序列中随机选择一个时隙，选择连续的若干步作为训练样本。在样本的时隙t，通过价值网络计算时隙t的簇头无人机n的动作Q值函数

作为估计的Q值，目标网络计算时隙t+1簇头无人机n的动作Q值函数

其中，

为时隙t簇头无人机n的观测值、动作与隐藏层状态。用如下公式计算动作Q值函数的真实值：Step 6-1: When each agent trains the value network, first randomly select a batch of historical experience data from the experience pool to form several time series, each time series is a complete communication round, and then in each sequence A time slot is randomly selected in , and several consecutive steps are selected as training samples. At the time slot t of the sample, the action Q-value function of the cluster head UAV n at the time slot t is calculated by the value network

As an estimated Q-value, the target network calculates the action Q-value function of the cluster-head drone n at time slot t+1

in,

is the observation value, action and hidden layer state of cluster head UAV n at time slot t. Use the following formula to calculate the true value of the action Q-value function:

步骤6-2：将真实的Q值与估计的Q值代入如下公式进行计算，即可更新价值网络参数w，逐步减小

即Step 6-2: Substituting the real Q value and the estimated Q value into the following formula for calculation, the value network parameter w can be updated and gradually reduced

which is

通过梯度下降法使得通过价值网络计算出的Q值更接近真实Q值。在各智能体每次训练神经网络前，需要将隐藏层状态置零，后续若干步的隐藏层状态

由网络迭代产生。The Q value calculated through the value network is closer to the real Q value through the gradient descent method. Before each agent trains the neural network each time, the state of the hidden layer needs to be set to zero, and the state of the hidden layer in subsequent steps

Generated by network iteration.

进一步地，步骤7中每隔一定时隙数，复制价值网络的参数形成新的目标网络，即w←w'。Further, in step 7, every certain number of time slots, copy the parameters of the value network to form a new target network, namely w←w'.

下面结合附图及具体实施例对本发明做进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

实施例Example

本实施例设置簇头无人机和干扰机完成100次移动以及信道、发射功率的选择为一个回合，即完成一次通信任务，回合中簇头无人机及干扰机每做出一次移动、信道选择和发射功率称作一个时隙。In this embodiment, it is set that the cluster head UAV and the jammer complete 100 moves and the selection of the channel and transmission power is a round, that is, a communication task is completed, and each time the cluster head UAV and the jammer make a move, channel The selection and transmit power is called a slot.

结合图1，本实施例基于部分可观测信息的无人机集群多智能体多域抗干扰方法，具体步骤如下：In combination with Figure 1, this embodiment is based on partially observable information of the UAV cluster multi-agent multi-domain anti-jamming method, the specific steps are as follows:

步骤1：初始化算法参数。Step 1: Initialize algorithm parameters.

算法参数包括学习率δ、贪心因子ε、折扣因子γ、经验池大小μ、衰减因子θ、价值网络参数w和目标网络参数w'。Algorithm parameters include learning rate δ, greedy factor ε, discount factor γ, experience pool size μ, decay factor θ, value network parameter w, and target network parameter w'.

步骤2：各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率。具体步骤如下：Step 2: Each cluster-head UAV obtains the channel and transmission power selected by its member UAVs in the last time slot by interacting with the environment. Specific steps are as follows:

联合观测集合

其中

是时隙t+1簇头无人机n的簇成员i的信道，

联合观测集合

其中

是时隙t簇头无人机n的簇成员i跳频到信道

是时隙t簇头无人机n为其簇成员i选择的发射功率

joint observation set

in

is the channel of cluster member i of cluster head drone n at time slot t+1,

joint observation set

in

is time slot t for cluster member i of cluster head UAV n to hop to channel

is the reward value of cluster-head UAV n at time slot t.

步骤3：各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率。具体步骤如下：Step 3: Each cluster head UAV uses the ε-greedy algorithm to select the channel and transmit power of the current time slot for its members in the cluster. Specific steps are as follows:

下执行动作

next action

其中，s^t为时隙t环境状态信息，

为时隙t簇头无人机n采取动作

Take action for clusterhead drone n for time slot t

The probability that the environment state s ^t transitions to s ^t+1 .

其中，p为0～1之间的随机数，ε(0＜ε＜1)为探索概率，

有关，

在回合开始时为0，即不包含任何历史信息。随着回合进行，

将进行迭代更新，时隙 t网络产生的

related,

will be updated iteratively, the time slot t network produces

该策略以ε的概率在动作空间中随机选择一个动作，避免陷入局部最优。ε为探索概率，1-ε为利用(选择当前最优策略)概率。ε的值越大，利用的概率就越小。算法执行初始阶段，由于状态动作空间较大，探索概率应该取较大的值，随着迭代次数的增加，逐渐接近最优策略，利用概率应该随之增加。概率ε更新方式如下：The strategy randomly selects an action in the action space with probability ε to avoid falling into a local optimum. ε is the exploration probability, and 1-ε is the utilization (choose the current optimal strategy) probability. The larger the value of ε, the smaller the probability of exploitation. In the initial stage of algorithm execution, due to the large state-action space, the exploration probability should take a larger value. As the number of iterations increases, it gradually approaches the optimal strategy, and the utilization probability should increase accordingly. The update method of probability ε is as follows:

ε＝max{0.01,θ^x}ε=max{0.01,θ ^x }

其中x为当前进行的回合数。Where x is the current round number.

步骤4：各簇头无人机计算与其簇内成员通信过程所需的能量开销总和，并获得对应环境奖励值。具体步骤如下：Step 4: Each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value. Specific steps are as follows:

记

和

分别为时隙t簇头无人机n的簇成员i和干扰机j的发射功率，

with

步骤5：将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中。具体步骤如下：Step 5: Store the observations, actions, rewards, and observations of the next time slot of each cluster-head UAV into their respective experience pools. Specific steps are as follows:

当簇头无人机n在时隙t按照

选择簇成员无人机跳频信道和发射功率后，环境状态由s^t跳转至s^t+1，通过奖励值计算公式计算在s^t下选择动作

得到的奖励

和观测

将当前时隙t产生的

received rewards

and observation

Generated by the current time slot t

Historical experience data is saved to the experience pool.

步骤6：当经验池样本数据足够时，各智能体从各自的经验池中进行随机采样，得到若干批历史信息数据组成时间序列，将时间序列输入各智能体的价值网络，采用梯度下降法更新价值网络参数。具体步骤如下：Step 6: When the sample data of the experience pool is sufficient, each agent randomly samples from its own experience pool to obtain a number of batches of historical information data to form a time series, input the time series into the value network of each agent, and use the gradient descent method to update Value network parameters. Specific steps are as follows:

簇头无人机n的神经网络由3个神经单元组成，第一个神经单元为长短期记忆单元(Long Short-Term Memory,LSTM)。LSTM结构是一种特殊的循环神经网络结构，可以利用历史信息对序列数据进行预测和处理。LSTM由遗忘门、输入门、输出门组成，遗忘门中的控制参数决定需要被丢弃的历史信息，输入门决定被加入的新信息，输出门决定从本LSTM单元输出到下一个单元的数据。The neural network of the cluster head UAV n is composed of three neural units, the first neural unit is a long short-term memory unit (Long Short-Term Memory, LSTM). The LSTM structure is a special cyclic neural network structure that can use historical information to predict and process sequence data. LSTM consists of a forget gate, an input gate, and an output gate. The control parameters in the forget gate determine the historical information that needs to be discarded, the input gate determines the new information to be added, and the output gate determines the data output from this LSTM unit to the next unit.

遗忘门：Forgotten Gate:

输入门：Input gate:

输出门：Output gate:

其中,W_i,f,c,o和b_i,f,c,o门的输入权重和偏置，

为时隙t LSTM单元的输入。Among them, the input weights and biases of W _i,f,c,o and b _i,f,c,o gates,

is the input of the slot t LSTM unit.

LSTM结构使用三个门来对输入的数据序列决定保留程度，可以实现通过历史信息对未来进行预测。本发明的抗干扰场景中，各智能体只有奖励信息的交换，所以无法确定其他智能体的动作信息。LSTM结构利用历史信息的经验来帮助各智能体预估其他智能体的动作，可以获得更好的无人机集群网络通信抗干扰策略。The LSTM structure uses three gates to determine the degree of retention of the input data sequence, and can predict the future through historical information. In the anti-jamming scenario of the present invention, each agent only exchanges reward information, so the action information of other agents cannot be determined. The LSTM structure uses the experience of historical information to help each agent predict the actions of other agents, and can obtain a better anti-jamming strategy for UAV cluster network communication.

簇头无人机n的神经网络输入为时隙t的观测值

其中，

in,

which is

Generated by network iteration.

梯度下降过程采用自适应矩估计(Adaptive Moment Estimation，ADAM)方式。价值网络参数更新过程中，每次迭代只采样一批历史经验数据进行训练，数据集不同，则损失函数不同，采用ADAM方式能降低收敛到局部最优的概率。ADAM根据损失函数对每个参数的梯度的一阶矩估计和二阶矩估计动态调整针对于每个参数的学习速率。ADAM基于梯度下降方法，但每次迭代参数的学习步长都有一个确定的范围，不会因为较大的梯度导致较大的学习步长，参数的值比较稳定。ADAM算法实现步骤具体如下：The gradient descent process adopts Adaptive Moment Estimation (ADAM) method. In the process of updating the value network parameters, only a batch of historical experience data is sampled for training in each iteration. Different data sets have different loss functions. Using the ADAM method can reduce the probability of converging to a local optimum. ADAM dynamically adjusts the learning rate for each parameter according to the first-order moment estimation and second-order moment estimation of the gradient of each parameter by the loss function. ADAM is based on the gradient descent method, but the learning step size of each iteration parameter has a certain range, which will not lead to a larger learning step size due to a larger gradient, and the value of the parameter is relatively stable. The implementation steps of the ADAM algorithm are as follows:

假设时隙t时,目标函数对于参数的一阶导数是g_t，首先计算指数移动均值：Assuming time slot t, the first derivative of the objective function with respect to the parameter is g _t , first calculate the exponential moving average:

m_t＝λ₁m_t-1+(1-λ₁)g_t-1 m _t =λ ₁ m _t-1 +(1-λ ₁ )g _t-1

再计算两个偏差校正项：Compute two more bias correction terms:

最后得到的梯度更新方法为：The final gradient update method obtained is:

最终返回误差函数相关的结果参数

算法中m_t为指数移动均值，ω_t为平方梯度，而参数λ₁、λ₂为控制这些移动均值的指数衰减率，η为学习步长，τ为常数，一般为10^-8。Finally return the result parameters related to the error function

In the algorithm, m _t is the exponential moving average, ω _t is the square gradient, and parameters λ ₁ and λ ₂ control the exponential decay rate of these moving averages, η is the learning step size, and τ is a constant, usually 10 ^-8 .

本发明采用Python对所述方法进行实施。设置信道总数C＝4，无人机数量N＝9，干扰机数量J＝1，时隙t簇头无人机n为其簇成员i选择的发射功率和干扰机j选择的发射功率分别为

(簇成员无人机i的发射功率可为27、30、33、36dBm，干扰机j的发射功率通常取值33dBm)，设计无人机集群完成一次通信任务需要簇头无人机n与簇成员无人机i完成 q次通信，对两种不同干扰模式下的抗干扰系统进行仿真。取学习率δ＝0.002，折扣因子γ＝0.99，衰减因子θ＝0.998，经验池大小μ＝200，环境噪声均方值σ²＝-114dBm；Rician 衰落信道增益用实部建模为均值为ξ²、虚部建模为均值为0方差为ξ²独立同分布的高斯随机过程，所以记信道快衰落为

a为实部，b为虚部。The present invention uses Python to implement the method. Set the total number of channels C = 4, the number of drones N = 9, the number of jammers J = 1, and the transmission power selected by cluster head drone n for its cluster member i and the transmission power selected by jammer j in time slot t are respectively

(The transmission power of cluster member UAV i can be 27, 30, 33, 36dBm, and the transmission power of jammer j is usually 33dBm). To design a UAV cluster to complete a communication task requires cluster head UAV n and cluster Member UAV i completes q times of communication, and simulates the anti-jamming system under two different jamming modes. Take the learning rate δ=0.002, the discount factor γ=0.99, the attenuation factor θ=0.998, the experience pool size μ=200, the mean square value of the environmental noise σ ² =-114dBm; the Rician fading channel gain is modeled by the real part as ξ ^2. The imaginary part is modeled as a Gaussian random process with a mean value of ⁰ and a variance of ξ2 independent and identically distributed, so record the channel fast fading as

a is the real part and b is the imaginary part.

在扫频及马尔科夫干扰模式下的学习收敛效果分别如图2～图3所示。扫频设置为同时干扰1个频道，以1MHz为扫频步长。马尔科夫干扰模式下设置干扰状态共计4个，每个干扰模式在仿真开始时随机生成，任意时隙干扰模式的转换遵循如下状态转移矩阵：The learning convergence effects in the frequency sweep and Markov interference modes are shown in Figures 2 to 3, respectively. The frequency sweep is set to interfere with one channel at the same time, with 1MHz as the frequency sweep step. A total of 4 interference states are set in the Markov interference mode, each interference mode is randomly generated at the beginning of the simulation, and the conversion of any slot interference mode follows the following state transition matrix:

图2与图3分别为扫频干扰与马尔科夫干扰两种干扰模式下，信道与功率随机选择方案、基于DQN以及基于DRQN的信道与功率选择方案中奖励值的收敛情况。从图中可以看出，基于DRQN方案相比DQN方案具有更高的收敛奖励值，且收敛结果更稳定，这是由于DRQN中存在长短期记忆网络，各智能体可以根据历史经验获得其他智能体的动作变化规律以及环境变化规律等隐藏信息，网络输出不只由自身观测情况所决定；而DQN的输出完全由自身的观测情况决定，一旦环境或其他智能体决策规律发生改变，将会造成整个网络的波动。对比图2与图3，三种信道与功率选择的方案奖励值收敛情况大致相同，在马尔可夫干扰条件下，基于DRQN方案相比基于DQN方案性能提升34.6％，相比基于随机方案提升54.5％；在扫频干扰条件下，基于DRQN方案相比基于DQN方案性能提升38.4％，相比基于随机方案提升56％。Figure 2 and Figure 3 respectively show the convergence of reward values in the channel and power random selection scheme, DQN-based and DRQN-based channel and power selection schemes under the two interference modes of frequency sweep interference and Markov interference. It can be seen from the figure that the DRQN-based scheme has a higher convergence reward value than the DQN scheme, and the convergence result is more stable. This is because there is a long-term and short-term memory network in DRQN, and each agent can obtain other agents based on historical experience. The hidden information such as the law of action changes and the law of environment changes, the network output is not only determined by its own observations; while the output of DQN is completely determined by its own observations, once the environment or other agents’ decision-making rules change, it will cause the entire network fluctuations. Comparing Figure 2 and Figure 3, the reward value convergence of the three channel and power selection schemes is roughly the same. Under the condition of Markov interference, the performance of the DRQN-based scheme is 34.6% higher than that of the DQN-based scheme, and 54.5% higher than that of the random-based scheme. %; Under the condition of sweeping interference, the performance of the DRQN-based scheme is 38.4% higher than that of the DQN-based scheme, and 56% higher than that of the random-based scheme.

图4与图5分别为扫频干扰与马尔科夫干扰两种干扰模式下，三种信道与功率选择方案的平均奖励收敛值与信道数目的关系。当信道数目增多时，三种方案的平均奖励收敛值均有所提高，这是由于信道数目增多后，同频干扰发生的情况减少，降低了无人机通信的能量开销。基于DRQN的方案相比其他方案平均奖励收敛值变化较小，说明其对这一环境条件并不敏感。Figure 4 and Figure 5 respectively show the relationship between the average reward convergence value and the number of channels of the three channel and power selection schemes under the two interference modes of sweep interference and Markov interference. When the number of channels increases, the average reward convergence values of the three schemes increase. This is because the increase in the number of channels reduces the occurrence of co-channel interference and reduces the energy expenditure of UAV communication. Compared with other schemes, the average reward convergence value of the DRQN-based scheme changes less, indicating that it is not sensitive to this environmental condition.

图6与图7分别为扫频干扰与马尔科夫干扰两种干扰模式下，三种信道与功率选择方案的奖励收敛值与干扰机数目的关系。从图中可以看出，在两种干扰模式下，随着干扰机数目的增多，环境不断恶化，三种方案下的平均奖励收敛值都有下降趋势，但是基于DRQN方案的平均奖励收敛值更稳定，下降幅度不超过10％，因此DRQN算法具有较好的稳健性。Figure 6 and Figure 7 respectively show the relationship between the reward convergence value and the number of jammers of the three channel and power selection schemes under the two interference modes of sweep interference and Markov interference. It can be seen from the figure that in the two jamming modes, as the number of jammers increases and the environment continues to deteriorate, the average reward convergence value under the three schemes has a downward trend, but the average reward convergence value based on the DRQN scheme is more Stable, the decline rate does not exceed 10%, so the DRQN algorithm has better robustness.

Claims

1. An unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information is characterized by comprising the following specific steps:

step 1: initializing algorithm parameters;

step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment;

and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;

and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;

and 5: storing the observed value, the action and the reward of the current time slot of each cluster head unmanned aerial vehicle and the observed value of the next time slot into respective experience pools;

and 6: when the experience pool sample data is enough, randomly sampling each cluster head unmanned aerial vehicle from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, inputting the time sequence into the value network of each cluster head unmanned aerial vehicle, and updating the value network parameters by adopting a gradient descent method;

and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;

and step 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;

and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.

2. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein the algorithm parameters in step 1 include learning rate δ, greedy factor e, discount factor γ, experience pool size μ, attenuation factor θ, value network parameter w and target network parameter w'.

3. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information of claim 1, wherein in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmission power selected by a time slot on an adult unmanned aerial vehicle in its cluster by interacting with an environment, specifically as follows:

the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:

joint observation set

Wherein

Is a channel of cluster member i of time slot t +1 cluster head drone n,

is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n as

Joint observation set

Wherein

Cluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle n

Is the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member i

Defining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition of

Is the prize value for slot t cluster head drone n.

4. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information as claimed in claim 1, wherein in step 3, each cluster head unmanned aerial vehicle adopts epsilon-greedy algorithm to select channel and transmission power of current time slot for its members in the cluster, specifically as follows:

step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observes

Lower execution of actions

Is the expectation of the cumulative future prize value for the beginning of time slot t +1, as follows:

wherein s is ^t Is the time slot t environment status information,

taking action for time slot t cluster head unmanned aerial vehicle n

Environmental state s ^t From transition to s ^t+1 The probability of (c).

Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:

wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,

the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state

In connection with this, the first and second electrodes,

the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state

0 at the beginning of the round, i.e., no history information is included. As the round is made, it is,

will be iteratively updated, generated by the slotted t-network

And the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.

The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the probability of exploration, and 1- ε is the probability of utilization (selection of the current best strategy). The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.

5. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein in step 4, each cluster head unmanned aerial vehicle calculates the sum of energy overhead required for the communication process with its members in the cluster, and obtains a corresponding environment reward value, specifically as follows:

note the book

And

the transmission power of cluster member i and interference machine j of the unmanned plane n of the cluster head of the time slot t respectively,

transmit power of cluster member k for time slot t cluster head drone m (when m = n, k ≠ i), G _U And G _J The gain for the antenna of the drone and the jammer respectively,

is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is ² The mean square value of the ambient noise is calculated,

is the fast fading between the cluster head unmanned plane n and the cluster member i thereof or between the cluster head unmanned plane n and the jammer j in the time slot t, and B is the channel bandwidth,t is the time required for a single communication transmission, s is the data size of the single communication transmission,

the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi ² Imaginary part modeling as mean 0 variance ξ ² Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded as

a is a real part and b is an imaginary part; set the energy cost of time slot t cluster head unmanned aerial vehicle n as

When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of

The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.

6. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information as claimed in claim 1, wherein in step 5, the observed value of the current time slot, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle are stored in respective experience pools, specifically as follows:

when cluster head unmanned aerial vehicle n is in time slot t according to

After the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s ^t Jump to s ^t+1 Calculated at s by the reward value calculation formula ^t Down selection action

Awarding of prizes

And observation of

Generated by current time slot t

And storing the historical experience data into an experience pool.

7. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information according to claim 1, wherein in step 6, when sample data of the experience pools is sufficient, each intelligent agent randomly samples from each experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into a value network of each intelligent agent, and a value network parameter is updated by a gradient descent method, specifically as follows:

observed value with time slot t as neural network input of cluster head unmanned aerial vehicle n

And outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.

Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sample

As the estimated Q value, the target network calculates the action Q value function of the time slot t +1 cluster head unmanned aerial vehicle n

Wherein,

and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n. The true value of the action Q function is calculated using the following equation:

step 6-2: substituting the real Q value and the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducing

Namely, it is

And the Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be set

Resulting from network iterations.

8. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method according to claim 1, wherein in step 7, at regular time slot number, the parameters of the value network are copied to form a new target network, i.e. w ← w'.