CN115454141A - A multi-agent and multi-domain anti-jamming method for UAV swarms based on partially observable information - Google Patents

A multi-agent and multi-domain anti-jamming method for UAV swarms based on partially observable information Download PDF

Info

Publication number
CN115454141A
CN115454141A CN202211261459.8A CN202211261459A CN115454141A CN 115454141 A CN115454141 A CN 115454141A CN 202211261459 A CN202211261459 A CN 202211261459A CN 115454141 A CN115454141 A CN 115454141A
Authority
CN
China
Prior art keywords
cluster
time slot
value
unmanned aerial
aerial vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211261459.8A
Other languages
Chinese (zh)
Other versions
CN115454141B (en
Inventor
刘梦泽
单雯
卢其然
林艳
张一晋
邹骏
吴志娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202211261459.8A priority Critical patent/CN115454141B/en
Publication of CN115454141A publication Critical patent/CN115454141A/en
Application granted granted Critical
Publication of CN115454141B publication Critical patent/CN115454141B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本发明公开了一种基于部分可观测信息的无人机集群多智能体多域抗干扰方法,该方法利用各智能体的部分观测环境信息,通过长短期记忆网络保留历史经验数据,输入各智能体的深度循环Q网络进行动作值函数拟合,采用ε‑greedy算法选择最大输出Q值对应的信道和功率,再经过不断独立训练各智能体的深度循环Q网络,更新Q值分布,最终学习到可适应未知干扰场景下实现通信传输能耗最小化的无人机信道和发射功率最优决策。本发明针对无人机集群网络分别处于扫频干扰和马尔科夫干扰两种场景下,利用部分可观测信息的历史经验数据,从频谱域和功率域实现有效多智能体抗干扰通信;相较于基于多智能体深度Q学习的对比方案,所提方案能够在环境信息部分可观测的情况下更高效地降低无人机集群网络的长期通信传输能耗。

Figure 202211261459

The invention discloses a UAV cluster multi-agent multi-domain anti-interference method based on partially observable information. The method utilizes part of the observed environment information of each agent, retains historical experience data through a long-term and short-term memory network, and inputs each agent The agent’s deep recurrent Q network performs action value function fitting, and uses the ε‑greedy algorithm to select the channel and power corresponding to the maximum output Q value, and then continuously and independently trains the deep recurrent Q network of each agent to update the Q value distribution, and finally learns To the optimal decision-making of UAV channel and transmission power that can adapt to unknown interference scenarios to minimize the energy consumption of communication transmission. In the present invention, the UAV cluster network is in two scenarios of frequency sweep interference and Markov interference, and uses the historical experience data of part of the observable information to realize effective multi-agent anti-interference communication from the spectrum domain and the power domain; Compared with the comparison scheme based on multi-agent deep Q learning, the proposed scheme can more efficiently reduce the long-term communication transmission energy consumption of the UAV cluster network when the environmental information is partially observable.

Figure 202211261459

Description

一种基于部分可观测信息的无人机集群多智能体多域抗干扰 方法A multi-agent and multi-domain anti-jamming for UAV swarms based on partially observable information method

技术领域technical field

本发明属于无线通信技术领域,特别是一种基于部分可观测信息的无人机集群多智能体多域抗干扰方法。The invention belongs to the technical field of wireless communication, in particular to a multi-agent and multi-domain anti-jamming method of an unmanned aerial vehicle cluster based on partially observable information.

背景技术Background technique

近年来,随着无线电技术的飞速发展,无人机通信系统中诸多优势的不断凸显,无人机被广泛应用于应急网络,缓解通信系统中的终端需求。无人机集群网络抗干扰技术是保障无人机通信免于干扰威胁的重要技术。其中,跳频抗干扰是最常见的抗干扰技术之一。由于传统跳频抗干扰技术无法应对未知高动态复杂干扰环境等问题,基于强化学习的跳频抗干扰技术已成为近年来无人机通信网络跳频抗干扰技术的研究热点。In recent years, with the rapid development of radio technology, many advantages of UAV communication systems have been highlighted, and UAVs are widely used in emergency networks to alleviate the terminal demand in communication systems. The anti-jamming technology of UAV swarm network is an important technology to protect UAV communication from interference threats. Among them, frequency hopping anti-jamming is one of the most common anti-jamming technologies. Since the traditional frequency hopping anti-jamming technology cannot cope with the unknown high dynamic and complex interference environment and other problems, the frequency hopping anti-jamming technology based on reinforcement learning has become a research hotspot in the frequency hopping anti-jamming technology of UAV communication network in recent years.

以往大多数研究采用Q学习(Q-Learning,QL)算法,但只适用于低维离散的动作空间。当动作空间较大时,将面临维数灾难的问题。针对上述问题,Shangxing Wang等人提出了基于深度Q网络(Deep Q-Network,DQN)在线学习的信道选择算法,这有效的改善了无人机通信网络在复杂环境下的抗干扰性能。Fuqiang Yao和Luliang Jia借助马尔可夫博弈框架 (Markov Game Framework),对无人机集群通信系统建立多智能体马尔科夫决策模型(Markov Decision Process,MDP)模型,降低了应用于实际通信环境时的通信开销。然而,上述抗干扰通信技术并未考虑到通信环境部分可观测问题。Most previous studies use Q-Learning (QL) algorithm, but it is only suitable for low-dimensional discrete action space. When the action space is large, it will face the problem of the curse of dimensionality. In response to the above problems, Shangxing Wang et al. proposed a channel selection algorithm based on Deep Q-Network (Deep Q-Network, DQN) online learning, which effectively improves the anti-jamming performance of the UAV communication network in complex environments. Fuqiang Yao and Luliang Jia used the Markov Game Framework to establish a multi-agent Markov Decision Process (MDP) model for the UAV swarm communication system, reducing the time required for application in the actual communication environment. communication overhead. However, the above-mentioned anti-jamming communication technology does not take into account some observable problems of the communication environment.

发明内容Contents of the invention

本发明旨在提供一种基于部分可观测信息的无人机集群多智能体多域抗干扰方法,利用深度循环Q学习(Deep Recurrent Q-Network,DRQN)算法,簇头无人机在建立 Dec-POMDP模型的基础上,通过采用长短期记忆网络保留历史信息数据训练DRQN,实现向真实环境模型的趋近。The present invention aims to provide a UAV cluster multi-agent multi-domain anti-jamming method based on partly observable information. Using the deep recurrent Q-learning (Deep Recurrent Q-Network, DRQN) algorithm, the cluster head UAV is establishing a Dec -On the basis of the POMDP model, the DRQN is trained by using the long-term short-term memory network to retain historical information data, so as to achieve the approach to the real environment model.

实现本发明目的的技术解决方案为:基于部分可观测信息的无人机集群多智能体多域抗干扰方法,具体步骤为:The technical solution to realize the purpose of the present invention is: a UAV cluster multi-agent multi-domain anti-jamming method based on partially observable information, the specific steps are:

步骤1:初始化算法参数;Step 1: Initialize algorithm parameters;

步骤2:各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率;Step 2: Each cluster-head UAV obtains the channel and transmission power selected by its member UAVs in the last time slot by interacting with the environment;

步骤3:各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率;Step 3: Each cluster head UAV uses the ε-greedy algorithm to select the channel and transmit power of the current time slot for its members in the cluster;

步骤4:各簇头无人机计算与其簇内成员通信过程所需的能量开销总和,并获得对应环境奖励值;Step 4: Each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value;

步骤5:将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中;Step 5: Store the observations, actions, rewards, and observations of the next time slot of each cluster-head UAV into their respective experience pools;

步骤6:当经验池样本数据足够时,各簇头无人机从各自的经验池中进行随机采样,得到若干批历史信息数据组成时间序列,将时间序列输入各簇头无人机的价值网络,采用梯度下降法更新价值网络参数;Step 6: When the sample data of the experience pool is sufficient, each cluster-head UAV conducts random sampling from its own experience pool, obtains several batches of historical information data to form a time series, and inputs the time series into the value network of each cluster-head UAV , using the gradient descent method to update the value network parameters;

步骤7:每隔一定时隙数,复制价值网络的参数形成新的目标网络;Step 7: Every certain number of time slots, copy the parameters of the value network to form a new target network;

步骤8:重复步骤2至步骤7,直至完成100次数据传输;Step 8: Repeat steps 2 to 7 until 100 data transfers are completed;

步骤9:重复步骤8,直至无人机集群网络的总奖励值收敛,完成本地训练。Step 9: Repeat step 8 until the total reward value of the UAV cluster network converges to complete the local training.

本发明与现有技术相比,其显著优点为:(1)提出一种可适用于部分可观测环境的多智能体多域抗干扰框架,通过以实现无人机集群网络的长期通信传输能耗最小化为目标,将多域抗干扰决策过程建模为多智能体部分可观测马尔科夫过程,并利用簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值作为历史经验,辅助每个无人机集群智能体完成各自信道选择和发射功率分配;(2)提出一种基于多智能体深度循环Q网络的多域抗干扰算法,通过采用长短期记忆网络保留历史信息数据,再输入到各智能体的深度循环Q网络进行动作值函数拟合,并更新各深度循环Q网络参数,最终获得可适应未知干扰场景下实现通信传输能耗最小化的无人机信道和发射功率最优决策。Compared with the prior art, the present invention has the following significant advantages: (1) A multi-agent and multi-domain anti-jamming framework applicable to partially observable environments is proposed, through which the long-term communication transmission capability of the UAV cluster network can be realized The goal is to minimize the consumption, model the multi-domain anti-jamming decision-making process as a multi-agent partly observable Markov process, and use the observation value, action, reward of the current time slot of the cluster head UAV and the observation of the next time slot Value as historical experience, assisting each UAV cluster agent to complete their channel selection and transmission power allocation; (2) Propose a multi-domain anti-jamming algorithm based on multi-agent deep cycle Q network, by using long short-term memory network Retain the historical information data, and then input it to the deep cycle Q network of each agent to perform action value function fitting, and update the parameters of each deep cycle Q network, and finally obtain the unmanned robot that can adapt to the unknown interference scene and realize the minimization of communication transmission energy consumption. The optimal decision of machine channel and transmit power.

附图说明Description of drawings

图1为本发明基于部分可观测信息的无人机集群多智能体多域抗干扰方法的流程图。Fig. 1 is a flow chart of the UAV swarm multi-agent multi-domain anti-jamming method based on partially observable information in the present invention.

图2为扫频干扰模式下不同算法的学习收敛效果示意图。Fig. 2 is a schematic diagram of learning convergence effects of different algorithms in the frequency sweep jamming mode.

图3为马尔科夫干扰模式下不同算法的学习收敛效果示意图。Fig. 3 is a schematic diagram of the learning convergence effect of different algorithms under the Markov interference mode.

图4为扫频干扰模式下不同算法环境奖励的收敛值与信道数目的关系图。Fig. 4 is a graph showing the relationship between the convergence value of different algorithm environment rewards and the number of channels in the frequency sweep jamming mode.

图5为马尔科夫干扰模式下不同算法环境奖励的收敛值与信道数目的关系图。Fig. 5 is a graph showing the relationship between the convergence value of environment rewards of different algorithms and the number of channels under the Markov interference mode.

图6为扫频干扰模式下不同算法环境奖励的收敛值与干扰机数目的关系图。Fig. 6 is a graph showing the relationship between the convergence value of different algorithm environment rewards and the number of jammers in the frequency sweep jamming mode.

图7为马尔科夫干扰模式下不同算法环境奖励的收敛值与干扰机数目的关系图。Fig. 7 is a graph showing the relationship between the convergence value of environment rewards of different algorithms and the number of jammers in the Markov jamming mode.

具体实施方式detailed description

本发明基于部分可观测信息的无人机集群多智能体多域抗干扰方法,具体步骤为:The present invention is based on partially observable information of the UAV cluster multi-agent multi-domain anti-interference method, the specific steps are:

步骤1:初始化算法参数;Step 1: Initialize algorithm parameters;

步骤2:各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率;Step 2: Each cluster-head UAV obtains the channel and transmission power selected by its member UAVs in the last time slot by interacting with the environment;

步骤3:各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率;Step 3: Each cluster head UAV uses the ε-greedy algorithm to select the channel and transmit power of the current time slot for its members in the cluster;

步骤4:各簇头无人机计算与其簇内成员通信过程所需的能量开销总和,并获得对应环境奖励值;Step 4: Each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value;

步骤5:将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中;Step 5: Store the observations, actions, rewards, and observations of the next time slot of each cluster-head UAV into their respective experience pools;

步骤6:当经验池样本数据足够时,各簇头无人机从各自的经验池中进行随机采样,得到若干批历史信息数据组成时间序列,将时间序列输入各簇头无人机的价值网络,采用梯度下降法更新价值网络参数;Step 6: When the sample data of the experience pool is sufficient, each cluster-head UAV conducts random sampling from its own experience pool, obtains several batches of historical information data to form a time series, and inputs the time series into the value network of each cluster-head UAV , using the gradient descent method to update the value network parameters;

步骤7:每隔一定时隙数,复制价值网络的参数形成新的目标网络;Step 7: Every certain number of time slots, copy the parameters of the value network to form a new target network;

步骤8:重复步骤2至步骤7,直至完成100次数据传输;Step 8: Repeat steps 2 to 7 until 100 data transfers are completed;

步骤9:重复步骤8,直至无人机集群网络的总奖励值收敛,完成本地训练。Step 9: Repeat step 8 until the total reward value of the UAV cluster network converges to complete the local training.

进一步地,步骤1中算法参数包括学习率δ、贪心因子ε、折扣因子γ、经验池大小μ、衰减因子θ、价值网络参数w和目标网络参数w'。Further, the algorithm parameters in step 1 include learning rate δ, greedy factor ε, discount factor γ, experience pool size μ, decay factor θ, value network parameter w and target network parameter w'.

进一步地,步骤2中各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率,具体如下:Further, in step 2, each cluster head UAV obtains the channel and transmission power selected by the member UAVs in the cluster by interacting with the environment, as follows:

本发明中的通信环境尽可能仿照真实环境,而大部分真实环境下,由于噪声与干扰的影响,智能体无法观测到全部状态信息。因此,将无人机抗干扰决策问题建模为分散式部分可观测的马尔科夫决策过程(Decentralized Partially Observable MarkovDecision Process, Dec-POMDP)。The communication environment in the present invention imitates the real environment as much as possible, but in most real environments, due to the influence of noise and interference, the agent cannot observe all the state information. Therefore, the UAV anti-jamming decision-making problem is modeled as a decentralized partially observable Markov decision process (Decentralized Partially Observable Markov Decision Process, Dec-POMDP).

系统模型建模为Dec-POMDP<D,S,A,O,R>,其中D为多个智能体集合,S为联合状态集合,A为联合动作集合,O为联合观测集合,R为奖励函数;定义D={1,…,N}为N 个智能体的集合;定义时隙t+1簇头无人机n的当前观测为:

Figure BDA0003891111010000031
联合观测集合
Figure BDA0003891111010000032
其中
Figure BDA0003891111010000033
是时隙t+1簇头无人机n的簇成员i的信道,
Figure BDA0003891111010000034
是时隙t+1 簇头无人机n为其簇成员i选择的发射功率;定义时隙t簇头无人机n的动作为
Figure BDA0003891111010000035
联合观测集合
Figure BDA0003891111010000036
其中
Figure BDA0003891111010000037
是时隙t簇头无人机n的簇成员i跳频到信道
Figure BDA0003891111010000038
是时隙t簇头无人机n为其簇成员i选择的发射功率
Figure BDA0003891111010000039
定义联合状态集合S为全部环境状态信息,联合观测集合O为N个智能体能够观测到的部分信息,因此可以将联合观测集合O看作联合状态集合S的子集;定义
Figure BDA00038911110100000310
是时隙t簇头无人机n的奖励值。The system model is modeled as Dec-POMDP<D,S,A,O,R>, where D is a set of multiple agents, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward Function; define D={1,...,N} as the set of N agents; define the current observation of time slot t+1 cluster head UAV n as:
Figure BDA0003891111010000031
joint observation set
Figure BDA0003891111010000032
in
Figure BDA0003891111010000033
is the channel of cluster member i of cluster head drone n at time slot t+1,
Figure BDA0003891111010000034
is the transmit power selected by cluster-head UAV n for its cluster member i in time slot t+1; define the action of cluster-head UAV n in time slot t as
Figure BDA0003891111010000035
joint observation set
Figure BDA0003891111010000036
in
Figure BDA0003891111010000037
is time slot t for cluster member i of cluster head UAV n to hop to channel
Figure BDA0003891111010000038
is the transmit power selected by the cluster-head UAV n for its cluster member i at time slot t
Figure BDA0003891111010000039
Define the joint state set S as all environmental state information, and the joint observation set O as part of the information that N agents can observe, so the joint observation set O can be regarded as a subset of the joint state set S; define
Figure BDA00038911110100000310
is the reward value of cluster-head UAV n at time slot t.

进一步地,步骤3中各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率,具体如下:Further, in step 3, each cluster head UAV uses the ε-greedy algorithm to select the channel and transmission power of the current time slot for its members in the cluster, as follows:

步骤3-1:各簇头无人机的观测值作为其价值网络的输入,每一个动作对应的Q值作为其价值网络的输出,其中,时隙t簇头无人机n在观测

Figure BDA00038911110100000311
下执行动作
Figure BDA00038911110100000312
的Q值为时隙t+1开始的累计未来奖励值的期望,如下:Step 3-1: The observation value of each cluster head UAV is used as the input of its value network, and the Q value corresponding to each action is used as the output of its value network, where the time slot t cluster head UAV n is observing
Figure BDA00038911110100000311
next action
Figure BDA00038911110100000312
The Q value of is the expectation of the cumulative future reward value starting from time slot t+1, as follows:

Figure BDA0003891111010000041
Figure BDA0003891111010000041

其中,st为时隙t环境状态信息,

Figure BDA0003891111010000042
为时隙t簇头无人机n采取动作
Figure BDA0003891111010000043
环境状态st由转移到st+1的概率。Among them, s t is the environmental state information of time slot t,
Figure BDA0003891111010000042
Take action for clusterhead drone n for time slot t
Figure BDA0003891111010000043
The probability that the environment state s t transitions to s t+1 .

步骤3-2:根据ε-greedy算法来选择动作,具体方式如下:Step 3-2: Select an action according to the ε-greedy algorithm, the specific method is as follows:

Figure BDA0003891111010000044
Figure BDA0003891111010000044

其中,p为0~1之间的随机数,ε(0<ε<1)为探索概率,

Figure BDA0003891111010000045
为时隙t簇头无人机n 神经网络的隐藏层状态,w为价值网络参数。在此网络中,输出不仅与输入有关,还与时隙 t隐藏层状态
Figure BDA0003891111010000046
有关,
Figure BDA0003891111010000047
用于存储簇头无人机n过去的网络状态,包含历史信息。隐藏层状态
Figure BDA0003891111010000048
在回合开始时为0,即不包含任何历史信息。随着回合进行,
Figure BDA00038911110100000417
将进行迭代更新,时隙 t网络产生的
Figure BDA0003891111010000049
将作为时隙t+1的隐藏层状态,从而影响时隙t+1价值网络的输出,逐步迭代。Among them, p is a random number between 0 and 1, ε (0<ε<1) is the exploration probability,
Figure BDA0003891111010000045
is the hidden layer state of the time slot t cluster head UAV n neural network, and w is the value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state
Figure BDA0003891111010000046
related,
Figure BDA0003891111010000047
It is used to store the past network status of the cluster head UAV n, including historical information. hidden layer state
Figure BDA0003891111010000048
It is 0 at the beginning of the round, that is, it does not contain any historical information. As the round progresses,
Figure BDA00038911110100000417
will be updated iteratively, the time slot t network produces
Figure BDA0003891111010000049
It will be used as the hidden layer state of time slot t+1, thereby affecting the output of the value network of time slot t+1, and iterated step by step.

该策略以ε的概率在动作空间中随机选择一个动作,避免陷入局部最优。ε为探索概率,1-ε为利用(选择当前最优策略)概率。ε的值越大,利用的概率就越小。算法执行初始阶段,由于状态动作空间较大,探索概率应该取较大的值,随着迭代次数的增加,逐渐接近最优策略,利用概率应该随之增加。The strategy randomly selects an action in the action space with probability ε to avoid falling into a local optimum. ε is the exploration probability, and 1-ε is the utilization (choose the current optimal strategy) probability. The larger the value of ε, the smaller the probability of exploitation. In the initial stage of algorithm execution, due to the large state-action space, the exploration probability should take a larger value. As the number of iterations increases, it gradually approaches the optimal strategy, and the utilization probability should increase accordingly.

进一步地,步骤4中各簇头无人机计算与其簇内成员通信过程所需的能量开销总和,并获得对应环境奖励值,具体如下:Further, in step 4, each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value, as follows:

Figure BDA00038911110100000410
Figure BDA00038911110100000411
分别为时隙t簇头无人机n的簇成员i和干扰机j的发射功率,
Figure BDA00038911110100000412
为时隙t簇头无人机m的簇成员k的发射功率(当m=n时,k≠i),GU和GJ分别为无人机和干扰机天线增益,
Figure BDA00038911110100000413
为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的欧几里得距离,ρ为无人机噪声系数,σ2环境噪声均方值,
Figure BDA00038911110100000414
为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的快衰落,B为信道带宽,T为单次通信传输所需时间,s为单次通信传输的数据大小,
Figure BDA00038911110100000415
为加性高斯白噪声信道中时隙t簇头无人机n与其簇成员i无差错传输的最大平均信息速率;Rician衰落信道增益用实部建模为均值为0方差为ξ2、虚部建模为均值为0方差为ξ2独立同分布的高斯随机过程,所以记信道快衰落为
Figure BDA00038911110100000416
a为实部,b为虚部;设置时隙t簇头无人机n的能量开销为remember
Figure BDA00038911110100000410
with
Figure BDA00038911110100000411
are the transmission powers of cluster member i and jammer j of cluster head UAV n in time slot t, respectively,
Figure BDA00038911110100000412
is the transmit power of cluster member k of cluster head UAV m in time slot t (when m=n, k≠i), G U and G J are the antenna gain of UAV and jammer respectively,
Figure BDA00038911110100000413
is the Euclidean distance between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j at time slot t, ρ is the noise coefficient of the UAV, and σ2 is the mean square value of the environmental noise ,
Figure BDA00038911110100000414
is the fast fading between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j in the time slot t, B is the channel bandwidth, T is the time required for a single communication transmission, and s is a single The data size of the communication transmission,
Figure BDA00038911110100000415
is the maximum average information rate of the error-free transmission between the cluster head UAV n and its cluster members i in the additive Gaussian white noise channel at time slot t ; It is modeled as a Gaussian random process with a mean value of 0 and a variance of ξ 2 independent and identical distribution, so record the channel fast fading as
Figure BDA00038911110100000416
a is the real part, b is the imaginary part; the energy cost of setting the time slot t cluster head UAV n is

Figure BDA0003891111010000051
Figure BDA0003891111010000051

Figure BDA0003891111010000052
Figure BDA0003891111010000052

Figure BDA0003891111010000053
Figure BDA0003891111010000053

Figure BDA0003891111010000054
Figure BDA0003891111010000054

其中,当簇头无人机n的簇成员i与干扰机j在同一信道时,β=1,否则β=0;当簇头无人机n的簇成员i与簇头无人机m的簇成员k在同一信道时,α=1,否则α=0。时隙t环境总奖励值为Among them, when the cluster member i of the cluster head UAV n and the jammer j are in the same channel, β=1, otherwise β=0; when the cluster member i of the cluster head UAV n and the cluster head UAV m When cluster member k is on the same channel, α=1, otherwise α=0. The total reward value of the time slot t environment is

Figure BDA0003891111010000055
Figure BDA0003891111010000055

能量开销的实际物理意义是簇头无人机n与所有簇成员无人机进行一次数据传输消耗的能量。The actual physical meaning of energy overhead is the energy consumed by a data transmission between the cluster head UAV n and all cluster member UAVs.

进一步地,步骤5中将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中,具体如下:Further, in step 5, the observations, actions, rewards, and observations of the next time slot of each cluster head UAV are stored in their respective experience pools, as follows:

当簇头无人机n在时隙t按照

Figure BDA0003891111010000056
选择簇成员无人机跳频信道和发射功率后,环境状态由 st跳转至st+1,通过奖励值计算公式计算在st下选择动作
Figure BDA0003891111010000057
得到的奖励
Figure BDA0003891111010000058
和观测
Figure BDA0003891111010000059
将当前时隙t产生的
Figure BDA00038911110100000510
历史经验数据保存至经验池中。When the cluster head UAV n is at time slot t according to
Figure BDA0003891111010000056
After selecting the frequency hopping channel and transmission power of the cluster member UAV, the environment state jumps from st to st+1 , and the action selected under s t is calculated by the reward value calculation formula
Figure BDA0003891111010000057
received rewards
Figure BDA0003891111010000058
and observation
Figure BDA0003891111010000059
Generated by the current time slot t
Figure BDA00038911110100000510
Historical experience data is saved to the experience pool.

进一步地,步骤6中当经验池样本数据足够时,各智能体从各自的经验池中进行随机采样,得到若干批历史信息数据组成时间序列,将时间序列输入各智能体的价值网络,采用梯度下降法更新价值网络参数,具体如下:Further, in step 6, when the sample data of the experience pool is sufficient, each agent randomly samples from its own experience pool to obtain several batches of historical information data to form a time series, and input the time series into the value network of each agent, using the gradient The descending method updates the value network parameters, as follows:

簇头无人机n的神经网络输入为时隙t的观测值

Figure BDA00038911110100000511
输出为时隙t每个动作对应的Q值。为了增强算法稳定性,本发明中采用双网络结构,记w为价值网络参数,w'为目标网络参数,在步骤7中每间隔一定回合更新一次目标网络参数w'。The neural network input of cluster head UAV n is the observation value of time slot t
Figure BDA00038911110100000511
The output is the Q value corresponding to each action in time slot t. In order to enhance the stability of the algorithm, a dual network structure is adopted in the present invention, and w is the value network parameter, and w' is the target network parameter. In step 7, the target network parameter w' is updated every certain round.

步骤6-1:在各智能体训练价值网络时,先从经验池中随机选取一批历史经验数据,组成若干个时间序列,每个时间序列都是一个完整的通信回合,再在每个序列中随机选择一个时隙,选择连续的若干步作为训练样本。在样本的时隙t,通过价值网络计算时隙t的簇头无人机n的动作Q值函数

Figure BDA00038911110100000512
作为估计的Q值,目标网络计算时隙t+1簇头无人机n的动作Q值函数
Figure BDA0003891111010000061
其中,
Figure BDA0003891111010000062
为时隙t簇头无人机n的观测值、动作与隐藏层状态。用如下公式计算动作Q值函数的真实值:Step 6-1: When each agent trains the value network, first randomly select a batch of historical experience data from the experience pool to form several time series, each time series is a complete communication round, and then in each sequence A time slot is randomly selected in , and several consecutive steps are selected as training samples. At the time slot t of the sample, the action Q-value function of the cluster head UAV n at the time slot t is calculated by the value network
Figure BDA00038911110100000512
As an estimated Q-value, the target network calculates the action Q-value function of the cluster-head drone n at time slot t+1
Figure BDA0003891111010000061
in,
Figure BDA0003891111010000062
is the observation value, action and hidden layer state of cluster head UAV n at time slot t. Use the following formula to calculate the true value of the action Q-value function:

Figure BDA0003891111010000063
Figure BDA0003891111010000063

步骤6-2:将真实的Q值与估计的Q值代入如下公式进行计算,即可更新价值网络参数w,逐步减小

Figure BDA0003891111010000064
即Step 6-2: Substituting the real Q value and the estimated Q value into the following formula for calculation, the value network parameter w can be updated and gradually reduced
Figure BDA0003891111010000064
which is

Figure BDA0003891111010000065
Figure BDA0003891111010000065

通过梯度下降法使得通过价值网络计算出的Q值更接近真实Q值。在各智能体每次训练神经网络前,需要将隐藏层状态置零,后续若干步的隐藏层状态

Figure BDA0003891111010000066
由网络迭代产生。The Q value calculated through the value network is closer to the real Q value through the gradient descent method. Before each agent trains the neural network each time, the state of the hidden layer needs to be set to zero, and the state of the hidden layer in subsequent steps
Figure BDA0003891111010000066
Generated by network iteration.

进一步地,步骤7中每隔一定时隙数,复制价值网络的参数形成新的目标网络,即w←w'。Further, in step 7, every certain number of time slots, copy the parameters of the value network to form a new target network, namely w←w'.

下面结合附图及具体实施例对本发明做进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

实施例Example

本实施例设置簇头无人机和干扰机完成100次移动以及信道、发射功率的选择为一个回合,即完成一次通信任务,回合中簇头无人机及干扰机每做出一次移动、信道选择和发射功率称作一个时隙。In this embodiment, it is set that the cluster head UAV and the jammer complete 100 moves and the selection of the channel and transmission power is a round, that is, a communication task is completed, and each time the cluster head UAV and the jammer make a move, channel The selection and transmit power is called a slot.

结合图1,本实施例基于部分可观测信息的无人机集群多智能体多域抗干扰方法,具体步骤如下:In combination with Figure 1, this embodiment is based on partially observable information of the UAV cluster multi-agent multi-domain anti-jamming method, the specific steps are as follows:

步骤1:初始化算法参数。Step 1: Initialize algorithm parameters.

算法参数包括学习率δ、贪心因子ε、折扣因子γ、经验池大小μ、衰减因子θ、价值网络参数w和目标网络参数w'。Algorithm parameters include learning rate δ, greedy factor ε, discount factor γ, experience pool size μ, decay factor θ, value network parameter w, and target network parameter w'.

步骤2:各簇头无人机通过与环境交互获得其簇内成员无人机上一时隙所选择的信道和发射功率。具体步骤如下:Step 2: Each cluster-head UAV obtains the channel and transmission power selected by its member UAVs in the last time slot by interacting with the environment. Specific steps are as follows:

本发明中的通信环境尽可能仿照真实环境,而大部分真实环境下,由于噪声与干扰的影响,智能体无法观测到全部状态信息。因此,将无人机抗干扰决策问题建模为分散式部分可观测的马尔科夫决策过程(Decentralized Partially Observable MarkovDecision Process, Dec-POMDP)。The communication environment in the present invention imitates the real environment as much as possible, but in most real environments, due to the influence of noise and interference, the agent cannot observe all the state information. Therefore, the UAV anti-jamming decision-making problem is modeled as a decentralized partially observable Markov decision process (Decentralized Partially Observable Markov Decision Process, Dec-POMDP).

系统模型建模为Dec-POMDP<D,S,A,O,R>,其中D为多个智能体集合,S为联合状态集合,A为联合动作集合,O为联合观测集合,R为奖励函数;定义D={1,…,N}为N 个智能体的集合;定义时隙t+1簇头无人机n的当前观测为:

Figure BDA0003891111010000067
联合观测集合
Figure BDA0003891111010000068
其中
Figure BDA0003891111010000069
是时隙t+1簇头无人机n的簇成员i的信道,
Figure BDA00038911110100000610
是时隙t+1 簇头无人机n为其簇成员i选择的发射功率;定义时隙t簇头无人机n的动作为
Figure BDA0003891111010000071
联合观测集合
Figure BDA0003891111010000072
其中
Figure BDA0003891111010000073
是时隙t簇头无人机n的簇成员i跳频到信道
Figure BDA0003891111010000074
是时隙t簇头无人机n为其簇成员i选择的发射功率
Figure BDA0003891111010000075
定义联合状态集合S为全部环境状态信息,联合观测集合O为N个智能体能够观测到的部分信息,因此可以将联合观测集合O看作联合状态集合S的子集;定义
Figure BDA0003891111010000076
是时隙t簇头无人机n的奖励值。The system model is modeled as Dec-POMDP<D,S,A,O,R>, where D is a set of multiple agents, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward Function; define D={1,...,N} as the set of N agents; define the current observation of time slot t+1 cluster head UAV n as:
Figure BDA0003891111010000067
joint observation set
Figure BDA0003891111010000068
in
Figure BDA0003891111010000069
is the channel of cluster member i of cluster head drone n at time slot t+1,
Figure BDA00038911110100000610
is the transmit power selected by cluster-head UAV n for its cluster member i in time slot t+1; define the action of cluster-head UAV n in time slot t as
Figure BDA0003891111010000071
joint observation set
Figure BDA0003891111010000072
in
Figure BDA0003891111010000073
is time slot t for cluster member i of cluster head UAV n to hop to channel
Figure BDA0003891111010000074
is the transmit power selected by the cluster-head UAV n for its cluster member i at time slot t
Figure BDA0003891111010000075
Define the joint state set S as all environmental state information, and the joint observation set O as part of the information that N agents can observe, so the joint observation set O can be regarded as a subset of the joint state set S; define
Figure BDA0003891111010000076
is the reward value of cluster-head UAV n at time slot t.

步骤3:各簇头无人机采用ε-greedy算法为其簇内成员选择当前时隙的信道和发射功率。具体步骤如下:Step 3: Each cluster head UAV uses the ε-greedy algorithm to select the channel and transmit power of the current time slot for its members in the cluster. Specific steps are as follows:

步骤3-1:各簇头无人机的观测值作为其价值网络的输入,每一个动作对应的Q值作为其价值网络的输出,其中,时隙t簇头无人机n在观测

Figure BDA0003891111010000077
下执行动作
Figure BDA0003891111010000078
的Q值为时隙t+1开始的累计未来奖励值的期望,如下:Step 3-1: The observation value of each cluster head UAV is used as the input of its value network, and the Q value corresponding to each action is used as the output of its value network, where the time slot t cluster head UAV n is observing
Figure BDA0003891111010000077
next action
Figure BDA0003891111010000078
The Q value of is the expectation of the cumulative future reward value starting from time slot t+1, as follows:

Figure BDA0003891111010000079
Figure BDA0003891111010000079

其中,st为时隙t环境状态信息,

Figure BDA00038911110100000710
为时隙t簇头无人机n采取动作
Figure BDA00038911110100000711
环境状态st由转移到st+1的概率。Among them, s t is the environmental state information of time slot t,
Figure BDA00038911110100000710
Take action for clusterhead drone n for time slot t
Figure BDA00038911110100000711
The probability that the environment state s t transitions to s t+1 .

步骤3-2:根据ε-greedy算法来选择动作,具体方式如下:Step 3-2: Select an action according to the ε-greedy algorithm, the specific method is as follows:

Figure BDA00038911110100000712
Figure BDA00038911110100000712

其中,p为0~1之间的随机数,ε(0<ε<1)为探索概率,

Figure BDA00038911110100000713
为时隙t簇头无人机n 神经网络的隐藏层状态,w为价值网络参数。在此网络中,输出不仅与输入有关,还与时隙 t隐藏层状态
Figure BDA00038911110100000714
有关,
Figure BDA00038911110100000715
用于存储簇头无人机n过去的网络状态,包含历史信息。隐藏层状态
Figure BDA00038911110100000716
在回合开始时为0,即不包含任何历史信息。随着回合进行,
Figure BDA00038911110100000717
将进行迭代更新,时隙 t网络产生的
Figure BDA00038911110100000718
将作为时隙t+1的隐藏层状态,从而影响时隙t+1价值网络的输出,逐步迭代。Among them, p is a random number between 0 and 1, ε (0<ε<1) is the exploration probability,
Figure BDA00038911110100000713
is the hidden layer state of the time slot t cluster head UAV n neural network, and w is the value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state
Figure BDA00038911110100000714
related,
Figure BDA00038911110100000715
It is used to store the past network status of the cluster head UAV n, including historical information. hidden layer state
Figure BDA00038911110100000716
It is 0 at the beginning of the round, that is, it does not contain any historical information. As the round progresses,
Figure BDA00038911110100000717
will be updated iteratively, the time slot t network produces
Figure BDA00038911110100000718
It will be used as the hidden layer state of time slot t+1, thereby affecting the output of the value network of time slot t+1, and iterated step by step.

该策略以ε的概率在动作空间中随机选择一个动作,避免陷入局部最优。ε为探索概率,1-ε为利用(选择当前最优策略)概率。ε的值越大,利用的概率就越小。算法执行初始阶段,由于状态动作空间较大,探索概率应该取较大的值,随着迭代次数的增加,逐渐接近最优策略,利用概率应该随之增加。概率ε更新方式如下:The strategy randomly selects an action in the action space with probability ε to avoid falling into a local optimum. ε is the exploration probability, and 1-ε is the utilization (choose the current optimal strategy) probability. The larger the value of ε, the smaller the probability of exploitation. In the initial stage of algorithm execution, due to the large state-action space, the exploration probability should take a larger value. As the number of iterations increases, it gradually approaches the optimal strategy, and the utilization probability should increase accordingly. The update method of probability ε is as follows:

ε=max{0.01,θx}ε=max{0.01,θ x }

其中x为当前进行的回合数。Where x is the current round number.

步骤4:各簇头无人机计算与其簇内成员通信过程所需的能量开销总和,并获得对应环境奖励值。具体步骤如下:Step 4: Each cluster head UAV calculates the sum of the energy expenditure required for the communication process with its members in the cluster, and obtains the corresponding environmental reward value. Specific steps are as follows:

Figure BDA0003891111010000081
Figure BDA0003891111010000082
分别为时隙t簇头无人机n的簇成员i和干扰机j的发射功率,
Figure BDA0003891111010000083
为时隙t簇头无人机m的簇成员k的发射功率(当m=n时,k≠i),GU和GJ分别为无人机和干扰机天线增益,
Figure BDA0003891111010000084
为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的欧几里得距离,ρ为无人机噪声系数,σ2环境噪声均方值,
Figure BDA0003891111010000085
为时隙t簇头无人机n与其簇成员i之间或簇头无人机n与干扰机j之间的快衰落,B为信道带宽,T为单次通信传输所需时间,s为单次通信传输的数据大小,
Figure BDA0003891111010000086
为加性高斯白噪声信道中时隙t簇头无人机n与其簇成员i无差错传输的最大平均信息速率;Rician衰落信道增益用实部建模为均值为0方差为ξ2、虚部建模为均值为0方差为ξ2独立同分布的高斯随机过程,所以记信道快衰落为
Figure BDA0003891111010000087
a为实部,b为虚部;设置时隙t簇头无人机n的能量开销为remember
Figure BDA0003891111010000081
with
Figure BDA0003891111010000082
are the transmission powers of cluster member i and jammer j of cluster head UAV n in time slot t, respectively,
Figure BDA0003891111010000083
is the transmit power of cluster member k of cluster head UAV m in time slot t (when m=n, k≠i), G U and G J are the antenna gain of UAV and jammer respectively,
Figure BDA0003891111010000084
is the Euclidean distance between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j at time slot t, ρ is the noise coefficient of the UAV, and σ2 is the mean square value of the environmental noise ,
Figure BDA0003891111010000085
is the fast fading between the cluster head UAV n and its cluster member i or between the cluster head UAV n and the jammer j in the time slot t, B is the channel bandwidth, T is the time required for a single communication transmission, and s is a single The data size of the communication transmission,
Figure BDA0003891111010000086
is the maximum average information rate of the error-free transmission between the cluster head UAV n and its cluster members i in the additive Gaussian white noise channel at time slot t ; It is modeled as a Gaussian random process with a mean value of 0 and a variance of ξ 2 independent and identical distribution, so record the channel fast fading as
Figure BDA0003891111010000087
a is the real part, b is the imaginary part; the energy cost of setting the time slot t cluster head UAV n is

Figure BDA0003891111010000088
Figure BDA0003891111010000088

Figure BDA0003891111010000089
Figure BDA0003891111010000089

Figure BDA00038911110100000810
Figure BDA00038911110100000810

Figure BDA00038911110100000811
Figure BDA00038911110100000811

其中,当簇头无人机n的簇成员i与干扰机j在同一信道时,β=1,否则β=0;当簇头无人机n的簇成员i与簇头无人机m的簇成员k在同一信道时,α=1,否则α=0。时隙t环境总奖励值为Among them, when the cluster member i of the cluster head UAV n and the jammer j are in the same channel, β=1, otherwise β=0; when the cluster member i of the cluster head UAV n and the cluster head UAV m When cluster member k is on the same channel, α=1, otherwise α=0. The total reward value of the time slot t environment is

Figure BDA00038911110100000812
Figure BDA00038911110100000812

能量开销的实际物理意义是簇头无人机n与所有簇成员无人机进行一次数据传输消耗的能量。The actual physical meaning of energy overhead is the energy consumed by a data transmission between the cluster head UAV n and all cluster member UAVs.

步骤5:将各簇头无人机当前时隙的观测值、动作、奖励和下一时隙的观测值存入各自的经验池中。具体步骤如下:Step 5: Store the observations, actions, rewards, and observations of the next time slot of each cluster-head UAV into their respective experience pools. Specific steps are as follows:

当簇头无人机n在时隙t按照

Figure BDA00038911110100000813
选择簇成员无人机跳频信道和发射功率后,环境状态由st跳转至st+1,通过奖励值计算公式计算在st下选择动作
Figure BDA0003891111010000091
得到的奖励
Figure BDA0003891111010000092
和观测
Figure BDA0003891111010000093
将当前时隙t产生的
Figure BDA0003891111010000094
历史经验数据保存至经验池中。When the cluster head UAV n is at time slot t according to
Figure BDA00038911110100000813
After selecting the frequency hopping channel and transmission power of the cluster member UAV, the environment state jumps from st to st+1 , and the action selected under s t is calculated by the reward value calculation formula
Figure BDA0003891111010000091
received rewards
Figure BDA0003891111010000092
and observation
Figure BDA0003891111010000093
Generated by the current time slot t
Figure BDA0003891111010000094
Historical experience data is saved to the experience pool.

步骤6:当经验池样本数据足够时,各智能体从各自的经验池中进行随机采样,得到若干批历史信息数据组成时间序列,将时间序列输入各智能体的价值网络,采用梯度下降法更新价值网络参数。具体步骤如下:Step 6: When the sample data of the experience pool is sufficient, each agent randomly samples from its own experience pool to obtain a number of batches of historical information data to form a time series, input the time series into the value network of each agent, and use the gradient descent method to update Value network parameters. Specific steps are as follows:

簇头无人机n的神经网络由3个神经单元组成,第一个神经单元为长短期记忆单元(Long Short-Term Memory,LSTM)。LSTM结构是一种特殊的循环神经网络结构,可以利用历史信息对序列数据进行预测和处理。LSTM由遗忘门、输入门、输出门组成,遗忘门中的控制参数决定需要被丢弃的历史信息,输入门决定被加入的新信息,输出门决定从本LSTM单元输出到下一个单元的数据。The neural network of the cluster head UAV n is composed of three neural units, the first neural unit is a long short-term memory unit (Long Short-Term Memory, LSTM). The LSTM structure is a special cyclic neural network structure that can use historical information to predict and process sequence data. LSTM consists of a forget gate, an input gate, and an output gate. The control parameters in the forget gate determine the historical information that needs to be discarded, the input gate determines the new information to be added, and the output gate determines the data output from this LSTM unit to the next unit.

遗忘门:Forgotten Gate:

Figure BDA0003891111010000095
Figure BDA0003891111010000095

输入门:Input gate:

Figure BDA0003891111010000096
Figure BDA0003891111010000096

Figure BDA0003891111010000097
Figure BDA0003891111010000097

Figure BDA0003891111010000098
Figure BDA0003891111010000098

输出门:Output gate:

Figure BDA0003891111010000099
Figure BDA0003891111010000099

其中,Wi,f,c,o和bi,f,c,o门的输入权重和偏置,

Figure BDA00038911110100000910
为时隙t LSTM单元的输入。Among them, the input weights and biases of W i,f,c,o and b i,f,c,o gates,
Figure BDA00038911110100000910
is the input of the slot t LSTM unit.

LSTM结构使用三个门来对输入的数据序列决定保留程度,可以实现通过历史信息对未来进行预测。本发明的抗干扰场景中,各智能体只有奖励信息的交换,所以无法确定其他智能体的动作信息。LSTM结构利用历史信息的经验来帮助各智能体预估其他智能体的动作,可以获得更好的无人机集群网络通信抗干扰策略。The LSTM structure uses three gates to determine the degree of retention of the input data sequence, and can predict the future through historical information. In the anti-jamming scenario of the present invention, each agent only exchanges reward information, so the action information of other agents cannot be determined. The LSTM structure uses the experience of historical information to help each agent predict the actions of other agents, and can obtain a better anti-jamming strategy for UAV cluster network communication.

簇头无人机n的神经网络输入为时隙t的观测值

Figure BDA00038911110100000911
输出为时隙t每个动作对应的Q值。为了增强算法稳定性,本发明中采用双网络结构,记w为价值网络参数,w'为目标网络参数,在步骤7中每间隔一定回合更新一次目标网络参数w'。The neural network input of cluster head UAV n is the observation value of time slot t
Figure BDA00038911110100000911
The output is the Q value corresponding to each action in time slot t. In order to enhance the stability of the algorithm, a dual network structure is adopted in the present invention, and w is the value network parameter, and w' is the target network parameter. In step 7, the target network parameter w' is updated every certain round.

步骤6-1:在各智能体训练价值网络时,先从经验池中随机选取一批历史经验数据,组成若干个时间序列,每个时间序列都是一个完整的通信回合,再在每个序列中随机选择一个时隙,选择连续的若干步作为训练样本。在样本的时隙t,通过价值网络计算时隙t的簇头无人机n的动作Q值函数

Figure BDA0003891111010000101
作为估计的Q值,目标网络计算时隙t+1簇头无人机n的动作Q值函数
Figure BDA0003891111010000102
其中,
Figure BDA0003891111010000103
为时隙t簇头无人机n的观测值、动作与隐藏层状态。用如下公式计算动作Q值函数的真实值:Step 6-1: When each agent trains the value network, first randomly select a batch of historical experience data from the experience pool to form several time series, each time series is a complete communication round, and then in each sequence A time slot is randomly selected in , and several consecutive steps are selected as training samples. At the time slot t of the sample, the action Q-value function of the cluster head UAV n at the time slot t is calculated by the value network
Figure BDA0003891111010000101
As an estimated Q-value, the target network calculates the action Q-value function of the cluster-head drone n at time slot t+1
Figure BDA0003891111010000102
in,
Figure BDA0003891111010000103
is the observation value, action and hidden layer state of cluster head UAV n at time slot t. Use the following formula to calculate the true value of the action Q-value function:

Figure BDA0003891111010000104
Figure BDA0003891111010000104

步骤6-2:将真实的Q值与估计的Q值代入如下公式进行计算,即可更新价值网络参数w,逐步减小

Figure BDA0003891111010000105
即Step 6-2: Substituting the real Q value and the estimated Q value into the following formula for calculation, the value network parameter w can be updated and gradually reduced
Figure BDA0003891111010000105
which is

Figure BDA0003891111010000106
Figure BDA0003891111010000106

通过梯度下降法使得通过价值网络计算出的Q值更接近真实Q值。在各智能体每次训练神经网络前,需要将隐藏层状态置零,后续若干步的隐藏层状态

Figure BDA0003891111010000107
由网络迭代产生。The Q value calculated through the value network is closer to the real Q value through the gradient descent method. Before each agent trains the neural network each time, the state of the hidden layer needs to be set to zero, and the state of the hidden layer in subsequent steps
Figure BDA0003891111010000107
Generated by network iteration.

梯度下降过程采用自适应矩估计(Adaptive Moment Estimation,ADAM)方式。价值网络参数更新过程中,每次迭代只采样一批历史经验数据进行训练,数据集不同,则损失函数不同,采用ADAM方式能降低收敛到局部最优的概率。ADAM根据损失函数对每个参数的梯度的一阶矩估计和二阶矩估计动态调整针对于每个参数的学习速率。ADAM基于梯度下降方法,但每次迭代参数的学习步长都有一个确定的范围,不会因为较大的梯度导致较大的学习步长,参数的值比较稳定。ADAM算法实现步骤具体如下:The gradient descent process adopts Adaptive Moment Estimation (ADAM) method. In the process of updating the value network parameters, only a batch of historical experience data is sampled for training in each iteration. Different data sets have different loss functions. Using the ADAM method can reduce the probability of converging to a local optimum. ADAM dynamically adjusts the learning rate for each parameter according to the first-order moment estimation and second-order moment estimation of the gradient of each parameter by the loss function. ADAM is based on the gradient descent method, but the learning step size of each iteration parameter has a certain range, which will not lead to a larger learning step size due to a larger gradient, and the value of the parameter is relatively stable. The implementation steps of the ADAM algorithm are as follows:

假设时隙t时,目标函数对于参数的一阶导数是gt,首先计算指数移动均值:Assuming time slot t, the first derivative of the objective function with respect to the parameter is g t , first calculate the exponential moving average:

mt=λ1mt-1+(1-λ1)gt-1 m t1 m t-1 +(1-λ 1 )g t-1

再计算两个偏差校正项:Compute two more bias correction terms:

Figure BDA0003891111010000108
Figure BDA0003891111010000108

Figure BDA0003891111010000109
Figure BDA0003891111010000109

最后得到的梯度更新方法为:The final gradient update method obtained is:

Figure BDA00038911110100001010
Figure BDA00038911110100001010

最终返回误差函数相关的结果参数

Figure BDA00038911110100001011
算法中mt为指数移动均值,ωt为平方梯度,而参数λ1、λ2为控制这些移动均值的指数衰减率,η为学习步长,τ为常数,一般为10-8。Finally return the result parameters related to the error function
Figure BDA00038911110100001011
In the algorithm, m t is the exponential moving average, ω t is the square gradient, and parameters λ 1 and λ 2 control the exponential decay rate of these moving averages, η is the learning step size, and τ is a constant, usually 10 -8 .

步骤7:每隔一定时隙数,复制价值网络的参数形成新的目标网络;Step 7: Every certain number of time slots, copy the parameters of the value network to form a new target network;

步骤8:重复步骤2至步骤7,直至完成100次数据传输;Step 8: Repeat steps 2 to 7 until 100 data transfers are completed;

步骤9:重复步骤8,直至无人机集群网络的总奖励值收敛,完成本地训练。Step 9: Repeat step 8 until the total reward value of the UAV cluster network converges to complete the local training.

本发明采用Python对所述方法进行实施。设置信道总数C=4,无人机数量N=9,干扰机数量J=1,时隙t簇头无人机n为其簇成员i选择的发射功率和干扰机j选择的发射功率分别为

Figure BDA0003891111010000111
(簇成员无人机i的发射功率可为27、30、33、36dBm,干扰机j的发射功率通常取值33dBm),设计无人机集群完成一次通信任务需要簇头无人机n与簇成员无人机i完成 q次通信,对两种不同干扰模式下的抗干扰系统进行仿真。取学习率δ=0.002,折扣因子γ=0.99,衰减因子θ=0.998,经验池大小μ=200,环境噪声均方值σ2=-114dBm;Rician 衰落信道增益用实部建模为均值为ξ2、虚部建模为均值为0方差为ξ2独立同分布的高斯随机过程,所以记信道快衰落为
Figure BDA0003891111010000112
a为实部,b为虚部。The present invention uses Python to implement the method. Set the total number of channels C = 4, the number of drones N = 9, the number of jammers J = 1, and the transmission power selected by cluster head drone n for its cluster member i and the transmission power selected by jammer j in time slot t are respectively
Figure BDA0003891111010000111
(The transmission power of cluster member UAV i can be 27, 30, 33, 36dBm, and the transmission power of jammer j is usually 33dBm). To design a UAV cluster to complete a communication task requires cluster head UAV n and cluster Member UAV i completes q times of communication, and simulates the anti-jamming system under two different jamming modes. Take the learning rate δ=0.002, the discount factor γ=0.99, the attenuation factor θ=0.998, the experience pool size μ=200, the mean square value of the environmental noise σ 2 =-114dBm; the Rician fading channel gain is modeled by the real part as ξ 2. The imaginary part is modeled as a Gaussian random process with a mean value of 0 and a variance of ξ2 independent and identically distributed, so record the channel fast fading as
Figure BDA0003891111010000112
a is the real part and b is the imaginary part.

在扫频及马尔科夫干扰模式下的学习收敛效果分别如图2~图3所示。扫频设置为同时干扰1个频道,以1MHz为扫频步长。马尔科夫干扰模式下设置干扰状态共计4个,每个干扰模式在仿真开始时随机生成,任意时隙干扰模式的转换遵循如下状态转移矩阵:The learning convergence effects in the frequency sweep and Markov interference modes are shown in Figures 2 to 3, respectively. The frequency sweep is set to interfere with one channel at the same time, with 1MHz as the frequency sweep step. A total of 4 interference states are set in the Markov interference mode, each interference mode is randomly generated at the beginning of the simulation, and the conversion of any slot interference mode follows the following state transition matrix:

Figure BDA0003891111010000113
Figure BDA0003891111010000113

图2与图3分别为扫频干扰与马尔科夫干扰两种干扰模式下,信道与功率随机选择方案、基于DQN以及基于DRQN的信道与功率选择方案中奖励值的收敛情况。从图中可以看出,基于DRQN方案相比DQN方案具有更高的收敛奖励值,且收敛结果更稳定,这是由于DRQN中存在长短期记忆网络,各智能体可以根据历史经验获得其他智能体的动作变化规律以及环境变化规律等隐藏信息,网络输出不只由自身观测情况所决定;而DQN的输出完全由自身的观测情况决定,一旦环境或其他智能体决策规律发生改变,将会造成整个网络的波动。对比图2与图3,三种信道与功率选择的方案奖励值收敛情况大致相同,在马尔可夫干扰条件下,基于DRQN方案相比基于DQN方案性能提升34.6%,相比基于随机方案提升54.5%;在扫频干扰条件下,基于DRQN方案相比基于DQN方案性能提升38.4%,相比基于随机方案提升56%。Figure 2 and Figure 3 respectively show the convergence of reward values in the channel and power random selection scheme, DQN-based and DRQN-based channel and power selection schemes under the two interference modes of frequency sweep interference and Markov interference. It can be seen from the figure that the DRQN-based scheme has a higher convergence reward value than the DQN scheme, and the convergence result is more stable. This is because there is a long-term and short-term memory network in DRQN, and each agent can obtain other agents based on historical experience. The hidden information such as the law of action changes and the law of environment changes, the network output is not only determined by its own observations; while the output of DQN is completely determined by its own observations, once the environment or other agents’ decision-making rules change, it will cause the entire network fluctuations. Comparing Figure 2 and Figure 3, the reward value convergence of the three channel and power selection schemes is roughly the same. Under the condition of Markov interference, the performance of the DRQN-based scheme is 34.6% higher than that of the DQN-based scheme, and 54.5% higher than that of the random-based scheme. %; Under the condition of sweeping interference, the performance of the DRQN-based scheme is 38.4% higher than that of the DQN-based scheme, and 56% higher than that of the random-based scheme.

图4与图5分别为扫频干扰与马尔科夫干扰两种干扰模式下,三种信道与功率选择方案的平均奖励收敛值与信道数目的关系。当信道数目增多时,三种方案的平均奖励收敛值均有所提高,这是由于信道数目增多后,同频干扰发生的情况减少,降低了无人机通信的能量开销。基于DRQN的方案相比其他方案平均奖励收敛值变化较小,说明其对这一环境条件并不敏感。Figure 4 and Figure 5 respectively show the relationship between the average reward convergence value and the number of channels of the three channel and power selection schemes under the two interference modes of sweep interference and Markov interference. When the number of channels increases, the average reward convergence values of the three schemes increase. This is because the increase in the number of channels reduces the occurrence of co-channel interference and reduces the energy expenditure of UAV communication. Compared with other schemes, the average reward convergence value of the DRQN-based scheme changes less, indicating that it is not sensitive to this environmental condition.

图6与图7分别为扫频干扰与马尔科夫干扰两种干扰模式下,三种信道与功率选择方案的奖励收敛值与干扰机数目的关系。从图中可以看出,在两种干扰模式下,随着干扰机数目的增多,环境不断恶化,三种方案下的平均奖励收敛值都有下降趋势,但是基于DRQN方案的平均奖励收敛值更稳定,下降幅度不超过10%,因此DRQN算法具有较好的稳健性。Figure 6 and Figure 7 respectively show the relationship between the reward convergence value and the number of jammers of the three channel and power selection schemes under the two interference modes of sweep interference and Markov interference. It can be seen from the figure that in the two jamming modes, as the number of jammers increases and the environment continues to deteriorate, the average reward convergence value under the three schemes has a downward trend, but the average reward convergence value based on the DRQN scheme is more Stable, the decline rate does not exceed 10%, so the DRQN algorithm has better robustness.

Claims (8)

1. An unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on part of observable information is characterized by comprising the following specific steps:
step 1: initializing algorithm parameters;
step 2: each cluster head unmanned aerial vehicle obtains a channel and transmitting power selected by a time slot on the member unmanned aerial vehicle in the cluster through interaction with the environment;
and step 3: each cluster head unmanned aerial vehicle adopts an epsilon-greedy algorithm to select a channel and transmitting power of a current time slot for members in the cluster;
and 4, step 4: each cluster head unmanned aerial vehicle calculates the total energy overhead required in the communication process with the members in the cluster and obtains a corresponding environment reward value;
and 5: storing the observed value, the action and the reward of the current time slot of each cluster head unmanned aerial vehicle and the observed value of the next time slot into respective experience pools;
and 6: when the experience pool sample data is enough, randomly sampling each cluster head unmanned aerial vehicle from the experience pool to obtain a plurality of batches of historical information data to form a time sequence, inputting the time sequence into the value network of each cluster head unmanned aerial vehicle, and updating the value network parameters by adopting a gradient descent method;
and 7: copying parameters of the value network to form a new target network at intervals of a certain time slot number;
and step 8: repeating the step 2 to the step 7 until the data transmission is completed for 100 times;
and step 9: and (5) repeating the step 8 until the total reward value of the unmanned aerial vehicle cluster network is converged, and finishing the local training.
2. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein the algorithm parameters in step 1 include learning rate δ, greedy factor e, discount factor γ, experience pool size μ, attenuation factor θ, value network parameter w and target network parameter w'.
3. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information of claim 1, wherein in step 2, each cluster head unmanned aerial vehicle obtains a channel and a transmission power selected by a time slot on an adult unmanned aerial vehicle in its cluster by interacting with an environment, specifically as follows:
the communication environment in the invention imitates the real environment as much as possible, and under most real environments, the intelligent agent cannot observe all state information due to the influence of noise and interference. Thus, the drone anti-jamming Decision problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Modeling a system model as Dec-POMDP<D,S,A,O,R>Wherein D is a plurality of agent sets, S is a joint state set, A is a joint action set, O is a joint observation set, and R is a reward function; defining D = {1, \8230, N } as a set of N agents; defining the current observation of the time slot t +1 cluster head unmanned aerial vehicle n as follows:
Figure FDA0003891110000000011
joint observation set
Figure FDA0003891110000000012
Wherein
Figure FDA0003891110000000013
Is a channel of cluster member i of time slot t +1 cluster head drone n,
Figure FDA0003891110000000014
is the transmission power selected by the time slot t +1 cluster head unmanned aerial vehicle n for the cluster member i; define the action of time slot t cluster head unmanned plane n as
Figure FDA0003891110000000021
Joint observation set
Figure FDA0003891110000000022
Wherein
Figure FDA0003891110000000023
Cluster member i frequency hopping to channel of time slot t cluster head unmanned aerial vehicle n
Figure FDA0003891110000000024
Is the transmission power selected by the time slot t cluster head unmanned aerial vehicle n for the cluster member i
Figure FDA0003891110000000025
Defining a joint state set S as all environment state information, and defining a joint observation set O as partial information which can be observed by N intelligent agents, so that the joint observation set O can be regarded as a subset of the joint state set S; definition of
Figure FDA0003891110000000026
Is the prize value for slot t cluster head drone n.
4. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partial observable information as claimed in claim 1, wherein in step 3, each cluster head unmanned aerial vehicle adopts epsilon-greedy algorithm to select channel and transmission power of current time slot for its members in the cluster, specifically as follows:
step 3-1: the observed value of each cluster head unmanned aerial vehicle is used as the input of the value network, the Q value corresponding to each action is used as the output of the value network, wherein, the time slot t is that the cluster head unmanned aerial vehicle n observes
Figure FDA0003891110000000027
Lower execution of actions
Figure FDA0003891110000000028
Is the expectation of the cumulative future prize value for the beginning of time slot t +1, as follows:
Figure FDA0003891110000000029
wherein s is t Is the time slot t environment status information,
Figure FDA00038911100000000210
taking action for time slot t cluster head unmanned aerial vehicle n
Figure FDA00038911100000000211
Environmental state s t From transition to s t+1 The probability of (c).
Step 3-2: the actions are selected according to the epsilon-greedy algorithm in the following specific way:
Figure FDA00038911100000000212
wherein p is a random number between 0 and 1, epsilon (0 < epsilon < 1) is an exploration probability,
Figure FDA00038911100000000213
the hidden layer state of the time slot t cluster head unmanned aerial vehicle n neural network is shown, and w is a value network parameter. In this network, the output is not only related to the input, but also to the slot t hidden layer state
Figure FDA00038911100000000214
In connection with this, the first and second electrodes,
Figure FDA00038911100000000215
the cluster head unmanned aerial vehicle n network state storage device is used for storing the past network state of the cluster head unmanned aerial vehicle n and comprises historical information. Hidden layer state
Figure FDA00038911100000000216
0 at the beginning of the round, i.e., no history information is included. As the round is made, it is,
Figure FDA00038911100000000217
will be iteratively updated, generated by the slotted t-network
Figure FDA00038911100000000218
And the state of the hidden layer is used as the time slot t +1, so that the output of the time slot t +1 value network is influenced, and iteration is performed step by step.
The strategy randomly selects an action in the action space with the probability of epsilon, and avoids falling into local optimum. ε is the probability of exploration, and 1- ε is the probability of utilization (selection of the current best strategy). The larger the value of epsilon, the smaller the probability of utilization. In the initial stage of algorithm execution, because the state action space is large, the exploration probability should be a large value, and gradually approaches the optimal strategy along with the increase of the iteration times, and the utilization probability should be increased accordingly.
5. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information as claimed in claim 1, wherein in step 4, each cluster head unmanned aerial vehicle calculates the sum of energy overhead required for the communication process with its members in the cluster, and obtains a corresponding environment reward value, specifically as follows:
note the book
Figure FDA0003891110000000031
And
Figure FDA0003891110000000032
the transmission power of cluster member i and interference machine j of the unmanned plane n of the cluster head of the time slot t respectively,
Figure FDA0003891110000000033
transmit power of cluster member k for time slot t cluster head drone m (when m = n, k ≠ i), G U And G J The gain for the antenna of the drone and the jammer respectively,
Figure FDA0003891110000000034
is the Euclidean distance between a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof or between the cluster head unmanned aerial vehicle n and an interference machine j, rho is the noise coefficient of the unmanned aerial vehicle, and sigma is 2 The mean square value of the ambient noise is calculated,
Figure FDA0003891110000000035
is the fast fading between the cluster head unmanned plane n and the cluster member i thereof or between the cluster head unmanned plane n and the jammer j in the time slot t, and B is the channel bandwidth,t is the time required for a single communication transmission, s is the data size of the single communication transmission,
Figure FDA0003891110000000036
the maximum average information rate of error-free transmission of a time slot t cluster head unmanned aerial vehicle n and a cluster member i thereof in an additive white Gaussian noise channel; rician fading channel gain is modeled as mean 0 with real part and variance is xi 2 Imaginary part modeling as mean 0 variance ξ 2 Independent and identically distributed Gaussian random process, so the fast fading of the channel is recorded as
Figure FDA0003891110000000037
a is a real part and b is an imaginary part; set the energy cost of time slot t cluster head unmanned aerial vehicle n as
Figure FDA0003891110000000038
Figure FDA0003891110000000039
Figure FDA00038911100000000310
Figure FDA00038911100000000311
When a cluster member i of the cluster head unmanned aerial vehicle n is in the same channel with the interference machine j, β =1, otherwise β =0; when the cluster member i of the cluster head drone n is in the same channel as the cluster member k of the cluster head drone m, α =1, otherwise α =0. Time slot t environment total reward value of
Figure FDA00038911100000000312
The practical physical meaning of the energy overhead is the energy consumed by the cluster head unmanned aerial vehicle n and all cluster member unmanned aerial vehicles for data transmission once.
6. The unmanned aerial vehicle cluster multi-agent multi-domain anti-jamming method based on partially observable information as claimed in claim 1, wherein in step 5, the observed value of the current time slot, the action, the reward and the observed value of the next time slot of each cluster head unmanned aerial vehicle are stored in respective experience pools, specifically as follows:
when cluster head unmanned aerial vehicle n is in time slot t according to
Figure FDA00038911100000000313
After the cluster member unmanned aerial vehicle frequency hopping channel and the transmitting power are selected, the environment state is determined by s t Jump to s t+1 Calculated at s by the reward value calculation formula t Down selection action
Figure FDA0003891110000000041
Awarding of prizes
Figure FDA0003891110000000042
And observation of
Figure FDA0003891110000000043
Generated by current time slot t
Figure FDA0003891110000000044
And storing the historical experience data into an experience pool.
7. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method based on partially observable information according to claim 1, wherein in step 6, when sample data of the experience pools is sufficient, each intelligent agent randomly samples from each experience pool to obtain a plurality of batches of historical information data to form a time sequence, the time sequence is input into a value network of each intelligent agent, and a value network parameter is updated by a gradient descent method, specifically as follows:
observed value with time slot t as neural network input of cluster head unmanned aerial vehicle n
Figure FDA0003891110000000045
And outputting Q values corresponding to each action of the time slots t. In order to enhance the stability of the algorithm, the invention adopts a double-network structure, w is recorded as a value network parameter, w 'is a target network parameter, and the target network parameter w' is updated once every certain round in step 7.
Step 6-1: when each agent trains the value network, a batch of historical experience data is randomly selected from an experience pool to form a plurality of time sequences, each time sequence is a complete communication turn, a time slot is randomly selected from each sequence, and a plurality of continuous steps are selected as training samples. Calculating the action Q value function of the cluster head unmanned aerial vehicle n of the time slot t through the value network at the time slot t of the sample
Figure FDA0003891110000000046
As the estimated Q value, the target network calculates the action Q value function of the time slot t +1 cluster head unmanned aerial vehicle n
Figure FDA0003891110000000047
Wherein,
Figure FDA0003891110000000048
and the observed value, the action and the hidden layer state of the time slot t cluster head unmanned aerial vehicle n. The true value of the action Q function is calculated using the following equation:
Figure FDA0003891110000000049
step 6-2: substituting the real Q value and the estimated Q value into the following formula for calculation, namely updating the value network parameter w and gradually reducing
Figure FDA00038911100000000410
Namely, it is
Figure FDA00038911100000000411
And the Q value calculated by the value network is closer to the real Q value by a gradient descent method. Before each intelligent agent trains the neural network, the hidden layer state needs to be set to zero, and the hidden layer state of a plurality of subsequent steps needs to be set
Figure FDA00038911100000000412
Resulting from network iterations.
8. The unmanned aerial vehicle cluster multi-agent multi-domain anti-interference method according to claim 1, wherein in step 7, at regular time slot number, the parameters of the value network are copied to form a new target network, i.e. w ← w'.
CN202211261459.8A 2022-10-14 2022-10-14 Multi-agent and multi-domain anti-interference method for UAV swarm based on partially observable information Active CN115454141B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211261459.8A CN115454141B (en) 2022-10-14 2022-10-14 Multi-agent and multi-domain anti-interference method for UAV swarm based on partially observable information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211261459.8A CN115454141B (en) 2022-10-14 2022-10-14 Multi-agent and multi-domain anti-interference method for UAV swarm based on partially observable information

Publications (2)

Publication Number Publication Date
CN115454141A true CN115454141A (en) 2022-12-09
CN115454141B CN115454141B (en) 2025-04-04

Family

ID=84311660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211261459.8A Active CN115454141B (en) 2022-10-14 2022-10-14 Multi-agent and multi-domain anti-interference method for UAV swarm based on partially observable information

Country Status (1)

Country Link
CN (1) CN115454141B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116131963A (en) * 2023-02-02 2023-05-16 广东工业大学 Noise Equalization Method Based on LSTM Neural Network for Optical Fiber Link Multipath Interference
CN116432690A (en) * 2023-06-15 2023-07-14 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN117675054A (en) * 2024-02-02 2024-03-08 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system
CN118870445A (en) * 2024-07-05 2024-10-29 长春理工大学 A joint optimization method for relay communication rate of clustered UAVs under interference conditions
CN119336049A (en) * 2024-12-23 2025-01-21 成都航空职业技术学院 A method for suppressing dynamic wind disturbance of unmanned aerial vehicles
CN119536353A (en) * 2025-01-20 2025-02-28 西安电子科技大学 A real-time decision-making method for intelligent agent paths facing dynamic threats and local perception

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
US20190011531A1 (en) * 2016-03-11 2019-01-10 Goertek Inc. Following method and device for unmanned aerial vehicle and wearable device
CN113382381A (en) * 2021-05-30 2021-09-10 南京理工大学 Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning
US20210373552A1 (en) * 2018-11-06 2021-12-02 Battelle Energy Alliance, Llc Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Multi-UAV distributed intelligent task assignment method for dynamic environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190011531A1 (en) * 2016-03-11 2019-01-10 Goertek Inc. Following method and device for unmanned aerial vehicle and wearable device
CN107179777A (en) * 2017-06-03 2017-09-19 复旦大学 Multiple agent cluster Synergistic method and multiple no-manned plane cluster cooperative system
US20210373552A1 (en) * 2018-11-06 2021-12-02 Battelle Energy Alliance, Llc Systems, devices, and methods for millimeter wave communication for unmanned aerial vehicles
CN113382381A (en) * 2021-05-30 2021-09-10 南京理工大学 Unmanned aerial vehicle cluster network intelligent frequency hopping method based on Bayesian Q learning
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Multi-UAV distributed intelligent task assignment method for dynamic environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116131963A (en) * 2023-02-02 2023-05-16 广东工业大学 Noise Equalization Method Based on LSTM Neural Network for Optical Fiber Link Multipath Interference
CN116432690A (en) * 2023-06-15 2023-07-14 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN116432690B (en) * 2023-06-15 2023-08-18 中国人民解放军国防科技大学 Markov-based intelligent decision method, device, equipment and storage medium
CN117675054A (en) * 2024-02-02 2024-03-08 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system
CN117675054B (en) * 2024-02-02 2024-04-23 中国电子科技集团公司第十研究所 Multi-domain combined anti-interference intelligent decision method and system
CN118870445A (en) * 2024-07-05 2024-10-29 长春理工大学 A joint optimization method for relay communication rate of clustered UAVs under interference conditions
CN119336049A (en) * 2024-12-23 2025-01-21 成都航空职业技术学院 A method for suppressing dynamic wind disturbance of unmanned aerial vehicles
CN119536353A (en) * 2025-01-20 2025-02-28 西安电子科技大学 A real-time decision-making method for intelligent agent paths facing dynamic threats and local perception

Also Published As

Publication number Publication date
CN115454141B (en) 2025-04-04

Similar Documents

Publication Publication Date Title
CN115454141A (en) A multi-agent and multi-domain anti-jamming method for UAV swarms based on partially observable information
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN110488861A (en) Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN111491358B (en) Adaptive modulation and power control system based on energy acquisition and optimization method
CN114281103B (en) A zero-interaction communication collaborative search method for aircraft clusters
CN116720674A (en) Deep reinforcement learning short-term stochastic optimization scheduling method for wind-solar-cascade reservoirs
CN112491818A (en) Power grid transmission line defense method based on multi-agent deep reinforcement learning
CN116866895A (en) An intelligent confrontation method based on neural virtual self-game
CN114298166A (en) A method and system for predicting spectrum availability based on wireless communication network
Han et al. Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
CN116956998A (en) Radar interference decision and parameter optimization method and device based on hierarchical reinforcement learning
CN106953801B (en) Random shortest path realization method based on hierarchical learning automaton
CN118945686A (en) A multi-user energy harvesting resource allocation method based on UAV interference assistance
CN119439709A (en) Parallel control method and device for power transmission line construction equipment based on Bi-LSTM and DDPG algorithm
CN116755046B (en) Multifunctional radar interference decision-making method based on imperfect expert strategy
CN118536380A (en) A method and system for predicting air conditioning energy consumption based on error compensation of multiple prediction models
Tan et al. A hybrid architecture of cognitive decision engine based on particle swarm optimization algorithms and case database
Mealing et al. Opponent modelling by sequence prediction and lookahead in two-player games
CN116073856B (en) Intelligent frequency hopping anti-interference decision method based on depth deterministic strategy
WO2024134260A1 (en) Penetration testing method for cyber-physical systems
Janiar et al. A transfer learning approach based on integrated feature extractor for anti-jamming in wireless networks
Xiong et al. Few-shot learning in wireless networks: a meta-learning model-enabled scheme
Li et al. Anti-Jamming System Based on Deep Reinforcement Learning in Dynamic Environment
CN119918572B (en) Operation optimizing algorithm development system under intelligent game

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant