CN116896777A

CN116896777A - Unmanned aerial vehicle group general sense one-body energy optimization method based on reinforcement learning

Info

Publication number: CN116896777A
Application number: CN202310843486.4A
Authority: CN
Inventors: 刘荣科; 祝倩; 刘启瑞
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-10-17

Abstract

The invention provides an unmanned aerial vehicle group general sense integrated energy consumption optimization method based on reinforcement learning, which comprises the following steps: firstly, utilizing the positioning perception performance of the unmanned aerial vehicle cluster to endow an initial value to a Q value function network reflecting the relation between the state and the action; judging the current state; selecting a certain working point as an initial state; thirdly, selecting a current action based on an epsilon-greedy strategy according to the current state; introducing the perception performance of the unmanned aerial vehicle and the energy consumption of the communication task into the design of a Reward rewarding function, and acquiring an actual environment rewarding value of the previous action and the next state; fifthly, updating the Q value function network by using the actual environment rewarding value of the previous step; and sixthly, setting the new state as the current state, and repeating the steps of three to six until the value in the Q value function network reaches convergence. The invention can solve the problem of larger energy consumption in the ISAC network based on the unmanned aerial vehicle group in the prior art, simultaneously ensure the real-time service performance of communication and positioning of the terminal ground user, and effectively improve the service life of the network.

Description

Synaesthesia-integrated energy consumption optimization method for UAV swarm based on reinforcement learning

技术领域Technical field

本发明属于无人机通信感知一体化技术领域，具体涉及一种基于强化学习的无人机群通感一体能耗优化方法。The invention belongs to the technical field of integrated communication and perception of unmanned aerial vehicles, and specifically relates to a synaesthesia integrated energy consumption optimization method for a group of unmanned aerial vehicles based on reinforcement learning.

背景技术Background technique

下一代无线网络(B5G/6G)推动着无线技术的不断创新与发展，同时为许多新兴应用提供关键助力，如互联智能、互联汽车和智慧城市等，这些应用需要高质量的无线通信连接和高精度的感知能力。因此可以预见，在B5G/6G网络中需要同时具备通信和感知能力，从而提升频谱资源利用率。其中，通信感知一体化(ISAC)技术被广泛认为是实现这一目标的有效方案之一。为满足用户对于通感服务全域协同覆盖的现实需求，未来无线网络需要更多地部署于密集城市及山区等复杂地形与电磁环境中。然而，在上述复杂环境中，以蜂窝网络和全球导航卫星系统(GNSS)为代表的现有网络技术均存在一定缺陷，使其网络性能无法满足用户高质量通信和感知需求。Next-generation wireless networks (B5G/6G) promote the continuous innovation and development of wireless technology and provide key support for many emerging applications, such as connected intelligence, connected cars, and smart cities. These applications require high-quality wireless communication connections and high-speed Perception of precision. Therefore, it is foreseeable that B5G/6G networks need to have both communication and sensing capabilities to improve spectrum resource utilization. Among them, Integrated Awareness Communication (ISAC) technology is widely considered to be one of the effective solutions to achieve this goal. In order to meet users' realistic needs for full-area collaborative coverage of synesthesia services, future wireless networks need to be deployed more in complex terrain and electromagnetic environments such as dense cities and mountainous areas. However, in the above-mentioned complex environment, existing network technologies represented by cellular networks and Global Navigation Satellite Systems (GNSS) have certain flaws, making their network performance unable to meet users' high-quality communication and perception needs.

尤其在恶劣的环境下，例如高楼林立或偏远山区，传统的卫星定位服务可能会由于网络边缘化等问题导致地面用户通信和感知的服务质量不佳。在这种情况下，无人机群凭借其高机动性和部署灵活等优势，通过结合相应通信和感知技术，有望弥补这一不足。因此，近年来，无人机集群辅助技术受到学术界和工业界的广泛关注。然而，目前关于无人机的供电方式通常为基于电池充电，即使少数型号无人机可以通过太阳能或其他方案进行能量补充，其能量供应也是相对有限的，因此无人机本身存在资源受限的不足。这种情况下，如何合理的进行功率、服务用户等资源的分配，从而降低空基平台的能量消耗显得愈发重要。因此，如何在完成通信和感知任务的基础上，通过优化无人机的功率分配和服务策略来最小化无人机的能量消耗是本发明的核心研究问题。该问题的研究有望在保障系统通信和感知性能的前提下，有效降低空基集群系统能耗，从而提升无人机集群通感一体网络的服务寿命。Especially in harsh environments, such as high-rise buildings or remote mountainous areas, traditional satellite positioning services may cause poor communication and perceived service quality for ground users due to problems such as network marginalization. In this case, the drone swarm, with its advantages of high mobility and flexible deployment, is expected to make up for this shortcoming by combining corresponding communication and sensing technologies. Therefore, UAV swarm-assisted technology has received widespread attention from academia and industry in recent years. However, the current power supply method for drones is usually based on battery charging. Even if a few models of drones can be replenished with energy through solar energy or other solutions, their energy supply is relatively limited, so the drones themselves are resource-constrained. insufficient. In this case, how to reasonably allocate resources such as power and service users to reduce the energy consumption of space-based platforms becomes increasingly important. Therefore, how to minimize the energy consumption of UAVs by optimizing the power allocation and service strategies of UAVs on the basis of completing communication and sensing tasks is the core research issue of this invention. Research on this issue is expected to effectively reduce the energy consumption of space-based cluster systems while ensuring system communication and perception performance, thereby improving the service life of the synesthetic integrated network of UAV clusters.

发明内容Contents of the invention

本发明目的在于提供一种基于强化学习的无人机群通感一体能耗优化方法，以解决现有技术在基于无人机群的ISAC网络中能耗较大的问题，同时保障终端地面用户的通信和定位的实时服务性能，有效改善网络服务寿命。The purpose of the present invention is to provide an integrated energy consumption optimization method for UAV swarms based on reinforcement learning to solve the existing problem of large energy consumption in the ISAC network based on UAV swarms, while ensuring the communication of terminal ground users. and positioning real-time service performance, effectively improving network service life.

本发明所提供的一种基于强化学习的无人机群通感一体能耗优化方法，所建立的系统环境如下：在一个单基站覆盖蜂窝通感网络中，基站坐标为(x₀,y₀,z₀)，无人机的集合为地面用户集合为/>无人机作为智能体进行一次完整的决策过程称为决策周期，在本发明中，无人机完成定位感知和通信任务卸载的一次完整过程即为一个决策周期，假设每个时刻t为一个决策周期，所有决策周期的状态集合称为状态空间集S，表示为：S＝{s₁,s₂,......,s_t,......}。所有决策周期的动作的集合称为动作空间集为A，表示为：A＝{a₁,a₂,......,a_t,......}。在时刻t，无人机-m和待定位用户-l的位置坐标分别为u_m(t)＝(x_m(t),y_m(t))^T和v_l＝(x_l,y_l)^T，且假设无人机平台-m的高度为一固定值H_m。集群中的所有无人机均在预先设定好的符合一定阈值范围的飞行轨迹上飞行，同时为地面用户提供通信和位置感知服务，且各无人机均可被地面待服务用户终端所识别。The invention provides a method for optimizing the energy consumption of synaesthesia for UAV groups based on reinforcement learning. The established system environment is as follows: In a single base station covering cellular synaesthesia network, the base station coordinates are (x ₀ , y ₀ , z ₀ ), the set of UAVs is The set of ground users is/> A complete decision-making process for a drone as an intelligent agent is called a decision-making cycle. In the present invention, a complete process for the drone to complete positioning sensing and communication task offloading is a decision-making cycle. It is assumed that each time t is a decision-making cycle. cycle, the state set of all decision-making cycles is called the state space set S, expressed as: S={s ₁ , s ₂ ,..., s _t ,...}. The set of actions in all decision-making cycles is called the action space set A, expressed as: A={a ₁ , a ₂ ,..., a _t ,...}. At time t, the position coordinates of drone-m and user-l to be located are u _m (t) = (x _m (t), y _m (t)) ^T and v _l = (x _l , y _l respectively) ) ^T , and assume that the height of the UAV platform-m is a fixed value H _m . All drones in the cluster fly on preset flight trajectories that meet a certain threshold range, while providing communication and location awareness services to ground users, and each drone can be identified by the ground user terminal to be served. .

第一步，利用无人机集群的定位感知性能，为反映无人机状态与动作关系的Q值函数网络赋初值。The first step is to use the positioning perception performance of the UAV cluster to assign an initial value to the Q-value function network that reflects the relationship between UAV status and action.

第二步，从环境中判断无人机的当前状态。可选定某一工作点作为无人机的初始状态。The second step is to determine the current status of the drone from the environment. A certain working point can be selected as the initial state of the drone.

第三步，根据无人机当前状态，基于ε-greedy策略来选择当前动作。The third step is to select the current action based on the ε-greedy strategy based on the current status of the drone.

第四步，将无人机感知性能和通信任务能量消耗引入Reward奖励函数的设计中，并获取上一步动作的实际环境奖励值，以及无人机下一个状态。The fourth step is to introduce the drone's perception performance and communication task energy consumption into the design of the Reward reward function, and obtain the actual environmental reward value of the previous action and the next state of the drone.

第五步，利用上一步动作的实际环境奖励值，更新Q值函数网络。The fifth step is to update the Q-value function network using the actual environmental reward value of the previous action.

第六步，将新状态设置为当前状态，重复第三步～第六步，直到Q值函数网络中的值达到收敛。Step 6: Set the new state to the current state and repeat steps 3 to 6 until the values in the Q-value function network converge.

上述步骤中，主要涉及以下关键技术要点：The above steps mainly involve the following key technical points:

(1)利用无人机集群定位感知性能初始化Q值函数网络(1) Initialize the Q-value function network using the positioning sensing performance of the UAV cluster

无人机和地面待服务用户之间的几何构型是影响其定位感知性能的前提，可用位置(三维)精度因子PDOP来表征该性能。无人机在不同的状态下，选择不同的动作，从而无人机与用户间所呈现出的几何构型不同，其为待服务用户提供感知服务的能力则不同。假设在时刻t，为用户-l提供定位感知服务的无人机基站子集为S_k(t)，且假设集合中无人机个数为M₀个。则计算该无人机子集的位置(三维)精度因子值，可表示为：The geometric configuration between the UAV and the user on the ground to be served is the premise that affects its positioning perception performance. The position (three-dimensional) precision factor PDOP can be used to characterize this performance. The drone selects different actions in different states, so the geometric configuration between the drone and the user is different, and its ability to provide perceptual services to the user to be served is different. Assume that at time t, the subset of UAV base stations that provide positioning awareness services for user-l is S _k (t), and it is assumed that the number of UAVs in the set is M ₀ . Then calculate the position (three-dimensional) accuracy factor of the UAV subset value, which can be expressed as:

上式中，是无人机基站子集S_k(t)的定位感知观测方程的雅可比矩阵，则其可进一步表示为下式：In the above formula, is the Jacobian matrix of the positioning sensing observation equation of the UAV base station subset S _k (t), then it can be further expressed as the following formula:

上式中，u₁(t)＝(x₁(t),y₁(t))^T,(1∈s_k(t))、分别表示无人机基站子集S_k(t)中的无人机-1、无人机-m₀、无人机-M₀的坐标；同理，H₁、/>分别为无人机基站子集S_k(t)中的无人机-1、无人机-m₀、无人机-M₀的固定高度值，v_l＝(x_l,y_l)^T则为待定位用户-l的位置坐标。In the above formula, u ₁ (t) = (x ₁ (t), y ₁ (t)) ^T , (1∈s _k (t)), Respectively represent the coordinates of UAV-1, UAV-m ₀ and UAV-M ₀ in the UAV base station subset S _k (t); similarly, H ₁ , /> are respectively the fixed height values of UAV-1, UAV-m ₀ and UAV-M ₀ in the UAV base station subset S _k (t), v _l = (x _l ,y _l ) ^T Then it is the location coordinate of user-l to be located.

利用无人机集群定位感知性能初始化Q值函数网络方法为：当无人机为正常工作状态时，在时刻t选择无人机基站子集S_k(t)为用户提供通感服务时，其对应的Q-table网格中的值为其余网格位置赋值为零。利用无人机集群的定位感知性能，为Q值函数网络赋初值，从而为无人机智能体的强化学习网络提供先验信息，进一步利于智能体的学习。The method of initializing the Q-value function network using the positioning perception performance of the UAV cluster is as follows: when the UAV is in the normal working state, when a subset of UAV base stations S _k (t) is selected to provide synaesthesia services to users at time t, its The value in the corresponding Q-table grid is The remaining grid positions are assigned zero values. The positioning perception performance of the UAV cluster is used to assign an initial value to the Q-value function network, thereby providing prior information for the reinforcement learning network of the UAV agent, which further facilitates the learning of the agent.

(2)通信任务上传的能量消耗(2) Energy consumption of communication task upload

在时刻t，第m架无人机到第l个地面用户的LoS信道功率增益可表示为：At time t, the LoS channel power gain from the m-th drone to the l-th ground user can be expressed as:

其中，α为信道的路损系数，β₀为单位(每米)信道增益，d_m,l(t)为第m架无人机到第l个地面用户的距离，且有进一步则有，该链路在该时刻的信噪比SNR_m,l(t)可表示为：Among them, α is the path loss coefficient of the channel, β ₀ is the unit (per meter) channel gain, d _m,l (t) is the distance from the m-th drone to the l-th ground user, and there is Furthermore, the signal-to-noise ratio SNR _m,l (t) of the link at that moment can be expressed as:

其中，P为空基平台的恒定发射功率，σ²为噪声功率，表示空基平台是否为地面用户提供通感服务，具体含义为：当/>时不提供通感服务，当/>时提供通感服务。/>为其他无人机平台的干扰，即同频信道干扰，其中，P_u(t)为其他无人机平台在时刻t的发射功率；g_u,l(t)为其他无人机平台在时刻t的LoS信道功率增益；/>表示无人机的集合。则第m架无人机到第l个地面用户这条链路的数据传输速率R_m,l(t)可表示为：Among them, P is the constant transmit power of the space-based platform, σ ² is the noise power, Indicates whether the space-based platform provides synaesthesia services for ground users. The specific meaning is: When/> Synaesthesia services are not provided when/> Synaesthesia services are provided at all times. /> is the interference from other UAV platforms, that is, co-channel channel interference, where P _u (t) is the transmit power of other UAV platforms at time t; g _u,l (t) is the transmission power of other UAV platforms at time t LoS channel power gain at t;/> Represents a collection of drones. Then the data transmission rate R _m,l (t) of the link from the m-th drone to the l-th ground user can be expressed as:

R_m,l(t)＝B·log₂(1+SNR_m,l(t)) (5)R _m,l (t)=B·log ₂ (1+SNR _m,l (t)) (5)

其中，B为信号带宽。进一步，该链路的能量消耗E_m,l(t)可表示为：Among them, B is the signal bandwidth. Furthermore, the energy consumption _Em,l (t) of the link can be expressed as:

其中，P为空基平台的恒定发射功率，表示空基平台是否为地面用户提供通感服务，具体含义为：当/>时不提供通感服务，当/>时提供通感服务。P_m(t)∈[0,1,,5]表示一架无人机服务用户的个数。我们假设，每个地面用户的数据上传任务最多只能发给一架无人机。Among them, P is the constant transmit power of the space-based platform, Indicates whether the space-based platform provides synaesthesia services for ground users. The specific meaning is: When/> Synaesthesia services are not provided when/> Synaesthesia services are provided at all times. P _m (t)∈[0,1,,5] represents the number of users served by a drone. We assume that each ground user's data upload task can only be sent to at most one drone.

(3)动作选择策略(3) Action selection strategy

Q-learning算法作为基于值的强化学习算法，利用Q值函数的迭代更新来寻找无人机(智能体)的最优策略π^*。在算法运行过程中，智能体根据ε-greedy策略来选择动作执行，即，智能体随机进行动作选择的概率为ε，选择Q值函数网络中值的最大值对应的动作的概率为1-ε。As a value-based reinforcement learning algorithm, the Q-learning algorithm uses iterative updates of the Q-value function to find the optimal strategy π ^* for the UAV (agent). During the running of the algorithm, the agent selects actions to execute according to the ε-greedy strategy, that is, the probability that the agent randomly selects an action is ε, and the probability of selecting the action corresponding to the maximum value in the Q-value function network is 1-ε .

当算法开始执行时，首先利用上述(1)中方法进行Q-table的初始化，之后选择当前状态s_t，对于该状态下的每一个动作a，其都存在一个对应的“状态-动作”值，并表示为Q(s_t,a)。在此情况下，依据ε-greedy策略来选择该状态下的动作，即选择Q值函数网络中值的最大值对应的动作，如下式所示：When the algorithm starts to be executed, the Q-table is first initialized using the method in (1) above, and then the current state s _t is selected. For each action a in this state, there is a corresponding "state-action" value. , and expressed as Q(s _t ,a). In this case, the action in this state is selected according to the ε-greedy strategy, that is, the action corresponding to the maximum value in the Q-value function network is selected, as shown in the following formula:

选定动作之后，智能体开始执行此动作，之后进入下一个状态s_t+1并得到当前动作选择决策周期内的奖励值r(t)，与此同时更新相应Q网络中相应位置的值：After selecting the action, the agent starts to perform this action, and then enters the next state s _t+1 and obtains the reward value r(t) within the current action selection decision period. At the same time, the value of the corresponding position in the corresponding Q network is updated:

其中，γ为折扣因子，且γ∈[0,1]。关于奖励值函数r(t)，具体描述见(4)。Among them, γ is the discount factor, and γ∈[0,1]. Regarding the reward value function r(t), see (4) for detailed description.

(4)奖励值函数r(t)设计(4)Design of reward value function r(t)

为了综合保障无人机群网络的通信和感知性能，设计在时刻t，奖励函数r(t)表示为下式：In order to comprehensively ensure the communication and perception performance of the UAV swarm network, the reward function r(t) is designed to be expressed as the following formula at time t:

其中，SNR_thr为提前设定的已知系统参数信噪比阈值，目的在于保障智能体的通信性能；PDOP_thr为已知参数三维精度因子阈值。目的在于保障智能体的感知性能。基于此函数，保障了无人机群网络在能耗优化动作选择过程中的通感特性。Among them, SNR _thr is the signal-to-noise ratio threshold of known system parameters set in advance to ensure the communication performance of the agent; PDOP _thr is the three-dimensional accuracy factor threshold of known parameters. The purpose is to ensure the perception performance of the agent. Based on this function, the synaesthesia characteristics of the UAV swarm network in the energy consumption optimization action selection process are guaranteed.

本发明基于强化学习的无人机群通感一体能耗优化方法，和现有方法相比，其优势和有益效果在于：Compared with existing methods, the present invention's synaesthesia-integrated energy consumption optimization method for UAV groups based on reinforcement learning has the following advantages and beneficial effects:

(a)本发明综合考虑无人机集群的通信感知性能和网络能量消耗问题，提出了一种基于无人机集群通感一体网络的能效最优策略，能够在保障无人机系统通信和感知性能的前提下，有效降低无人机网络能量消耗，从而提升本身资源受限的无人机集群的网络服务寿命。(a) The present invention comprehensively considers the communication sensing performance and network energy consumption of UAV clusters, and proposes an energy-efficiency optimal strategy based on the synaesthesia integrated network of UAV clusters, which can ensure UAV system communication and perception. Under the premise of high performance, it can effectively reduce the energy consumption of UAV network, thereby improving the network service life of UAV clusters with limited resources.

(b)本发明设计了基于强化学习的智能决策算法，使得无人机能够根据动态变化的环境自适应地进行服务用户数量选择和功率选择，在保障系统通感性能的前提下，以最小的网络能耗卸载最多的任务上传流量，避免了传统集中式网络控制自带的僵硬模式，克服了环境动态性对策略制定带来的难题。(b) The present invention designs an intelligent decision-making algorithm based on reinforcement learning, so that the drone can adaptively select the number of service users and power selection according to the dynamically changing environment, and on the premise of ensuring the synaesthesia performance of the system, with the minimum The task upload traffic that consumes the most network energy is offloaded, avoiding the rigid model of traditional centralized network control and overcoming the difficulties in policy formulation caused by environmental dynamics.

附图说明Description of the drawings

为了对所提发明的技术原理和具体流程方案进行更为清晰的说明，下面将对实施例中所涉及到的相关附图进行简单说明与介绍。显然，下文所描述的附图1～3仅仅是用于实施例的描述与说明，对于本领域的普通相关技术研究人员而言，此类其他附图还可以在不通过创造性劳动的前提下来获得。In order to provide a clearer explanation of the technical principles and specific process solutions of the proposed invention, the relevant drawings involved in the embodiments will be briefly described and introduced below. Obviously, the accompanying drawings 1 to 3 described below are only for description and illustration of the embodiments. For ordinary related technical researchers in this field, such other drawings can also be obtained without creative work. .

图1为本发明所提出的基于强化学习的无人机群通感一体能耗优化方法的流程示意图。Figure 1 is a schematic flowchart of the synaesthesia-integrated energy consumption optimization method for UAV swarms based on reinforcement learning proposed by the present invention.

图2为本发明所提出的Q值函数网络的学习流程示意图。Figure 2 is a schematic diagram of the learning process of the Q-value function network proposed by the present invention.

图3为本发明所针对的基于无人机群的通感一体化网络场景示意图。Figure 3 is a schematic diagram of a synesthesia integrated network scenario based on a drone group targeted by the present invention.

具体实施方式Detailed ways

以下将结合附图和实施例对本发明的特征和原理作进一步描述，所列举实施例的作用仅限于解释本发明，并非用于限定本发明的应用范围。The features and principles of the present invention will be further described below with reference to the drawings and examples. The examples listed are only used to explain the present invention and are not intended to limit the scope of application of the present invention.

参照图1所示，考虑一个单基站覆盖的蜂窝网络，网络半径500m的场景下，本发明提出一种基于强化学习的无人机群通感一体能耗优化方法。下面将以参数表示对应选择用户是否被服务；P_m(t)∈[0,1,,5]表示一架无人机服务用户的个数；假设设置强化学习网络的探索率ε＝0.8；折扣因子γ＝0.9；l_max＝1000；通信阈值SNR_thr＝2dB；定位感知阈值PDOP_thr＝1.5；用于给每位地面用户提供定位感知服务的无人机个数为4；以及假设每个地面用户的数据上传任务最多只能发给1架无人机为例对所提供的总体发明方法的具体实施方式进行进一步解释和详细介绍。Referring to Figure 1, considering a cellular network covered by a single base station and a network radius of 500m, the present invention proposes a synaesthetic integrated energy consumption optimization method for UAV swarms based on reinforcement learning. The parameters will be used below Indicates whether the corresponding selected user is served; P _m (t)∈[0,1,,5] indicates the number of users served by a drone; Assume that the exploration rate of the reinforcement learning network is set to ε = 0.8; the discount factor γ = 0.9; l _max =1000; communication threshold SNR _thr =2dB; positioning sensing threshold PDOP _thr =1.5; the number of UAVs used to provide positioning sensing services to each ground user is 4; and it is assumed that the data of each ground user The upload task can only be sent to at most one drone as an example to further explain and introduce in detail the specific implementation method of the overall invention method provided.

步骤一：在系统运行过程中，首先建立Q-table网格，接着，按照所提方案利用无人机集群定位感知性能，即三维精度因子来初始化Q值函数网络，具体而言，当无人机为正常工作状态时，在时刻t选择无人机基站子集S_k(t)为用户提供通感服务时，其对应的Q-table网格中的值为-PDOP_sk(t)，其余网格位置赋值为零。Step 1: During the operation of the system, first establish the Q-table grid, and then use the UAV cluster positioning perception performance, that is, the three-dimensional accuracy factor, to initialize the Q-value function network according to the proposed plan. Specifically, when no one When the drone is in normal working state, when a subset of UAV base stations S _k (t) is selected to provide synaesthesia services to users at time t, the value in its corresponding Q-table grid is -PDOP _{sk (t)} , and the rest The grid position is assigned a value of zero.

步骤二：根据当前网络环境选定某一状态作为无人机智能体的初始状态。Step 2: Select a certain state as the initial state of the drone agent based on the current network environment.

步骤三：基于ε-greedy策略选择步骤二所选当前状态s_t下无人机服务状态和服务用户数量选择的动作，即根据公式(7)和ε＝0.8来确定和P_m(t)的值。具体而言，以概率1-ε＝0.2选择Q值函数网络中值的最大值对应的动作，即/>以概率ε＝0.8随机选择动作；Step 3: Based on the ε-greedy strategy, select the action to select the drone service status and the number of service users under the current state s _t selected in step 2, which is determined according to formula (7) and ε = 0.8 and the value of P _m (t). Specifically, the action corresponding to the maximum value in the Q-value function network is selected with probability 1-ε=0.2, that is/> Randomly select actions with probability ε = 0.8;

步骤四：动作决策完成后，无人机得到在该动作决策周期内的通信上传任务的能量消耗，即通过公式(6)得到E_m,l(t)，并且将和P_m(t)的值，以及通信阈值SNR_thr＝2dB和定位感知阈值PDOP_thr＝1.5一同代入公式(9)，计算得到奖励值r(t)同时转移到下一状态s_t+1；Step 4: After the action decision is completed, the UAV obtains the energy consumption of the communication upload task within the action decision period, that is, E _m,l (t) is obtained through formula (6), and Substituting the value of P _m (t), the communication threshold SNR _thr =2dB and the positioning sensing threshold PDOP _thr =1.5 into formula (9), the reward value r(t) is calculated and transferred to the next state s _t+1 ;

步骤五：将强化学习网络的探索率ε＝0.8和折扣因子γ＝0.9代入公式(8)，得到该动作下的Q(s_t,a_t)，从而对Q值函数网络的值进行更新；Step 5: Substitute the exploration rate ε = 0.8 and discount factor γ = 0.9 of the reinforcement learning network into formula (8) to obtain Q(s _t , a _t ) under the action, thereby updating the value of the Q-value function network;

步骤六：将新状态s_t+1设置为当前状态，重复步骤三～步骤六，直到Q值函数网络中的值达到收敛。Step 6: Set the new state s _t+1 as the current state, and repeat steps 3 to 6 until the values in the Q-value function network reach convergence.

当Q值函数网络中的值通过不断更新最终达到收敛状态时，便可以利用Q-table指导无人机在对应状态下做出最佳决策，即选择最优数量的服务用户以及对应状态下的发射功率，得到最优的用户通信任务流量卸载策略，即无人机的最优能效。下面给出算法全部流程：When the values in the Q-value function network finally reach the convergence state through continuous updating, the Q-table can be used to guide the drone to make the best decision in the corresponding state, that is, to select the optimal number of service users and the number of service users in the corresponding state. Transmit power to obtain the optimal user communication task traffic offloading strategy, that is, the optimal energy efficiency of the UAV. The entire process of the algorithm is given below:

Q-Learning算法：得到无人机集群通感一体网络的能效最优策略Q-Learning Algorithm: Obtain the optimal energy efficiency strategy for the synaesthesia integrated network of UAV clusters

初始化对于任意s∈S，a∈A(s),Initialization for any s∈S, a∈A(s),

利用关键技术(1)对Q-table赋初值Use key technology (1) to assign initial value to Q-table

初始化t＝1,ε＝0.8，γ＝0.9，Initialize t=1, ε=0.8, γ=0.9,

重复：repeat:

根据当前环境信息初始化状态sInitialize state s according to current environment information

在每个动作决策周期t中重复执行：Repeatedly executed in each action decision cycle t:

根据ε-greedy策略选择状态s下的动作aSelect action a in state s according to ε-greedy strategy

执行动作a，利用公式(9)获得奖励函数值并进入下一状态s’Execute action a, use formula (9) to obtain the reward function value and enter the next state s’

利用公式(8)更新Q-table对应的值Use formula (8) to update the corresponding value of Q-table

令t＝t+1，s’＝sLet t=t+1, s’=s

重复上述步骤，直至达到最大迭代次数l_max＝1000。Repeat the above steps until the maximum number of iterations l _max =1000 is reached.

上述所列举实施例涉及到的本发明的特征和原理的整体详细流程图如图2所示，本发明所针对的相关场景图如图3所示。The overall detailed flow chart of the features and principles of the present invention involved in the above-mentioned embodiments is shown in Figure 2, and the relevant scene diagram targeted by the present invention is shown in Figure 3.

综上所述，本发明提出的一种基于强化学习的无人机群通感一体能耗优化方法，通过利用无人机的感知性能为强化学习网络提供先验信息，使得无人机可以更加快速、高效的达到最佳目标状态。另一方面，在保障无人机集群通信和感知性能的前提下，降低系统任务能耗，从而有效提升无人机网络的服务寿命，这对于本身资源受限的无人机系统具有重要意义。In summary, the present invention proposes a synaesthesia-integrated energy consumption optimization method for UAV groups based on reinforcement learning, which provides prior information for the reinforcement learning network by utilizing the perception performance of UAVs, allowing the UAVs to operate faster , achieve the best target state efficiently. On the other hand, on the premise of ensuring the communication and sensing performance of the UAV cluster, the system task energy consumption can be reduced, thereby effectively extending the service life of the UAV network, which is of great significance for UAV systems with limited resources.

Claims

1. A synaesthesia-integrated energy consumption optimization method for UAV swarms based on reinforcement learning, which is characterized by: including the following steps:

The first step is to use the positioning perception performance of the UAV cluster to assign an initial value to the Q-value function network that reflects the relationship between UAV status and action;

The second step is to determine the current state of the drone from the environment; select a certain working point as the initial state of the drone;

The third step is to select the current action based on the ε-greedy strategy according to the current status of the drone;

The fourth step is to introduce the UAV perception performance and communication task energy consumption into the design of the Reward reward function, and obtain the actual environmental reward value of the previous action and the next state of the UAV;

The fifth step is to update the Q-value function network using the actual environmental reward value of the previous action;

Step 6: Set the new state to the current state and repeat steps 3 to 6 until the values in the Q-value function network converge.

2. A synaesthetic integrated energy consumption optimization method for UAV swarms based on reinforcement learning according to claim 1, characterized in that: the initial value assigned to the Q-value function network is: set at time t, providing positioning for user-l The subset of UAV base stations for sensing service is S _k (t), and the number of UAVs in the set is M ₀ ; then calculate the position accuracy factor of the UAV subset value, expressed as:

In the above formula, is the Jacobian matrix of the positioning sensing observation equation of the UAV base station subset S _k (t).

3. A synaesthesia-integrated energy consumption optimization method for UAV swarms based on reinforcement learning according to claim 2, characterized by: It is further expressed as the following formula:

Among them, u ₁ (t)=(x ₁ (t), y ₁ (t)) ^T , (1∈s _k (t)), (M ₀ ∈s _k (t)) respectively represent the coordinates of UAV-1, UAV-m ₀ and UAV-M ₀ in the UAV base station subset S _k (t); similarly, H ₁ ,/> are respectively the fixed height values of UAV-1, UAV-m ₀ and UAV-M ₀ in the UAV base station subset S _k (t), v _l = (x _l ,y _l ) ^T Then it is the location coordinate of user-l to be located.

4. A synaesthesia-integrated energy consumption optimization method for UAV groups based on reinforcement learning according to claim 1 or 2, characterized in that: when the UAV is in a normal working state, the UAV base station is selected at time t When the subset S _k (t) provides synaesthesia services to users, the value in its corresponding Q-table grid is The rest of the grid positions are assigned zero values; the positioning perception performance of the UAV cluster is used to assign an initial value to the Q-value function network.

5. A method for optimizing the synaesthesia integrated energy consumption of UAV groups based on reinforcement learning according to claim 1, characterized in that: the energy consumption of the communication task is, at time t, the mth UAV to the lth UAV The LoS channel power gain of ground users is expressed as:

Among them, α is the path loss coefficient of the channel, β ₀ is the unit (per meter) channel gain, d _m,l (t) is the distance from the m-th drone to the l-th ground user, and there is

6. A method for optimizing energy consumption of synaesthesia for UAV groups based on reinforcement learning according to claim 5, characterized in that: the signal-to-noise ratio SNR _m,l (t) of the link at that moment is expressed as:

Among them, P is the constant transmit power of the space-based platform, σ ² is the noise power, Indicates whether the space-based platform provides synaesthesia services for ground users. The specific meaning is: When/> Synaesthesia services are not provided when/> Provide synaesthesia services;/> is the interference from other UAV platforms, that is, co-channel channel interference, where P _u (t) is the transmit power of other UAV platforms at time t; g _u,l (t) is the transmission power of other UAV platforms at time t LoS channel power gain at t;/> represents a collection of UAVs; then the data transmission rate R _m,l (t) of the link from the m-th UAV to the l-th ground user is expressed as:

R _m,l (t)=B·log ₂ (1+SNR _m,l (t)) (5)

Among them, B is the signal bandwidth.

7. A synaesthetic integrated energy consumption optimization method for UAV swarms based on reinforcement learning according to claim 6, characterized in that: the energy consumption of the link E _m,l (t) is expressed as:

Among them, P is the constant transmit power of the space-based platform, Indicates whether the space-based platform provides synaesthesia services for ground users. The specific meaning is: When/> Synaesthesia services are not provided when/> Synaesthesia services are provided at any time; P _m (t)∈[0,1,…,5] represents the number of users served by a drone; the data upload task of each ground user can only be sent to one drone at most. machine.

8. A method for optimizing energy consumption of UAV synaesthesia based on reinforcement learning according to claim 2, characterized in that: selecting the current action based on the ε-greedy strategy is: the probability that the agent randomly selects the action is ε, the probability of selecting the action corresponding to the maximum value in the Q-value function network is 1-ε;

When execution starts, first use the method in formula (1) to initialize the Q-table, and then select the current state s _t . For each action a in this state, there is a corresponding "state-action" value, And expressed as Q(s _t ,a); in this case, the action in this state is selected according to the ε-greedy strategy, that is, the action corresponding to the maximum value in the Q-value function network is selected, as shown in the following formula:

9. A synaesthesia integrated energy consumption optimization method for UAV swarms based on reinforcement learning according to claim 8, characterized in that: after selecting an action, the agent starts to perform this action, and then enters the next state s _{t+ 1} and obtain the reward value r(t) within the current action selection decision period, and at the same time update the value of the corresponding position in the corresponding Q network:

Among them, γ is the discount factor, and γ∈[0,1].

10. A method for optimizing the integrated energy consumption of UAV swarm synaesthesia based on reinforcement learning according to claim 9, characterized in that: the reward value r(t) is designed: the design is at time t, and the reward function r(t) represents is the following formula:

Among them, SNR _thr is the known system parameter signal-to-noise ratio threshold set in advance to ensure the communication performance of the agent;

PDOP _thr is the three-dimensional precision factor threshold of known parameters; the purpose is to ensure the perception performance of the agent; based on this function, the synaesthesia characteristics of the UAV swarm network in the energy consumption optimization action selection process are guaranteed.