CN118034355A

CN118034355A - Network training method, unmanned aerial vehicle obstacle avoidance method and device

Info

Publication number: CN118034355A
Application number: CN202410447633.0A
Authority: CN
Inventors: 刘克新; 吴其臻; 吕金虎; 陈磊; 朱国梁
Original assignee: Academy of Mathematics and Systems Science of CAS
Current assignee: Beihang University; Beijing Institute of Technology BIT; Academy of Mathematics and Systems Science of CAS
Priority date: 2024-04-15
Filing date: 2024-04-15
Publication date: 2024-05-14
Anticipated expiration: 2044-04-15
Also published as: CN118034355B

Abstract

The present invention provides a network training method, a method and device for avoiding obstacles of unmanned aerial vehicles, and relates to the technical field of unmanned aerial vehicle control. The method comprises: updating an experience replay pool with sample data of the target moment constructed by the target moment environmental situation of the sample unmanned aerial vehicle, the next moment environmental situation, the optimal heading angle at the target moment, and the target moment reward value; when the number of sample data in the updated experience replay pool reaches a preset number, extracting a plurality of sample data to be processed from the pool for multi-step prediction to obtain the optimal heading angle prediction value, environmental situation prediction value, and reward prediction value at a plurality of future moments; training a target strategy network according to the environmental situation, the optimal heading angle, the reward value, the environmental situation prediction value, the reward prediction value, and the optimal heading angle prediction value in each sample data to be processed, and obtaining an optimized strategy network. The present invention effectively improves the learning efficiency and sample utilization rate in obstacle avoidance of unmanned aerial vehicles.

Description

Network training method, unmanned aerial vehicle obstacle avoidance method and device

技术领域Technical Field

本发明涉及无人机控制技术领域，尤其涉及一种网络训练方法、无人机避障方法及装置。The present invention relates to the technical field of unmanned aerial vehicle control, and in particular to a network training method, a unmanned aerial vehicle obstacle avoidance method and a device.

背景技术Background Art

无人机避障问题可以描述为无人机在一个存在障碍物的空间中导航的任务。任务通常遵循一些优化标准，如工作成本最小、飞行距离最短、飞行时间最短等。常见的传统避障方法包括：动态规划算法、人工势场法、基于采样的方法以及基于图论的方法，但这些方法却需要根据不同的情况建立不同的模型。然而在实际的无人机飞行环境中，工作环境复杂且不可预测，往往需要无人机在未知环境中进行探测并实时决策。The obstacle avoidance problem of drones can be described as the task of a drone navigating in a space with obstacles. The task usually follows some optimization criteria, such as minimum work cost, shortest flight distance, shortest flight time, etc. Common traditional obstacle avoidance methods include: dynamic programming algorithm, artificial potential field method, sampling-based method and graph theory-based method, but these methods require different models to be established according to different situations. However, in the actual drone flight environment, the working environment is complex and unpredictable, and drones are often required to detect in unknown environments and make real-time decisions.

随着人工智能技术的进步，强化学习在游戏、机器人、互联网等领域的应用日益广泛，引起了广泛关注。无模型强化学习是一种常用的解决未知环境决策的方法，已经广泛应用于无人机的避障问题中。但是由于无人机与环境的相互交互作用有限，导致无模型强化学习的样本利用率低和自主学习效率低，进而导致无人机避障性能较差。With the advancement of artificial intelligence technology, reinforcement learning has been increasingly used in games, robots, the Internet and other fields, and has attracted widespread attention. Model-free reinforcement learning is a commonly used method to solve decision-making in unknown environments and has been widely used in the obstacle avoidance problem of drones. However, due to the limited interaction between drones and the environment, the sample utilization rate of model-free reinforcement learning is low and the autonomous learning efficiency is low, which in turn leads to poor obstacle avoidance performance of drones.

发明内容Summary of the invention

本发明提供一种网络训练方法、无人机避障方法及装置，用以解决现有技术中无人机与环境的相互交互作用有限，导致无模型强化学习的样本利用率低和自主学习效率低，进而导致无人机避障性能较差的缺陷，实现提高无人机避障中的自主学习效率和样本利用率，以提高无人机避障性能。The present invention provides a network training method, a UAV obstacle avoidance method and a device, which are used to solve the defects in the prior art that the interaction between the UAV and the environment is limited, resulting in low sample utilization rate and low autonomous learning efficiency of model-free reinforcement learning, and further resulting in poor obstacle avoidance performance of the UAV, so as to improve the autonomous learning efficiency and sample utilization rate in the obstacle avoidance of the UAV, so as to improve the obstacle avoidance performance of the UAV.

本发明提供一种网络训练方法，包括：The present invention provides a network training method, comprising:

根据样本无人机的目标时刻环境态势、目标时刻最优航向角、下一时刻环境态势，以及目标时刻奖励值，构建目标时刻的样本数据；According to the environmental situation of the sample drone at the target moment, the optimal heading angle at the target moment, the environmental situation at the next moment, and the reward value at the target moment, the sample data at the target moment is constructed;

将所述目标时刻的样本数据更新至经验回放池，在更新后的经验回放池中的样本数据的数量达到预设数量的情况下，从所述更新后的经验回放池中抽取出目标预测区间内多个不同时刻的待处理样本数据；The sample data at the target moment is updated to the experience replay pool, and when the number of sample data in the updated experience replay pool reaches a preset number, the sample data to be processed at multiple different moments within the target prediction interval are extracted from the updated experience replay pool;

将各所述待处理样本数据中的环境态势输入至目标策略网络，得到多个不同未来时刻的最优航向角预测值，并将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值；Inputting the environmental situation in each of the sample data to be processed into the target strategy network to obtain the optimal heading angle prediction values at multiple different future moments, and inputting the environmental situation in each of the sample data to be processed and the optimal heading angle prediction values at multiple different future moments into the target prediction network for multi-step prediction to obtain the environmental situation prediction values and reward prediction values at multiple different future moments;

根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，并根据训练结果，获取优化的策略网络；According to the environmental situation, the optimal heading angle and the reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, the target policy network is subjected to reinforcement learning training, and an optimized policy network is obtained according to the training results;

其中，所述优化的策略网络用于基于当前无人机的当前时刻环境态势预测所述当前无人机的当前时刻最优航向角，以供所述当前无人机根据所述当前时刻最优航向角执行避障任务。The optimized strategy network is used to predict the optimal heading angle of the current drone at the current moment based on the current environmental situation of the current drone at the current moment, so that the current drone can perform an obstacle avoidance task according to the optimal heading angle at the current moment.

根据本发明提供的一种网络训练方法，所述目标时刻环境态势和所述目标时刻最优航向角是基于如下步骤获取的：According to a network training method provided by the present invention, the environmental situation at the target moment and the optimal heading angle at the target moment are obtained based on the following steps:

根据所述样本无人机的目标时刻位置、半径和目的地位置，障碍物的目标时刻位置、目标时刻速度和半径，以及所述样本无人机与所述障碍物之间的目标时刻距离，确定所述目标时刻环境态势；Determine the target time environmental situation according to the target time position, radius and destination position of the sample UAV, the target time position, target time speed and radius of the obstacle, and the target time distance between the sample UAV and the obstacle;

将所述目标时刻环境态势输入至所述目标策略网络，得到所述目标时刻最优航向角。The environmental situation at the target moment is input into the target strategy network to obtain the optimal heading angle at the target moment.

根据本发明提供的一种网络训练方法，所述下一时刻环境态势是基于如下步骤获取的：According to a network training method provided by the present invention, the next moment environment situation is obtained based on the following steps:

根据无人机动力学约束模型、运动学约束和扰动流场法，计算得到所述样本无人机的下一时刻位置；The next moment position of the sample UAV is calculated based on the UAV dynamic constraint model, kinematic constraints and disturbance flow field method;

根据所述样本无人机的下一时刻位置、半径和目的地位置，障碍物的下一时刻位置、下一时刻速度和半径，以及所述样本无人机与所述障碍物之间的下一时刻距离，确定所述下一时刻环境态势。The environmental situation at the next moment is determined according to the next moment position, radius and destination position of the sample drone, the next moment position, next moment speed and radius of the obstacle, and the next moment distance between the sample drone and the obstacle.

根据本发明提供的一种网络训练方法，所述目标时刻奖励值是基于如下步骤获取的：According to a network training method provided by the present invention, the target moment reward value is obtained based on the following steps:

在所述样本无人机与障碍物之间的目标时刻距离小于第一距离值的情况下，根据所述样本无人机与所述障碍物之间的目标时刻距离、所述样本无人机的半径、所述障碍物的半径，以及第一奖励值，确定所述目标时刻奖励值；When the target time distance between the sample drone and the obstacle is less than the first distance value, determining the target time reward value according to the target time distance between the sample drone and the obstacle, the radius of the sample drone, the radius of the obstacle, and the first reward value;

在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且所述样本无人机与目的地位置之间的目标时刻距离小于第二距离值的情况下，根据所述样本无人机与所述目的地位置之间的目标时刻距离、所述样本无人机的起点位置与所述目的地位置之间的距离，以及第二奖励值和第三奖励值，确定所述目标时刻奖励值；When the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, and the target time distance between the sample drone and the destination position is less than the second distance value, the target time reward value is determined according to the target time distance between the sample drone and the destination position, the distance between the starting position of the sample drone and the destination position, and the second reward value and the third reward value;

在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且所述样本无人机与所述目的地位置之间的目标时刻距离大于或等于所述第二距离值的情况下，根据所述样本无人机与所述目的地位置之间的目标时刻距离、所述样本无人机的起点位置与所述目的地位置之间的距离，以及所述第三奖励值，确定所述目标时刻奖励值；When the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, and the target time distance between the sample drone and the destination position is greater than or equal to the second distance value, the target time reward value is determined according to the target time distance between the sample drone and the destination position, the distance between the starting position of the sample drone and the destination position, and the third reward value;

其中，所述第一奖励值为常值奖励值；所述第二奖励值为用于限制所述样本无人机远离所述障碍物的威胁奖励；所述第三奖励值为任务完成对应的附加奖励值。Among them, the first reward value is a constant reward value; the second reward value is a threat reward for limiting the sample drone to stay away from the obstacle; and the third reward value is an additional reward value corresponding to task completion.

根据本发明提供的一种网络训练方法，所述第二奖励值是基于如下步骤确定的：According to a network training method provided by the present invention, the second reward value is determined based on the following steps:

在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且小于第三距离值的情况下，基于所述样本无人机与所述障碍物之间的目标时刻距离、所述样本无人机的半径、所述障碍物的半径、预设威胁半径和第四奖励值，确定所述第二奖励值；When the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value and less than the third distance value, the second reward value is determined based on the target time distance between the sample drone and the obstacle, the radius of the sample drone, the radius of the obstacle, the preset threat radius and the fourth reward value;

在所述样本无人机与所述障碍物之间的目标时刻距离小于所述第一距离值，或者大于或等于所述第三距离值的情况下，基于预设常数值，确定所述第二奖励值。When the target time distance between the sample UAV and the obstacle is less than the first distance value, or greater than or equal to the third distance value, the second reward value is determined based on a preset constant value.

根据本发明提供的一种网络训练方法，所述将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值，包括：According to a network training method provided by the present invention, the environmental situation in each of the sample data to be processed and the optimal heading angle prediction values at multiple different future moments are input into the target prediction network for multi-step prediction to obtain environmental situation prediction values and reward prediction values at multiple different future moments, including:

将各所述待处理样本数据中的环境态势和所述最优航向角预测值输入至所述目标预测网络的奖励函数网络，以及将各所述待处理样本数据中的环境态势和所述最优航向角预测值输入至所述目标预测网络的态势转移函数网络，进行多步预测，得到所述奖励函数网络输出的所述奖励预测值和所述态势转移函数网络输出所述环境态势预测值。The environmental situation in each of the sample data to be processed and the predicted value of the optimal heading angle are input into the reward function network of the target prediction network, and the environmental situation in each of the sample data to be processed and the predicted value of the optimal heading angle are input into the situation transfer function network of the target prediction network, and multi-step prediction is performed to obtain the reward prediction value output by the reward function network and the environmental situation prediction value output by the situation transfer function network.

根据本发明提供的一种网络训练方法，所述根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，包括：According to a network training method provided by the present invention, the target strategy network is subjected to reinforcement learning training according to the environmental situation, the optimal heading angle and the reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, including:

根据各所述待处理样本数据中的最优航向角，以及所述最优航向角预测值、所述环境态势预测值和所述奖励预测值，获取值函数代价函数；Obtaining a value function cost function according to the optimal heading angle in each of the sample data to be processed, as well as the optimal heading angle prediction value, the environmental situation prediction value and the reward prediction value;

根据所述最优航向角预测值和所述环境态势预测值，获取策略代价函数；Acquire a strategy cost function according to the optimal heading angle prediction value and the environmental situation prediction value;

根据各所述待处理样本数据中的环境态势和所述环境态势预测值，获取态势转移代价函数；Obtaining a situation transfer cost function according to the environmental situation in each of the sample data to be processed and the environmental situation prediction value;

根据各所述待处理样本数据中的奖励值和所述奖励预测值，获取奖励代价函数；Obtaining a reward cost function according to the reward value in each of the sample data to be processed and the reward prediction value;

根据所述值函数代价函数、所述策略代价函数、所述态势转移代价函数以及所述奖励代价函数，对所述目标策略网络进行强化学习。Reinforcement learning is performed on the target policy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the reward cost function.

本发明还提供一种无人机避障方法，包括：The present invention also provides a method for avoiding obstacles in a drone, comprising:

获取当前无人机的当前时刻环境态势；Get the current environmental situation of the current drone;

将所述当前时刻环境态势输入至优化的策略网络，得到所述当前无人机的当前时刻最优航向角；Inputting the current environmental situation into the optimized strategy network to obtain the optimal heading angle of the current drone at the current moment;

根据所述当前时刻最优航向角，控制所述当前无人机执行避障任务；According to the optimal heading angle at the current moment, controlling the current drone to perform an obstacle avoidance task;

其中，所述优化的策略网络是基于如上述任一项所述网络训练方法训练得到。Wherein, the optimized policy network is trained based on any of the network training methods described above.

本发明还提供一种网络训练装置，包括：The present invention also provides a network training device, comprising:

构建单元，用于根据样本无人机的目标时刻环境态势、目标时刻最优航向角、下一时刻环境态势，以及目标时刻奖励值，构建目标时刻的样本数据；A construction unit, used to construct sample data at the target moment according to the target moment environmental situation of the sample UAV, the optimal heading angle at the target moment, the next moment environmental situation, and the target moment reward value;

抽取单元，用于将所述目标时刻的样本数据更新至经验回放池，在更新后的经验回放池中的样本数据的数量达到预设数量的情况下，从所述更新后的经验回放池中抽取出目标预测区间内多个不同时刻的待处理样本数据；An extraction unit, used to update the sample data at the target moment to the experience replay pool, and when the number of sample data in the updated experience replay pool reaches a preset number, extract the sample data to be processed at multiple different moments within the target prediction interval from the updated experience replay pool;

预测单元，用于将各所述待处理样本数据中的环境态势输入至目标策略网络，得到多个不同未来时刻的最优航向角预测值，并将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值；A prediction unit, used for inputting the environmental situation in each of the sample data to be processed into a target strategy network to obtain optimal heading angle prediction values at multiple different future moments, and inputting the environmental situation in each of the sample data to be processed and the optimal heading angle prediction values at multiple different future moments into a target prediction network for multi-step prediction to obtain environmental situation prediction values and reward prediction values at multiple different future moments;

优化单元，用于根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，并根据训练结果，获取优化的策略网络；An optimization unit, configured to perform reinforcement learning training on the target policy network according to the environmental situation, the optimal heading angle and the reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, and obtain an optimized policy network according to the training results;

本发明还提供一种无人机避障装置，包括：The present invention also provides a drone obstacle avoidance device, comprising:

获取单元，用于获取当前无人机的当前时刻环境态势；The acquisition unit is used to obtain the current environmental situation of the current drone;

决策单元，用于将所述当前时刻环境态势输入至优化的策略网络，得到所述当前无人机的当前时刻最优航向角；A decision-making unit, used for inputting the current environmental situation into the optimized strategy network to obtain the optimal heading angle of the current UAV at the current moment;

避障控制单元，用于根据所述当前时刻最优航向角，控制所述当前无人机执行避障任务；An obstacle avoidance control unit, used to control the current UAV to perform an obstacle avoidance task according to the optimal heading angle at the current moment;

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述网络训练方法，或者如上述任一种所述无人机避障方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the network training method described in any one of the above, or the drone obstacle avoidance method described in any one of the above, is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述网络训练方法，或者如上述任一种所述无人机避障方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, the network training method described in any one of the above or the drone obstacle avoidance method described in any one of the above is implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述网络训练方法，或者如上述任一种所述无人机避障方法。The present invention also provides a computer program product, including a computer program, which, when executed by a processor, implements any of the network training methods described above, or any of the drone obstacle avoidance methods described above.

本发明提供的网络训练方法、无人机避障方法及装置，通过从经验回放池中抽取出多个不同时刻的待处理样本数据，并且基于目标策略网络以及目标预测网络对各待处理样本数据进行预测，得到多个不同未来时刻的最优航向角预测值、环境态势预测值和奖励预测值，以基于各待处理样本数据中的环境态势、最优航向角和奖励值，以及未来时刻的环境态势预测值、奖励预测值和最优航向角预测值，对目标策略网络进行滚动优化，不仅可以实现通过目标预测网络的虚拟环境数据生成来扩充样本数据的数量，以减少与真实环境的交互次数，提高样本利用率，加快训练速度，提高学习效率，进而使得无人机在与环境交互次数更少的情况下目标策略网络可以快速收敛到最优；还可以使得优化的决策模型既具备当前决策经验，又具备未来决策经验，以便做出更加优化的无人机避障决策，由此提高无人机避障性能。The network training method, unmanned aerial vehicle obstacle avoidance method and device provided by the present invention extract a plurality of to-be-processed sample data at different moments from an experience replay pool, and predict each to-be-processed sample data based on a target policy network and a target prediction network, so as to obtain the optimal heading angle prediction values, environmental situation prediction values and reward prediction values at a plurality of different future moments, and perform rolling optimization on the target policy network based on the environmental situation, optimal heading angle and reward value in each to-be-processed sample data, and the environmental situation prediction value, reward prediction value and optimal heading angle prediction value at the future moment, so as to realize not only the expansion of the number of sample data by generating virtual environmental data of the target prediction network, so as to reduce the number of interactions with the real environment, improve the sample utilization rate, accelerate the training speed, and improve the learning efficiency, so as to enable the target policy network to converge to the optimum quickly when the number of interactions with the environment of the unmanned aerial vehicle is less; and also enable the optimized decision model to have both current decision experience and future decision experience, so as to make a more optimized obstacle avoidance decision for the unmanned aerial vehicle, thereby improving the obstacle avoidance performance of the unmanned aerial vehicle.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的网络训练方法的流程示意图之一；FIG1 is a flow chart of a network training method provided by the present invention;

图2是本发明提供的网络训练方法的流程示意图之二；FIG2 is a second flow chart of the network training method provided by the present invention;

图3是本发明提供的网络训练方法的流程示意图之三；FIG3 is a third flow chart of the network training method provided by the present invention;

图4是本发明提供的无人机避障方法的流程示意图；FIG4 is a schematic diagram of a flow chart of a method for avoiding obstacles in a drone provided by the present invention;

图5是本发明提供的网络训练装置的结构示意图；FIG5 is a schematic diagram of the structure of a network training device provided by the present invention;

图6是本发明提供的无人机避障装置的结构示意图；FIG6 is a schematic structural diagram of a drone obstacle avoidance device provided by the present invention;

图7是本发明提供的电子设备的结构示意图。FIG. 7 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

无人机避障问题可以描述为无人机在一个存在障碍物的空间中导航的任务。任务通常遵循一些优化标准，如工作成本最小、飞行距离最短、飞行时间最短等。常见的传统避障方法包括：动态规划算法、人工势场法、基于采样的方法以及基于图论的方法，但这些方法却需要根据不同的情况建立不同的模型。然而在实际的无人机飞行环境中，工作环境复杂且不可预测，往往需要无人机具有自学习、自适应和鲁棒能力，以在未知环境中进行探测并实时决策。The obstacle avoidance problem of drones can be described as the task of a drone navigating in a space with obstacles. The task usually follows some optimization criteria, such as minimum work cost, shortest flight distance, shortest flight time, etc. Common traditional obstacle avoidance methods include: dynamic programming algorithm, artificial potential field method, sampling-based method and graph theory-based method, but these methods require different models to be established according to different situations. However, in the actual drone flight environment, the working environment is complex and unpredictable, and drones are often required to have self-learning, adaptive and robust capabilities to detect and make real-time decisions in unknown environments.

为了克服这些方法的弱点，相关人员探索了各种解决方案，如强化学习，其可以从环境状态中学习适当的行为，好处在于其具备在线学习的能力，以及可在不同环境中产生相应的奖励或惩罚，是一种常用的解决未知环境决策的方法，已经广泛应用于无人机的避障问题中。强化学习的端到端运动规划方法允许将系统视为一个整体，使其更加具有鲁棒性。但由于无人机与环境所需的交互数量有限，也即无人机与环境的相互作用有限，导致无模型强化学习的样本利用率低和自主学习效率低，无模型强化学习在现实世界中的应用也受到了限制，进而导致无人机避障性能较差。In order to overcome the weaknesses of these methods, relevant personnel have explored various solutions, such as reinforcement learning, which can learn appropriate behaviors from environmental states. The advantage is that it has the ability to learn online and can generate corresponding rewards or penalties in different environments. It is a commonly used method to solve unknown environmental decision-making and has been widely used in the obstacle avoidance problem of drones. The end-to-end motion planning method of reinforcement learning allows the system to be considered as a whole, making it more robust. However, due to the limited number of interactions required between drones and the environment, that is, the limited interaction between drones and the environment, the sample utilization rate of model-free reinforcement learning is low and the autonomous learning efficiency is low. The application of model-free reinforcement learning in the real world is also limited, which leads to poor obstacle avoidance performance of drones.

针对无模型强化学习在无人机避障中存在训练效率低、样本利用率低的问题，本申请实施例提供了一种网络训练方法，该方法通过数据驱动的方法得到的目标预测网络预测模拟环境态势和奖励模型，并采用滚动优化区间进行策略网络的优化训练，使得无人机避障，具有较高的学习效率、更快的策略趋近于最优值的收敛速度以及较少的经验重放缓冲区所需的样本容量空间，进而使得无人机仅通过少量真实环境交互就能学习最佳策略，促进了强化学习方法在避障问题中的应用，提高了无人机避障性能。In response to the problems of low training efficiency and low sample utilization in model-free reinforcement learning in drone obstacle avoidance, an embodiment of the present application provides a network training method. This method obtains a target prediction network prediction simulated environmental situation and reward model through a data-driven method, and uses a rolling optimization interval to perform optimization training of the policy network, so that the drone can avoid obstacles with higher learning efficiency, faster convergence speed of the strategy to the optimal value, and less sample capacity space required for the experience replay buffer, so that the drone can learn the optimal strategy through only a small amount of real environment interaction, which promotes the application of reinforcement learning methods in obstacle avoidance problems and improves the obstacle avoidance performance of drones.

图1为本发明提供的网络训练方法的流程示意图之一，如图1所示，该方法包括：FIG. 1 is a flow chart of a network training method provided by the present invention. As shown in FIG. 1 , the method includes:

步骤110，根据样本无人机的目标时刻环境态势、目标时刻最优航向角、下一时刻环境态势，以及目标时刻奖励值，构建目标时刻的样本数据；Step 110, constructing sample data at the target moment according to the target moment environmental situation of the sample drone, the optimal heading angle at the target moment, the next moment environmental situation, and the target moment reward value;

此处，目标时刻可以是当前时刻，也可是各历史时刻，本实施对此不作具体地限定；下一时刻也即为目标时刻的下一时刻。Here, the target moment may be the current moment or any historical moment, and this implementation does not specifically limit this; the next moment is the moment after the target moment.

目标时刻环境态势为目标时刻下的环境态势，具体可以是基于样本无人机的目标时刻运行参数以及样本无人机所处环境下的障碍物的目标时刻运行参数进行确定的。目标时刻最优航向角为目标时刻下的最优航向角，此处的最优航向角可以将目标时刻环境态势输入至目标策略网络进行决策输出获取的。下一时刻环境态势为目标时刻的下一时刻的环境态势，具体可以是基于样本无人机的下一时刻运行参数以及样本无人机所处环境下的障碍物的下一时刻运行参数进行确定的。目标时刻奖励值为在基于目标时刻最优航向角执行避障操作所形成的奖励值。The target moment environmental situation is the environmental situation at the target moment, which can be determined based on the target moment operating parameters of the sample drone and the target moment operating parameters of the obstacles in the environment where the sample drone is located. The target moment optimal heading angle is the optimal heading angle at the target moment. The optimal heading angle here can be obtained by inputting the target moment environmental situation into the target strategy network for decision output. The next moment environmental situation is the environmental situation at the next moment of the target moment, which can be determined based on the next moment operating parameters of the sample drone and the next moment operating parameters of the obstacles in the environment where the sample drone is located. The target moment reward value is the reward value formed by performing the obstacle avoidance operation based on the target moment optimal heading angle.

在一些实施中，所述目标时刻环境态势和所述目标时刻最优航向角是基于如下步骤获取的：In some implementations, the target moment environmental situation and the target moment optimal heading angle are obtained based on the following steps:

根据所述样本无人机的目标时刻位置、半径和目的地位置，障碍物的目标时刻位置、目标时刻速度和半径，以及所述样本无人机与所述障碍物之间的目标时刻距离，确定所述目标时刻环境态势；将所述目标时刻环境态势输入至所述目标策略网络，得到所述目标时刻最优航向角。The target moment environmental situation is determined according to the target moment position, radius and destination position of the sample UAV, the target moment position, target moment speed and radius of the obstacle, and the target moment distance between the sample UAV and the obstacle; the target moment environmental situation is input into the target strategy network to obtain the optimal heading angle at the target moment.

此处，障碍物可以是圆柱体或者圆球形等，本实施例对此不作具体地限定。Here, the obstacle may be a cylinder or a sphere, etc., which is not specifically limited in this embodiment.

可选地，获取目标时刻环境态势的步骤具体包括：Optionally, the step of obtaining the environmental situation at the target moment specifically includes:

基于障碍物的目标时刻位置与样本无人机的目标时刻位置之间的差值、样本无人机的半径和障碍物的半径之和，以及样本无人机与障碍物之间的目标时刻距离，确定目标时刻环境态势的第一参数值；基于目的地位置与障碍物的目标时刻位置之间的差值，确定目标时刻环境态势的第二参数值；基于障碍物的目标时刻速度，确定目标时刻环境态势的第三参数值，基于第一参数值、第二参数值和第三参数值构建形成目标时刻环境态势，具体计算公式如下：Based on the difference between the target moment position of the obstacle and the target moment position of the sample UAV, the sum of the radius of the sample UAV and the radius of the obstacle, and the target moment distance between the sample UAV and the obstacle, the first parameter value of the target moment environmental situation is determined; based on the difference between the destination position and the target moment position of the obstacle, the second parameter value of the target moment environmental situation is determined; based on the target moment speed of the obstacle, the third parameter value of the target moment environmental situation is determined, and the target moment environmental situation is constructed based on the first parameter value, the second parameter value and the third parameter value. The specific calculation formula is as follows:

； ;

其中，为时刻环境态势；和分别为样本无人机、障碍物的时刻位置、为目的地位置；为样本无人机与障碍物之间的时刻距离；和分别为样本无人机和障碍物的半径；为障碍物的时刻速度。in, for The current environmental situation; and They are sample drones, obstacles, Time location, is the destination location; is the distance between the sample drone and the obstacle. Time distance; and are the radii of the sample drone and obstacles, respectively; for obstacles Time speed.

可选地，获取目标时刻最优航向角的步骤具体包括：Optionally, the step of obtaining the optimal heading angle at the target moment specifically includes:

将目标时刻环境态势输入至目标策略网络，以由目标策略网络根据目标时刻环境态势灵活地做出相应的决策，由此输出目标时刻最优航向角，以便后续基于此优化训练的目标策略网络可以根据环境态势灵活地调整无人机的航向角，以最大程度地及时避开障碍物并到达目的地，并且可以减少依赖于预先建立的地图或静态路径规划，由此可适应各种复杂的环境的避障场景，提高无人机避障的泛化性和鲁棒性。The environmental situation at the target moment is input into the target policy network, so that the target policy network can flexibly make corresponding decisions according to the environmental situation at the target moment, thereby outputting the optimal heading angle at the target moment, so that the target policy network based on the subsequent optimization training can flexibly adjust the heading angle of the UAV according to the environmental situation, so as to avoid obstacles and reach the destination in time to the greatest extent, and can reduce reliance on pre-established maps or static path planning, thereby adapting to obstacle avoidance scenarios in various complex environments and improving the generalization and robustness of UAV obstacle avoidance.

此处，目标策略网络可以是基于目标预测区间的上一预测区间内抽取的多个不同时刻的待处理样本数据中对应时刻的环境态势、最优航向角，以及环境态势预测值和最优航向角预测值进行训练得到的。Here, the target strategy network can be trained based on the environmental situation, optimal heading angle, and environmental situation prediction value and optimal heading angle prediction value at the corresponding time in the to-be-processed sample data at multiple different times extracted in the previous prediction interval of the target prediction interval.

此处，最优航向角包括但不限于最优滚转角、最优俯仰角、最优偏航角，具体计算公式如下：Here, the optimal heading angle includes but is not limited to the optimal roll angle, the optimal pitch angle, and the optimal yaw angle. The specific calculation formula is as follows:

； ;

其中，为样本无人机的时刻最优航向角；分别表示样本无人机的最优滚转角、最优俯仰角和最优偏航角；为所需训练的目标策略网络。在DDPG（Deep Deterministic Policy Gradient，深度强化学习）算法中，目标策略网络中的主目标策略网络是执行器（也称Actor），值函数网络中的主值函数网络是评价器（也称Critic）。Actor 要做的是与环境交互，并在 Critic 价值函数的指导下用策略梯度学习一个更好的策略。Critic 要做的是通过 Actor 与环境交互收集的数据学习一个价值函数，这个价值函数会用于判断在当前状态什么动作是好的，什么动作不是好的，进而帮助 Actor 进行策略更新。对于主值函数网络，通过输入目标时刻的环境态势和最优航向角，网络会输出价值函数，该函数表示着未来累积奖励的大小。对于主目标策略网络，通过输入目标时刻的环境态势，可以计算出最优航向角。另外，在DDPG中，目标策略网络还可以附加策略网络以及值函数网络还可以包含附加值函数网络，用来帮助强化学习训练过程更加稳定。in, For sample drones Optimal heading angle at the moment; They represent the optimal roll angle, optimal pitch angle, and optimal yaw angle of the sample UAV respectively; is the target policy network to be trained. In the DDPG (Deep Deterministic Policy Gradient, deep reinforcement learning) algorithm, the main target policy network in the target policy network It is the executor (also called Actor), the main value function network in the value function network It is the evaluator (also called Critic). What the Actor needs to do is to interact with the environment and learn a better strategy using policy gradient under the guidance of the Critic value function. What the Critic needs to do is to learn a value function through the data collected by the Actor's interaction with the environment. This value function will be used to determine what actions are good and what actions are not good in the current state, and then help the Actor to update its strategy. For the main value function network By inputting the environment situation and optimal heading angle at the target time, the network will output a value function, which represents the size of the future cumulative reward. By inputting the environmental situation at the target time, the optimal heading angle can be calculated. In addition, in DDPG, the target policy network can also be attached with a policy network And the value function network can also contain additional value function network , which is used to help the reinforcement learning training process become more stable.

可选地，在基于上述各实施例获取到样本无人机的目标时刻环境态势以及目标时刻最优航向角之后，即可基于目标时刻环境态势以及目标时刻最优航向角进行环境交互，以得到下一时刻环境态势和目标时刻奖励值，具体可表示为：Optionally, after obtaining the target moment environmental situation and the optimal heading angle of the sample drone based on the above embodiments, environmental interaction can be performed based on the target moment environmental situation and the optimal heading angle to obtain the next moment environmental situation and the target moment reward value, which can be specifically expressed as:

； ;

其中，为时刻环境态势；为时刻奖励值；表示环境交互函数，即可根据无人机动力学约束模型、运动学约束和扰动流场法得到样本无人机的下一时刻位置，并结合障碍物的下一时刻的位置和目的地位置得到下一时刻的环境态势以及当前时刻的奖励。in, for The current environmental situation; for Moment reward value; Representing the environmental interaction function, the next moment position of the sample UAV can be obtained according to the UAV dynamic constraint model, kinematic constraints and perturbation flow field method, and the next moment position of the obstacle and the destination position can be combined to obtain the next moment environmental situation and the current moment reward.

在一些实施例中，下一时刻环境态势是基于如下步骤获取的：In some embodiments, the next moment environmental situation is obtained based on the following steps:

根据无人机动力学约束模型、运动学约束和扰动流场法，计算得到所述样本无人机的下一时刻位置；根据所述样本无人机的下一时刻位置、半径和目的地位置，障碍物的下一时刻位置、下一时刻速度和半径，以及所述样本无人机与所述障碍物之间的下一时刻距离，确定所述下一时刻环境态势。According to the UAV dynamic constraint model, kinematic constraints and disturbance flow field method, the next moment position of the sample UAV is calculated; according to the next moment position, radius and destination position of the sample UAV, the next moment position, next moment speed and radius of the obstacle, and the next moment distance between the sample UAV and the obstacle, the next moment environmental situation is determined.

可选地，获取下一时刻环境态势的步骤具体包括：Optionally, the step of obtaining the environmental situation at the next moment specifically includes:

首先，基于无人机动力学约束模型、运动学约束和扰动流场法，根据样本无人机的目标时刻位置、目标时刻速度和目的地位置、障碍物的目标时刻位置和目标时刻速度，计算获取样本无人机的下一时刻位置。此种计算方式，通过在计算过程中考虑无人机的动力学约束以及运动学约束，并使用扰动流场法对环境因素进行修正，可以实现无人机的精确运动规划，从使得计算出无人机的下一时刻位置更加精准和可靠。First, based on the UAV dynamic constraint model, kinematic constraints and perturbation flow field method, the next moment position of the sample UAV is calculated according to the target moment position, target moment speed and destination position of the sample UAV, the target moment position and target moment speed of the obstacle. This calculation method can achieve accurate motion planning of the UAV by considering the dynamic constraints and kinematic constraints of the UAV in the calculation process and using the perturbation flow field method to correct environmental factors, making the calculation of the next moment position of the UAV more accurate and reliable.

此处，无人机动力学约束模型包括但不限于无人机的偏航角、爬升角、轨迹段长度、飞行高度和轨迹总长度等约束；运动学约束包括无人机运动的限制条件；扰动流场法可以是基于障碍物的下一时刻位置、下一时刻速度和半径，以及样本无人机与所述障碍物之间的下一时刻距离进行运行轨迹规划，并基于无人机动力学约束模型、运动学约束的约束条件进行运行轨迹规划的矫正，由此得到样本无人机的下一时刻位置。Here, the UAV dynamic constraint model includes but is not limited to constraints such as the UAV's yaw angle, climb angle, trajectory segment length, flight altitude and total trajectory length; the kinematic constraints include the restriction conditions for the UAV's movement; the perturbation flow field method can be based on the next moment position, next moment speed and radius of the obstacle, and the next moment distance between the sample UAV and the obstacle for operation trajectory planning, and based on the UAV dynamic constraint model and the kinematic constraint constraints, the operation trajectory planning is corrected to obtain the next moment position of the sample UAV.

接着，根据障碍物的下一时刻位置与样本无人机的下一时刻位置之间的差值、样本无人机的半径和障碍物的半径之和，以及样本无人机与障碍物之间的下一时刻距离，确定下一时刻环境态势的第一参数值；基于目的地位置与障碍物的下一时刻位置之间的差值，确定下一时刻环境态势的第二参数值；基于障碍物的下一时刻速度，确定下一时刻环境态势的第三参数值，基于第一参数值、第二参数值和第三参数值构建形成下一时刻环境态势，具体可参见目标时刻环境态势的获取步骤，此处不再赘述。Next, according to the difference between the next moment position of the obstacle and the next moment position of the sample UAV, the sum of the radius of the sample UAV and the radius of the obstacle, and the next moment distance between the sample UAV and the obstacle, determine the first parameter value of the environmental situation at the next moment; based on the difference between the destination position and the next moment position of the obstacle, determine the second parameter value of the environmental situation at the next moment; based on the next moment speed of the obstacle, determine the third parameter value of the environmental situation at the next moment, and construct the environmental situation at the next moment based on the first parameter value, the second parameter value and the third parameter value. For details, please refer to the steps for obtaining the environmental situation at the target moment, which will not be repeated here.

在一些实施例中，目标时刻奖励值是基于如下步骤获取的：In some embodiments, the target moment reward value is obtained based on the following steps:

在所述样本无人机与障碍物之间的目标时刻距离小于第一距离值的情况下，根据所述样本无人机与所述障碍物之间的目标时刻距离、所述样本无人机的半径、所述障碍物的半径，以及第一奖励值，确定所述目标时刻奖励值；在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且所述样本无人机与目的地位置之间的目标时刻距离小于第二距离值的情况下，根据所述样本无人机与所述目的地位置之间的目标时刻距离、所述样本无人机的起点位置与所述目的地位置之间的距离，以及第二奖励值和第三奖励值，确定所述目标时刻奖励值；在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且所述样本无人机与所述目的地位置之间的目标时刻距离大于或等于所述第二距离值的情况下，根据所述样本无人机与所述目的地位置之间的目标时刻距离、所述样本无人机的起点位置与所述目的地位置之间的距离，以及所述第三奖励值，确定所述目标时刻奖励值；其中，所述第一奖励值为常值奖励值；所述第二奖励值为用于限制所述样本无人机远离所述障碍物的威胁奖励；所述第三奖励值为任务完成对应的附加奖励值。In the case where the target time distance between the sample drone and the obstacle is less than the first distance value, the target time reward value is determined according to the target time distance between the sample drone and the obstacle, the radius of the sample drone, the radius of the obstacle, and the first reward value; in the case where the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, and the target time distance between the sample drone and the destination position is less than the second distance value, the target time reward value is determined according to the target time distance between the sample drone and the destination position, the distance between the starting position of the sample drone and the destination position, and the second reward value and the third reward value; in the case where the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, and the target time distance between the sample drone and the destination position is greater than or equal to the second distance value, the target time reward value is determined according to the target time distance between the sample drone and the destination position, the distance between the starting position of the sample drone and the destination position, and the third reward value; wherein the first reward value is a constant reward value; the second reward value is a threat reward for limiting the sample drone to stay away from the obstacle; and the third reward value is an additional reward value corresponding to task completion.

此处，第一距离值是根据样本无人机的半径以及障碍物的半径之和确定的；第二距离是根据预设距离值进行确定的。Here, the first distance value is determined according to the sum of the radius of the sample UAV and the radius of the obstacle; the second distance is determined according to the preset distance value.

可选地，获取目标时刻奖励值的具体步骤包括：Optionally, the specific steps of obtaining the target moment reward value include:

将样本无人机与障碍物之间的目标时刻距离与第一距离值进行比较，若样本无人机与障碍物之间的目标时刻距离小于第一距离值，则基于样本无人机与所述障碍物之间的目标时刻距离、样本无人机的半径和障碍物的半径之和，以及第一奖励值，计算得到目标时刻奖励值；若样本无人机与障碍物之间的目标时刻距离大于或等于第一距离值，则将样本无人机与目的地位置之间的目标时刻距离与第二距离值进一步进行比较，若在此基础上，样本无人机与目的地位置之间的目标时刻距离小于第二距离值，则根据样本无人机与目的地位置之间的目标时刻距离和样本无人机的起点位置与目的地位置之间的距离之间的比值，以及第二奖励值和第三奖励值，计算得到目标时刻奖励值；若在样本无人机与障碍物之间的目标时刻距离大于或等于第一距离值的基础上，样本无人机与目的地位置之间的目标时刻距离大于或等于第二距离值，则根据样本无人机与目的地位置之间的目标时刻距离和样本无人机的起点位置与目的地位置之间的距离之间的比值，以及第三奖励值，计算得到目标时刻奖励值。The target time distance between the sample drone and the obstacle is compared with the first distance value. If the target time distance between the sample drone and the obstacle is less than the first distance value, the target time reward value is calculated based on the target time distance between the sample drone and the obstacle, the sum of the radius of the sample drone and the radius of the obstacle, and the first reward value; if the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, the target time distance between the sample drone and the destination position is further compared with the second distance value. If on this basis, the target time distance between the sample drone and the destination position is less than the second distance value, the target time reward value is calculated according to the ratio between the target time distance between the sample drone and the destination position and the distance between the starting position of the sample drone and the destination position, as well as the second reward value and the third reward value; if on the basis that the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, the target time reward value is calculated according to the ratio between the target time distance between the sample drone and the destination position and the distance between the starting position of the sample drone and the destination position, as well as the third reward value.

其中，时刻奖励值的具体计算公式如下：in, The specific calculation formula of the moment reward value is as follows:

； ;

其中，为样本无人机与障碍物之间的时刻距离；和分别为样本无人机和障碍物的半径；、和分别为第一奖励值、第二奖励值和第三奖励值；为第二距离值，此距离值是预先设定的；为样本无人机与目的地位置之间的时刻距离；为样本无人机的起点位置与目的地位置之间的距离。in, is the distance between the sample drone and the obstacle. Time distance; and are the radii of the sample drone and obstacles, respectively; , and are the first reward value, the second reward value and the third reward value respectively; is a second distance value, which is preset; is the distance between the sample UAV and the destination location Time distance; is the distance between the starting position and the destination position of the sample UAV.

此处，第一奖励值为预先设定的常值奖励值；第三奖励值为任务完成所设定的对应的附加奖励值；第二奖励值为限制样本无人机远离障碍物所设置的威胁奖励。Here, the first reward value is a pre-set constant reward value; the third reward value is the corresponding additional reward value set for task completion; and the second reward value is a threat reward set to restrict the sample drone from staying away from obstacles.

本实施例提供的方法，通过无人机与障碍物之间的目标时刻距离和无人机与目的地之间的目标时刻距离，动态调整目标时刻奖励值，以便无人机在接近障碍物时，将受到第一奖励值的限制，以便尽量避免与障碍物碰撞。而在远离障碍物、接近目的地时，将受到第二奖励值和第三奖励值的鼓励，以便引导无人机尽可能保持安全距离的同时，鼓励无人机尽快到达目的地完成任务；在远离目的地时，将受到第三奖励值的鼓励，以便鼓励无人机尽快到达目的地完成任务，由此提升无人机的飞行避障的动态适应性、避障能力以及任务完成能力，进而提高无人机避障性能。The method provided in this embodiment dynamically adjusts the target time reward value through the target time distance between the drone and the obstacle and the target time distance between the drone and the destination, so that when the drone approaches the obstacle, it will be restricted by the first reward value to avoid collision with the obstacle as much as possible. When it is far away from the obstacle and close to the destination, it will be encouraged by the second reward value and the third reward value to guide the drone to maintain a safe distance as much as possible while encouraging the drone to reach the destination as soon as possible to complete the task; when it is far away from the destination, it will be encouraged by the third reward value to encourage the drone to reach the destination as soon as possible to complete the task, thereby improving the dynamic adaptability of the drone's flight obstacle avoidance, obstacle avoidance ability and task completion ability, thereby improving the drone's obstacle avoidance performance.

在一些实施例中，所述第二奖励值是基于如下步骤确定的：In some embodiments, the second reward value is determined based on the following steps:

在所述样本无人机与所述障碍物之间的目标时刻距离大于或等于所述第一距离值，且小于第三距离值的情况下，基于所述样本无人机与所述障碍物之间的目标时刻距离、所述样本无人机的半径、所述障碍物的半径、预设威胁半径和第四奖励值，确定所述第二奖励值；在所述样本无人机与所述障碍物之间的目标时刻距离小于所述第一距离值，或者大于或等于所述第三距离值的情况下，基于预设常数值，确定所述第二奖励值。When the target moment distance between the sample UAV and the obstacle is greater than or equal to the first distance value and less than the third distance value, the second reward value is determined based on the target moment distance between the sample UAV and the obstacle, the radius of the sample UAV, the radius of the obstacle, the preset threat radius and the fourth reward value; when the target moment distance between the sample UAV and the obstacle is less than the first distance value, or greater than or equal to the third distance value, the second reward value is determined based on a preset constant value.

此处，第三距离值是基于样本无人机的半径、障碍物的半径和预设威胁半径三者之和进行确定的。Here, the third distance value is determined based on the sum of the radius of the sample UAV, the radius of the obstacle and the preset threat radius.

可选地，第二奖励值的具体获取步骤包括：Optionally, the specific steps of obtaining the second reward value include:

将样本无人机与障碍物之间的目标时刻距离与第一距离值进行比较，在样本无人机与障碍物之间的目标时刻距离大于或等于第一距离值的情况下，进一步将样本无人机与障碍物之间的目标时刻距离与第三距离值进行比较；若据此确定样本无人机与障碍物之间的目标时刻距离大于或等于第一距离值，且小于第三距离值，则基于样本无人机的半径、障碍物的半径和预设威胁半径三者之和、样本无人机与所述障碍物之间的目标时刻距离，以及预设威胁半径，计算得到第二奖励值。若据此确定样本无人机与障碍物之间的目标时刻距离小于第一距离值，或者大于或等于第三距离值，则基于预设常数值，确定第二奖励值。此处的预设常数值可以是0或者无限接近于0的常数值。The target time distance between the sample drone and the obstacle is compared with the first distance value. When the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value, the target time distance between the sample drone and the obstacle is further compared with the third distance value. If it is determined that the target time distance between the sample drone and the obstacle is greater than or equal to the first distance value and less than the third distance value, the second reward value is calculated based on the sum of the radius of the sample drone, the radius of the obstacle and the preset threat radius, the target time distance between the sample drone and the obstacle, and the preset threat radius. If it is determined that the target time distance between the sample drone and the obstacle is less than the first distance value, or greater than or equal to the third distance value, the second reward value is determined based on the preset constant value. The preset constant value here can be 0 or a constant value infinitely close to 0.

其中，第二奖励值的计算公式为：Among them, the second reward value The calculation formula is:

； ;

其中，为预设威胁半径；为第四奖励值；此处，第四奖励值也可以为预先设定的常值奖励值。in, is the preset threat radius; is the fourth reward value; here, the fourth reward value may also be a preset constant reward value.

本实施例提供的方法，通过根据无人机与障碍物之间的目标时刻距离，动态调整第二奖励值，以更加准确地反映无人机与障碍物之间的关系，提供更有效的奖励信号，提高无人机的避障能力和飞行安全性，进而有助于提升无人机的避障性能和应对多样化场景的能力。The method provided in this embodiment dynamically adjusts the second reward value according to the target moment distance between the drone and the obstacle to more accurately reflect the relationship between the drone and the obstacle, provide a more effective reward signal, improve the drone's obstacle avoidance capability and flight safety, and thus help improve the drone's obstacle avoidance performance and ability to cope with diverse scenarios.

可选地，基于上述各实施例获取到样本无人机的目标时刻环境态势、下一时刻环境态势、目标时刻最优航向角以及目标时刻奖励值之后，即可将目标时刻环境态势、下一时刻环境态势、目标时刻最优航向角以及目标时刻奖励值构建形成的4元组，构建形成目标时刻的样本数据。Optionally, after obtaining the target moment environmental situation, next moment environmental situation, target moment optimal heading angle and target moment reward value of the sample drone based on the above embodiments, the target moment environmental situation, next moment environmental situation, target moment optimal heading angle and target moment reward value can be constructed into a 4-tuple to form sample data at the target moment.

步骤120，将所述目标时刻的样本数据更新至经验回放池，在更新后的经验回放池中的样本数据的数量达到预设数量的情况下，从所述更新后的经验回放池中抽取出目标预测区间内多个不同时刻的待处理样本数据；Step 120, updating the sample data at the target moment to the experience replay pool, and when the number of sample data in the updated experience replay pool reaches a preset number, extracting to-be-processed sample data at multiple different moments within the target prediction interval from the updated experience replay pool;

此处，经验回放池是一个存储历史样本数据的缓冲区，用于在训练过程中随机抽取样本进行训练。Here, the experience replay pool is a buffer for storing historical sample data, which is used to randomly extract samples for training during the training process.

可选地，在获取到目标时刻的样本数据之后，可以将目标时刻的样本数据更新至经验回放池；具体表示公式如下：Optionally, after obtaining the sample data at the target time, the sample data at the target time may be updated to the experience replay pool; the specific expression formula is as follows:

； ;

其中，是经验回放池；为时刻的样本数据；和分别为时刻环境态势和时刻环境态势；和分别为时刻最优航向角和时刻奖励值。in, It is the experience replay pool; for Sample data at the moment; and They are The current environmental situation and The current environmental situation; and They are The optimal heading angle at the moment Moment reward value.

接着，将更新后的经验回放池中的样本数据的数量与预设数量进行比较，若更新后的经验回放池中的样本数据的数量未达到预设数量，则继续收集下一时刻的样本数据更新至经验回放池，直到更新后的经验回放池中的样本数据的数量达到预设数量；若更新后的经验回放池中的样本数据的数量达到预设数量，则从中采样目标预测区间内多个不同时刻的待处理样本数据，以便后续基于模型预测控制的多步预测方法来预测每一个样本数据未来的环境态势、无人机的最优航向角以及奖励。实现通过滚动优化方法来最大化每个预测区间的累积回报，由此提高无人机在强化学习中的学习效率和样本利用率，进而提高无人机避障性能。Next, the number of sample data in the updated experience replay pool is compared with the preset number. If the number of sample data in the updated experience replay pool does not reach the preset number, the sample data of the next moment is collected and updated to the experience replay pool until the number of sample data in the updated experience replay pool reaches the preset number. If the number of sample data in the updated experience replay pool reaches the preset number, the sample data to be processed at multiple different moments in the target prediction interval are sampled from it, so that the multi-step prediction method based on model predictive control can be used to predict the future environmental situation of each sample data, the optimal heading angle of the drone, and the reward. The rolling optimization method is used to maximize the cumulative return of each prediction interval, thereby improving the learning efficiency and sample utilization of the drone in reinforcement learning, and thus improving the obstacle avoidance performance of the drone.

其中，从更新后的经验回放池中抽取出多个不同时刻的待处理样本数据的表示公式为：The formula for extracting the sample data to be processed at multiple different times from the updated experience replay pool is:

； ;

其中，为所需抽取的样本数据的总数量。in, The total number of sample data that needs to be extracted.

步骤130，将各所述待处理样本数据中的环境态势输入至目标策略网络，得到多个不同未来时刻的最优航向角预测值，并将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值；Step 130, inputting the environmental situation in each of the sample data to be processed into the target strategy network to obtain the optimal heading angle prediction values at multiple different future moments, and inputting the environmental situation in each of the sample data to be processed and the optimal heading angle prediction values at multiple different future moments into the target prediction network for multi-step prediction to obtain the environmental situation prediction values and reward prediction values at multiple different future moments;

可选地，在抽取到多个不同时刻的待处理样本数据之后，可以是对每一待处理样本数据执行步预测操作，以得到各待处理样本数据对应的多个不同未来时刻的环境态势预测值和奖励预测值。其中，为预先设定的预测步长，N小于。Optionally, after extracting a plurality of sample data to be processed at different times, each sample data to be processed may be subjected to Step prediction operation is performed to obtain the environmental situation prediction values and reward prediction values at multiple different future moments corresponding to each sample data to be processed. is the pre-set prediction step length, N is less than .

此处，对于每一待处理样本数据，迭代执行步预测操作的具体步骤如下：Here, for each sample data to be processed, iterative execution The specific steps of the step prediction operation are as follows:

首先，基于目标策略网络和多个不同时刻的待处理样本数据的不同时刻的环境态势，计算在不同预测步骤下对应时刻（也即多个不同未来时刻）的最优航向角，也即将不同预测步骤下的对应时刻的环境态势输入至目标策略网络，得到不同预测步骤下的对应时刻的最优航向角预测值，具体计算公式如下：First, based on the target strategy network and the environmental situation at different times of the sample data to be processed at different times, the prediction steps are calculated. The optimal heading angle at the corresponding moment (that is, multiple different future moments) under different prediction steps is input into the target strategy network to obtain the optimal heading angle prediction value at the corresponding moment under different prediction steps. The specific calculation formula is as follows:

； ;

其中，为第个样本数据在预测步骤的对应时刻的最优航向角预测值；为主策略网络；为第个样本数据在预测步骤的对应时刻的环境态势。in, For the Sample data in the prediction step The optimal heading angle prediction value at the corresponding moment; As the main strategy network; For the Sample data in the prediction step The environmental situation at the corresponding moment.

接着，与训练的目标预测网络进行交互，得到不同预测步骤的对应时刻的下一时刻的环境态势以及不同预测步骤的对应时刻的奖励值；具体将不同预测步骤下对应时刻的环境态势以及最优航向角预测值输入至目标预测网络，由目标预测网络进行多步预测，预测得到多个不同预测步骤的对应时刻的下一时刻的环境态势预测值和多个不同预测步骤的对应时刻的奖励预测值。Next, it interacts with the trained target prediction network to obtain the environmental situation at the next moment corresponding to different prediction steps and the reward value at the corresponding moment of different prediction steps; specifically, the environmental situation at the corresponding moment under different prediction steps and the optimal heading angle prediction value are input into the target prediction network, and the target prediction network performs multi-step prediction to predict the environmental situation prediction value at the next moment corresponding to multiple different prediction steps and the reward prediction value at the corresponding moment of multiple different prediction steps.

此处的目标预测网络为可以进行环境态势预测以及奖励值预测的模型；其是以目标预测区间的上一预测区间内的样本无人机的各时刻环境态势值和各时刻最优航向角为样本，以各时刻的下一时刻环境态势值和各时刻奖励值为标签，和/或以上一预测区间内的样本无人机的各时刻的环境态势预测值和最优航向角预测值，以及下一时刻环境态势预测值和奖励预测值进行迭代训练得到的目标预测网络，以使得训练的目标预测网络在低维环境中具有很好的逼近效果。The target prediction network here is a model that can perform environmental situation prediction and reward value prediction; it uses the environmental situation values and optimal heading angles of sample drones at each moment in the previous prediction interval of the target prediction interval as samples, the environmental situation values of the next moment and the reward values at each moment as labels, and/or the environmental situation prediction values and optimal heading angle prediction values of sample drones at each moment in the previous prediction interval, as well as the environmental situation prediction values and reward prediction values at the next moment to iteratively train the target prediction network, so that the trained target prediction network has a good approximation effect in a low-dimensional environment.

本实施例提供的方法，通过将从经验回放池中抽取的样本数据同时用于更新目标预测网络和后续的策略模型更新，且目标预测网络又在后续对策略模型的更新具有促进作用，使得无人机在避障问题中具有较高的学习效率，更快的最优策略收敛速度以及较少的经验重放缓冲区所需的样本容量空间。The method provided in this embodiment uses sample data extracted from the experience replay pool to update the target prediction network and the subsequent policy model update, and the target prediction network promotes the subsequent update of the policy model, so that the drone has higher learning efficiency, faster optimal policy convergence speed and less sample capacity space required for the experience replay buffer in obstacle avoidance problems.

步骤140，根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，并根据训练结果，获取优化的策略网络；其中，所述优化的策略网络用于基于当前无人机的当前时刻环境态势预测所述当前无人机的当前时刻最优航向角，以供所述当前无人机根据所述当前时刻最优航向角执行避障任务。Step 140, according to the environmental situation, optimal heading angle and reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, the target policy network is subjected to reinforcement learning training, and an optimized policy network is obtained according to the training results; wherein the optimized policy network is used to predict the optimal heading angle of the current drone at the current moment based on the current environmental situation of the current drone at the current moment, so that the current drone can perform an obstacle avoidance task according to the optimal heading angle at the current moment.

可选地，在获取到各待处理样本数据对应的多个不同未来时刻的环境态势预测值、奖励预测值和最优航向角预测值之后，可以基于多个不同未来时刻的环境态势预测值、奖励预测值和最优航向角预测值，以及待处理样本数据中所包含的环境态势、奖励值和最优航向角，获取至少一个代价函数，以基于代价函数对目标策略网络的网络参数进行梯度更新，由此实现通过与环境交互和利用预测信息，不断更新和优化目标策略网络，以得到可根据无人机的环境态势，精准地决策出无人机的最优航向角的优化的策略网络。Optionally, after obtaining the environmental situation prediction values, reward prediction values and optimal heading angle prediction values at multiple different future moments corresponding to each sample data to be processed, at least one cost function can be obtained based on the environmental situation prediction values, reward prediction values and optimal heading angle prediction values at multiple different future moments, as well as the environmental situation, reward value and optimal heading angle contained in the sample data to be processed, so as to perform gradient update on the network parameters of the target policy network based on the cost function, thereby achieving continuous updating and optimization of the target policy network by interacting with the environment and utilizing prediction information, so as to obtain an optimized policy network that can accurately decide the optimal heading angle of the drone according to the environmental situation of the drone.

随后，在获取到优化的策略网络之后，即可将当前无人机的当前时刻环境态势输入至优化的策略网络中，以由优化的策略网络基于当前时刻环境态势进行决策，以预测得到当前无人机的当前时刻最优航向角，进而便于引导当前无人机根据当前时刻最优航向角执行避障任务，以使其在当前环境下能够获得最大奖励，实现有效的避障行为。Subsequently, after obtaining the optimized strategy network, the current environmental situation of the current UAV can be input into the optimized strategy network, so that the optimized strategy network can make decisions based on the current environmental situation to predict the current optimal heading angle of the current UAV, thereby facilitating guiding the current UAV to perform obstacle avoidance tasks according to the current optimal heading angle, so that it can obtain the maximum reward in the current environment and achieve effective obstacle avoidance behavior.

相比于一些相关技术中通过基于模型的强化学习生成额外的虚拟数据，导致需要长期预测，学习成本变得昂贵，本实施例提供的方法，通过模型预测控制在各预测区间内的有限短范围内优化轨迹，提供当前局部最优解，避免长期规划的高成本，实现了投入成本和收敛收益之间的平衡。此外，相比于一些相关技术中通过概率性建模实现环境态势预测和奖励预测，本实施例提供的方法，通过确定性建模，也即基于数据驱动的方法对环境模型的状态转移和奖励模型进行建模，也就是通过神经网络对环境模型进行逼近，有利于减少神经网络的数量，提高算法的计算速度，进而提高目标策略网络的学习效率。Compared to some related technologies that generate additional virtual data through model-based reinforcement learning, which requires long-term prediction and makes learning expensive, the method provided in this embodiment optimizes the trajectory within a limited short range within each prediction interval through model predictive control, provides the current local optimal solution, avoids the high cost of long-term planning, and achieves a balance between input cost and convergence benefit. In addition, compared to some related technologies that achieve environmental situation prediction and reward prediction through probabilistic modeling, the method provided in this embodiment uses deterministic modeling, that is, a data-driven method to model the state transition and reward model of the environmental model, that is, to approximate the environmental model through a neural network, which is conducive to reducing the number of neural networks, improving the calculation speed of the algorithm, and thus improving the learning efficiency of the target policy network.

本实施例提供的网络训练方法，通过从经验回放池中抽取出多个不同时刻的待处理样本数据，并且基于目标策略网络以及目标预测网络对各待处理样本数据进行预测，得到多个不同未来时刻的最优航向角预测值、环境态势预测值和奖励预测值，以基于各待处理样本数据中的环境态势、最优航向角和奖励值，以及未来时刻的环境态势预测值、奖励预测值和最优航向角预测值，对目标策略网络进行滚动优化，不仅可以实现通过目标预测网络的虚拟环境数据生成来扩充样本数据的数量，以减少与真实环境的交互次数，提高样本利用率，加快训练速度，提高学习效率，进而使得无人机在与环境交互次数更少的情况下目标策略网络可以快速收敛到最优；还可以使得优化的决策模型既具备当前决策经验，又具备未来决策经验，以便做出更加优化的无人机避障决策，由此提高无人机避障性能。The network training method provided in this embodiment extracts sample data to be processed at multiple different moments from the experience replay pool, and predicts each sample data to be processed based on the target policy network and the target prediction network to obtain the optimal heading angle prediction values, environmental situation prediction values and reward prediction values at multiple different future moments, so as to perform rolling optimization on the target policy network based on the environmental situation, optimal heading angle and reward value in each sample data to be processed, as well as the environmental situation prediction value, reward prediction value and optimal heading angle prediction value at the future moment. This can not only realize the expansion of the number of sample data by generating virtual environment data through the target prediction network to reduce the number of interactions with the real environment, improve sample utilization, speed up training, and improve learning efficiency, so that the target policy network can quickly converge to the optimum when the number of interactions between the drone and the environment is less; it can also enable the optimized decision model to have both current decision-making experience and future decision-making experience, so as to make more optimized drone obstacle avoidance decisions, thereby improving the drone obstacle avoidance performance.

在一些实施例中，步骤130中将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值，包括：In some embodiments, in step 130, the environmental situation in each of the sample data to be processed and the optimal heading angle prediction values at multiple different future moments are input into the target prediction network for multi-step prediction to obtain environmental situation prediction values and reward prediction values at multiple different future moments, including:

此处，目标预测网络可以是包括可进行环境态势预测的态势转移函数网络，以及可进行奖励值预测的奖励函数网络。Here, the target prediction network may include a situation transfer function network capable of predicting environmental situations and a reward function network capable of predicting reward values.

可选地，环境态势预测值和奖励预测值的多步预测获取步骤具体包括：Optionally, the multi-step prediction acquisition steps of the environmental situation prediction value and the reward prediction value specifically include:

将各待处理样本数据中的环境态势和该环境态势的对应时刻的最优航向角预测值输入至奖励函数网络，得到对应时刻的奖励预测值；将各待处理样本数据中的环境态势和该环境态势的对应时刻的最优航向角预测值输入至态势转移函数网络，得到对应时刻的下一时刻的环境态势预测值；Input the environmental situation in each sample data to be processed and the optimal heading angle prediction value of the environmental situation at the corresponding time into the reward function network to obtain the reward prediction value at the corresponding time; input the environmental situation in each sample data to be processed and the optimal heading angle prediction value of the environmental situation at the corresponding time into the situation transfer function network to obtain the environmental situation prediction value at the next moment of the corresponding time;

将对应时刻的下一时刻的环境态势预测值以及对应时刻的下一时刻的最优航向角预测值输入至奖励函数网络，得到对应时刻的下一时刻的奖励预测值，将对应时刻的下一时刻的环境态势预测值以及对应时刻的下一时刻的最优航向角预测值输入至态势转移函数网络，得到下一时刻的未来时刻的环境态势预测值，对每一待处理样本数据迭代执行上述预测步骤，直到预测步长达到预先设定的预测步长。The predicted value of the environmental situation at the next moment of the corresponding moment and the predicted value of the optimal heading angle at the next moment of the corresponding moment are input into the reward function network to obtain the predicted value of the reward at the next moment of the corresponding moment. The predicted value of the environmental situation at the next moment of the corresponding moment and the predicted value of the optimal heading angle at the next moment of the corresponding moment are input into the situation transfer function network to obtain the predicted value of the environmental situation at the next future moment. The above prediction steps are iteratively performed for each sample data to be processed until the prediction step length reaches the preset prediction step length. .

其中，环境态势预测值和奖励预测值的多步预测获取公式可以表示为：Among them, the multi-step prediction acquisition formula of the environmental situation prediction value and the reward prediction value can be expressed as:

； ;

其中，是态势转移函数网络，是奖励函数网络，和分别是态势转移函数网络和奖励函数网络的模型参数。和分别表示通过态势转移函数网络预测得到的环境态势预测值和通过奖励函数网络预测得到的奖励预测值。in, is the situation transfer function network, is the reward function network, and are the model parameters of the situation transfer function network and the reward function network respectively. and They respectively represent the environmental situation prediction value obtained by the situation transfer function network and the reward prediction value obtained by the reward function network.

本实施例提供的方法，通过目标预测网络的奖励函数网络和态势转移函数网络对待处理样本数据进行多步预测，以通过确定性模型的方法来近似环境模型，也即使用数据驱动的方法学习环境的环境态势预测值和奖励预测值，提高了无人机在强化学习中的样本利用率，进而提高无人机避障性能。The method provided in this embodiment performs multi-step prediction on the sample data to be processed through the reward function network and the situation transfer function network of the target prediction network, so as to approximate the environmental model through the method of the deterministic model, that is, to use the data-driven method to learn the environmental situation prediction value and the reward prediction value of the environment, thereby improving the sample utilization rate of the UAV in reinforcement learning and further improving the obstacle avoidance performance of the UAV.

在一些实施例中，步骤140中根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，包括：根据各所述待处理样本数据中的最优航向角，以及所述最优航向角预测值、所述环境态势预测值和所述奖励预测值，获取值函数代价函数；根据所述最优航向角预测值和所述环境态势预测值，获取策略代价函数；根据各所述待处理样本数据中的环境态势和所述环境态势预测值，获取态势转移代价函数；根据各所述待处理样本数据中的奖励值和所述奖励预测值，获取奖励代价函数；根据所述值函数代价函数、所述策略代价函数、所述态势转移代价函数以及所述奖励代价函数，对所述目标策略网络进行强化学习。In some embodiments, in step 140, reinforcement learning training is performed on the target policy network according to the environmental situation, optimal heading angle and reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, including: obtaining a value function cost function according to the optimal heading angle in each of the sample data to be processed, as well as the optimal heading angle prediction value, the environmental situation prediction value and the reward prediction value; obtaining a strategy cost function according to the optimal heading angle prediction value and the environmental situation prediction value; obtaining a situation transfer cost function according to the environmental situation in each of the sample data to be processed and the environmental situation prediction value; obtaining a reward cost function according to the reward value in each of the sample data to be processed and the reward prediction value; and performing reinforcement learning on the target policy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the reward cost function.

可选地，目标策略网络的强化学习训练步骤具体包括：Optionally, the reinforcement learning training steps of the target policy network specifically include:

将各预测步骤的对应时刻的最优航向角预测值或者最优航向角，以及环境态势预测值和奖励预测值输入至目标值函数网络的附加值函数网络，根据附加值函数网络的输出以及各预测步骤的对应时刻的奖励预测值，得到各预测步骤的对应时刻的目标值，具体计算公式如下：The optimal heading angle prediction value or optimal heading angle at the corresponding moment of each prediction step, as well as the environmental situation prediction value and the reward prediction value are input into the additional value function network of the target value function network. According to the output of the additional value function network and the reward prediction value at the corresponding moment of each prediction step, the target value at the corresponding moment of each prediction step is obtained. The specific calculation formula is as follows:

； ;

其中，对第个样本数据在预测步骤的对应时刻的目标值；为预先设定的预测步长；和为系数；、和分别为奖励预测值、环境态势预测值和最优航向角预测值；为附加值函数网络。in, For Sample data in the prediction step The target value at the corresponding moment; is the pre-set prediction step size; and is the coefficient; , and They are respectively the predicted value of reward, the predicted value of environmental situation and the predicted value of optimal heading angle; is the value-added function network.

将各预测步骤的对应时刻的最优航向角预测值和环境态势预测值输入至目标值函数网络的主值函数网络，以根据主值函数网络的输出与各预测步骤的对应时刻的目标值之间的差值，计算得到值函数代价函数；其中，值函数代价函数具体计算公式为：The optimal heading angle prediction value and environmental situation prediction value at the corresponding time of each prediction step are input into the main value function network of the target value function network, so as to calculate the value function cost function according to the difference between the output of the main value function network and the target value at the corresponding time of each prediction step; wherein, the value function cost function The specific calculation formula is:

； ;

其中，为各预测区间内所需抽取的样本数据的数量；、和分别为各预测步骤的对应时刻的目标值、环境态势预测值和最优航向角预测值；为主值函数网络；为预先设定的预测步长。in, is the number of sample data required to be extracted in each prediction interval; , and are the target value, environmental situation prediction value and optimal heading angle prediction value at the corresponding moment of each prediction step respectively; is the main value function network; is the preset prediction step size.

将各预测步骤的对应时刻的最优航向角预测值和环境态势预测值输入至目标值函数网络中的主值函数网络，基于主值函数网络的输出，计算得到策略代价函数；其中，策略代价函数具体计算公式为：The optimal heading angle prediction value and environmental situation prediction value at the corresponding moment of each prediction step are input into the principal value function network in the target value function network, and the strategy cost function is calculated based on the output of the principal value function network; wherein, the strategy cost function The specific calculation formula is:

； ;

根据对应时刻的环境态势和环境态势预测值之间的差值，计算得到态势转移代价函数；其中，态势转移代价函数具体计算公式为：According to the difference between the environmental situation at the corresponding moment and the environmental situation prediction value, the situation transfer cost function is calculated; among them, the situation transfer cost function The specific calculation formula is:

； ;

和分别为第个样本数据的对应时刻的环境态势预测值和环境态势。 and Respectively The environmental situation prediction value and environmental situation at the corresponding moment of each sample data.

根据对应时刻的奖励值和奖励预测值之间的差值，计算得到奖励代价函数；其中，奖励代价函数具体计算公式为：The reward cost function is calculated based on the difference between the reward value at the corresponding moment and the predicted reward value; where the reward cost function The specific calculation formula is:

； ;

和分别为第个样本数据的对应时刻的奖励预测值和奖励值。 and Respectively The reward prediction value and reward value at the corresponding moment of the sample data.

随后，在获取到上述代价函数之后，即可基于值函数代价函数对目标值函数网络、目标策略网络以及目标预测网络进行优化，并将此次训练的策略网络作为下一预测区间对应的目标策略网络，将此次训练的值函数网络作为下一预测区间对应的目标值函数网络，将此次训练的预测网络作为下一预测区间对应的目标预测网络，在此基础上，基于下一预测区间内多个不同时刻的待处理样本数据以及预测的最优航向角预测值、环境态势预测值和奖励预测值对下一预测区间对应的目标值函数网络、下一预测区间对应的目标策略网络以及下一预测区间对应的目标预测网络进行迭代优化，直到迭代训练的回合次数达到设定的训练回合，由此得到优化的策略网络。Subsequently, after obtaining the above-mentioned cost function, the target value function network, the target policy network and the target prediction network can be optimized based on the value function cost function, and the policy network trained this time is used as the target policy network corresponding to the next prediction interval, the value function network trained this time is used as the target value function network corresponding to the next prediction interval, and the prediction network trained this time is used as the target prediction network corresponding to the next prediction interval. On this basis, based on the to-be-processed sample data at multiple different times in the next prediction interval and the predicted optimal heading angle prediction value, environmental situation prediction value and reward prediction value, the target value function network corresponding to the next prediction interval, the target policy network corresponding to the next prediction interval and the target prediction network corresponding to the next prediction interval are iteratively optimized until the number of iterative training rounds reaches the set training rounds, thereby obtaining the optimized policy network.

本实施例提供的方法，通过应用滚动优化方法来最大化每个预测区间的累积回报，并融合多种代价函数进行策略网络优化，有助于提高策略的学习效率和样本利用率，由此使得优化的策略网络具有较高的避障性能。The method provided in this embodiment maximizes the cumulative return of each prediction interval by applying a rolling optimization method, and integrates multiple cost functions to optimize the policy network, which helps to improve the learning efficiency and sample utilization of the policy, thereby enabling the optimized policy network to have higher obstacle avoidance performance.

图2为本发明提供的网络训练方法的流程示意图之二；如图2，该方法的完整流程包括：FIG. 2 is a second flow chart of the network training method provided by the present invention; as shown in FIG. 2 , the complete flow of the method includes:

步骤210，基于目标策略网络和目标时刻环境态势计算样本无人机的目标时刻最优航向角；Step 210, calculating the optimal heading angle of the sample UAV at the target time based on the target strategy network and the environmental situation at the target time;

步骤220，根据无人机动力学约束模型、运动学约束和扰动流场法得到样本无人机的下一时刻位置，并结合障碍物的下一时刻位置和目的地位置得到样本无人机的下一时刻环境态势、目标时刻奖励值，并根据目标时刻最优航向角、目标时刻环境态势、下一时刻环境态势、目标时刻奖励值，构建目标时刻的样本数据，再将样本数据存储到经验回放池中；Step 220, the next moment position of the sample UAV is obtained according to the UAV dynamic constraint model, kinematic constraint and disturbance flow field method, and the next moment environmental situation and target moment reward value of the sample UAV are obtained in combination with the next moment position of the obstacle and the destination position, and the sample data at the target moment is constructed according to the optimal heading angle at the target moment, the environmental situation at the target moment, the next moment environmental situation and the target moment reward value, and then the sample data is stored in the experience replay pool;

步骤230，判断步骤220中经验回放池的样本数据的数量是否超过一定值；若超过一定值，则执行步骤240；若未超过一定值，则跳转至步骤210；Step 230, determining whether the number of sample data in the experience playback pool in step 220 exceeds a certain value; if it exceeds the certain value, executing step 240; if it does not exceed the certain value, jumping to step 210;

步骤240，从经验回放池中采样部分待处理样本数据，基于预测网络预测控制的多步预测方法来预测每一个待处理样本数据对应的多个未来时刻的环境态势、无人机的最优航向角以及奖励值；Step 240, sampling part of the sample data to be processed from the experience replay pool, and predicting the environmental situation, the optimal heading angle of the UAV and the reward value at multiple future moments corresponding to each sample data to be processed based on the multi-step prediction method of the prediction network predictive control;

步骤250，通过步骤240中待处理样本数据的对应时刻和未来时刻的环境态势、无人机的最优航向角、奖励值计算至少一种代价函数，再根据代价函数对目标策略网络和目标预测网络进行梯度更新；Step 250, calculating at least one cost function by using the environmental situation at the corresponding moment and future moment of the sample data to be processed in step 240, the optimal heading angle of the drone, and the reward value, and then performing gradient update on the target policy network and the target prediction network according to the cost function;

步骤260，重复步骤210-250，直至达到指定训练回合，得到优化的策略网络。Step 260, repeat steps 210-250 until a specified training round is reached to obtain an optimized policy network.

图3为本发明提供的网络训练方法的流程示意图之三；如图3所示，步骤240具体执行步骤包括：FIG. 3 is a third flow chart of the network training method provided by the present invention; as shown in FIG. 3 , the specific execution steps of step 240 include:

步骤310，从经验回放池中采样多个不同时刻的待处理样本数据；Step 310, sampling a plurality of sample data to be processed at different times from the experience replay pool;

步骤320，基于当前预测步骤的对应时刻的环境态势预测值以及目标策略网络，计算当前预测步骤的对应时刻的最优航向角预测值；需要说明的是，对于当前预测步骤为初始预测步骤的情况下，当前预测步骤的对应时刻的环境态势预测值为各待处理样本数据中的环境态势；Step 320, based on the environmental situation prediction value at the corresponding moment of the current prediction step and the target strategy network, calculate the optimal heading angle prediction value at the corresponding moment of the current prediction step; it should be noted that when the current prediction step is the initial prediction step, the environmental situation prediction value at the corresponding moment of the current prediction step is the environmental situation in each sample data to be processed;

步骤330，与预测网络交互，预测得到下一预测步骤的对应时刻的环境态势预测值和当前预测步骤的对应时刻的奖励预测值；Step 330, interacting with the prediction network to predict the environmental situation prediction value at the corresponding moment of the next prediction step and the reward prediction value at the corresponding moment of the current prediction step;

步骤340，计算目标策略网络的更新的目标值；Step 340, calculating an updated target value of the target policy network;

步骤350，判断当前预测步骤是否大于预先设定的预测步长；若大于预先设定的预测步长，则结束步骤240，跳转至步骤250；若不大于预先设定的预测步长，则跳转至步骤320。Step 350, determine whether the current prediction step is greater than the preset prediction step length; if it is greater than the preset prediction step length, end step 240 and jump to step 250; if it is not greater than the preset prediction step length, jump to step 320.

综上，针对强化学习在无人机避障中存在训练效率低的问题，本实施例提供的避障方法，通过数据驱动的方法学习环境态势和奖励模型，并采用滚动优化区间进行无人机的策略训练。该方法具有较高的学习效率、更快的策略趋近于最优值的收敛速度以及较少的经验重放缓冲区所需的样本容量空间，使得无人机仅通过少量尝试就能学习最佳策略，促进了强化学习方法在避障问题中的应用，提升了无人机避障性能。In summary, in view of the problem of low training efficiency of reinforcement learning in drone obstacle avoidance, the obstacle avoidance method provided in this embodiment learns the environmental situation and reward model through a data-driven method, and uses a rolling optimization interval to train the drone's strategy. This method has higher learning efficiency, faster convergence speed of the strategy to the optimal value, and less sample capacity space required for the experience replay buffer, so that the drone can learn the best strategy with only a small number of attempts, which promotes the application of reinforcement learning methods in obstacle avoidance problems and improves the obstacle avoidance performance of drones.

图4为本发明提供的无人机避障方法的流程示意图；如图4所示，该方法包括：步骤410，获取当前无人机的当前时刻环境态势；步骤420，将所述当前时刻环境态势输入至优化的策略网络，得到所述当前无人机的当前时刻最优航向角；步骤430，根据所述当前时刻最优航向角，控制所述当前无人机执行避障任务；其中，所述优化的策略网络是基于上述各实施例提供的所述网络训练方法训练得到。Figure 4 is a flow chart of the obstacle avoidance method for unmanned aerial vehicles provided by the present invention; as shown in Figure 4, the method includes: step 410, obtaining the current environmental situation of the current unmanned aerial vehicle; step 420, inputting the current environmental situation into the optimized strategy network to obtain the current optimal heading angle of the current unmanned aerial vehicle; step 430, controlling the current unmanned aerial vehicle to perform the obstacle avoidance task according to the current optimal heading angle; wherein the optimized strategy network is trained based on the network training method provided in the above-mentioned embodiments.

可选地，在基于步骤110-140，获取到优化的策略网络之后，将当前无人机的当前时刻环境态势输入至优化的策略网络中，以由优化的策略网络基于当前时刻环境态势进行决策，以预测得到当前无人机的当前时刻最优航向角，进而便于引导当前无人机根据当前时刻最优航向角执行避障任务，以使其在当前环境下能够获得最大奖励，实现有效的避障行为。本方法，实现通过目标预测网络的虚拟环境数据生成来扩充样本数据的数量，以减少与真实环境的交互次数，提高样本利用率，加快训练速度，提高学习效率，进而使得无人机在与环境交互次数更少的情况下目标策略网络可以快速收敛到最优；还可以使得优化的决策模型既具备当前决策经验，又具备未来决策经验，以便做出更加优化的无人机避障决策，由此提高无人机避障性能。Optionally, after obtaining the optimized policy network based on steps 110-140, the current environmental situation of the current drone is input into the optimized policy network, so that the optimized policy network makes a decision based on the current environmental situation to predict the optimal heading angle of the current drone at the current moment, thereby facilitating the current drone to perform the obstacle avoidance task according to the optimal heading angle at the current moment, so that it can obtain the maximum reward in the current environment and achieve effective obstacle avoidance behavior. This method realizes the expansion of the number of sample data by generating virtual environment data of the target prediction network, so as to reduce the number of interactions with the real environment, improve sample utilization, speed up training, and improve learning efficiency, so that the target policy network can quickly converge to the optimal when the drone interacts with the environment less times; it can also make the optimized decision model have both current decision experience and future decision experience, so as to make more optimized drone obstacle avoidance decisions, thereby improving the drone obstacle avoidance performance.

下面对本发明提供的网络训练装置进行描述，下文描述的网络训练装置与上文描述的网络训练方法可相互对应参照。The network training device provided by the present invention is described below. The network training device described below and the network training method described above can be referenced to each other.

图5为本发明提供的网络训练装置的结构示意图；如图5所示，该装置包括：构建单元510用于根据样本无人机的目标时刻环境态势、目标时刻最优航向角、下一时刻环境态势，以及目标时刻奖励值，构建目标时刻的样本数据；抽取单元520用于将所述目标时刻的样本数据更新至经验回放池，在更新后的经验回放池中的样本数据的数量达到预设数量的情况下，从所述更新后的经验回放池中抽取出目标预测区间内多个不同时刻的待处理样本数据；预测单元530用于将各所述待处理样本数据中的环境态势输入至目标策略网络，得到多个不同未来时刻的最优航向角预测值，并将各所述待处理样本数据中的环境态势，以及所述多个不同未来时刻的最优航向角预测值输入至目标预测网络进行多步预测，得到多个不同未来时刻的环境态势预测值和奖励预测值；优化单元540用于根据各所述待处理样本数据中的环境态势、最优航向角和奖励值，以及所述环境态势预测值、所述奖励预测值和所述最优航向角预测值，对所述目标策略网络进行强化学习训练，并根据训练结果，获取优化的策略网络；其中，所述优化的策略网络用于基于当前无人机的当前时刻环境态势预测所述当前无人机的当前时刻最优航向角，以供所述当前无人机根据所述当前时刻最优航向角执行避障任务。本实施例提供的网络训练装置，实现通过目标预测网络的虚拟环境数据生成来扩充样本数据的数量，以减少与真实环境的交互次数，提高样本利用率，加快训练速度，提高学习效率，进而使得无人机在与环境交互次数更少的情况下目标策略网络可以快速收敛到最优；还可以使得优化的决策模型既具备当前决策经验，又具备未来决策经验，以便做出更加优化的无人机避障决策，由此提高无人机避障性能。Figure 5 is a schematic diagram of the structure of the network training device provided by the present invention; as shown in Figure 5, the device includes: a construction unit 510 is used to construct sample data at the target moment according to the environmental situation of the sample drone at the target moment, the optimal heading angle at the target moment, the environmental situation at the next moment, and the reward value at the target moment; an extraction unit 520 is used to update the sample data at the target moment to the experience replay pool, and when the number of sample data in the updated experience replay pool reaches a preset number, extract the sample data to be processed at multiple different moments within the target prediction interval from the updated experience replay pool; a prediction unit 530 is used to input the environmental situation in each of the sample data to be processed into the target strategy network, obtain the optimal heading angle prediction value at multiple different future moments, and each of the The environmental situation in the sample data to be processed and the optimal heading angle prediction values at multiple different future moments are input into the target prediction network for multi-step prediction to obtain environmental situation prediction values and reward prediction values at multiple different future moments; the optimization unit 540 is used to perform reinforcement learning training on the target policy network according to the environmental situation, optimal heading angle and reward value in each of the sample data to be processed, as well as the environmental situation prediction value, the reward prediction value and the optimal heading angle prediction value, and obtain an optimized policy network according to the training results; wherein the optimized policy network is used to predict the optimal heading angle of the current drone at the current moment based on the current environmental situation of the current drone, so that the current drone can perform an obstacle avoidance task according to the optimal heading angle at the current moment. The network training device provided in this embodiment expands the amount of sample data by generating virtual environment data of the target prediction network, so as to reduce the number of interactions with the real environment, improve sample utilization, accelerate training speed, and improve learning efficiency, so that the target strategy network of the drone can converge to the optimal quickly when the number of interactions with the environment is reduced; it can also make the optimized decision model have both current decision-making experience and future decision-making experience, so as to make more optimized drone obstacle avoidance decisions, thereby improving the drone obstacle avoidance performance.

图6为本发明提供的无人机避障装置的结构示意图；如图6所示，该装置包括：获取单元610用于获取当前无人机的当前时刻环境态势；决策单元620用于将所述当前时刻环境态势输入至优化的策略网络，得到所述当前无人机的当前时刻最优航向角；避障控制单元630用于根据所述当前时刻最优航向角，控制所述当前无人机执行避障任务；其中，所述优化的策略网络是基于如上述各实施例提供的所述网络训练方法训练得到。本装置，实现通过目标预测网络的虚拟环境数据生成来扩充样本数据的数量，以减少与真实环境的交互次数，提高样本利用率，加快训练速度，提高学习效率，进而使得无人机在与环境交互次数更少的情况下目标策略网络可以快速收敛到最优；还可以使得优化的决策模型既具备当前决策经验，又具备未来决策经验，以便做出更加优化的无人机避障决策，由此提高无人机避障性能。FIG6 is a schematic diagram of the structure of the obstacle avoidance device for unmanned aerial vehicles provided by the present invention; as shown in FIG6 , the device includes: an acquisition unit 610 for acquiring the current environmental situation of the current unmanned aerial vehicle; a decision unit 620 for inputting the current environmental situation into the optimized strategy network to obtain the optimal heading angle of the current unmanned aerial vehicle at the current moment; an obstacle avoidance control unit 630 for controlling the current unmanned aerial vehicle to perform the obstacle avoidance task according to the optimal heading angle at the current moment; wherein the optimized strategy network is obtained by training based on the network training method provided in the above embodiments. This device realizes the expansion of the number of sample data by generating virtual environment data of the target prediction network, so as to reduce the number of interactions with the real environment, improve sample utilization, speed up training speed, and improve learning efficiency, so that the target strategy network can quickly converge to the optimal when the number of interactions with the environment is less; it can also make the optimized decision model have both current decision experience and future decision experience, so as to make a more optimized obstacle avoidance decision for the unmanned aerial vehicle, thereby improving the obstacle avoidance performance of the unmanned aerial vehicle.

图7示例了一种电子设备的实体结构示意图，如图7所示，该电子设备可以包括：处理器(processor)710、通信接口(Communications Interface)720、存储器(memory)730和通信总线740，其中，处理器710，通信接口720，存储器730通过通信总线740完成相互间的通信。处理器710可以调用存储器730中的逻辑指令，以执行上述各方法所提供的网络训练方法或者无人机避障方法。FIG7 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG7 , the electronic device may include: a processor 710, a communications interface 720, a memory 730, and a communication bus 740, wherein the processor 710, the communications interface 720, and the memory 730 communicate with each other through the communication bus 740. The processor 710 may call the logic instructions in the memory 730 to execute the network training method or the drone obstacle avoidance method provided by the above methods.

此外，上述的存储器730中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 730 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的网络训练方法或者无人机避障方法。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the network training method or drone obstacle avoidance method provided by the above methods.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的网络训练方法或者无人机避障方法。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the network training method or drone obstacle avoidance method provided by the above methods.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of network training, comprising:

According to the environmental situation at the target moment, the optimal course angle at the target moment, the environmental situation at the next moment and the rewarding value at the target moment of the sample unmanned aerial vehicle, constructing sample data at the target moment;

Updating the sample data at the target moment to an experience playback pool, and extracting a plurality of sample data to be processed at different moments in a target prediction interval from the updated experience playback pool under the condition that the number of the sample data in the updated experience playback pool reaches a preset number;

Inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future times, and inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future times into the target strategy network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future times;

Performing reinforcement learning training on the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and acquiring an optimized strategy network according to a training result;

the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.

2. The network training method according to claim 1, wherein the target time environment situation and the target time optimal heading angle are obtained based on the steps of:

Determining the target moment environment situation according to the target moment position, the radius and the destination position of the sample unmanned aerial vehicle, the target moment position, the target moment speed and the radius of the obstacle and the target moment distance between the sample unmanned aerial vehicle and the obstacle;

And inputting the environment situation at the target moment into the target strategy network to obtain the optimal course angle at the target moment.

3. The network training method of claim 1, wherein the next time environmental situation is obtained based on the steps of:

calculating to obtain the next moment position of the sample unmanned aerial vehicle according to an unmanned aerial vehicle dynamics constraint model, a kinematic constraint and a disturbance flow field method;

And determining the environmental situation of the next moment according to the position, the radius and the destination position of the sample unmanned aerial vehicle at the next moment, the position, the speed and the radius of the obstacle at the next moment and the distance between the sample unmanned aerial vehicle and the obstacle at the next moment.

4. A network training method as claimed in any one of claims 1 to 3, wherein the target time prize value is obtained on the basis of:

Determining a target moment rewarding value according to the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle and a first rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than a first distance value;

Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, the second rewarding value and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is smaller than the second distance value;

Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is larger than or equal to the second distance value;

wherein the first prize value is a constant prize value; the second prize value is a threat prize for limiting the sample drone from moving away from the obstacle; and the third prize value is an additional prize value corresponding to the completion of the task.

5. The network training method of claim 4, wherein the second prize value is determined based on the steps of:

Determining the second rewarding value based on the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle, a preset threat radius and a fourth rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and smaller than a third distance value;

And determining the second rewarding value based on a preset constant value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than the first distance value or larger than or equal to the third distance value.

6. A network training method according to any one of claims 1 to 3, wherein the inputting the environmental situation in each of the sample data to be processed and the optimal heading angle predicted values at the different future times into the target prediction network to perform multi-step prediction, to obtain environmental situation predicted values and rewards predicted values at the different future times, includes:

Inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a reward function network of the target prediction network, and inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a situation transfer function network of the target prediction network to perform multi-step prediction, so as to obtain the reward predicted value and the environmental situation predicted value output by the reward function network.

7. A network training method according to any one of claims 1 to 3, wherein said reinforcement learning training of said target strategy network according to the environmental situation, the optimal course angle and the prize value in each of said sample data to be processed, and said environmental situation predicted value, said prize predicted value and said optimal course angle predicted value comprises:

Obtaining a value function cost function according to the optimal course angle in each piece of sample data to be processed, the optimal course angle predicted value, the environment situation predicted value and the rewarding predicted value;

acquiring a strategy cost function according to the optimal course angle predicted value and the environment situation predicted value;

acquiring a situation transfer cost function according to the environmental situation and the environmental situation predicted value in each sample data to be processed;

Obtaining a reward cost function according to the reward value and the reward predicted value in each sample data to be processed;

And performing reinforcement learning on the target strategy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the rewarding cost function.

8. An unmanned aerial vehicle obstacle avoidance method, comprising:

Acquiring the current moment environmental situation of the current unmanned aerial vehicle;

inputting the current environmental situation to an optimized strategy network to obtain the current optimal course angle of the current unmanned aerial vehicle;

according to the optimal course angle at the current moment, controlling the current unmanned aerial vehicle to execute an obstacle avoidance task;

Wherein the optimized policy network is trained based on the network training method according to any one of claims 1 to 7.

9. A network training device, comprising:

The construction unit is used for constructing sample data of the target moment according to the environmental situation of the target moment, the optimal course angle of the target moment, the environmental situation of the next moment and the rewarding value of the target moment of the sample unmanned plane;

the extraction unit is used for updating the sample data at the target moment to the experience playback pool, and extracting the sample data to be processed at a plurality of different moments in the target prediction interval from the updated experience playback pool under the condition that the quantity of the sample data in the updated experience playback pool reaches the preset quantity;

The prediction unit is used for inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future moments, inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future moments into the target prediction network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future moments;

The optimizing unit is used for performing reinforcement learning training on the target strategy network according to the environment situation, the optimal course angle and the rewarding value in each sample data to be processed, the environment situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and obtaining an optimized strategy network according to training results;

10. An unmanned aerial vehicle keeps away barrier device, characterized in that includes:

The acquisition unit is used for acquiring the current environmental situation of the current unmanned aerial vehicle at the current moment;

The decision unit is used for inputting the current time environment situation into an optimized strategy network to obtain the current time optimal course angle of the current unmanned aerial vehicle;

The obstacle avoidance control unit is used for controlling the current unmanned aerial vehicle to execute an obstacle avoidance task according to the current time optimal course angle;