CN118034355A - Network training method, unmanned aerial vehicle obstacle avoidance method and device - Google Patents

Network training method, unmanned aerial vehicle obstacle avoidance method and device Download PDF

Info

Publication number
CN118034355A
CN118034355A CN202410447633.0A CN202410447633A CN118034355A CN 118034355 A CN118034355 A CN 118034355A CN 202410447633 A CN202410447633 A CN 202410447633A CN 118034355 A CN118034355 A CN 118034355A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
value
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410447633.0A
Other languages
Chinese (zh)
Other versions
CN118034355B (en
Inventor
刘克新
吴其臻
吕金虎
陈磊
朱国梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Beijing Institute of Technology BIT
Academy of Mathematics and Systems Science of CAS
Original Assignee
Academy of Mathematics and Systems Science of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Mathematics and Systems Science of CAS filed Critical Academy of Mathematics and Systems Science of CAS
Priority to CN202410447633.0A priority Critical patent/CN118034355B/en
Publication of CN118034355A publication Critical patent/CN118034355A/en
Application granted granted Critical
Publication of CN118034355B publication Critical patent/CN118034355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a network training method, an unmanned aerial vehicle obstacle avoidance method and a device, and relates to the technical field of unmanned aerial vehicle control, wherein the method comprises the following steps: updating an experience playback pool by using sample data of a target moment constructed by the environmental situation of the target moment, the environmental situation of the next moment, the optimal course angle of the target moment and the rewarding value of the target moment of the sample unmanned aerial vehicle, and extracting a plurality of sample data to be processed from the updated experience playback pool when the number of the sample data in the updated experience playback pool reaches a preset number to perform multi-step prediction to obtain an optimal course angle predicted value, an environment situation predicted value and a rewarding predicted value of a plurality of future moments; and training the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value to obtain an optimized strategy network. The invention effectively improves the learning efficiency and the sample utilization rate in the obstacle avoidance of the unmanned aerial vehicle.

Description

Network training method, unmanned aerial vehicle obstacle avoidance method and device
Technical Field
The invention relates to the technical field of unmanned aerial vehicle control, in particular to a network training method, an unmanned aerial vehicle obstacle avoidance method and an unmanned aerial vehicle obstacle avoidance device.
Background
The unmanned aerial vehicle obstacle avoidance problem can be described as the task of the unmanned aerial vehicle navigating in a space where an obstacle exists. Tasks typically follow some optimization criteria such as minimum cost of work, minimum flight distance, minimum time of flight, etc. Common traditional obstacle avoidance methods include: dynamic programming algorithms, artificial potential field methods, sampling-based methods and graph theory-based methods, but these methods require different models to be built according to different situations. However, in an actual unmanned aerial vehicle flight environment, the working environment is complex and unpredictable, and often requires the unmanned aerial vehicle to detect and make real-time decisions in an unknown environment.
With the progress of artificial intelligence technology, reinforcement learning is increasingly widely applied in the fields of games, robots, the internet and the like, and is attracting a great deal of attention. Model-free reinforcement learning is a common method for solving the problem of obstacle avoidance of unmanned aerial vehicles, and is widely applied to the problem of the unknown environment decision. However, due to limited interaction between the unmanned aerial vehicle and the environment, the sample utilization rate and the autonomous learning efficiency of model-free reinforcement learning are low, and the obstacle avoidance performance of the unmanned aerial vehicle is poor.
Disclosure of Invention
The invention provides a network training method, an unmanned aerial vehicle obstacle avoidance method and a device, which are used for solving the defects that in the prior art, the interaction between an unmanned aerial vehicle and the environment is limited, so that the sample utilization rate of model-free reinforcement learning is low, the autonomous learning efficiency is low, and the obstacle avoidance performance of the unmanned aerial vehicle is poor.
The invention provides a network training method, which comprises the following steps:
According to the environmental situation at the target moment, the optimal course angle at the target moment, the environmental situation at the next moment and the rewarding value at the target moment of the sample unmanned aerial vehicle, constructing sample data at the target moment;
Updating the sample data at the target moment to an experience playback pool, and extracting a plurality of sample data to be processed at different moments in a target prediction interval from the updated experience playback pool under the condition that the number of the sample data in the updated experience playback pool reaches a preset number;
Inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future times, and inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future times into the target strategy network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future times;
Performing reinforcement learning training on the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and acquiring an optimized strategy network according to a training result;
the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.
According to the network training method provided by the invention, the target moment environment situation and the target moment optimal course angle are obtained based on the following steps:
Determining the target moment environment situation according to the target moment position, the radius and the destination position of the sample unmanned aerial vehicle, the target moment position, the target moment speed and the radius of the obstacle and the target moment distance between the sample unmanned aerial vehicle and the obstacle;
And inputting the environment situation at the target moment into the target strategy network to obtain the optimal course angle at the target moment.
According to the network training method provided by the invention, the environmental situation at the next moment is obtained based on the following steps:
calculating to obtain the next moment position of the sample unmanned aerial vehicle according to an unmanned aerial vehicle dynamics constraint model, a kinematic constraint and a disturbance flow field method;
And determining the environmental situation of the next moment according to the position, the radius and the destination position of the sample unmanned aerial vehicle at the next moment, the position, the speed and the radius of the obstacle at the next moment and the distance between the sample unmanned aerial vehicle and the obstacle at the next moment.
According to the network training method provided by the invention, the target time rewarding value is acquired based on the following steps:
Determining a target moment rewarding value according to the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle and a first rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than a first distance value;
Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, the second rewarding value and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is smaller than the second distance value;
Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is larger than or equal to the second distance value;
wherein the first prize value is a constant prize value; the second prize value is a threat prize for limiting the sample drone from moving away from the obstacle; and the third prize value is an additional prize value corresponding to the completion of the task.
According to the network training method provided by the invention, the second reward value is determined based on the following steps:
Determining the second rewarding value based on the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle, a preset threat radius and a fourth rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and smaller than a third distance value;
And determining the second rewarding value based on a preset constant value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than the first distance value or larger than or equal to the third distance value.
According to the network training method provided by the invention, the environmental situation in each sample data to be processed and the optimal course angle predicted values at a plurality of different future moments are input into a target predicted network to perform multi-step prediction, so as to obtain environmental situation predicted values and rewarding predicted values at a plurality of different future moments, and the method comprises the following steps:
Inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a reward function network of the target prediction network, and inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a situation transfer function network of the target prediction network to perform multi-step prediction, so as to obtain the reward predicted value and the environmental situation predicted value output by the reward function network.
According to the network training method provided by the invention, the reinforcement learning training is performed on the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and the method comprises the following steps:
Obtaining a value function cost function according to the optimal course angle in each piece of sample data to be processed, the optimal course angle predicted value, the environment situation predicted value and the rewarding predicted value;
acquiring a strategy cost function according to the optimal course angle predicted value and the environment situation predicted value;
acquiring a situation transfer cost function according to the environmental situation and the environmental situation predicted value in each sample data to be processed;
Obtaining a reward cost function according to the reward value and the reward predicted value in each sample data to be processed;
And performing reinforcement learning on the target strategy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the rewarding cost function.
The invention also provides an obstacle avoidance method of the unmanned aerial vehicle, which comprises the following steps:
Acquiring the current moment environmental situation of the current unmanned aerial vehicle;
inputting the current environmental situation to an optimized strategy network to obtain the current optimal course angle of the current unmanned aerial vehicle;
according to the optimal course angle at the current moment, controlling the current unmanned aerial vehicle to execute an obstacle avoidance task;
wherein the optimized policy network is trained based on the network training method as described in any one of the above.
The invention also provides a network training device, comprising:
The construction unit is used for constructing sample data of the target moment according to the environmental situation of the target moment, the optimal course angle of the target moment, the environmental situation of the next moment and the rewarding value of the target moment of the sample unmanned plane;
the extraction unit is used for updating the sample data at the target moment to the experience playback pool, and extracting the sample data to be processed at a plurality of different moments in the target prediction interval from the updated experience playback pool under the condition that the quantity of the sample data in the updated experience playback pool reaches the preset quantity;
The prediction unit is used for inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future moments, inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future moments into the target prediction network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future moments;
The optimizing unit is used for performing reinforcement learning training on the target strategy network according to the environment situation, the optimal course angle and the rewarding value in each sample data to be processed, the environment situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and obtaining an optimized strategy network according to training results;
the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.
The invention also provides an unmanned aerial vehicle obstacle avoidance device, which comprises:
The acquisition unit is used for acquiring the current environmental situation of the current unmanned aerial vehicle at the current moment;
The decision unit is used for inputting the current time environment situation into an optimized strategy network to obtain the current time optimal course angle of the current unmanned aerial vehicle;
The obstacle avoidance control unit is used for controlling the current unmanned aerial vehicle to execute an obstacle avoidance task according to the current time optimal course angle;
wherein the optimized policy network is trained based on the network training method as described in any one of the above.
The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the network training method according to any one of the above or the unmanned aerial vehicle obstacle avoidance method according to any one of the above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a network training method as described in any one of the above, or a drone obstacle avoidance method as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a network training method as described in any one of the above, or a drone obstacle avoidance method as described in any one of the above.
According to the network training method, the unmanned aerial vehicle obstacle avoidance method and the device, the to-be-processed sample data at different moments are extracted from the experience playback pool, the to-be-processed sample data are predicted based on the target strategy network and the target prediction network, so that the optimal course angle predicted value, the environment situation predicted value and the rewarding predicted value at different future moments are obtained, the target strategy network is subjected to rolling optimization based on the environment situation, the optimal course angle and the rewarding value in the to-be-processed sample data and the environment situation predicted value, the rewarding predicted value and the optimal course angle predicted value at the future moments, the number of sample data can be expanded through virtual environment data generation of the target prediction network, the interaction times with a real environment can be reduced, the sample utilization rate can be improved, the training speed can be increased, the learning efficiency can be improved, and the target strategy network can be rapidly converged to be optimal under the condition that the interaction times with the environment of an unmanned aerial vehicle are fewer; the optimized decision model not only has the current decision experience, but also has the future decision experience so as to make a more optimized unmanned aerial vehicle obstacle avoidance decision, thereby improving the unmanned aerial vehicle obstacle avoidance performance.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a network training method according to the present invention;
FIG. 2 is a second flow chart of the network training method according to the present invention;
FIG. 3 is a third flow chart of the network training method according to the present invention;
fig. 4 is a schematic flow chart of an obstacle avoidance method of the unmanned aerial vehicle;
FIG. 5 is a schematic diagram of a network training device according to the present invention;
fig. 6 is a schematic structural diagram of an obstacle avoidance device of an unmanned aerial vehicle provided by the invention;
Fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The unmanned aerial vehicle obstacle avoidance problem can be described as the task of the unmanned aerial vehicle navigating in a space where an obstacle exists. Tasks typically follow some optimization criteria such as minimum cost of work, minimum flight distance, minimum time of flight, etc. Common traditional obstacle avoidance methods include: dynamic programming algorithms, artificial potential field methods, sampling-based methods and graph theory-based methods, but these methods require different models to be built according to different situations. However, in actual unmanned aerial vehicle flight environments, the working environment is complex and unpredictable, often requiring the unmanned aerial vehicle to have self-learning, adaptive and robust capabilities for detection and real-time decision making in an unknown environment.
To overcome the weaknesses of these methods, related personnel explore various solutions, such as reinforcement learning, which can learn proper behaviors from environmental conditions, and have the advantages of online learning capability and corresponding rewards or penalties generated in different environments, and are a common method for solving the unknown environmental decisions, and have been widely applied to the obstacle avoidance problem of unmanned aerial vehicles. The reinforcement learning end-to-end motion planning method allows the system to be considered as a whole, making it more robust. However, due to the limited interaction quantity between the unmanned aerial vehicle and the environment, namely the limited interaction between the unmanned aerial vehicle and the environment, the sample utilization rate and the autonomous learning efficiency of model-free reinforcement learning are low, and the application of the model-free reinforcement learning in the real world is limited, so that the obstacle avoidance performance of the unmanned aerial vehicle is poor.
Aiming at the problems of low training efficiency and low sample utilization rate of unmanned aerial vehicle obstacle avoidance in model-free reinforcement learning, the embodiment of the application provides a network training method, which predicts a simulation environment situation and a rewarding model by a target prediction network obtained by a data driving method, and adopts a rolling optimization interval to perform optimization training of a strategy network, so that the unmanned aerial vehicle obstacle avoidance has higher learning efficiency, faster strategy approaches to the convergence speed of an optimal value and less sample capacity space required by an experience replay buffer zone, and further the unmanned aerial vehicle can learn the optimal strategy only through a small amount of real environment interaction, thereby promoting the application of the reinforcement learning method in the obstacle avoidance problem and improving the obstacle avoidance performance of the unmanned aerial vehicle.
Fig. 1 is a schematic flow chart of a network training method provided in the present invention, as shown in fig. 1, the method includes:
Step 110, constructing sample data of a target moment according to the environmental situation of the target moment, the optimal course angle of the target moment, the environmental situation of the next moment and the rewarding value of the target moment of the sample unmanned plane;
Here, the target time may be the current time or each history time, which is not specifically limited in this embodiment; the next time is the next time of the target time.
The environmental situation at the target moment is an environmental situation at the target moment, and specifically may be determined based on the target moment operation parameter of the sample unmanned aerial vehicle and the target moment operation parameter of the obstacle in the environment where the sample unmanned aerial vehicle is located. The optimal course angle at the target moment is the optimal course angle at the target moment, and the optimal course angle can be obtained by inputting the environment situation at the target moment into a target strategy network for decision output. The environmental situation at the next time is the environmental situation at the next time of the target time, and specifically may be determined based on the operation parameter at the next time of the sample unmanned aerial vehicle and the operation parameter at the next time of the obstacle in the environment where the sample unmanned aerial vehicle is located. The target time bonus value is a bonus value formed by performing obstacle avoidance operation based on the optimal heading angle at the target time.
In some implementations, the target time environmental situation and the target time optimal heading angle are obtained based on the steps of:
Determining the target moment environment situation according to the target moment position, the radius and the destination position of the sample unmanned aerial vehicle, the target moment position, the target moment speed and the radius of the obstacle and the target moment distance between the sample unmanned aerial vehicle and the obstacle; and inputting the environment situation at the target moment into the target strategy network to obtain the optimal course angle at the target moment.
Here, the obstacle may be a cylinder, a sphere, or the like, which is not particularly limited in this embodiment.
Optionally, the step of obtaining the environmental situation at the target moment specifically includes:
Determining a first parameter value of a target moment environment situation based on a difference value between a target moment position of the obstacle and a target moment position of the sample unmanned aerial vehicle, a sum of a radius of the sample unmanned aerial vehicle and a radius of the obstacle, and a target moment distance between the sample unmanned aerial vehicle and the obstacle; determining a second parameter value of the target time environment situation based on a difference between the destination location and the target time location of the obstacle; determining a third parameter value of the environment situation at the target moment based on the target moment speed of the obstacle, and constructing and forming the environment situation at the target moment based on the first parameter value, the second parameter value and the third parameter value, wherein a specific calculation formula is as follows:
Wherein, ForTime environmental situation; /(I)AndRespectively is the/>, of the sample unmanned plane and the obstacleA time position,Is the destination location; /(I)Is/>, between the sample unmanned plane and the obstacleA time distance; /(I)AndThe radius of the sample unmanned plane and the radius of the obstacle are respectively; /(I)Is the barrierTime of day speed.
Optionally, the step of obtaining the optimal heading angle at the target moment specifically includes:
The target moment environment situation is input into the target strategy network, so that the target strategy network flexibly makes corresponding decisions according to the target moment environment situation, and therefore the optimal course angle of the target moment is output, so that the course angle of the unmanned aerial vehicle can be flexibly adjusted according to the environment situation by the target strategy network based on the follow-up optimization training, obstacles can be avoided to the greatest extent in time and reach a destination, and the dependence on a pre-established map or static path planning can be reduced, thereby adapting to obstacle avoidance scenes of various complex environments, and improving generalization and robustness of obstacle avoidance of the unmanned aerial vehicle.
Here, the target policy network may be obtained by training based on the environmental situation and the optimal heading angle at the corresponding time in the to-be-processed sample data extracted in the previous prediction interval of the target prediction interval, and the environmental situation predicted value and the optimal heading angle predicted value.
Here, the optimal heading angle includes, but is not limited to, an optimal roll angle, an optimal pitch angle, and an optimal yaw angle, and the specific calculation formula is as follows:
Wherein, For sample unmanned aerial vehicleThe time optimal course angle; /(I)Respectively representing an optimal rolling angle, an optimal pitch angle and an optimal yaw angle of the sample unmanned aerial vehicle; /(I)A target strategy network for the required training. In DDPG (DEEP DETERMINISTIC Policy Gradient, deep reinforcement learning) algorithm, the primary target Policy network/>, among the target Policy networksIs an actuator (also called an Actor), a main value function network/>, in a value function networkIs an evaluator (also known as Critic). What the Actor needs to do is to interact with the environment and learn a better strategy with a strategy gradient under the direction of the Critic cost function. What Critic does is to learn a cost function through the data collected by the Actor interacting with the environment, which can be used to determine what actions are good and what actions are not good in the current state, thereby helping the Actor to perform policy updates. For principal value function networkBy inputting the environmental situation at the target time and the optimal heading angle, the network will output a cost function that represents the magnitude of the future jackpot. For a primary target policy networkBy inputting the environment situation of the target moment, the optimal course angle can be calculated. In addition, in DDPG, the target policy network may also attach policy networksThe value function network may also comprise a value function networkIs used for helping the reinforcement learning training process to be more stable.
Optionally, after the target time environmental situation and the target time optimal course angle of the sample unmanned plane are obtained based on the above embodiments, environmental interaction may be performed based on the target time environmental situation and the target time optimal course angle, so as to obtain a next time environmental situation and a target time rewarding value, which may be specifically expressed as:
Wherein, ForTime environmental situation; /(I)ForA time prize value; /(I)And the environmental interaction function is represented, namely the position of the sample unmanned aerial vehicle at the next moment can be obtained according to an unmanned aerial vehicle dynamic constraint model, a kinematic constraint and a disturbance flow field method, and the environmental situation at the next moment and the rewarding at the current moment can be obtained by combining the position of the obstacle at the next moment and the destination position.
In some embodiments, the next time environmental situation is obtained based on the following steps:
Calculating to obtain the next moment position of the sample unmanned aerial vehicle according to an unmanned aerial vehicle dynamics constraint model, a kinematic constraint and a disturbance flow field method; and determining the environmental situation of the next moment according to the position, the radius and the destination position of the sample unmanned aerial vehicle at the next moment, the position, the speed and the radius of the obstacle at the next moment and the distance between the sample unmanned aerial vehicle and the obstacle at the next moment.
Optionally, the step of obtaining the environmental situation at the next moment specifically includes:
Firstly, calculating and acquiring the next time position of the sample unmanned aerial vehicle according to the target time position, the target time speed and the destination position of the sample unmanned aerial vehicle, the target time position and the target time speed of the obstacle based on an unmanned aerial vehicle dynamics constraint model, a kinematic constraint and a disturbance flow field method. According to the calculation mode, the dynamic constraint and the kinematic constraint of the unmanned aerial vehicle are considered in the calculation process, and the environmental factors are corrected by using the disturbance flow field method, so that the accurate movement planning of the unmanned aerial vehicle can be realized, and the position of the unmanned aerial vehicle at the next moment is calculated more accurately and reliably.
Here, the unmanned aerial vehicle dynamic constraint model includes, but is not limited to, constraints such as yaw angle, climb angle, track segment length, flying height, and track total length of the unmanned aerial vehicle; the kinematic constraint comprises a limiting condition of the unmanned aerial vehicle; the disturbance flow field method can be used for carrying out running track planning based on the next time position, the next time speed and the radius of the obstacle and the next time distance between the sample unmanned aerial vehicle and the obstacle, and carrying out the correction of the running track planning based on the unmanned aerial vehicle dynamics constraint model and the kinematic constraint condition, so that the next time position of the sample unmanned aerial vehicle is obtained.
Then, determining a first parameter value of an environmental situation at the next moment according to a difference value between the position of the obstacle at the next moment and the position of the sample unmanned aerial vehicle at the next moment, a sum of the radius of the sample unmanned aerial vehicle and the radius of the obstacle, and a distance between the sample unmanned aerial vehicle and the obstacle at the next moment; determining a second parameter value for the environmental situation at the next time based on a difference between the destination location and the next time location of the obstacle; based on the next time speed of the obstacle, determining a third parameter value of the environment situation of the next time, and constructing and forming the environment situation of the next time based on the first parameter value, the second parameter value and the third parameter value, wherein the step of acquiring the environment situation of the target time can be specifically referred to, and will not be described herein.
In some embodiments, the target time prize value is obtained based on the steps of:
Determining a target moment rewarding value according to the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle and a first rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than a first distance value; determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, the second rewarding value and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is smaller than the second distance value; determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is larger than or equal to the second distance value; wherein the first prize value is a constant prize value; the second prize value is a threat prize for limiting the sample drone from moving away from the obstacle; and the third prize value is an additional prize value corresponding to the completion of the task.
Here, the first distance value is determined from a sum of a radius of the sample drone and a radius of the obstacle; the second distance is determined based on a preset distance value.
Optionally, the specific step of obtaining the target time prize value includes:
Comparing the target moment distance between the sample unmanned aerial vehicle and the obstacle with a first distance value, and if the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than the first distance value, calculating to obtain a target moment rewarding value based on the sum of the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle and the radius of the obstacle and the first rewarding value; if the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to a first distance value, further comparing the target moment distance between the sample unmanned aerial vehicle and the destination position with a second distance value, and if the target moment distance between the sample unmanned aerial vehicle and the destination position is smaller than the second distance value on the basis, calculating to obtain a target moment rewarding value according to the ratio of the target moment distance between the sample unmanned aerial vehicle and the destination position to the distance between the starting point position and the destination position of the sample unmanned aerial vehicle and the second rewarding value and the third rewarding value; if the target time distance between the sample unmanned aerial vehicle and the destination location is greater than or equal to the second distance value on the basis that the target time distance between the sample unmanned aerial vehicle and the obstacle is greater than or equal to the first distance value, calculating to obtain a target time rewarding value according to the ratio of the target time distance between the sample unmanned aerial vehicle and the destination location to the distance between the starting point location and the destination location of the sample unmanned aerial vehicle and the third rewarding value.
Wherein,The specific calculation formula of the time rewarding value is as follows:
Wherein, Is/>, between the sample unmanned plane and the obstacleA time distance; /(I)AndThe radius of the sample unmanned plane and the radius of the obstacle are respectively; /(I)AndThe first rewarding value, the second rewarding value and the third rewarding value are respectively; /(I)A second distance value, which is preset; /(I)For/>, between sample drone and destination locationA time distance; /(I)Is the distance between the origin position and the destination position of the sample drone.
Here, the first reward value is a preset constant reward value; the third prize value is a corresponding additional prize value set for completing the task; the second prize value is a threat prize set by the sample drone remote from the obstacle.
According to the method provided by the embodiment, the target time rewarding value is dynamically adjusted through the target time distance between the unmanned aerial vehicle and the obstacle and the target time distance between the unmanned aerial vehicle and the destination, so that the unmanned aerial vehicle is limited by the first rewarding value when approaching the obstacle, and collision with the obstacle is avoided as much as possible. When the unmanned aerial vehicle is far away from the obstacle and approaches the destination, the unmanned aerial vehicle is encouraged by the second and third rewards values so as to guide the unmanned aerial vehicle to keep a safe distance as far as possible, and meanwhile, the unmanned aerial vehicle is encouraged to arrive at the destination as soon as possible to finish the task; when the unmanned aerial vehicle is far away from the destination, the unmanned aerial vehicle is encouraged by a third rewarding value so as to encourage the unmanned aerial vehicle to arrive at the destination as soon as possible to finish the task, thereby improving the dynamic adaptability, the obstacle avoidance capacity and the task completion capacity of the unmanned aerial vehicle in flying obstacle avoidance, and further improving the obstacle avoidance performance of the unmanned aerial vehicle.
In some embodiments, the second prize value is determined based on the steps of:
Determining the second rewarding value based on the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle, a preset threat radius and a fourth rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and smaller than a third distance value; and determining the second rewarding value based on a preset constant value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than the first distance value or larger than or equal to the third distance value.
Here, the third distance value is determined based on a sum of a radius of the sample drone, a radius of the obstacle, and a preset threat radius.
Optionally, the specific obtaining step of the second prize value includes:
Comparing the target moment distance between the sample unmanned aerial vehicle and the obstacle with a first distance value, and further comparing the target moment distance between the sample unmanned aerial vehicle and the obstacle with a third distance value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value; if the target moment distance between the sample unmanned aerial vehicle and the obstacle is determined to be larger than or equal to the first distance value and smaller than the third distance value according to the first distance value, a second prize value is calculated based on the radius of the sample unmanned aerial vehicle, the sum of the radius of the obstacle and the preset threat radius, the target moment distance between the sample unmanned aerial vehicle and the obstacle and the preset threat radius. If the target moment distance between the sample unmanned aerial vehicle and the obstacle is determined to be smaller than the first distance value or larger than or equal to the third distance value according to the target moment distance, a second prize value is determined based on a preset constant value. The preset constant value here may be 0 or a constant value infinitely close to 0.
Wherein the second prize valueThe calculation formula of (2) is as follows:
Wherein, Is a preset threat radius; /(I)A fourth prize value; here, the fourth prize value may be a predetermined constant prize value.
According to the method provided by the embodiment, the second reward value is dynamically adjusted according to the target moment distance between the unmanned aerial vehicle and the obstacle, so that the relation between the unmanned aerial vehicle and the obstacle is reflected more accurately, more effective reward signals are provided, the obstacle avoidance capability and the flight safety of the unmanned aerial vehicle are improved, and further the obstacle avoidance performance of the unmanned aerial vehicle and the capability of coping with diversified scenes are improved.
Optionally, after the target time environmental situation, the next time environmental situation, the target time optimal course angle and the target time rewarding value of the sample unmanned plane are obtained according to the embodiments, the 4-tuple formed by constructing the target time environmental situation, the next time environmental situation, the target time optimal course angle and the target time rewarding value can be constructed, and sample data for forming the target time is constructed.
Step 120, updating the sample data at the target moment to an experience playback pool, and extracting to-be-processed sample data at a plurality of different moments in a target prediction interval from the updated experience playback pool under the condition that the number of the sample data in the updated experience playback pool reaches a preset number;
here, the experience playback pool is a buffer for storing historical sample data for randomly extracting samples for training during the training process.
Optionally, after obtaining the sample data at the target time, updating the sample data at the target time to the experience playback pool; the specific expression formula is as follows:
Wherein, Is an experience playback pool; /(I)ForSample data of time; /(I)AndRespectivelyMoment environmental situation andTime environmental situation; /(I)AndRespectivelyTime optimal course angle sumA time of day prize value.
Then comparing the number of the sample data in the updated experience playback pool with a preset number, if the number of the sample data in the updated experience playback pool does not reach the preset number, continuing to collect the sample data at the next moment and updating the sample data to the experience playback pool until the number of the sample data in the updated experience playback pool reaches the preset number; and if the number of the sample data in the updated experience playback pool reaches the preset number, sampling the sample data to be processed at a plurality of different moments in the target prediction interval, so that the future environmental situation of each sample data, the optimal course angle of the unmanned aerial vehicle and rewards can be predicted by a subsequent multi-step prediction method based on model prediction control. The method is used for maximizing the accumulated return of each prediction interval through a rolling optimization method, so that the learning efficiency and the sample utilization rate of the unmanned aerial vehicle in reinforcement learning are improved, and the obstacle avoidance performance of the unmanned aerial vehicle is further improved.
The expression formula for extracting the sample data to be processed at different moments from the updated experience playback pool is as follows:
Wherein, Is the total amount of sample data that needs to be extracted.
130, Inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future times, and inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future times into the target strategy network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future times;
Alternatively, after extracting sample data to be processed to a plurality of different times, it may be that each sample data to be processed is executed And performing prediction operation to obtain environmental situation predicted values and rewarding predicted values of a plurality of different future moments corresponding to the sample data to be processed. WhereinFor a preset prediction step length, N is smaller than
Here, for each sample data to be processed, iterative execution is performedThe specific steps of the step prediction operation are as follows:
Firstly, calculating different prediction steps based on a target strategy network and environmental situations of different moments of a plurality of sample data to be processed at different moments The optimal course angle at the corresponding moment (i.e. a plurality of different future moments) is input into the target strategy network, namely the environment situation at the corresponding moment in different prediction steps is input, and the optimal course angle predicted value at the corresponding moment in different prediction steps is obtained, wherein the specific calculation formula is as follows:
Wherein, ForThe individual sample data are found in the prediction stepAn optimal heading angle predicted value at a corresponding moment; Is a master policy network; /(I) ForThe individual sample data are found in the prediction stepThe environmental situation at the corresponding moment of time.
Then, interacting with a trained target prediction network to obtain environmental situation at the next moment of corresponding moment of different prediction steps and rewarding values of corresponding moment of different prediction steps; the environment situation predicted values and the optimal course angle predicted values at corresponding moments under different prediction steps are input to a target prediction network, and the target prediction network performs multi-step prediction to obtain the environment situation predicted values at the next moment of the corresponding moments of a plurality of different prediction steps and the rewarding predicted values at the corresponding moments of a plurality of different prediction steps.
The target prediction network is a model capable of performing environment situation prediction and rewarding value prediction; the method is characterized in that environmental situation values and optimal course angles of all moments of a sample unmanned aerial vehicle in a last prediction interval of a target prediction interval are taken as samples, environmental situation values and rewarding values of all moments of the next moment of the sample unmanned aerial vehicle are taken as labels, and/or environmental situation predicted values and optimal course angle predicted values of all moments of the sample unmanned aerial vehicle in the last prediction interval, and the target prediction network obtained through iterative training of the environmental situation predicted values and rewarding predicted values of the next moment of the sample unmanned aerial vehicle are taken as labels, so that the trained target prediction network has good approximation effect in a low-dimensional environment.
According to the method provided by the embodiment, the sample data extracted from the experience playback pool is simultaneously used for updating the target prediction network and the subsequent strategy model, and the target prediction network has a promotion effect on the subsequent strategy model updating, so that the unmanned aerial vehicle has higher learning efficiency in obstacle avoidance, faster optimal strategy convergence speed and less sample capacity space required by the experience playback buffer zone.
Step 140, performing reinforcement learning training on the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and acquiring an optimized strategy network according to a training result; the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.
Optionally, after the environmental situation predicted values, the rewarding predicted values and the optimal course angle predicted values at different future times corresponding to each sample data to be processed are obtained, at least one cost function may be obtained based on the environmental situation predicted values, the rewarding predicted values and the optimal course angle predicted values at different future times, and the environmental situation, the rewarding value and the optimal course angle contained in the sample data to be processed, so as to perform gradient update on network parameters of the target policy network based on the cost function, thereby continuously updating and optimizing the target policy network by interacting with the environment and using the predicted information, so as to obtain an optimized policy network capable of accurately deciding the optimal course angle of the unmanned aerial vehicle according to the environmental situation of the unmanned aerial vehicle.
And then, after the optimized strategy network is obtained, the current time environment situation of the current unmanned aerial vehicle can be input into the optimized strategy network, so that the optimized strategy network makes a decision based on the current time environment situation to predict and obtain the current time optimal course angle of the current unmanned aerial vehicle, and the current unmanned aerial vehicle is conveniently guided to execute the obstacle avoidance task according to the current time optimal course angle, so that the current unmanned aerial vehicle can obtain the maximum rewards under the current environment, and the effective obstacle avoidance behavior is realized.
Compared with the method that in some related technologies, additional virtual data are generated through reinforcement learning based on a model, long-term prediction is needed, learning cost becomes expensive, and the method provided by the embodiment optimizes the track in a limited short range in each prediction interval through model prediction control, provides a current local optimal solution, avoids high cost of long-term planning, and achieves balance between input cost and convergence benefits. In addition, compared with the method for realizing environmental situation prediction and rewarding prediction through probabilistic modeling in some related technologies, the method provided by the embodiment models the state transition and rewarding model of the environmental model through deterministic modeling, namely based on a data driving method, namely approximates the environmental model through a neural network, is beneficial to reducing the number of the neural network, improves the calculation speed of an algorithm, and further improves the learning efficiency of a target strategy network.
According to the network training method provided by the embodiment, a plurality of sample data to be processed at different moments are extracted from the experience playback pool, and are predicted based on the target strategy network and the target prediction network, so that a plurality of optimal course angle predicted values, environment situation predicted values and rewarding predicted values at different future moments are obtained, and the target strategy network is subjected to rolling optimization based on the environment situation, the optimal course angle and rewarding values in the sample data to be processed, the environment situation predicted values, the rewarding predicted values and the optimal course angle predicted values at the future moments, so that the number of sample data can be expanded through virtual environment data generation of the target prediction network, the interaction times with a real environment can be reduced, the sample utilization rate can be improved, the training speed can be increased, the learning efficiency can be improved, and the target strategy network can be quickly converged to the optimal state under the condition that the number of interaction times with the environment of an unmanned plane is less; the optimized decision model not only has the current decision experience, but also has the future decision experience so as to make a more optimized unmanned aerial vehicle obstacle avoidance decision, thereby improving the unmanned aerial vehicle obstacle avoidance performance.
In some embodiments, in step 130, inputting the environmental situation in each of the sample data to be processed and the optimal heading angle predicted values of the multiple different future times to the target prediction network for multi-step prediction, to obtain environmental situation predicted values and rewarding predicted values of the multiple different future times, including:
Inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a reward function network of the target prediction network, and inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a situation transfer function network of the target prediction network to perform multi-step prediction, so as to obtain the reward predicted value and the environmental situation predicted value output by the reward function network.
Here, the target prediction network may be a situation transfer function network including a situation prediction-enabled environment situation, and a reward function network including a reward value prediction-enabled environment.
Optionally, the step of obtaining the environmental situation predicted value and the rewards predicted value includes:
Inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted value of the corresponding moment of the environmental situation into a reward function network to obtain a reward predicted value of the corresponding moment; inputting the environment situation in each sample data to be processed and the optimal heading angle predicted value of the environment situation at the corresponding moment to a situation transfer function network to obtain the environment situation predicted value of the next moment of the corresponding moment;
Inputting the environmental situation predicted value of the next moment corresponding to the moment and the optimal heading angle predicted value of the next moment corresponding to the moment into a reward function network to obtain the reward predicted value of the next moment corresponding to the moment, inputting the environmental situation predicted value of the next moment corresponding to the moment and the optimal heading angle predicted value of the next moment corresponding to the moment into a situation transfer function network to obtain the environmental situation predicted value of the next moment future moment, and iteratively executing the predicting steps on each sample data to be processed until the predicting step length reaches the preset predicting step length
The multi-step prediction acquisition formula of the environmental situation predicted value and the rewards predicted value can be expressed as follows:
Wherein, Is a situation transfer function network,Is a reward function network,AndModel parameters of the situation transfer function network and the bonus function network, respectively. /(I)AndAnd respectively representing the predicted environmental situation value predicted by the situation transfer function network and the predicted rewarding value predicted by the rewarding function network.
According to the method provided by the embodiment, the sample data to be processed is subjected to multi-step prediction through the reward function network and the situation transfer function network of the target prediction network, so that the environment model is approximated through the deterministic model method, namely, the environment situation predicted value and the reward predicted value of the environment are learned through the data-driven method, the sample utilization rate of the unmanned aerial vehicle in reinforcement learning is improved, and the obstacle avoidance performance of the unmanned aerial vehicle is further improved.
In some embodiments, in step 140, performing reinforcement learning training on the target policy network according to the environmental situation, the optimal course angle and the rewards value in each of the sample data to be processed, and the environmental situation predicted value, the rewards predicted value and the optimal course angle predicted value, including: obtaining a value function cost function according to the optimal course angle in each piece of sample data to be processed, the optimal course angle predicted value, the environment situation predicted value and the rewarding predicted value; acquiring a strategy cost function according to the optimal course angle predicted value and the environment situation predicted value; acquiring a situation transfer cost function according to the environmental situation and the environmental situation predicted value in each sample data to be processed; obtaining a reward cost function according to the reward value and the reward predicted value in each sample data to be processed; and performing reinforcement learning on the target strategy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the rewarding cost function.
Optionally, the reinforcement learning training step of the target strategy network specifically includes:
inputting the optimal course angle predicted value or the optimal course angle at the corresponding time of each prediction step, the environment situation predicted value and the rewarding predicted value into the additional value function network of the target value function network, and obtaining the target value at the corresponding time of each prediction step according to the output of the additional value function network and the rewarding predicted value at the corresponding time of each prediction step, wherein the specific calculation formula is as follows:
Wherein, ForThe individual sample data are found in the prediction stepA target value for the corresponding time of (a); /(I)The method comprises the steps of a preset prediction step length; /(I)AndIs a coefficient; /(I)AndThe predicted value of the rewards, the predicted value of the environment situation and the predicted value of the optimal course angle are respectively; /(I)Is a network of added value functions.
Inputting the optimal course angle predicted value and the environment situation predicted value of the corresponding time of each predicting step into a main value function network of a target value function network, and calculating to obtain a value function cost function according to the difference between the output of the main value function network and the target value of the corresponding time of each predicting step; wherein the value function cost functionThe specific calculation formula is as follows:
;/>
Wherein, The number of sample data to be extracted in each prediction interval; /(I)AndRespectively obtaining a target value, an environmental situation predicted value and an optimal course angle predicted value of the corresponding time of each predicting step; /(I)A network of principal functions; /(I)Is a preset prediction step length.
Inputting the optimal course angle predicted value and the environment situation predicted value of the corresponding time of each predicting step into a main value function network in a target value function network, and calculating to obtain a strategy cost function based on the output of the main value function network; wherein the policy cost functionThe specific calculation formula is as follows:
Calculating to obtain a situation transfer cost function according to the difference between the environment situation and the environment situation predicted value at the corresponding moment; wherein, the situation transfer cost function The specific calculation formula is as follows:
And/> Respectively isEnvironmental situation predicted values and environmental situations at corresponding moments of the individual sample data.
Calculating to obtain a reward cost function according to the difference value between the reward value and the reward predicted value at the corresponding moment; wherein the cost function is awardedThe specific calculation formula is as follows:
And/> Respectively isA prize forecast and a prize value for corresponding times of the individual sample data.
And then, after the cost function is obtained, optimizing a target value function network, a target strategy network and a target prediction network based on the value function cost function, taking the trained strategy network as a target strategy network corresponding to a next prediction interval, taking the trained value function network as a target value function network corresponding to the next prediction interval, taking the trained prediction network as a target prediction network corresponding to the next prediction interval, and carrying out iterative optimization on the target value function network corresponding to the next prediction interval, the target strategy network corresponding to the next prediction interval and the target prediction network corresponding to the next prediction interval based on the to-be-processed sample data, the predicted optimal heading angle predicted value, the environment situation predicted value and the rewarding predicted value in a plurality of different moments in the next prediction interval until the number of rounds of iterative training reaches a set training round, thereby obtaining an optimized strategy network.
According to the method provided by the embodiment, the accumulated return of each prediction interval is maximized by applying the rolling optimization method, and the strategy network optimization is performed by fusing a plurality of cost functions, so that the learning efficiency and the sample utilization rate of the strategy are improved, and the optimized strategy network has higher obstacle avoidance performance.
FIG. 2 is a second flow chart of the network training method according to the present invention; as shown in fig. 2, the complete flow of the method includes:
step 210, calculating a target time optimal course angle of the sample unmanned aerial vehicle based on the target strategy network and the target time environment situation;
Step 220, obtaining the next moment position of the sample unmanned aerial vehicle according to the unmanned aerial vehicle dynamics constraint model, the kinematic constraint and the disturbance flow field method, obtaining the next moment environmental situation and the target moment rewarding value of the sample unmanned aerial vehicle by combining the next moment position and the destination position of the obstacle, constructing sample data of the target moment according to the optimal course angle of the target moment, the environment situation of the next moment and the target moment rewarding value, and storing the sample data into the experience playback pool;
Step 230, judging whether the number of the sample data of the experience playback pool in step 220 exceeds a certain value; if the value exceeds a certain value, go to step 240; if the value does not exceed the certain value, the process goes to step 210;
Step 240, sampling part of sample data to be processed from the experience playback pool, and predicting environmental situation, optimal course angle and rewarding value of the unmanned aerial vehicle at a plurality of future moments corresponding to each sample data to be processed based on a multi-step prediction method of prediction control of a prediction network;
Step 250, calculating at least one cost function according to the environmental situation of the sample data to be processed at the corresponding time and the future time, the optimal course angle and the rewarding value of the unmanned aerial vehicle in step 240, and then carrying out gradient update on the target strategy network and the target prediction network according to the cost function;
Step 260, repeating steps 210-250 until the specified training round is reached, resulting in an optimized policy network.
FIG. 3 is a third flow chart of the network training method according to the present invention; as shown in fig. 3, the step 240 specifically includes:
Step 310, sampling a plurality of sample data to be processed at different moments from an experience playback pool;
Step 320, calculating an optimal heading angle predicted value at the corresponding moment of the current prediction step based on the environmental situation predicted value at the corresponding moment of the current prediction step and the target policy network; it should be noted that, in the case where the current prediction step is the initial prediction step, the environmental situation predicted value at the corresponding time of the current prediction step is the environmental situation in each sample data to be processed;
step 330, interacting with a prediction network, and predicting to obtain an environmental situation predicted value at the corresponding time of the next prediction step and a reward predicted value at the corresponding time of the current prediction step;
step 340, calculating an updated target value of the target policy network;
Step 350, judging whether the current prediction step is larger than a preset prediction step length; if the predicted step size is larger than the preset predicted step size, ending the step 240 and jumping to the step 250; if not, the process proceeds to step 320.
In summary, aiming at the problem that the training efficiency is low in unmanned aerial vehicle obstacle avoidance of reinforcement learning, the obstacle avoidance method provided by the embodiment learns an environment situation and a reward model through a data driving method, and adopts a rolling optimization interval to carry out strategy training of the unmanned aerial vehicle. The method has higher learning efficiency, faster strategy approaching to the convergence speed of the optimal value and less sample capacity space required by the experience replay buffer zone, so that the unmanned aerial vehicle can learn the optimal strategy only through a small amount of attempts, the application of the reinforcement learning method in obstacle avoidance is promoted, and the obstacle avoidance performance of the unmanned aerial vehicle is improved.
Fig. 4 is a schematic flow chart of an obstacle avoidance method of an unmanned aerial vehicle provided by the invention; as shown in fig. 4, the method includes: step 410, acquiring the current moment environmental situation of the current unmanned aerial vehicle; step 420, inputting the environmental situation at the current moment into an optimized strategy network to obtain an optimal heading angle at the current moment of the current unmanned aerial vehicle; step 430, controlling the current unmanned aerial vehicle to execute an obstacle avoidance task according to the optimal course angle at the current moment; the optimized strategy network is obtained by training based on the network training method provided by each embodiment.
Optionally, after the optimized policy network is obtained based on steps 110-140, the current time environmental situation of the current unmanned aerial vehicle is input into the optimized policy network, so that the optimized policy network makes a decision based on the current time environmental situation to predict and obtain the current time optimal course angle of the current unmanned aerial vehicle, and then the current unmanned aerial vehicle is guided to execute the obstacle avoidance task according to the current time optimal course angle, so that the current unmanned aerial vehicle can obtain the maximum rewards under the current environment, and effective obstacle avoidance behavior is realized. The method is used for expanding the number of sample data through virtual environment data generation of the target prediction network, so that the interaction times with a real environment are reduced, the sample utilization rate is improved, the training speed is accelerated, the learning efficiency is improved, and the target strategy network can be quickly converged to be optimal under the condition that the interaction times with the environment of the unmanned aerial vehicle are fewer; the optimized decision model not only has the current decision experience, but also has the future decision experience so as to make a more optimized unmanned aerial vehicle obstacle avoidance decision, thereby improving the unmanned aerial vehicle obstacle avoidance performance.
The network training device provided by the invention is described below, and the network training device described below and the network training method described above can be referred to correspondingly.
Fig. 5 is a schematic structural diagram of a network training device provided by the present invention; as shown in fig. 5, the apparatus includes: the construction unit 510 is configured to construct sample data at a target time according to a target time environment situation, a target time optimal heading angle, a next time environment situation, and a target time rewarding value of the sample unmanned aerial vehicle; the extracting unit 520 is configured to update the sample data at the target time to an experience playback pool, and extract, from the updated experience playback pool, sample data to be processed at a plurality of different times within a target prediction interval when the number of the sample data in the updated experience playback pool reaches a preset number; the prediction unit 530 is configured to input the environmental situation in each of the sample data to be processed to a target policy network to obtain a plurality of optimal heading angle predicted values at different future times, and input the environmental situation in each of the sample data to be processed and the plurality of optimal heading angle predicted values at different future times to the target policy network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at different future times; the optimizing unit 540 is configured to perform reinforcement learning training on the target policy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and obtain an optimized policy network according to a training result; the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle. The network training device provided by the embodiment realizes that the number of sample data is expanded through the generation of the virtual environment data of the target prediction network, so that the interaction times with the real environment are reduced, the sample utilization rate is improved, the training speed is accelerated, the learning efficiency is improved, and the target strategy network can be quickly converged to the optimal state under the condition that the interaction times with the environment of the unmanned aerial vehicle are less; the optimized decision model not only has the current decision experience, but also has the future decision experience so as to make a more optimized unmanned aerial vehicle obstacle avoidance decision, thereby improving the unmanned aerial vehicle obstacle avoidance performance.
Fig. 6 is a schematic structural diagram of an obstacle avoidance device of an unmanned aerial vehicle according to the present invention; as shown in fig. 6, the apparatus includes: the acquiring unit 610 is configured to acquire a current time environmental situation of the current unmanned aerial vehicle; the decision unit 620 is configured to input the current time environmental situation to an optimized policy network, so as to obtain a current time optimal heading angle of the current unmanned aerial vehicle; the obstacle avoidance control unit 630 is configured to control the current unmanned aerial vehicle to perform an obstacle avoidance task according to the current optimal heading angle; the optimized policy network is trained based on the network training method provided by the above embodiments. The device expands the number of sample data through the generation of virtual environment data of the target prediction network, so as to reduce the interaction times with the real environment, improve the sample utilization rate, accelerate the training speed and improve the learning efficiency, and further enable the target strategy network to quickly converge to the optimal state under the condition that the interaction times with the environment of the unmanned aerial vehicle are less; the optimized decision model not only has the current decision experience, but also has the future decision experience so as to make a more optimized unmanned aerial vehicle obstacle avoidance decision, thereby improving the unmanned aerial vehicle obstacle avoidance performance.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform the network training method or the drone obstacle avoidance method provided by the methods described above.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the network training method or the unmanned aerial vehicle obstacle avoidance method provided by the above methods.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the network training method or the unmanned aerial vehicle obstacle avoidance method provided by the above methods.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of network training, comprising:
According to the environmental situation at the target moment, the optimal course angle at the target moment, the environmental situation at the next moment and the rewarding value at the target moment of the sample unmanned aerial vehicle, constructing sample data at the target moment;
Updating the sample data at the target moment to an experience playback pool, and extracting a plurality of sample data to be processed at different moments in a target prediction interval from the updated experience playback pool under the condition that the number of the sample data in the updated experience playback pool reaches a preset number;
Inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future times, and inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future times into the target strategy network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future times;
Performing reinforcement learning training on the target strategy network according to the environmental situation, the optimal course angle and the rewarding value in each sample data to be processed, and the environmental situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and acquiring an optimized strategy network according to a training result;
the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.
2. The network training method according to claim 1, wherein the target time environment situation and the target time optimal heading angle are obtained based on the steps of:
Determining the target moment environment situation according to the target moment position, the radius and the destination position of the sample unmanned aerial vehicle, the target moment position, the target moment speed and the radius of the obstacle and the target moment distance between the sample unmanned aerial vehicle and the obstacle;
And inputting the environment situation at the target moment into the target strategy network to obtain the optimal course angle at the target moment.
3. The network training method of claim 1, wherein the next time environmental situation is obtained based on the steps of:
calculating to obtain the next moment position of the sample unmanned aerial vehicle according to an unmanned aerial vehicle dynamics constraint model, a kinematic constraint and a disturbance flow field method;
And determining the environmental situation of the next moment according to the position, the radius and the destination position of the sample unmanned aerial vehicle at the next moment, the position, the speed and the radius of the obstacle at the next moment and the distance between the sample unmanned aerial vehicle and the obstacle at the next moment.
4. A network training method as claimed in any one of claims 1 to 3, wherein the target time prize value is obtained on the basis of:
Determining a target moment rewarding value according to the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle and a first rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than a first distance value;
Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, the second rewarding value and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is smaller than the second distance value;
Determining a target time rewarding value according to the target time distance between the sample unmanned aerial vehicle and the destination position, the distance between the starting point position of the sample unmanned aerial vehicle and the destination position, and the third rewarding value when the target time distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and the target time distance between the sample unmanned aerial vehicle and the destination position is larger than or equal to the second distance value;
wherein the first prize value is a constant prize value; the second prize value is a threat prize for limiting the sample drone from moving away from the obstacle; and the third prize value is an additional prize value corresponding to the completion of the task.
5. The network training method of claim 4, wherein the second prize value is determined based on the steps of:
Determining the second rewarding value based on the target moment distance between the sample unmanned aerial vehicle and the obstacle, the radius of the sample unmanned aerial vehicle, the radius of the obstacle, a preset threat radius and a fourth rewarding value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is larger than or equal to the first distance value and smaller than a third distance value;
And determining the second rewarding value based on a preset constant value under the condition that the target moment distance between the sample unmanned aerial vehicle and the obstacle is smaller than the first distance value or larger than or equal to the third distance value.
6. A network training method according to any one of claims 1 to 3, wherein the inputting the environmental situation in each of the sample data to be processed and the optimal heading angle predicted values at the different future times into the target prediction network to perform multi-step prediction, to obtain environmental situation predicted values and rewards predicted values at the different future times, includes:
Inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a reward function network of the target prediction network, and inputting the environmental situation and the optimal course angle predicted value in each sample data to be processed into a situation transfer function network of the target prediction network to perform multi-step prediction, so as to obtain the reward predicted value and the environmental situation predicted value output by the reward function network.
7. A network training method according to any one of claims 1 to 3, wherein said reinforcement learning training of said target strategy network according to the environmental situation, the optimal course angle and the prize value in each of said sample data to be processed, and said environmental situation predicted value, said prize predicted value and said optimal course angle predicted value comprises:
Obtaining a value function cost function according to the optimal course angle in each piece of sample data to be processed, the optimal course angle predicted value, the environment situation predicted value and the rewarding predicted value;
acquiring a strategy cost function according to the optimal course angle predicted value and the environment situation predicted value;
acquiring a situation transfer cost function according to the environmental situation and the environmental situation predicted value in each sample data to be processed;
Obtaining a reward cost function according to the reward value and the reward predicted value in each sample data to be processed;
And performing reinforcement learning on the target strategy network according to the value function cost function, the strategy cost function, the situation transfer cost function and the rewarding cost function.
8. An unmanned aerial vehicle obstacle avoidance method, comprising:
Acquiring the current moment environmental situation of the current unmanned aerial vehicle;
inputting the current environmental situation to an optimized strategy network to obtain the current optimal course angle of the current unmanned aerial vehicle;
according to the optimal course angle at the current moment, controlling the current unmanned aerial vehicle to execute an obstacle avoidance task;
Wherein the optimized policy network is trained based on the network training method according to any one of claims 1 to 7.
9. A network training device, comprising:
The construction unit is used for constructing sample data of the target moment according to the environmental situation of the target moment, the optimal course angle of the target moment, the environmental situation of the next moment and the rewarding value of the target moment of the sample unmanned plane;
the extraction unit is used for updating the sample data at the target moment to the experience playback pool, and extracting the sample data to be processed at a plurality of different moments in the target prediction interval from the updated experience playback pool under the condition that the quantity of the sample data in the updated experience playback pool reaches the preset quantity;
The prediction unit is used for inputting the environmental situation in each sample data to be processed into a target strategy network to obtain a plurality of optimal heading angle predicted values at different future moments, inputting the environmental situation in each sample data to be processed and the optimal heading angle predicted values at the different future moments into the target prediction network to perform multi-step prediction to obtain a plurality of environmental situation predicted values and rewarding predicted values at the different future moments;
The optimizing unit is used for performing reinforcement learning training on the target strategy network according to the environment situation, the optimal course angle and the rewarding value in each sample data to be processed, the environment situation predicted value, the rewarding predicted value and the optimal course angle predicted value, and obtaining an optimized strategy network according to training results;
the optimized strategy network is used for predicting the current time optimal course angle of the current unmanned aerial vehicle based on the current time environment situation of the current unmanned aerial vehicle so that the current unmanned aerial vehicle can execute obstacle avoidance tasks according to the current time optimal course angle.
10. An unmanned aerial vehicle keeps away barrier device, characterized in that includes:
The acquisition unit is used for acquiring the current environmental situation of the current unmanned aerial vehicle at the current moment;
The decision unit is used for inputting the current time environment situation into an optimized strategy network to obtain the current time optimal course angle of the current unmanned aerial vehicle;
The obstacle avoidance control unit is used for controlling the current unmanned aerial vehicle to execute an obstacle avoidance task according to the current time optimal course angle;
Wherein the optimized policy network is trained based on the network training method according to any one of claims 1 to 7.
CN202410447633.0A 2024-04-15 2024-04-15 Network training method, unmanned aerial vehicle obstacle avoidance method and device Active CN118034355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410447633.0A CN118034355B (en) 2024-04-15 2024-04-15 Network training method, unmanned aerial vehicle obstacle avoidance method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410447633.0A CN118034355B (en) 2024-04-15 2024-04-15 Network training method, unmanned aerial vehicle obstacle avoidance method and device

Publications (2)

Publication Number Publication Date
CN118034355A true CN118034355A (en) 2024-05-14
CN118034355B CN118034355B (en) 2024-08-13

Family

ID=90993552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410447633.0A Active CN118034355B (en) 2024-04-15 2024-04-15 Network training method, unmanned aerial vehicle obstacle avoidance method and device

Country Status (1)

Country Link
CN (1) CN118034355B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210165405A1 (en) * 2019-12-03 2021-06-03 University-Industry Cooperation Group Of Kyung Hee University Multiple unmanned aerial vehicles navigation optimization method and multiple unmanned aerial vehicles system using the same
CN113543068A (en) * 2021-06-07 2021-10-22 北京邮电大学 Forest area unmanned aerial vehicle network deployment method and system based on hierarchical clustering
US20220189312A1 (en) * 2019-10-30 2022-06-16 Wuhan University Of Technology Intelligent collision avoidance method for a swarm of unmanned surface vehicles based on deep reinforcement learning
WO2022242468A1 (en) * 2021-05-18 2022-11-24 北京航空航天大学杭州创新研究院 Task offloading method and apparatus, scheduling optimization method and apparatus, electronic device, and storage medium
CN116974299A (en) * 2023-08-10 2023-10-31 北京理工大学 Reinforced learning unmanned aerial vehicle track planning method based on delayed experience priority playback mechanism
CN117707207A (en) * 2024-02-06 2024-03-15 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning
CN117705113A (en) * 2023-11-22 2024-03-15 南京邮电大学 Unmanned aerial vehicle vision obstacle avoidance and autonomous navigation method for improving PPO

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220189312A1 (en) * 2019-10-30 2022-06-16 Wuhan University Of Technology Intelligent collision avoidance method for a swarm of unmanned surface vehicles based on deep reinforcement learning
US20210165405A1 (en) * 2019-12-03 2021-06-03 University-Industry Cooperation Group Of Kyung Hee University Multiple unmanned aerial vehicles navigation optimization method and multiple unmanned aerial vehicles system using the same
WO2022242468A1 (en) * 2021-05-18 2022-11-24 北京航空航天大学杭州创新研究院 Task offloading method and apparatus, scheduling optimization method and apparatus, electronic device, and storage medium
CN113543068A (en) * 2021-06-07 2021-10-22 北京邮电大学 Forest area unmanned aerial vehicle network deployment method and system based on hierarchical clustering
CN116974299A (en) * 2023-08-10 2023-10-31 北京理工大学 Reinforced learning unmanned aerial vehicle track planning method based on delayed experience priority playback mechanism
CN117705113A (en) * 2023-11-22 2024-03-15 南京邮电大学 Unmanned aerial vehicle vision obstacle avoidance and autonomous navigation method for improving PPO
CN117707207A (en) * 2024-02-06 2024-03-15 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN118034355B (en) 2024-08-13

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
Kahn et al. Uncertainty-aware reinforcement learning for collision avoidance
CN110989576B (en) Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN110520868B (en) Method, program product and storage medium for distributed reinforcement learning
CN114162146B (en) Driving strategy model training method and automatic driving control method
CN112362066A (en) Path planning method based on improved deep reinforcement learning
JP7448683B2 (en) Learning options for action selection using meta-gradient in multi-task reinforcement learning
CN111783994A (en) Training method and device for reinforcement learning
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
CN111830822A (en) System for configuring interaction with environment
Liu et al. Reinforcement learning-based collision avoidance: Impact of reward function and knowledge transfer
CN117590867B (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
Mohanty et al. Application of deep Q-learning for wheel mobile robot navigation
Mustafa Towards continuous control for mobile robot navigation: A reinforcement learning and slam based approach
CN113743603A (en) Control method, control device, storage medium and electronic equipment
CN118034355B (en) Network training method, unmanned aerial vehicle obstacle avoidance method and device
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN113589810B (en) Dynamic autonomous obstacle avoidance movement method and device for intelligent body, server and storage medium
CN115081612A (en) Apparatus and method to improve robot strategy learning
KR20230010746A (en) Training an action selection system using relative entropy Q-learning
CN115293334B (en) Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning
Cao et al. A New Deep Reinforcement Learning Based Robot Path Planning Algorithm without Target Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240828

Address after: 100190 No. 55 East Zhongguancun Road, Beijing, Haidian District

Patentee after: ACADEMY OF MATHEMATICS AND SYSTEM SCIENCE, CHINESE ACADEMY OF SCIENCES

Country or region after: China

Patentee after: BEIJING INSTITUTE OF TECHNOLOGY

Patentee after: BEIHANG University

Address before: 100190 No. 55 East Zhongguancun Road, Beijing, Haidian District

Patentee before: ACADEMY OF MATHEMATICS AND SYSTEM SCIENCE, CHINESE ACADEMY OF SCIENCES

Country or region before: China