CN115903880A

CN115903880A - Unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning

Info

Publication number: CN115903880A
Application number: CN202211002222.8A
Authority: CN
Inventors: 祝小平; 王飞; 祝宁华
Original assignee: Xian Aisheng Technology Group Co Ltd
Current assignee: Xian Aisheng Technology Group Co Ltd
Priority date: 2022-08-21
Filing date: 2022-08-21
Publication date: 2023-04-04

Abstract

The invention relates to an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning, and provides an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on an image and experience pool storage mechanism, namely an FRDDM-DQN method. In the invention, an agent meeting the requirement is trained by an FRDDM-DQN method; when the task is executed, the trained intelligent body controls the unmanned aerial vehicle to realize autonomous image navigation and obstacle avoidance. Has the advantages that: by introducing the Faster R-CNN model into the DQN algorithm and converting the recognition result of the Faster R-CNN model, the unmanned aerial vehicle autonomous image navigation and obstacle avoidance capability in a complex environment is obtained. By introducing the experience pool data storage mechanism provided by the invention into the DQN algorithm, the autonomous image navigation and obstacle avoidance capability of the unmanned aerial vehicle in a complex environment is improved. By the aid of the method of subsection training, retraining time consumption during application scene change is reduced.

Description

Unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning

Technical Field

The invention belongs to the application field of unmanned aerial vehicles, and relates to an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved strong learning, in particular to an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning.

Background

The unmanned aerial vehicle is an autonomous and semi-autonomous unmanned aerial vehicle, has high maneuverability, good concealment and adaptability, and can replace human beings to execute some dangerous military tasks. For example, drones may perform terrestrial, marine search, rescue, and reconnaissance tasks instead of humans. In some military scenes, communication between the unmanned aerial vehicle and the ground station may be interfered, the unmanned aerial vehicle cannot execute tasks in a manual control mode, and the unmanned aerial vehicle is required to have autonomous navigation and obstacle avoidance capacity. In addition, in order to cope with military scenarios in which airborne radar cannot be used or fails, the drone should have the ability to avoid obstacles through various sensors. The airborne photoelectric equipment (such as an airborne camera) is widely used for unmanned aerial vehicles at present due to the characteristics of small volume and light weight, in particular reconnaissance type unmanned aerial vehicles and reconnaissance and striking integrated unmanned aerial vehicles. Therefore, the unmanned aerial vehicle should have the capability of autonomous navigation and obstacle avoidance through images shot by the onboard camera.

Currently, some studies implement image-based navigation and obstacle avoidance. In an unmanned aerial vehicle visual image algorithm, an obstacle avoidance step and an information fusion processing system thereof (patent, publication number CN112286230A, publication date 2020.11.13), navigation and obstacle avoidance of an unmanned aerial vehicle are realized through images acquired by an airborne camera. When an obstacle is encountered, the algorithm processes the obstacle avoidance task and the navigation task separately, namely the obstacle avoidance algorithm calculates how to avoid the obstacle, and the navigation task returns to the original navigation route to continue to execute the task after the obstacle is avoided. This processing method may result in actions selected by the algorithm that may avoid the obstacle but are not necessarily optimal for the navigation task, i.e. the selected action is not globally optimal. In Deep Learning-based monoclonal objective Avoidance for Unmanned Aerial Vehicle Navigation in Tree Plantations, fast Region-based general Network Approach ("Journal of Intelligent & robotics Systems (2021) 101 5), a fast R-CNN model is used to extract obstacles in images, and Obstacle Avoidance of Unmanned Aerial vehicles is realized. However, the obstacle avoidance strategy in the algorithm is established based on human experience, and the limited strategy established according to human experience is not necessarily the optimal strategy in all cases. Therefore, the current image-based navigation and obstacle avoidance algorithm has the problem that the action selection strategy is not globally optimal.

In addition, the autonomous navigation and obstacle avoidance problems can be solved through a method based on reinforcement learning. The method does not need prior knowledge, and the intelligent agent can gradually find a global optimal strategy suitable for the current rule (reward function) in the training process. In The autonomous navigation and obstacle visibility for USVs with ANOA depth retrieval method (knowlege-Based Systems (2020) 5) 196), autonomous navigation and obstacle avoidance of The unmanned ship are realized in a simple simulation environment by an enhanced learning algorithm, i.e., a DQN method with a convolutional neural network. In a robot obstacle avoidance method based on a DoubleDQN network and deep reinforcement learning (patent, publication number CN 109407676A, publication date 2019.03.01), obstacle avoidance of a robot is realized through a reinforcement learning algorithm, namely a DoubleDQN algorithm. Because the intelligent agent performs optimization through random sampling experience in the experience pool, when the size of a scene applied by the two algorithms changes, for example, the scene of the unmanned aerial vehicle executing tasks is large and the navigation speed of the unmanned aerial vehicle is low compared with the size of the scene, the occupation ratio of various types of experience in the experience pool changes, which limits further improvement of the training effect of the intelligent agent in the later training period, and even possibly causes the training failure of the intelligent agent.

Therefore, the method which can realize the autonomous image navigation and obstacle avoidance task of the unmanned aerial vehicle in a complex scene and has a good application effect is designed to have great significance.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning. The unmanned aerial vehicle who is mainly used for carrying out low-altitude flight tasks such as reconnaissance, search and rescue realizes keeping away the barrier when carrying out the task, keeps away the barrier here and mainly means avoiding unmanned aerial vehicle to get into the forbidden zone that flies that ground barrier produced. Currently, the autonomous navigation and obstacle avoidance algorithm based on images has the problem that the output action is not globally optimal; the autonomous navigation and obstacle avoidance algorithm based on reinforcement learning also has the problem of poor training effect in the scene of the invention. Therefore, the invention provides a method based on improved reinforcement learning, an intelligent agent with stronger autonomous image navigation and obstacle avoidance capabilities can be obtained through the method, and the intelligent agent controls the unmanned aerial vehicle to execute tasks.

Technical scheme

An unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning is characterized by comprising the following steps:

step 1: unmanned aerial vehicle autonomous image navigation and obstacle avoidance problem modeling;

1. setting a kinematic model of the unmanned aerial vehicle;

wherein, P _u ＝[x _u (t),y _u (t),z _u (t)]Is the position of the unmanned aerial vehicle, V is the speed of the unmanned aerial vehicle, and χ (t) and γ (t) are the heading angle and the climbing angle of the unmanned aerial vehicle, respectively, [ u [ ] _γ ,u _χ ]Is the control quantity of the unmanned aerial vehicle;

2. arrival definition;

the location of the destination is P _g ＝[x _g (t)y _g (t)z _g (t)] ^T The radius of the destination area of influence is R _g (ii) a Distance D between unmanned aerial vehicle and destination _g Is defined as

When D is present _g ≤R _g When the unmanned aerial vehicle arrives at the destination;

3. defining collision;

position of the obstacle is P _obs ＝[x _obs (t)y _obs (t)z _obs (t)] ^T The radius of a no-fly zone generated by the barrier is R _obs (ii) a Distance D between unmanned aerial vehicle and obstacle _obs Is defined as

When D is present _obs ＜R _obs When the unmanned aerial vehicle enters a no-fly zone generated by the barrier, the unmanned aerial vehicle collides with the barrier;

4. out-of-bound definition;

when the unmanned aerial vehicle executes the task, the flying range is

P _range ＝{(x,y,z)|X _min ≤x≤X _max ,Y _min ≤y(t)≤Y _max ,H _min ≤z≤H _max }

When in use

When the unmanned aerial vehicle is out of bound;

step 2: image s acquired from onboard camera _o Extracting position information of the obstacle;

1. identification of images s by the Faster R-CNN model _o The obstacle in (1);

wherein obs _posImage Is the recognition result of the Faster R-CNN model, and the subscript i represents the ith obstacle recognized by the Faster R-CNN model; x is the number of _i,1 ,y _i,1 And x _i,2 ,y _i,2 Respectively representing the coordinates of the upper left corner and the lower right corner of the obstacle;

2. processing the recognition result of the Faster R-CNN model;

obs′ _pos ＝(x′ _o ,y′ _o )＝(τ ₁ ×x _oInImage ,(-1×τ ₁ ×y _oInImage ))

U′ _pos ＝(x′ _U ,y′ _U )＝(τ ₁ ×x _image /2,(-(τ ₁ ×y _image +d _c )))

wherein x is _image And y _image Is the size of the image, τ ₁ Is the scale of the image, d _c Is not provided withThe distance between the human-machine and the view frame;

3. the position information of the obstacle is

Wherein, theta' _o Is an obstacle-the unmanned aerial vehicle front corner; d' _OtoU The distance between the unmanned aerial vehicle and the obstacle; χ' is the relative heading angle of the drone in the sight frame;

and step 3: experience logging mechanism in formulating training agents

1. The agent

The structure of the agent decision network is 29 × 512 × 128 × 6, where 29 is the number of input nodes and 6 is the number of output nodes;

2. intelligent input s' (t)

Suppose the position of the drone is P _u ＝[x _u (t),y _u (t),z _u (t)]The pre-designated destination position is P _g ＝[x _g (t)y _g (t)z _g (t)] ^T (ii) a Distance D between unmanned aerial vehicle and destination _g With the leading angle theta of the unmanned aerial vehicle in the XOY plane _{g_XOY} Is defined as:

the inputs to the agent are: s' (t) = [ z ] _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s′ _o (t)]

Wherein H _UtoG (t) unmanned aerial vehicle and destinationS 'is the height difference' _o ＝[D′ _OtoU ,θ′ _o ]；

3. Defining a reward function r _U ；

Defining the destination reward of unmanned aerial vehicle as

/>

Defining the reward of collision of the unmanned aerial vehicle as

Defining the reward for the unmanned plane out of bounds as

Thus, the reward function r _U Is r _U (s(t+1),a _U )＝r _arrived +r _collision +r _out

4. Classification of experiences

In the training process of the agent, the experience RM stored in the experience pool is

RM＝{RM(i)|RM(i)＝(s ⁱ (a-),a _U ,r _U ,s ⁱ (a+)),i＜RM _Capacity }

Wherein, the superscript i represents the number of the current experience in the experience pool; s ⁱ (a-) and s ⁱ (a +) each represents performing action a _U Preceding and performing action a _U The latter state, RM _Capacity Is the capacity of the experience pool;

in a single experience, state s ⁱ (t) the task condition is defined as

Wherein the content of the first and second substances,

acquiring a task condition represented by a specified state; e.g. of the type _o ,e _c ,e _out ,e _g Is used for describingService status parameters: e.g. of the type _o For describing whether the agent detected an obstacle; e.g. of the type _c The intelligent agent is used for describing whether the intelligent agent is collided or not; e.g. of the type _out For describing whether the agent is out of bounds; e.g. of the type _g For describing whether the agent has reached the destination;

in the training process of unmanned aerial vehicle, state s of intelligent body ⁱ (t) can be divided into the following categories: state s where no obstacle is detected, no collision occurs, no exit is possible, and no destination is reached by the drone _safe (ii) a State s when the drone detects an obstacle, does not collide, does not exit, and does not reach the destination _obs (ii) a Collision state s between unmanned aerial vehicle and obstacle _collision (ii) a State s of unmanned aerial vehicle out of bounds _out (ii) a State s of unmanned aerial vehicle arriving at destination _arrival (ii) a Namely:

s ⁱ (t)∈{s _safe ,s _obs ,s _collision ,s _out ,s _arrival }

thus, for any experience RM (i) =(s) ⁱ (a-),a _U ,s ⁱ (a +) is divided into the following categories:

(1) Results experience RE: divided into reach experience RE _arrival And collision experience RE _collision RE, out of bounds experience _out Namely:

RE＝{RE _arrival ,RE _collision ,RE _out },RE∈RM

RE _arrival ＝{RM(i){s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _arrival }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _arrival }}

RE _collision ＝{RM(i)|{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _collision }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _collision }}

RE _out ＝{RM(i)|{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _out }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _out }}

(2) Hazard experience DE: representing that the agent has detected an obstacle, i.e.:

DE＝{RM(i)|{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _safe }∪{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _obs }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _obs }}

(3) Safety experience SE: the intermediate state that the unmanned aerial vehicle is far away from the barrier in the process of navigating to the destination is as follows:

SE＝{RM(i)|{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _safe }}

5. treatment of experiences

Setting the stock ratio p for the experience of RE type, the experience of DE type, and the experience of SE type _RE 、p _DE 、 p _SE ；

In the training process, classifying the generated experience according to the definition of the experience type, randomly screening the experience according to the experience storage ratio of the type, storing part of the experience in an experience pool, and discarding the rest of the experience; the number relation of various types of experience in the experience pool RM' after the experience pool is verified and stored into the mechanism adjustment is as follows:

|RM′|＝p _RE ×|RE|+p _DE ×|DE|+p _SE ×|SE|

wherein, | · | is the number of specified experiences in the experience pool;

and 4, step 4: training agent according to FRDDM-DQN algorithm

1. Training a Faster-R CNN model to identify a specified obstacle;

initializing a Faster-R CNN model through a pre-training model VGG 16;

setting an initial learning rate, a delay coefficient and a delay weight of the Faster-R CNN model;

acquiring an image containing an obstacle through an unmanned aerial vehicle, and marking the position of the obstacle and the type of the obstacle in the image;

training a Faster-R CNN model through an image containing an obstacle and corresponding labeling information;

obtaining a Faster-R CNN model for recognizing the obstacles after the training is finished;

2. training an agent based on the output of the Faster-R CNN model;

step 2.1: initializing relevant parameters

Setting a reward function r _U Experience pool experience logging mechanism, experience logging ratio p = p _SE :p _DE :p _RE ；

Initializing empirical pool capacity RM _Capacity Attenuation coefficient gamma, maximum number of steps T of single screen _e Maximum effective training step number T _t Network update frequency C;

initial search rate epsilon and minimum search rate epsilon _min A search rate reset period N and a search rate reset value epsilon _reset ；

Initial learning rate alpha, segmented learning rate alpha ₁ ,α ₂ ,α ₃ ,α ₄ ]Boundary of segmented learning rate (n) ₁ ,n ₂ ,n ₃ ,n ₄ )；

The decision network of the agent is divided into a prediction network and a target network; initializing a predictive network Q and a target network

Parameters theta and theta of ^- ；

Step 2.2: initializing a training scene;

initializing the starting point and the destination position of the unmanned aerial vehicle, and initializing the position of the obstacle;

number of steps t executed in a single screen _e Effective training step number t _t Reset to 0;

acquiring an initial state s' (t);

step 2.3: selecting action a according to state s' (t) _U ；

Taking a random number p epsilon [0,1], and if p is larger than epsilon, selecting an action according to the predicted network Q; otherwise, selecting random action;

step 2.4: performing action a _U Then, the prize r is acquired _U And the new state s '(t + 1) and obtaining the experience RM = (s' (t), a) generated at the current time step _U ,r _U ,s′(t+1))；

Step 2.5: processing the experience RM;

storing the experience RM into an experience pool or discarding the experience RM according to an experience pool storing mechanism;

step 2.6: updating the learning rate:

the learning rate adopts a segmented fixed learning rate, and the learning rate is updated according to an adjustment strategy;

step 2.7: updating the exploration rate;

updating the exploration rate according to the exploration rate updating strategy;

step 2.8: performing network optimization;

if the network optimization is not executed, the step 2.9 is carried out; otherwise, executing network optimization:

randomly sampling m groups of experiences from an experience pool;

if the experience is an end experience, the predicted Q value y = r of the target network _U (ii) a If the experience is not over, the predicted Q value of the target network is set as

Calculation of loss L (θ) = E (y-Q (s (t), a) _U (t),θ))；

Optimizing a parameter theta of the prediction network according to the loss value L (theta) through a gradient descent algorithm;

covering the target network with the parameters of the prediction network every C significant steps, i.e. theta ^- ＝θ；

Step 2.9: update state s '(t) ← s' (t + 1), t _e ←t _e +1；

Step 2.10: judging a training state;

if the number of valid steps t _t ≥T _t If so, ending the training and storing the agent at the moment; otherwise, whether the unmanned aerial vehicle reaches the destination at the current time step or not, whether the unmanned aerial vehicle collides or not, whether the unmanned aerial vehicle is out of bounds or reaches the maximum step number T of the single screen is continuously judged _e If yes, the current screen is ended, and the step 2.2 is carried out; otherwise, turning to the step 2.3;

and 5: and (4) controlling the unmanned aerial vehicle to perform autonomous image navigation and obstacle avoidance through the intelligent agent saved in the step (4).

In the step 2: during the execution of a task, x _image 、y _image 、d _c Is a fixed value; χ' is constant.

E in step 3 for describing whether the agent reaches the destination _g The values and meanings are as follows:

the experience pool storing mechanism in the step 3 is described as follows:

/>

the adjustment strategy updating learning rate in the step 3 is as follows:

/>

the search rate update strategy in the step 3 is as follows:

/>

advantageous effects

The invention provides an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning, and provides an unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on an image and experience pool storage mechanism, namely an FRDDM-DQN method. In the invention, an agent meeting the requirement is trained by an FRDDM-DQN method; when the task is executed, the trained intelligent body controls the unmanned aerial vehicle to realize autonomous image navigation and obstacle avoidance.

In the FRDDM-DQN method, firstly, images acquired by an airborne camera are processed through a Faster R-CNN model, namely obstacle information is extracted from the images, and the obstacle information is converted according to the kinematics characteristics of an unmanned aerial vehicle. The obstacle information is then added in the input state and training of the agent begins. During the training process, the number of different types of experience in the experience pool is found to be widely different, which affects the convergence of the agent. According to the characteristic that the unmanned aerial vehicle generates experience when executing tasks, an experience pool storing mechanism is provided. During training, the mechanism sets a logging rate for each type of experience, i.e., each type of experience is only partially logged, and the rest of the experience is discarded. And randomly extracting a small batch of experience from the optimized experience pool to optimize the network. And finally, training an intelligent body with stronger autonomous image navigation and obstacle avoidance capabilities in a complex and unknown scene.

Has the advantages that:

1. the unmanned aerial vehicle autonomous image navigation and obstacle avoidance capability under the complex environment is obtained by introducing the Faster R-CNN model into the DQN algorithm and converting the recognition result of the Faster R-CNN model.

According to the invention, the Faster R-CNN model is added in the DQN algorithm, and the Faster R-CNN model has stronger image recognition capability, so that the DQN algorithm combined with the Faster R-CNN model can preliminarily realize unmanned aerial vehicle autonomous navigation and obstacle avoidance based on images. In addition, the output of the Faster R-CNN model is converted, namely the obstacle coordinate information output by the Faster R-CNN model is converted into angle and distance information according to the kinematics characteristics of the unmanned aerial vehicle, so that the intelligent agent trained by the algorithm can better control the unmanned aerial vehicle.

2. By introducing the experience pool data storage mechanism provided by the invention into the DQN algorithm, the autonomous image navigation and obstacle avoidance capability of the unmanned aerial vehicle in a complex environment is improved.

As the DQN algorithm is added with the experience pool data storage mechanism provided by the invention, the mechanism classifies the experiences generated by training and assigns corresponding storage ratios to each experience type, so that compared with the DQN algorithm, the method provided by the invention trains an intelligent agent with stronger autonomous image navigation and obstacle avoidance capabilities.

3. By the aid of the method of subsection training, retraining time consumption during application scene change is reduced.

The training process is divided into two parts to be respectively executed, namely, the training of the Faster R-CNN model for image recognition and the FRDDM-DQN training based on the output value of the Faster R-CNN model are separately executed, so that the time consumption for retraining is reduced when the application scene is changed, namely, only the Faster R-CNN model needs to be retrained to recognize the specified obstacle, and the FRDDM-DQN does not need to be retrained.

Drawings

FIG. 1: the method is a three-dimensional task scene graph of the unmanned aerial vehicle.

FIG. 2: is a two-dimensional task scene graph of the unmanned aerial vehicle.

FIG. 3: the relative position relationship diagram of the unmanned aerial vehicle, the obstacle and the airborne camera view frame is shown.

FIG. 4 is a schematic view of: is a flow chart of the FRDDM-DQN method provided by the invention.

FIG. 5: is the result of the Faster R-CNN model recognizing the obstacle.

FIG. 6: the invention provides an arrival rate curve chart of the FRDDM-DQN method and the FR-DQN method in the training process.

FIG. 7: the test result chart of the FRDDM-DQN method and the FR-DQN method provided by the invention for the intelligent agent at different training stages in the training process.

FIG. 8: is a trace diagram of an agent trained based on the FRDDM-DQN method and the FR-DQN method in an environment containing multiple static obstacles.

FIG. 9: is a trace diagram of an agent trained based on the FRDDM-DQN method and the FR-DQN method in an environment containing multiple dynamic obstacles.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

step 1: modeling autonomous image navigation and obstacle avoidance problems of the unmanned aerial vehicle;

in order to realize the autonomous image navigation and obstacle avoidance functions of the unmanned aerial vehicle, the problem is defined firstly. The method provided by the invention is a reinforcement learning algorithm, and also defines core elements of the reinforcement learning algorithm, namely states, actions and reward functions;

step 1-1: the unmanned aerial vehicle autonomous image navigation and obstacle avoidance problem is defined, and the scene is shown in fig. 1 and fig. 2;

in the invention, the task executed by the unmanned aerial vehicle is to fast navigate from a starting point to a specified destination, and avoid entering a no-fly zone generated by ground obstacles in the process of executing the task;

position of unmanned aerial vehicle is P _u ＝[x _u (t)y _u (t)z _u (t)] ^T The speed is fixed as V, and the heading angle and the climbing angle of the unmanned aerial vehicle are respectively chi (t) and gamma (t). In order for the unmanned aerial vehicle to better complete the task, the unmanned aerial vehicle should be (H) _min ,H _max ) Sailing in altitude, wherein H _min And H _max Respectively representing the minimum value and the maximum value of the flight altitude of the unmanned aerial vehicle;

the location of the destination is P _g ＝[x _g (t)y _g (t)z _g (t)] ^T Distance D between unmanned aerial vehicle and destination _g With the leading angle theta of the drone in the XOY plane _{g_XOY} Is defined as:

defining the area of influence of the destination as a radius R _g When D is _g ≤R _g When the unmanned aerial vehicle arrives at the destination, the unmanned aerial vehicle is considered to arrive at the destination;

position of the obstacle is P _obs ＝[x _obs (t)y _obs (t)z _obs (t)] ^T The radius of a no-fly zone generated by the barrier is R _obs . Distance D between unmanned aerial vehicle and obstacle _obs Is defined as

When D is present _obs ＜R _obs When the unmanned aerial vehicle enters into a no-fly zone generated by the obstacle, the unmanned aerial vehicle is considered to collide with the obstacle;

step 1-2: setting a kinematics model of the unmanned aerial vehicle;

a _U is a control quantity of an agent, i.e. a _U ＝[u _γ ,u _χ ]

Step 1-3: setting the state s (t) of the agent;

the available state information of the unmanned aerial vehicle is as follows: state s of the drone _U Destination information s _g Image information s _o . Wherein the state of the drone is acquired by a GPS and a gyroscope, s _U ＝[x _u (t),y _u (t),z _u (t),V,γ(t),χ(t)](ii) a Destination information s _g Is pre-specified before execution of the task, s _g ＝[x _g (t),y _g (t),z _g (t)](ii) a Image information s _o The unmanned aerial vehicle obstacle avoidance system is acquired by an airborne camera, and the image is mainly used for guiding the unmanned aerial vehicle to avoid obstacles. Considering the kinematics of the drone, the input state s (t) is defined as

s(t)＝[z _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s _o (t)]

Wherein H _UtoG (t) is the altitude difference of the drone and the destination;

step 1-4: setting an action a of an agent _U (t)；

Defining the action as a according to the input state s (t) defined in the previous step _U (t)＝[u _γ ,u _χ ]Wherein u is _χ And u _γ Respectively controlling the navigation angle and the climbing angle;

step 1-5: setting a reward function r for an agent _U ；

During the training process, the agent selects action a according to the current state s (t) _U At the execution of a _U Then a new state s (t + 1) is reached and the prize value is obtained from the environment. The prize value is the prize function r _U To output of (c).

In order for the agent to know that he should navigate to the destination, the reward in the reward function as to whether the drone is arriving at the destination is set to

That is, when the drone arrives at the destination, the reward value is set to +1, otherwise 0. In the course of navigating the unmanned aerial vehicle, when approaching an obstacle, in order to make the intelligent agent know that a collision should be avoided, the reward about the collision in the reward function is set to be

That is, when the drone collides with an obstacle, the reward is set to-1, otherwise 0. In addition, in order to guarantee the mission effect, the unmanned aerial vehicle should be (H) _min ,H _max ) Sailing in the height range; in order to improve the training efficiency of the agent and prevent the agent from detouring for obstacle avoidance (detouring the obstacle area from the outermost side of the obstacle area), a horizontal boundary is set in the XOY plane. In order for the drone to know that it should navigate within the boundary, the reward for out of bounds in the reward function is set to

That is, when the drone is out of bounds, the value of the reward is set to-1, otherwise 0. The method provided by the invention is a reinforcement learning algorithm, and the essence of the reinforcement learning algorithm is to enable an intelligent agent to find an optimal strategy in the process of interaction with the environment. Therefore, no other bonus is set regarding the intermediate state. In summary, the reward function is set to

r _U (s(t+1),a _U )＝r _arrived +r _collision +r _out

Step 2: extracting obstacle information from image information acquired by a airborne camera;

in the state s (t) defined in steps 1-3, s _o Is the image information collected by the onboard camera. To get s _o The method is used for guiding the intelligent body to avoid the obstacle, and obstacle information in the image needs to be extracted through a Faster R-CNN model. For more efficient training, further processing of the output information of the Faster R-CNN model is required.

Step 2-1: identifying an obstacle in the image through a Faster R-CNN model;

in the input state s (t) of the agent, the information about the obstacle is only the image information s collected by the onboard camera of the unmanned aerial vehicle _o . The information of the obstacles possibly existing in the image is identified through a Faster R-CNN model, and the output is

Wherein, the subscript i represents the ith anchor point identified by the Faster R-CNN model; b _i Is the bounding box (rectangular box) of the ith anchor point, x _i,1 ,y _i,1 And x _i,2 ,y _i,2 Respectively representing the coordinates of the upper left corner and the lower right corner of the anchor point bounding box, as shown in the figure; obs _posImage Is the coordinates of the center of the anchor point bounding box, i.e., the relative coordinates of the center of the obstacle in the image. obs _posImage Is also the recognition result of the Faster R-CNN model.

Step 2-2: processing the recognition result of the Faster R-CNN model, as shown in FIG. 3;

obs recognition result due to the Faster R-CNN model _posImage The relative coordinates of the obstacles in the image are adopted, and the intelligent agent cannot directly avoid the obstacles through the information, so that the information is further processed. Will obs _posImage Converting to relative coordinates obs 'between drone and obstacle' _pos I.e. by

obs′ _pos ＝(x′ _o ,y′o)＝(τ ₁ ×x _oInImage ,(-1×τ ₁ ×y _oInImage ))

Wherein x is _image And y _image Is the size of the image, τ ₁ Is the scale of the image. After the onboard camera is mounted and fixed to the drone, the drone may be considered a fixed value during the performance of the mission. In addition, the position relation between the unmanned aerial vehicle and the onboard camera sight frame is fixed. Thus, in the relative positional relationship of the unmanned aerial vehicle and the view field frame, the unmanned aerial vehicleThe position can be considered fixed with coordinates of

Wherein d is _c Is the distance between the unmanned aerial vehicle and the view field frame. After the camera is fixed to the drone, during the execution of the mission, x _image 、y _image 、d _c Can be considered as a fixed value. Since the intelligent agent is controlled by the unmanned aerial vehicle in the current task scene, the input coordinate information is not beneficial to the convergence of the intelligent agent. Thus, the positional relationship of the unmanned aerial vehicle to the obstacle in the image is converted into an obstacle-unmanned aerial vehicle front angle θ' _o And distance D 'between unmanned aerial vehicle and obstacle' _OtoU

Where χ' is the relative heading angle of the drone in the frame of the view, this value is constant after the onboard camera is fixed to the drone.

Finally, the output of the Faster R-CNN model is

s′(t)＝[z _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s′ _o (t)]

Wherein, s' _o ＝[D′ _OtoU ,θ′ _o ]。

And 3, step 3: screening the experience stored in the experience pool through an experience pool storing mechanism;

the agent is trained by the new state s' (t) containing the obstacle information obtained in step 2. In the training process, although the number of experiences in the experience pool is large, the number of experiences of various types is not uniformly distributed. The experience pool storing mechanism provided by the invention is that each type of experience in the experience pool is analyzed, a proper storing ratio is set for the experience pool, and the rest of experiences are discarded, namely, only part of experiences in each type of experience are stored in the experience pool. In this way, the number of each type of experience in the experience pool can be rebalanced, which is beneficial to improving the training effect of the agent.

Step 3-1: defining a single experience;

RM＝{RM(i)|RM(i)＝(s ⁱ (a-),a _U ,s ⁱ (a+)),i＜RM _Capacity }

Wherein, the superscript i represents the number of the current experience in the experience pool; s is ⁱ (a-) and s ⁱ (a +) represents the execution of action a, respectively _U Preceding and performing action a _U The latter state, RM _Capacity Is the capacity of the experience pool. At this time, state s in the i-th experience ⁱ (a +) is s of the i +1 th experience ⁱ⁺¹ (a-)。

In a single experience, state s ⁱ The task condition represented by (t) can be defined as

Wherein, the first and the second end of the pipe are connected with each other,

the representative obtains the task status represented by the designated state; e.g. of a cylinder _o ,e _c ,e _out ,e _g Are parameters used to describe the task conditions: e.g. of the type _o For describing whether the agent detected an obstacle; e.g. of the type _c The intelligent agent is used for describing whether the intelligent agent is collided or not; e.g. of the type _out For describing whether the agent is out of bounds; e.g. of the type _g For describing whether the agent has reached the destination. The values and meanings are as follows:

in the training process of unmanned aerial vehicle, state s of intelligent body ⁱ (t) may beThe method is divided into the following categories: state s in which the drone has not detected an obstacle, has not collided, has not exited and has not reached the destination _safe (ii) a State s when the drone detects an obstacle, does not collide, does not exit, and does not reach the destination _obs (ii) a Collision state s between unmanned aerial vehicle and obstacle _collision (ii) a State s of unmanned aerial vehicle out of bounds _out (ii) a State s of unmanned aerial vehicle arriving at destination _arrival . Namely:

s ⁱ (t)∈{s _safe ,s _obs ,s _collision ,s _out ,s _arrival }

step 3-2: classifying the experience RM (i) and setting an experience storing ratio;

for any experience RM (i) =(s) ⁱ (a-),a _U ,s ⁱ (a +)), which can be classified into the following categories:

(1) The result is an experience RE, this type of experience being the experience generated in the last step of each screen. Experience of this type can be further divided into arriving experience REs _arrival And collision experience RE _collision Exit experience RE _out Namely:

RE＝{RE _arrival ,RE _collision ,RE _out },RE∈RM

RE _arrival ＝{RM(i)|{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _arrival }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _arrival }}

all three of these end experiences in the end experience RE have a non-zero reward value by which the agent can learn an effective policy, namely: knowing through the positive reward agent that the destination should be navigated to; intelligent perception through negative prize values will know that collisions or out of bounds should be avoided. In addition, since the end experience is generated only at the end of each screen, the number of end experience REs in the experience pool is small. Therefore, the experience of RE type should be set to a higher score ratio p _RE ；

(2) Hazard experience DE, this type of experience representing that an intelligent agent has detected an obstacle. The type of experience shows that the unmanned aerial vehicle approaches the obstacle, at the moment, the intelligent body needs to decide whether to avoid the obstacle according to the current state, and if so, decides to execute obstacle avoidance action after synthesizing destination information. Defining the hazard experience DE as

DE＝{RM(i)|{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _safe }∪{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _obs }∪{s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _obs Where, experience { RM (i) | s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _safe Represents that an obstacle is detected before the action is executed and is not detected after the action is executed; experience with{RM(i)|s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _obs Represents that no obstacle is detected before the action is executed and an obstacle is detected after the action is executed; experience { RM (i) | s ⁱ (a-)∈s _obs ,s ⁱ (a+)∈s _obs Represents that the obstacles are detected before and after the action is executed;

through experience with RE types, the agent may know that collisions with obstacles should be avoided when approaching them. With DE type experience, the agent can learn how to avoid collisions when approaching an obstacle. This type of experience occurs when the drone is near an obstacle during training. As training progresses, the more the intelligence will behave to understand the rules, the more situations will be navigating between obstacles. The number of this type of experience in the final experience pool is also large. Thus, a lower score ratio p is set for the DE type experience _DE ；

(3) Safety experience SE, this type of experience being an intermediate state of the drone moving away from the obstacle during its journey to the destination, i.e. SE = { RM (i) | { s = ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _safe }}

This type of experience may help the agent learn how to navigate to the destination in order for the agent to eventually reach the destination (get a positive reward). This type of experience is more extensive due to the larger scene in which the drone performs the task. Therefore, a lower logging ratio p should be set for the safety experience SE _SE ；

Step 3-3: processing experience generated in the training process through a set logging ratio to obtain an optimized experience pool RM';

based on the above analysis, different experience stocking ratios are set for the various types of experience. In the training process, the experience part of each type is stored in an experience pool according to different experience storing ratios, and the rest of the experience is discarded. The quantitative relation of various types of experience in the experience pool after the experience pool is stored into the experience pool adjusted by the mechanism is as follows:

|RM′|＝p _RE ×|RE|+p _DE ×|DE|+p _SE ×|SE|

where | is the number of specified experiences in the experience pool.

And 4, step 4: training an intelligent body for realizing unmanned aerial vehicle autonomous image navigation and obstacle avoidance through an FRDDM-DQN method;

in step 2, an image s for obstacle avoidance in the input state of the agent is used _o Processing is carried out, and a new state s' (t) capable of helping the intelligent body to avoid the obstacle is obtained; in step 3, an experience pool experience storage mechanism is provided, and storage rules for generating experiences in training are specified. In this step, an intelligent agent is trained by an optimized DQN algorithm, namely an FRDDM-DQN method, and the structure of the FRDDM-DQN method is shown in FIG. 4.

The training of the FRDDM-DQN method is divided into two parts, and a Faster-R CNN model is trained to identify a specified obstacle; training an FRDDM-DQN method based on the output value of the Faster-R CNN model;

step 4-1: training a Faster-R CNN model to identify a specified obstacle;

initializing a Faster-R CNN model through a pre-training model VGG 16;

acquiring an image containing an obstacle through an unmanned aerial vehicle, and marking the position of the obstacle and the type of the obstacle in the acquired image;

after the training is finished, obtaining a Faster-R CNN model capable of identifying the obstacle, wherein the identification result is shown in figure 5;

step 4-2: training FRDDM-DQN based on the output value of the Faster-R CNN model;

step 4-2-1: establishing a decision network of an agent in an FRDDM-DQN method and initializing related parameters;

building decision network of intelligent agent, setting reward function r _U Experience ratio p = p in experience pool experience logging mechanism _SE :p _DE :p _RE ；

Initializing empirical pool capacity RM _Capacity Learning rate alpha, attenuation coefficient gamma, search rate epsilon, search rate minimum value epsilon _min A search rate reset period N and a search rate reset value epsilon _reset Maximum number of steps T of single screen _e Maximum effective training step number T _t Network update frequency C;

in the training of the FRDDM-DQN method, the decision network of an agent is divided into a prediction network and a target network. Initialized prediction network Q and target network

Parameters theta and theta of ^- ；

Step 4-2-2: initializing a training scene;

initializing the starting point and the destination position of the unmanned aerial vehicle, and initializing the position of an obstacle;

acquiring an initial state s' (t);

step 4-2-3: selecting action a according to action selection policy and state s' (t) _U ；

Step 4-2-4: performing action a _U Then, the prize r is acquired _U And the new state s '(t + 1) and obtaining the experience RM = (s' (t), a) generated at the current time step _U ,r _U ,s′(t+1))；

Step 4-2-5: screening the experience according to an experience storing mechanism provided in the step 3;

if the experience is stored in the experience pool, the time step is a valid step t _t ←t _t +1; if the experience is not stored in the experience pool, discarding the experience, wherein the time step is not a valid step, t _t ←t _t ；

Step 4-2-6: changing the learning rate alpha according to a learning rate adjusting strategy;

step 4-2-7: changing the exploration rate epsilon according to an exploration rate adjusting strategy;

step 4-2-8: judging whether the current time step executes network optimization or not;

if the network optimization is not executed, the step 4-2-10 is carried out; if network optimization is performed:

randomly sampling m sets of experiences from the experience pool RM' and calculating the m sets of experiences:

Calculation of loss L (θ) = E (y-Q (s (t), a) _U (t),θ))；

overlaying the target network with the parameters of the prediction network every C significant steps, i.e. theta ^- ＝θ；

Step 4-2-9: updating state and step number: s '(t) ← s' (t + 1), t _e ←t _e +1；

Step 4-2-10: judging a training state;

if the number of valid steps t _t ≥T _t Turning to the step 4-2-11: (ii) a Otherwise, whether the unmanned aerial vehicle reaches the destination or not at the current time step or whether the unmanned aerial vehicle collides or whether the unmanned aerial vehicle is out of range is continuously judged, if so, the current screen is ended, and the step 4-2-2 is carried out: (ii) a Otherwise, continuously judging the number t of steps executed by the single screen _e Maximum number of steps T of single screen _e If t is _e ≥T _e If yes, the method goes to step 4-2-2: (ii) a Otherwise, turning to the step 4-2-3: continuing training;

step 4-2-11: and after the training is finished, saving the currently trained network. The network is the decision network of the trained agent, and the step 5 is carried out;

and 5: through the steps 4-2-11: the trained intelligent body controls the unmanned aerial vehicle to carry out autonomous image navigation and obstacle avoidance;

the trained intelligent agent is tested by setting different scenes. The starting point and the destination of the unmanned aerial vehicle are randomly generated, the initial position of the obstacle is randomly generated, and the motion direction of the dynamic obstacle is randomly generated;

the specific embodiment is as follows:

step 1-1: defining the unmanned aerial vehicle autonomous image navigation and obstacle avoidance problem;

position of unmanned plane is P _u ＝[x _u (t)y _u (t)z _u (t)] ^T The heading angle and the climbing angle are respectively chi (t) and gamma (t), and the speed is fixed to be v =42m/s;

the location of the destination is P _g ＝[x _g (t)y _g (t)z _g (t)] ^T The radius of the destination area of influence is R _g =2000m, destination-drone lead angle θ in XOY plane _{g_XOY} Distance between unmanned aerial vehicle and destination is D _g . When D is _g ≤R _g When the unmanned aerial vehicle arrives at the destination, the unmanned aerial vehicle is regarded as the unmanned aerial vehicle;

position of the obstacle is P _obs ＝[x _obs (t)y _obs (t)z _obs (t)] ^T The radius of a no-fly zone generated by the barrier is R _obs =1500m; the distance between the unmanned plane and the barrier is D _obs When D is present _obs ＜R _obs When the unmanned aerial vehicle enters a no-fly zone generated by the barrier, the unmanned aerial vehicle is considered to collide with the barrier;

step 1-2: setting a kinematic model of the unmanned aerial vehicle;

a _U is a controlled quantity of an agent, i.e. a _U ＝[u _γ ,u _χ ]

Step 1-3: setting the state s (t) of the intelligent agent;

the available state information of the unmanned aerial vehicle is as follows: state s of the drone _U Destination information s _g Image information s _o . Wherein the state of the drone is acquired by a GPS and a gyroscope, s _U ＝[x _u (t),y _u (t),z _u (t),V,γ(t),χ(t)](ii) a Destination information s _g Is pre-specified before execution of the task, s _g ＝[x _g (t),y _g (t),z _g (t)](ii) a Image information s _o The image is acquired by an airborne camera and is mainly used for guiding the unmanned aerial vehicle to avoid obstacles;

considering the kinematics of the drone, the input state s (t) is defined as

s(t)＝[z _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s _o (t)]

step 1-4: setting actions a of Agents _U (t)；

step 1-5: setting a reward function r for an agent _U ；

That is, when the drone arrives at the destination, the reward value is set to +1, otherwise 0.

During the course of the flight of the drone, in order for the agent to know that a collision should be avoided when approaching an obstacle, the reward for the collision in the reward function is set to

That is, when the drone collides with an obstacle, the reward is set to-1, otherwise 0.

In order for the drone to know that it should be sailing within the boundary, the reward for out of bounds in the reward function is set to

That is, when the drone is out of bounds, the value of the reward is set to-1, otherwise 0.

The method provided by the invention is a reinforcement learning algorithm, and the essence of the reinforcement learning algorithm is to enable an intelligent agent to independently find an optimal strategy in the process of interacting with the environment. Therefore, no other bonus is set regarding the intermediate state. In summary, the reward function is set to

r _U (s(t+1),a _U )＝r _arrived +r _collision +r _out

Step 2: training an intelligent body for realizing unmanned aerial vehicle autonomous image navigation and obstacle avoidance through an FRDDM-DQN method;

step 2-1: for image information s in input state s (t) _o Carrying out treatment;

step 2-1-1: training fast-R CNN model recognition image information s _o The obstacle specified in (1);

initializing a Faster-R CNN model through a pre-training model VGG 16;

setting the initial learning rate of the Faster-R CNN model to 0.001, the delay coefficient to 0.1 and the delay weight to 0.0005;

acquiring an image containing a barrier by an unmanned aerial vehicle, and marking the position of the barrier and the type of the barrier in the acquired image;

obtaining a fast-R CNN model capable of identifying the obstacle after the training is finished, wherein the output of the model is obs _posImage . As a result of recognition, as shown in FIG. 5, the gray dots are the obstacles recognized by the Faster-R CNN model, the letters and numbers above the gray dots mean that the type of the currently recognized object is an obstacle, and the object is an obstacleThe probability of an object;

step 2-1-2: output obs to fast-R CNN model _posImage Carrying out conversion;

will obs _posImage Converting to relative coordinates obs 'between drone and obstacle' _pos I.e. by

Wherein, tau ₁ Is the scale of the image, in this embodiment τ ₁ ＝2.5。

Calculating the relative position U 'of the unmanned aerial vehicle according to the position relation between the unmanned aerial vehicle and the airborne camera view frame' _pos

Wherein, d _c Is the distance between the unmanned aerial vehicle and the view field frame, d in this embodiment _c ＝624。

Converting the position relation of the unmanned aerial vehicle and the obstacle in the image into an obstacle-unmanned aerial vehicle front angle theta' _o And distance D 'between unmanned aerial vehicle and obstacle' _OtoU

Wherein χ 'is a relative heading angle of the unmanned aerial vehicle in the view frame, and χ' =90 in the embodiment, namely, the airborne camera is installed towards the front of the unmanned aerial vehicle;

the output s '(t) of the Faster R-CNN model is s' _o ＝[D′ _OtoU ,θ′ _o ]

s′(t)＝[z _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s′ _o (t)]

Step 2-2: training FRDDM-DQN based on the output value of the Faster-R CNN model;

step 2-2-1: establishing a decision network of an agent in an FRDDM-DQN method and initializing related parameters;

in this embodiment, the decision network structure of the agent is 29 × 512 × 128 × 6, where 29 is the number of input nodes and 6 is the number of output nodes; setting the reward function to r _U (s(t+1),a _U )＝r _arrived +r _collision +r _out (ii) a The relevant parameters are set as follows:

hyper-parameter	Value of
		Number of samples m	64
Empirical pool capacity RM _Capacity	300,000
		Coefficient of attenuation gamma	0.95
Segment learning rate α = (α) ₁ ,α ₂ ,α ₃ ,α ₄ )	0.001,0.0005,0.0001,0.00005
		Boundary of segmental learning rate [ n ] ₁ ,n ₂ ,n ₃ ,n ₄ ]	[0,100000,200000,300000,350000]
Initial exploration rate ε ₀	1.0
		Minimum exploratory rate ε _min	0.001
Fraction reset period N	100000
		The search rate reset value epsilon _reset	0.5
Maximum number of steps T of single screen _e	600
		Maximum effective number of steps T _t	350,000
Target network update frequency (C)	3000
		Data storage ratio p = p _SE :p _DE :p _RE	[0.16:0.05:1]

Parameters theta and theta of ^- ；

Step 2-2-2: initializing a training scene;

number of executed steps t of single screen _e Effective training step number t _t Reset to 0;

acquiring an initial state s' (t);

step 2-2-3: selecting action a according to action selection policy and state s' (t) _U . In this embodiment, the action selection strategy adopts an epsilon-greedy algorithm;

step 2-2-4: performing action a _U Then, the prize r is acquired _U And the new state s '(t + 1) and obtaining the experience RM = (s' (t), a) generated at the current time step _U ,r _U ,s′(t+1))；

Step 2-2-5: the experience is screened according to the experience storing mechanism provided by the invention, and the screening process is as follows:

if the piece of experience is stored in the experience pool (i.e. F) _storage True), then the time step is the active step, t _t ←t _t +1; if the piece of experience does not fit into the experience pool (i.e., F) _storage False), then the rule is discarded, the time step is not a valid step, t _t ←t _t ；

Step 2-2-6: changing the learning rate alpha according to a learning rate adjusting strategy;

in this embodiment, the learning rate is a fixed learning rate in segments, and the adjustment strategy is as follows:

step 2-2-7: changing the exploration rate epsilon according to an exploration rate adjusting strategy;

because the invention provides the concept of effective steps, the search rate epsilon changing mode in the epsilon-greedy algorithm is optimized as follows:

step 2-2-8: judging whether the current time step executes network optimization or not;

if the network optimization is not executed, the step 2-2-10 is carried out; if network optimization is performed:

randomly sampling m groups of experiences from the experience pool, and calculating the m groups of experiences:

Calculation of loss L (θ) = E (y-Q (s (t), a) _U (t),θ))；

Step 2-2-9: updating state and step number: s '(t) ← s' (t + 1), t _e ←t _e +1；

Step 2-2-10: judging a training state;

if the number of valid steps t _t ≥T _t If so, the training is finished, the prediction network (or the target network) is the decision network of the trained intelligent body, and the step 2-2-11 is carried out: (ii) a Otherwise, whether the unmanned aerial vehicle reaches the destination at the current time step or not, whether the unmanned aerial vehicle collides or not or whether the unmanned aerial vehicle is out of bounds is continuously judged, if yes, the current screen is ended, and the step 2-2-2 is carried out:(ii) a Otherwise, continuously judging the number t of the executed steps of the single screen _e Maximum number of steps T of single screen _e If t is _e ≥T _e If yes, the method goes to step 2-2-2: (ii) a Otherwise, turning to the step 2-2-3: continuing training;

step 2-2-11: and after the training is finished, saving the currently trained network. The network at this time is the decision network of the trained agent.

To demonstrate the superiority of the FRDDM-DQN method proposed by the present invention during training, the training curve of the FRDDM-DQN method is shown in FIG. 6. For comparison, the training curves for the FR-DQN method in the same training environment are also shown. Among them, the FR-DQN method is a DQN method in combination with the Faster R-CNN model. In FIG. 6, the FRDDM-DQN process was in the first 5000 screens, the performance of the agent was not significantly improved, while the performance of the FR-DQN process agent was gradually improved; after 5000 curtains, the performance of the agent trained by the FRDDM-DQN method starts to be gradually improved, and the final arrival rate is increased to 83%, while the performance of the agent trained by the FR-DQN method is not obviously improved, and the arrival rate fluctuates around 75% all the time. Therefore, the final reaching rate of the agent trained by the FRDDM-DQN method is higher. In addition, because an epsilon-greedy strategy is adopted in the action selection strategy in the training process, the selection of partial actions of the intelligent agent is randomly generated, and the training effect of the intelligent agent cannot be completely displayed by a training curve. Thus, the agent's decision network is saved every 5000 curtains during the training process. Finally, 500 screens of testing are performed on each of the saved agents, and the testing results are shown in fig. 7. Likewise, comparative experiments were performed on the FR-DQN method. As shown in fig. 7, in the first 15000 screens of training, since each step of the FR-DQN method performs network optimization, the FR-DQN method obtains a better training effect, while the training effect of the FRDDM-DQN method is promoted more slowly; after 15000 scenes, the arrival rate of the agent trained by the FR-DQN method fluctuates around 70%, while the arrival rate of the agent trained by the FRDDM-DQN method continues to increase, and finally the arrival rate increases to 93%. Therefore, combining fig. 6 and 7 concludes: the FRDDM-DQN method provided by the invention has better performance in the training process.

Turning to the step 3 to test the trained intelligent agent;

and 3, step 3: testing steps 2-2-11 in a three-dimensional environment containing a plurality of obstacles: a saved agent;

in order to verify the performance of the agent trained by the FRDDM-DQN method proposed by the present invention, it was tested in a three-dimensional scene containing a plurality of static obstacles and dynamic obstacles as shown in fig. 8 and 9. In comparison, the FR-DQN method trained agent was also tested. In fig. 8 and 9, subgraphs (a) - (f) represent scene graphs at different times, respectively. Wherein, the small sphere represents a no-fly zone generated by the obstacle, and a black point in the middle of the small sphere is the position of the obstacle; the large sphere represents the drone destination; the black lines represent the navigation path of the agent trained by the FRDDM-DQN method, and the gray lines represent the navigation path of the agent trained by the FR-DQN method; the black lines on the obstacle of fig. 9 represent the trajectory traveled by the obstacle.

In FIG. 8, during the initial stages of the test (FIGS. 8 (a) and (b)), the intelligence trained by the FRDDM-DQN method and the FR-DQN method performed approximately the same. Starting from fig. 8 (c), when an obstacle is detected, both exhibit different obstacle avoidance strategies: after detecting an obstacle, an agent trained by the FR-DQN method does not comprehensively consider how to avoid the obstacle and quickly reach the destination, selects an obstacle avoidance strategy that can avoid the obstacle but is far away from the destination (detour), and finally reaches the destination in 672 seconds (fig. 8 (f)); after detecting an obstacle, the agent trained by the FRDDM-DQN method comprehensively considers how to avoid the obstacle and quickly reach the destination, selects an obstacle avoidance strategy that can simultaneously satisfy obstacle avoidance and quickly reach the destination, and finally reaches the destination in 366 seconds (fig. 8 (e)), which takes far less time than the agent trained by the FR-DQN method.

In fig. 9, during the initial stages of training (fig. 9 (a) and (b)), the agents trained by the FRDDM-DQN method and the FR-DQN method exhibit different strategies when navigating towards the destination. The navigation direction of the agent trained by the FRDDM-DQN method is the direction towards the destination, and when the obstacle information is unknown, the intelligent agent can reach the destination most quickly; the FR-DQN method trained agent selects a direction of flight close to, but not the fastest to, the destination. When the no-fly zones generated by two obstacles are close to each other, the agent trained by the FRDDM-DQN method can accurately control the unmanned aerial vehicle to navigate between the no-fly zones (fig. 9 (c) - (e)), and safely reach the destination in 321 seconds (fig. 9 (f)); the obstacle avoidance performance of the agent trained by the FR-DQN method is poor, and the agent collides with an obstacle at the 213 th second (fig. 9 (e)). Therefore, in a scene containing a plurality of static and dynamic obstacles, the FRDDM-DQN method provided by the invention can be used for training an intelligent object to better perform.

The FRDDM-DQN method provided by the invention realizes autonomous image navigation and obstacle avoidance of the unmanned aerial vehicle in a complex environment. In the FRDDM-DQN method, firstly, an obstacle in an image acquired by an airborne camera is identified through a Faster R-CNN model, and an identification result is converted according to the kinematics characteristic of an unmanned aerial vehicle; secondly, in the training process, the number of various types of experience in the experience pool is adjusted through the experience pool storing mechanism provided by the invention, and the problem of imbalance of the proportion of various types of experience in the DQN algorithm is improved. In the training process, an agent trained by the FRDDM-DQN method finds a better strategy; in the tests, the agent trained with the FRDDM-DQN method performed better. In conclusion, compared with the FR-DQN method, the FRDDM-DQN method provided by the invention improves the autonomous image navigation and obstacle avoidance capability of the unmanned aerial vehicle in a complex environment.

Claims

1. An unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on improved reinforcement learning is characterized by comprising the following steps:

1. setting a kinematics model of the unmanned aerial vehicle;

wherein, P _u ＝[x _u (t),y _u (t),z _u (t)]For the position of the unmanned aerial vehicle, V is the speed of the unmanned aerial vehicle, and x (t) and gamma (t) are unmanned respectivelyHeading angle and climbing angle of the machine [ u ] _γ ,u _χ ]Is the control quantity of the unmanned aerial vehicle;

2. arrival definition;

the location of the destination is P _g ＝[x _g (t) y _g (t) z _g (t)] ^T The radius of the destination area of influence is R _g (ii) a Distance D between unmanned aerial vehicle and destination _g Is defined as

3. defining collision;

position of the obstacle is P _obs ＝[x _obs (t) y _obs (t) z _obs (t)] ^T The radius of a no-fly zone generated by the barrier is R _obs (ii) a Distance D between unmanned aerial vehicle and obstacle _obs Is defined as

4. out-of-bound definition;

when the unmanned aerial vehicle executes the task, the flying range is

When in use

When the unmanned aerial vehicle is out of the bound;

and 2, step: image s acquired from onboard camera _o Extracting position information of the obstacle;

1. by Faster RCNN model identification image s _o The obstacle in (1);

2. processing the recognition result of the Faster R-CNN model;

wherein x is _image And y _image Is the size of the image, τ ₁ Is the scale of the image, d _c Is the distance between the unmanned aerial vehicle and the view field frame;

3. the position information of the obstacle is

Wherein, theta' _o Is an obstacle-the drone lead angle; d' _OtoU The distance between the unmanned aerial vehicle and the obstacle; χ' is the relative heading angle of the drone in the frame of view;

and step 3: experience logging mechanism when formulating a training agent

1. The agent

2. intelligent input s' (t)

Suppose the position of the drone is P _u ＝[x _u (t),y _u (t),z _u (t)]The pre-designated destination position is P _g ＝[x _g (t) y _g (t) z _g (t)] ^T (ii) a Distance D between unmanned aerial vehicle and destination _g With the leading angle theta of the drone in the XOY plane _{g_XOY} Is defined as:

the input of the agent is

s′(t)＝[z _u (t),H _UtoG (t),D _g (t),θ _{g_XOY} (t),χ(t),s′ _o (t)]

Wherein H _UtoG (t) is the altitude difference, s 'between the drone and the destination' _o ＝[D′ _OtoU ,θ′ _o ]；

3. Defining a reward function r _U ；

Defining the destination reward of unmanned aerial vehicle as

Defining the reward of collision of the unmanned aerial vehicle as

Defining the reward for unmanned aerial vehicle going out of bounds as

Thus, the reward function r _U Is composed of

r _U (s(t+1),a _U )＝r _arrived +r _collision +r _out

4. Classification of experiences

RM＝{RM(i)|RM(i)＝(s ⁱ (a-),a _U ,r _U ,s ⁱ (a+)),i＜RM _Capacity }

in a single experience, state s ⁱ (t) the task condition is defined as

[[s ⁱ (t)]]＝(e _o ,e _c ,e _out ,e _g )

Wherein, [ [ ·]]Acquiring a task condition represented by a specified state; e.g. of the type _o ,e _c ,e _out ,e _g Are parameters used to describe the task status: e.g. of the type _o For describing whether the agent detected an obstacle; e.g. of the type _c The intelligent agent is used for describing whether the intelligent agent is collided or not; e.g. of the type _out For describing whether the agent is out of bounds; e.g. of the type _g For describing whether the agent has reached the destination;

in the training process of unmanned aerial vehicle, state s of intelligent body ⁱ (t) can be divided into the following categories: state s in which the drone has not detected an obstacle, has not collided, has not exited and has not reached the destination _safe (ii) a State s when the drone detects an obstacle, does not collide, does not exit, and does not reach the destination _obs (ii) a Collision state s between unmanned aerial vehicle and obstacle _collision (ii) a State s of unmanned aerial vehicle out of bounds _out (ii) a Unmanned aerial vehicle reaches purposeState of ground s _arrival (ii) a Namely:

s ⁱ (t)∈{s _safe ,s _obs ,s _collision ,s _out ,s _arrival }

s _safe ＝{s ⁱ (t)|[[s ⁱ (t)]]＝(e _o ＝0,e _c ＝0,e _out ＝0,e _g ＝0)}

s _obs ＝{s ⁱ (t)|[[s ⁱ (t)]]＝(e _o ＝1,e _c ＝0,e _out ＝0,e _g ＝0)}

s _collision ＝{s ⁱ (t)|[[s ⁱ (t)]]＝(e _o ∈{0,1},e _c ＝1,e _out ＝0,e _g ＝0)}

s _out ＝{s ⁱ (t)|[[s ⁱ (t)]]＝(e _o ∈{0,1},e _c ＝0,e _out ＝1,e _g ＝0)}

s _arrival ＝{s ⁱ (t)|[[s ⁱ (t)]]＝(e _o ∈{0,1},e _c ＝0,e _out ＝0,e _g ＝1)}

(1) Results experience RE: divided into reach experience RE _arrival And collision experience RE _collision Exit experience RE _out Namely:

RE＝{RE _arrival ,RE _collision ,RE _out },RE∈RM

(2) Hazard experience DE: representing that the agent has detected an obstacle, namely:

(3) Safety experience SE: the intermediate state that the unmanned aerial vehicle keeps away from the barrier to the destination navigation in-process, promptly:

SE＝{RM(i)|{s ⁱ (a-)∈s _safe ,s ⁱ (a+)∈s _safe }}

5. treatment of experiences

Setting the stock ratio p for the experience of RE type, the experience of DE type, and the experience of SE type _RE 、p _DE 、p _SE ；

In the training process, classifying the generated experience according to the definition of the experience type, randomly screening the experience according to the experience storage ratio of the type, storing part of the experience in an experience pool, and discarding the rest of the experience; the quantitative relation of various types of experience in the experience pool RM' after the experience pool is stored into the mechanism adjustment is as follows:

|RM′|＝p _RE ×|RE|+p _DE ×|DE|+p _SE ×|SE|

wherein, | · | is the number of specified experiences in the experience pool;

and 4, step 4: training agent according to FRDDM-DQN algorithm

1. Training a Faster-R CNN model to identify a specified obstacle;

initializing a Faster-R CNN model through a pre-training model VGG 16;

2. training an agent based on the output of the Faster-R CNN model;

step 2.1: initializing relevant parameters

Initial learning rate alpha, segmented learning rate alpha ₁ ,α ₂ ,α ₃ ,α ₄ ]Boundary of the segmental learning rate (n) ₁ ,n ₂ ,n ₃ ,n ₄ )；

The decision network of the agent is divided into a prediction network and a target network; initializing a prediction network Q and a target network

Parameters theta and theta of ^- ；

Step 2.2: initializing a training scene;

acquiring an initial state s' (t);

step 2.3: selecting action a according to state s' (t) _U ；

Selecting a random number p from [0,1], and selecting an action according to the prediction network Q if p is larger than epsilon; otherwise, selecting random action;

step 2.4: performing action a _U Then, the prize r is acquired _U And new states '(t + 1) and obtains the empirical RM = (s' (t), a) generated at the current time step _U ,r _U ,s′(t+1))；

Step 2.5: processing the experience RM;

step 2.6: updating the learning rate:

step 2.7: updating the exploration rate;

step 2.8: performing network optimization;

randomly sampling m groups of experiences from an experience pool;

if the experience is an end experience, the predicted Q value y = r of the target network _U (ii) a If the experience is not the end of experience, the predicted Q value of the target network is set as

Calculation of loss L (θ) = E (y-Q (s (t), a) _U (t),θ))；

Step 2.9: update status s '(t) ← s' (t + 1), t _e ←t _e +1；

Step 2.10: judging a training state;

if the number of valid steps t _t ≥T _t If so, ending the training and storing the agent at the moment; otherwise, whether the unmanned aerial vehicle reaches the destination at the current time step or not, whether the unmanned aerial vehicle collides or not, whether the unmanned aerial vehicle is out of bounds or reaches the maximum single-screen step number T is continuously judged _e If yes, the current screen is ended, and the step 2.2 is carried out; otherwise, turning to the step 2.3;

and 5: and (5) controlling the unmanned aerial vehicle to perform autonomous image navigation and obstacle avoidance through the intelligent agent stored in the step (4).

2. The unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on the improved reinforcement learning of claim 1, wherein: in the step 2: during the execution of a task, x _image 、y _image 、d _c Is a fixed value; χ' is a constant.

3. The unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on the improved reinforcement learning of claim 1, wherein: e in step 3 for describing whether the agent reaches the destination _g The values and meanings are as follows:

4. the unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on the reinforcement learning improvement as claimed in claim 1, wherein: the experience pool logging mechanism is described as follows:

/>

5. the unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on the improved reinforcement learning of claim 1, wherein: the adjustment strategy updating learning rate is as follows:

inputting: current total number of valid steps t _t Whether to begin training the identification, isTraining And (3) outputting: learning rate alpha Initialization: fractional learning rate [ alpha ] ₁ ,α ₂ ,α ₃ ,α ₄ ]Boundary of the segmental learning rate (n) ₁ ,n ₂ ,n ₃ ,n ₄ ) if isTraining and t _t ≤n ₁ do α←α ₁ else if isTraining and n ₁ ＜t _t ≤n ₂ do α←α ₂ else if isTraining and n ₂ ＜t _t ≤n ₃ do α←α ₃ else if isTraining and n ₃ ＜t _t ≤n ₄ do α←α ₄ end if Returnα

6. The unmanned aerial vehicle autonomous image navigation and obstacle avoidance method based on the improved reinforcement learning of claim 1, wherein: the exploration rate updating strategy is as follows:

/>

/>