CN114326821A

CN114326821A - Unmanned aerial vehicle autonomous obstacle avoidance system and method based on deep reinforcement learning

Info

Publication number: CN114326821A
Application number: CN202210195266.0A
Authority: CN
Inventors: 王钦辉; 陈志龙; 魏军儒; 何昌其; 王云宪; 焦萍; 闫茜茜
Original assignee: ARMY COMMAND INST CPLA
Current assignee: ARMY COMMAND INST CPLA
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-04-12
Anticipated expiration: 2042-03-02
Also published as: CN114326821B

Abstract

The invention discloses an unmanned aerial vehicle autonomous obstacle avoidance system and method based on deep reinforcement learning, wherein training and decision are separated through a novel system architecture, so that training time consumption can be greatly reduced, and the decision timeliness of an aircraft is improved; the autonomous obstacle avoidance method adopts a depth reinforcement learning model based on strategy iteration, takes an original RGB image shot by an unmanned aerial vehicle monocular camera as training data, does not need other 3D information such as complex point cloud, trains the original RGB image through a complete convolution neural network to obtain depth image information, analyzes and predicts the image through a reinforcement learning method based on strategy iteration, and pre-judges the flight action of the unmanned aerial vehicle at the next moment in advance to realize autonomous obstacle avoidance. Compared with the existing typical value iteration-based method, the obstacle avoidance method provided by the invention has the advantages of higher training time consumption, lower time consumption, capability of flexibly and autonomously avoiding obstacles, and suitability for autonomous obstacle avoidance scenes with high requirements such as automatic substation inspection, unmanned aerial vehicle cruising and the like.

Description

Unmanned aerial vehicle autonomous obstacle avoidance system and method based on deep reinforcement learning

Technical Field

The invention relates to an unmanned aerial vehicle obstacle avoidance system and a method, in particular to an unmanned aerial vehicle autonomous obstacle avoidance system and a method based on deep reinforcement learning; belong to unmanned aerial vehicle flight control technical field.

Background

Obstacle avoidance is one of the core problems of unmanned aerial vehicles, and aims to enable the unmanned aerial vehicle to autonomously explore an unknown environment so as to avoid collision with other objects, so that a flight path capable of avoiding threats to safely reach a target is obtained. The traditional obstacle avoidance technology is to plan a path by detecting a traversable space and obstacles, and data information used by the traditional obstacle avoidance technology is captured by an RGB-D camera, light detection, a distance measurement sensor (LIDAR), even a sonar and the like. These traditional obstacle avoidance techniques can be better applicable to the autonomic obstacle avoidance of ground robot, but have great degree of difficulty when using in the autonomic obstacle avoidance of unmanned aerial vehicle class aerial vehicle. The specific expression is that the ranging sensor can only capture limited information, and for unmanned aerial vehicles, the unmanned aerial vehicles are too heavy and consume electricity, and are expensive. In contrast, monocular cameras capture rich information about the environment, are low cost, lightweight, and are suitable for a variety of platforms. However, when distance perception is captured by a monocular camera (i.e., RGB images), the 3-D world is flattened into a 2-D image, eliminating the direct correspondence between pixels and distances, and the obstacle avoidance problem becomes extremely difficult.

With the wide application of deep learning in robot and computer vision, the application of deep learning to obstacle avoidance path planning is becoming more and more popular. The prior art uses Convolutional Neural Network (CNN) training methods to enable aircraft to cruise in complex forest environments. Some techniques label track types by training a convolutional neural network using 3D point cloud data. These methods can be divided into two categories, namely supervised learning and semi-supervised learning, wherein the former requires a lot of manpower for type marking, and the latter learning strategy is limited to some extent by the label generation strategy.

The Deep Reinforcement Learning (DRL) method has been proved recently, and can realize the appearance of superman in the game on the basis of fully utilizing the original image. Therefore, in recent years, people are concerned about realizing vision-based autonomous obstacle avoidance by using DRL research, and one common point of the work is that data of model training is not an original image. Some use laser scanners and depth image data for network training, and some propose training the network entirely in a 3D CAD model simulator to predict collisions. While these efforts may extend the trained network to the real world, significant computing resources are still required to generate and train a large data set. Based on the above reasons, it is necessary to provide a more practical and convenient unmanned aerial vehicle autonomous obstacle avoidance technology.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide the unmanned aerial vehicle autonomous obstacle avoidance system and method based on the deep reinforcement learning, and the unmanned aerial vehicle autonomous obstacle avoidance system and method can realize flexible and efficient autonomous obstacle avoidance through an original RGB image acquired by a monocular camera.

In order to achieve the above object, the present invention adopts the following technical solutions:

the invention firstly discloses an unmanned aerial vehicle autonomous obstacle avoidance system based on deep reinforcement learning, which comprises:

the server is used for finishing data training and calculation;

the base station is connected with the server;

the aircraft is communicated with the base station, receives the training result of the server fed back by the base station and makes a flight decision;

the server comprises a local server and a cloud server which are connected through the Internet.

Preferably, the aforesaid aircraft is a drone, equipped with a monocular camera for taking raw RGB images.

The invention also discloses an obstacle avoidance method of the unmanned aerial vehicle autonomous obstacle avoidance system based on the deep reinforcement learning, which comprises the following steps:

s1, acquiring an original RGB image acquired by the unmanned aerial vehicle monocular camera;

s2, training the original RGB image by adopting a complete convolution neural network to obtain depth information;

s3, training the depth image by adopting a reinforcement learning method based on an iterative method based on the preset discrete unmanned aerial vehicle flight action (described by linear velocity and angular velocity);

s4, the server obtains the flight action pre-taken by the unmanned aerial vehicle: and feeding back the linear velocity and the angular velocity to the unmanned aerial vehicle, and selecting flight actions based on the linear velocity and the angular velocity by the unmanned aerial vehicle to realize autonomous obstacle avoidance.

Preferably, the specific process of the foregoing step S2 is: acquiring the weighted sum of pixel values in an observation area, outputting a characteristic value by using a nonlinear activation function after convolution operation, wherein the nonlinear activation function is preferably a sigmoid function:

(ii) a Specifically, an FCNN complete convolutional neural network learning mode is adopted for depth information perception, the system receives an input image with any size, and the deconvolution layer is adopted for up-sampling the feature map of the last convolution layer to restore the feature map to the same size of the input image, so that a prediction is generated for each pixel to obtain depth image information.

More preferably, the operation of each phase of the aforementioned FCNN comprises the following three steps: convolution, nonlinear activation, pooling.

Still preferably, the policy-based reinforcement learning in the step S3 directly iterates the policy, and uses a function

To approximately represent the strategy in which,

the state of the unmanned aerial vehicle is represented, and the state description can be represented by a multi-dimensional vector, including the flight state, the flight position, the environment information (environment image) and the like of the unmanned aerial vehicle;

representing the action of the unmanned aerial vehicle, including the flight angular velocity and the flight speed;

the representation contains adjustable parameters

Function of, using parameters

Approximating the obtained strategy;

indicating a state

Take action down

The probability of (d); the goal of the algorithm is to maximize the expected yield of the strategy

Wherein

Is shown in the current state

Under performed the action

The resulting reward.

Still preferably, in the foregoing step, the parameter is derived from the expected profit

The updating calculation method comprises the following steps:

wherein

Is a differential operator. Based on such a concept, the Actor-criticic method adds a value function to evaluate the selected action on the basis of a direct iteration of the strategy. Actor stands for the policy structure in the algorithm, which is used for action selection, and Critic stands for a value function, which evaluates the action selected by Actor.

More preferably, in the foregoing step S3, the method of using the cut proxy when updating the Actor network maximizes

Wherein

Is a parameter of the Actor function,

and

respectively representing the old strategy and the new strategy; the first half part of the formula is gradient updating, the Actor modifies the new strategy according to the potential on the basis of the old strategy, and if the potential is larger, the modification amplitude is large, so that the new strategy is more likely to occur; the latter half of the above equation contains a penalty term, i.e., KL divergence, using a parameter

An influence factor representing a divergence term; if the difference between the old strategy and the new strategy is large, the KL divergence is also large, which is not favorable for convergence.

Further preferably, the method for cutting the proxy comprises: note the book

Proxy object is noted as

Clipping proxy objects limits proxy changesAn amplitude; the final optimization objective becomes:

wherein

A function representing the clipping function is represented,

represents a tuning parameter; minimization of Critic update

Wherein, in the step (A),

the parameters of the criticic function are represented, the updating of the criticic network is not different from the general Actor-criticic framework, and the error of the dominance function is minimized;

representing a function of the state values of the band parameters.

The invention has the advantages that:

(1) according to the obstacle avoidance system based on deep reinforcement learning, training and decision are separated through a novel system architecture, so that training time consumption can be greatly reduced, and aircraft decision timeliness is improved;

(2) the unmanned aerial vehicle autonomous obstacle avoidance method is based on a depth reinforcement learning technology of strategy iteration (DPPO), an RGB image shot by an unmanned aerial vehicle monocular camera is used as an original image of training data to obtain depth image information, other 3D information such as complex point cloud is not needed, the image is analyzed and predicted through the depth reinforcement learning method, the flying speed and the flying angle which are needed to be adopted at the next moment are judged in advance, and autonomous obstacle avoidance is achieved. The experiment comparison of a plurality of methods shows that the training performance of the method based on strategy iteration is more efficient, the time consumption is lower, and the training performance is more efficient than that of a typical value-based iteration method DQN;

(3) the method is based on the obstacle avoidance learning model of the depth reinforcement learning, is different from simple flight control actions, the aircraft action output by the model is more flexible, the discrete angular velocity and linear velocity can be set randomly for selection of the aircraft, the efficient and flexible obstacle avoidance can be realized, and the method is suitable for autonomous obstacle avoidance scenes with high requirements such as automatic substation inspection, unmanned aerial vehicle cruising and the like.

Drawings

FIG. 1 is a block flow diagram of an obstacle avoidance method of the present invention;

FIG. 2 is a graph comparing the performance of the method of the invention (DPPO) with a value-based iterative model (DQN) of the prior art;

fig. 3 is a graph comparing the training time consumption of the method of the present invention and the A3C, DQN method of the prior art.

Detailed Description

The autonomous obstacle avoidance system is based on the obstacle avoidance problem of monocular vision, an unmanned aerial vehicle observes the state of the environment through a monocular camera, the state is fed back to a server through a base station, data training and calculation are completed, then the data are sent to the unmanned aerial vehicle, and the unmanned aerial vehicle correspondingly selects the action, the linear velocity and the angular velocity according to the result to complete autonomous obstacle avoidance.

For a better understanding and appreciation of the invention, reference will now be made in detail to the following description taken in conjunction with the accompanying drawings and specific examples.

Example 1

In this embodiment, the discount coefficient is set to 0.95, the learning rate is set to 0.0001, and the original image size is set to 84 × 84. The instantaneous reward function is defined as

Where is the time of each training cycle, set to 0.5 seconds. The reward is intended to run the robot as fast as possible and is penalized by a simple in-place rotation. If a collision is detected, the training epicode is immediately terminated with a penalty of-10. Otherwise, the epicode will continue to the maximum number of steps, which is set to 500, with no penalty. In this embodiment, the original 10000 pictures are learned。

The embodiment is an unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning, and an obstacle avoidance process of the unmanned aerial vehicle autonomous obstacle avoidance method is shown in fig. 1, and the method specifically comprises the following steps:

and S1, acquiring the original RGB image acquired by the unmanned aerial vehicle monocular camera.

And S2, training the original RGB image by adopting a complete convolution neural network method to obtain depth information.

The specific process of the step is as follows: acquiring the weighted sum of pixel values in an observation area, and outputting a characteristic value by adopting a nonlinear activation function after convolution operation; specifically, a FCNN complete convolution neural network learning manner is adopted for deep information sensing, and the operation at each stage of the FCNN includes the following three steps: convolution, nonlinear activation, pooling. The system receives an input image with any size, and the deconvolution layer is adopted to perform upsampling on the feature map of the last convolution layer so as to restore the feature map to the same size of the input image, thereby generating a prediction for each pixel and obtaining depth image information.

And S3, training the depth image by adopting a reinforcement learning method based on an iterative method based on the preset discrete unmanned aerial vehicle flight action, linear velocity and angular velocity.

In this step, the strategy is directly iterated by adopting reinforced learning based on the strategy, and a function is used

To approximately represent the strategy in which,

the representation contains adjustable parameters

Function of, using parameters

Approximating the obtained strategy;

indicating a state

Take action down

Wherein

Is shown in the current state

Under performed the action

The resulting reward.

Wherein the parameters are derived from the expected yield

The updating calculation method comprises the following steps:

，

is a differential operator. Based on such a concept, ActThe or-Critic method adds a value function to evaluate the selected action based on a direct iteration of the strategy. Actor stands for the policy structure in the algorithm, which is used for action selection, and Critic stands for a value function, which evaluates the action selected by Actor.

When updating the Actor network, adopting a method of shearing an agent to maximize

Wherein

Is a parameter of the Actor function,

and

respectively representing the old strategy and the new strategy; the first half of the above formula is the gradient update, and Actor is in old strategy, according to potential

Modifying the new strategy, wherein if the potential is larger, the modification amplitude is large, so that the new strategy is more likely to occur; the latter half of the above formula contains a penalty term, i.e. KL divergence, and the influence factor of the divergence term is represented by a parameter; if the difference between the old strategy and the new strategy is large, the KL divergence is also large, which is not favorable for convergence.

Specifically, the method for cutting the proxy comprises the following steps: note the book

Proxy object is noted as

Clipping proxy objects limits the magnitude of change of the proxy; the final optimization objective becomes:

wherein

A function representing the clipping function is represented,

represents a tuning parameter; minimization of Critic update

Wherein, in the step (A),

representing a function of the state values of the band parameters.

The training efficiency and performance of the DPPO model of the present application was compared with the value iteration based approach (DQN) of the prior art, with the results shown in fig. 2, where the abscissa is the learned episde, the maximum number of steps per episde is 500, and the ordinate is the average reward obtained by the robot. As can be seen from the figure, the method of the present invention exhibits good performance; however, the DQN model in the prior art does not perform well, which the applicant analyzed may be due to: for the obstacle avoidance problem, overestimation of the Q value is not a problem that can be alleviated by more exploration, and the longer the training time is, the more serious it may be, further hindering the DQN from obtaining high performance. Fig. 3 is a graph comparing the time consumption of the method of the present invention with that of two methods in the prior art, and the ordinate represents unit time, and it can be seen from the graph that the training time consumption of the DPPO-based method proposed by the present invention is more efficient than the existing A3C method and the typical value-based iterative method DQN. Therefore, the unmanned aerial vehicle autonomous obstacle avoidance method based on DPPO deep learning obtains depth information by training the collected RGB images, analyzes and predicts the images by the reinforcement learning method, and pre-judges the flight speed and the flight angle to be taken at the next moment in advance, so that autonomous obstacle avoidance is realized, and the unmanned aerial vehicle autonomous obstacle avoidance method has a good application prospect.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims

1. Unmanned aerial vehicle independently keeps away barrier system based on degree of depth reinforcement study, its characterized in that includes:

the server is used for finishing data training and calculation;

the base station is in communication connection with the server;

2. The unmanned aerial vehicle autonomous obstacle avoidance system based on depth reinforcement learning of claim 1, wherein the aircraft is an unmanned aerial vehicle equipped with a monocular camera for capturing original RGB images.

3. The obstacle avoidance method of the unmanned aerial vehicle autonomous obstacle avoidance system based on the deep reinforcement learning of claim 1 is characterized by comprising the following steps:

s3, unmanned aerial vehicle flying action based on preset dispersion: training the depth image by adopting a reinforcement learning method based on a strategy iteration method to obtain the optimal flight action to be taken by the unmanned aerial vehicle at the next moment;

4. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning of claim 3, wherein the specific process of step S2 is as follows: acquiring the weighted sum of pixel values in an observation area, and outputting a characteristic value by adopting a nonlinear activation function after convolution operation; specifically, a complete convolutional neural network (FCNN) learning mode is adopted for depth information perception, the system receives an input image with any size, an deconvolution layer is adopted for up-sampling a feature map of the last convolution layer, the feature map is restored to the same size as the input image, and therefore a prediction is generated for each pixel to obtain depth image information.

5. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning according to claim 4, wherein the operation of each phase of the FCNN comprises the following three steps: convolution, nonlinear activation, pooling.

6. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning of claim 3, wherein the reinforcement learning in step S3 directly iterates the strategy by using a function

To approximately represent the strategy in which,

the state of the unmanned aerial vehicle is represented, the state description is represented by a multi-dimensional vector and comprises the flight state, the flight position and the environment information of the unmanned aerial vehicle;

representing the action of the unmanned aerial vehicle, including the flight angular velocity and the flight linear velocity;

the representation contains adjustable parameters

Function of, using parameters

Approximating the obtained strategy;

indicating a state

Take action down

Wherein

Is shown in the current state

Under performed the action

The resulting reward.

7. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning of claim 6, wherein parameters are obtained from expected profit

The updating calculation method comprises the following steps:

wherein

Is a differential operator.

8. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning of claim 6, wherein in step S3, a method of using a cut proxy is adopted when an Actor network is updated, so as to maximize the capability of the Actor network

Wherein

Is a parameter of the Actor function,

and

Modifying the new strategy, wherein if the potential is larger, the modification amplitude is large, so that the new strategy is more likely to occur; the latter half of the above equation contains a penalty term, i.e., KL divergence, using a parameter

9. The unmanned aerial vehicle autonomous obstacle avoidance method based on deep reinforcement learning of claim 6, wherein the method of shearing the agent is as follows: note the book