CN111693047B

CN111693047B - Visual navigation method for micro unmanned aerial vehicle in high-dynamic scene

Info

Publication number: CN111693047B
Application number: CN202010381182.7A
Authority: CN
Inventors: 张磊; 刘硕; 文鹏程
Original assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Current assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2022-07-05
Anticipated expiration: 2040-05-08
Also published as: CN111693047A

Abstract

The invention provides a visual navigation method of a microminiature unmanned aerial vehicle in a high dynamic scene, which comprises the following steps: 1. the exposure control and compensation method is adopted to carry out luminosity pretreatment on the input image, so that the random change of illumination in a high dynamic scene is overcome; 2. a lightweight image semantic segmentation algorithm is adopted to segment high-mobility people or objects from a static scene or the static scene accurately, so that interference factors in the high-mobility scene are overcome; 3. the visual reprojection error, the image photometric error and the pixel semantic error are used for jointly constructing a target function, and the optimal solution is calculated through nonlinear optimization, so that the optimization precision of the local image is improved; 4. the visual and inertial integrated navigation is supported, and the navigation robustness and the data updating rate are improved; 5. a loop detection algorithm based on deep learning is adopted, so that the loop detection precision is improved; 6. and the navigation precision is further improved by optimizing the global pose graph. The method has the navigation capability of high precision, strong real-time performance, strong robustness and expandability.

Description

Visual navigation method for micro unmanned aerial vehicle in high-dynamic scene

Technical Field

The invention belongs to the field of computer vision navigation, and particularly relates to a visual navigation method for a micro unmanned aerial vehicle in a high dynamic scene.

Background

The microminiature unmanned aerial vehicle is an unmanned aerial vehicle type with excellent characteristics of low cost, light weight, flexibility, maneuverability, easy concealment, easy penetration and the like, and is also a research hotspot in recent years.

The micro unmanned aerial vehicle is limited by very limited load capacity, computing capacity and energy supply capacity, and cannot bear various sensors to improve positioning and navigation accuracy. At present, unmanned aerial vehicle positioning and navigation mainly comprises various methods such as inertial navigation, global satellite navigation, radio navigation positioning and dead reckoning, visual navigation and the like. The inertial navigation has high precision, but the inertial navigation has large volume and heavy weight and cannot be loaded on a micro unmanned aerial vehicle; the global satellite navigation is restricted by satellite signals, has poor autonomy and instantaneity and is easy to lose efficacy in scenes such as the interior of a building; radio navigation requires a ground station to complete accurate positioning, and is also limited by a large scene. And characteristics such as autonomy, accuracy, visuality and intelligence of visual navigation to the low dependence to external perception source all have good commonality to each type of complicated scene, consequently are prepared for favour in the unmanned aerial vehicle navigation field. Meanwhile, with the continuous progress of the visual sensor technology, the machine vision technology and the artificial intelligence technology, the visual navigation technology is rapidly brought to be a research hotspot in the field.

Visual navigation can be divided into passive visual navigation and active visual navigation according to sensor types. The active visual navigation utilizes active detection modes such as laser radar, millimeter wave radar and sonar to sense the environment, and although the technology has the advantages of high detection precision and capability of obtaining scene depth information, the technology has large volume and weight and is expensive.

And the monocular camera has the advantages of small volume, light weight, low price and long detection distance, and is more suitable for the application of the micro unmanned aerial vehicle. Meanwhile, most micro unmanned aerial vehicles are provided with video equipment, video information acquired by the video equipment for task load can be used for autonomous navigation and positioning of the unmanned aerial vehicles by using visual navigation of a monocular camera, so that two purposes of one equipment are achieved, the size, the power consumption and the load weight of the micro unmanned aerial vehicles are further reduced, and passive visual navigation is realized.

The micro unmanned aerial vehicle is mainly used in complex environments such as the interior of buildings, urban road blocks, jungles and the like, the scenes generally have the characteristics of complex three-dimensional space, various barrier species and shapes and the like, the positioning and navigation precision and the real-time performance of the micro unmanned aerial vehicle are highly required, and the self-positioning and safe navigation of the micro unmanned aerial vehicle are realized by using visual navigation, so that the great challenge is brought. Particularly, with the difference of the application scenarios of the micro unmanned aerial vehicle, the problems that need to be solved urgently by the visual navigation technology are different. The illumination is stable in a narrow indoor environment, and the image features are difficult to extract from an indoor white smooth wall surface area; the similarity of indoor structures is extremely high, and the semantic constraint effect is not obvious; in an outdoor environment, the range of the unmanned aerial vehicle is greatly increased, and the accumulated error of the visual odometer is increased; the outdoor map points are rapidly increased, the calculation amount of global map optimization is increased, and the real-time performance is difficult to guarantee; rapid movement of vehicles and pedestrians in a city block causes image matching errors, and camera movement and maps cannot be estimated correctly; under the influence of illumination and seasonal variation in the field environment, the camera motion and the three-dimensional scene map cannot be accurately restored by using sequence images in both the feature method and the direct method.

Therefore, the new visual navigation method based on the monocular camera needs to be developed for the safety navigation requirements of the micro unmanned aerial vehicle by facing the limitation and the constraint of the complex scene, and technical support and design guidance are provided for realizing the safety navigation of the micro unmanned aerial vehicle in the complex environment.

Disclosure of Invention

In order to solve the problems, the invention provides a visual navigation method of a microminiature unmanned aerial vehicle in a high dynamic scene, which adopts the following technical scheme:

the visual navigation method of the micro unmanned aerial vehicle in the high dynamic scene comprises the following steps:

image preprocessing

Performing first feature extraction, semantic segmentation and deep neural network feature extraction on an input image; preferably, the semantic segmentation of the target in the image is performed by adopting a DFANet or Mask R-CNN semantic segmentation algorithm, and the deep neural network feature extraction is performed by adopting a convolutional neural network; of course, other algorithms or networks can be adopted for adjustment according to real-time calculation requirements on the premise of meeting data requirements;

visual odometer

According to the first feature of the image and the result of semantic segmentation, performing inter-frame image matching and local map optimization, thereby updating a map and a state;

preferably, the first feature extraction is performed by using FAST, Corner, ORB, SURF or SIFT algorithm;

3 ] Loop detection

Firstly, performing dimensionality reduction processing on deep neural network features extracted from an input image to simplify the features and record the features as a current frame, then searching a historical key frame in a global map and calculating the similarity between the current frame and the historical frame, if a loop exists, performing similarity transformation on the current frame, and outputting a corrected key frame pose; if no loop exists, continuing to process the extracted deep neural network features in the new input image;

global optimization and map creation

Performing local optimization according to the local update map and state obtained in the step 2 and the corrected key frame pose obtained in the step 3; and optimizing the tree or the map through a filtering theory or an optimizing theory on all map points to finally obtain the optimal global pose estimation and the map.

Preferably, in the step 1, before the first feature extraction, the semantic segmentation and the deep neural network feature extraction are performed on the input image, luminosity correction is performed on each frame of input image to eliminate the influence of illumination change on the image. Other algorithms or correction methods can be used for image adjustment, but at least the accuracy and other requirements of the subsequent steps on the relevant parameters are met.

Preferably, the photometric correction of each frame of input image is specifically:

the imaging model of each frame of input image is

I_i(x)＝G(t_iV(x)B_i(x)) (1)

Wherein G: R → [0,255]Represents a nonlinear response function, V: Ω → [0,1]Representing lens attenuation, B_iAnd I_iIrradiance and observed illumination intensity, t, respectively, over an image frame i_iRepresents an exposure time;

the luminosity correction algorithm of each frame of input image is

And substituting the formula (2) into the formula (1) to obtain a correction result, and using the correction result for photometric parameter estimation in the local map optimization process in the step 2.

The correction method has the advantages of small relative calculation amount and high calculation speed, and is particularly suitable for airborne calculation of the micro unmanned aerial vehicle.

Preferably, in the step 2, inter-frame image matching specifically judges whether the current frame is a key frame through frame tracking, and if the current frame is the key frame, map points and semantics are updated so as to complete local map optimization to update a map; if not, triangle matching is carried out and semantic and depth information is updated to update the state.

The key frame judgment methods are more, but in consideration of the computing capability of the micro unmanned aerial vehicle and the maximum bearing capacity of the airborne weight, a method with smaller computation amount and higher computation accuracy is selected as far as possible to perform, for example:

under the geometric constraint, photometric constraint and semantic constraint between two frames of images, establishing a reprojection error constraint term in a sliding window, a pixel photometric error constraint term in a neighborhood and a semantic error constraint term;

wherein (a) reprojection error

The reprojection error refers to the observed point (x) in the image_j,y_j)^TMap points corresponding thereto

The position deviation of the back projection in the image is defined as

Wherein T is_iIndicating the position, T, of the i-frame image in the world coordinate system_jExpressing the position of the j frame image in a world coordinate system, K expressing a camera internal parameter calibration matrix, pi expressing the conversion from homogeneous coordinates to Cartesian coordinates, and d expressing the estimated depth of the key point;

(b) pixel luminance error in the neighborhood

Any point p on the image frame i belongs to omega_iThe luminance error on the other frame image j is defined as

Wherein N is_pIs the set of pixels in the neighborhood around the p-point, I_iAnd I_jIs two adjacent frames of images, t_iAnd t_jIs the exposure time of two adjacent frames of images, a_i、b_i、a_jAnd b_jIs to correct affine illumination transformation coefficient, gamma represents Huber norm, omega_pIs a gradient dependent weight, p' is the projection onto the image frame I_jA point on;

(c) semantic error

Semantic error represents map point with semantic c

Probability of pixel point semantic being c in corresponding image

Observing the semantic meaning of an image S_kWith camera attitude T_kThree-dimensional point p_iAssociated with and re-projected onto the image S_kTwo-dimensional pixel coordinates of

The distance to the nearest region marked is in inverse proportion, and the map point is marked

Is defined as

Wherein the content of the first and second substances,

defined as a distance transform function on a binary image, sigma representing the uncertainty of the semantic image segmentation, and a probability vector passing through the corresponding map point p_iIs calculated from all observations of (a), if by the camera T_iObserve that F is the set of keyframes to be optimized, then

Defining the sum of a reprojection error constraint term of the first characteristic point in the sliding window, a pixel luminosity error constraint term in the neighborhood and a semantic error constraint term as a visual error sum function to obtain

Wherein eta is_R、η_PAnd η_sWeighting coefficients of a reprojection error constraint term, a pixel luminosity error constraint term in the neighborhood and a semantic error constraint term are respectively determined by analyzing image semantic information under the current scene;

performing combined navigation on the visual measurement and the inertial measurement, assuming that the relative pose of a camera and an inertial measurement unit is accurately calibrated, and simultaneously calculating a solution when the visual error and the function and the inertial error and the function obtain the minimum value in a sliding window to realize local map optimization; estimating the camera motion by using the information of a plurality of key frames in the sliding window is equivalent to finding a solution when the formula (8) obtains the minimum value, so that the estimation of the camera motion state is converted into a nonlinear optimization problem, as shown in the formula (9)

When the Gaussian Newton iteration method is adopted to solve the problem, the following matrix needs to be calculated

Wherein the content of the first and second substances,

in order to be a weight matrix, the weight matrix,

in the form of a residual vector, the vector,

is the Jacobian matrix.

Preferably, the performing the dimensionality reduction on the extracted deep neural network feature in the loop detection in the step 3 to simplify the feature specifically includes:

on the premise that the high-dimensional data and the low-dimensional data after dimensionality reduction have similar or the same probability distribution, the distances mapped on the high-dimensional data have similar characteristics after being mapped into the low-dimensional data; using an energy-based model, the probability distributions of the samples in the high-dimensional space and the low-dimensional space can be defined as

Where y is the vector that x maps to the low dimensional space.

In order to keep the distribution of the feature vectors in the high-dimensional space and the low-dimensional space consistent, the distance between the two distributions is measured by using KL divergence to obtain

H ＝∑_iKL(P_i||Q_i)＝∑_i∑_jp_ji(logp_ji-logq_ji) (11)

In the formula P_iAnd Q_iRespectively the probability distribution of the sample i in a high-dimensional space and the probability distribution of the sample i in a low-dimensional space; by minimizing the KL divergence measure, a corresponding dimension reduction method is learned, namely, the features are simplified on the premise of not losing data information.

Preferably, the calculating of the similarity between the current frame and the historical frame specifically includes:

the current frame set is recorded as P ═ { P1, P2, …, pn };

the historical frame set is recorded as F ═ { F1, F2, …, fn };

respectively recording the feature vectors of the key frame 1 and the key frame 2 after dimensionality reduction as

Wherein m is the characteristic dimension after the dimension reduction operation;

calculating each characteristic value after dimension reduction through a Gaussian function to obtain

Then, the similarity score between two key frames is the average of the sum of all features

Wherein m is the characteristic dimension after dimension reduction;

and judging whether the similarity score between the two key frames reaches a similarity threshold value. If the similarity score is larger than the similarity threshold, detecting a loop, performing similarity transformation on the current key frame image, and outputting a corrected key frame pose; if the similarity score is less than the similarity threshold, no loopback is detected.

Preferably, the optimization of the tree or the map by the filtering theory or the optimization theory for all map points in the step 4 specifically comprises: the pose optimization method based on nonlinear optimization simultaneously realizes local pose optimization and global pose optimization; namely, when the monocular camera carries out image input, the objective function is determined by the state variable and the observation equation:

where C represents the optimal consistent set of samples, x_kRepresenting the moving pose of the unmanned aerial vehicle at the kth moment, and is called a position node in the graph optimization; y is_iThe landmark which represents the landmark which can be observed by a visual sensor of the robot at the kth moment is called a point node in the graph optimization; z is a radical of_k,jThen the constraints of the point node and the point node are represented. When the optimization process is loop detection, z_k,jRepresenting the constraints of a pos node and a pos node, e (x)_k,y_i,z_k,j) The vector error function reflects the degree of the two nodes meeting the constraint condition, the vector error function is represented as a connecting line between the two nodes in the graph optimization, the smaller the connecting line value is, the higher the degree of the two nodes meeting the matching constraint is, and the value is 0, the two nodes completely meet the matching constraint;

assuming that the error function conforms to a gaussian distribution, the bayesian probability is described as: solving for the appropriate (x, y) makes the system most likely to produce the current observation Z, i.e. the maximum likelihood estimate is as follows:

final e (x) of_k,y_i,z_k,j) Is essentially solving for the maximum likelihood.

The beneficial effects of the invention are:

the visual navigation method of the micro unmanned aerial vehicle in the high dynamic scene provided by the invention is an improvement on the SLAM method aiming at the characteristics of a complex application environment on the basis of comprehensively analyzing the technical advantages and disadvantages of various typical SLAM methods, can meet the safety navigation requirement of the micro unmanned aerial vehicle in the low altitude high dynamic scene and without depending on a global navigation satellite system and prior geographic information, and has the navigation capabilities of high precision, strong real-time performance, robustness, autonomy, safety and expandability.

Drawings

FIG. 1 is a flow chart of a visual navigation method of a micro unmanned aerial vehicle in a high dynamic scene;

FIG. 2 is a block flow diagram of the visual odometer of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

A typical visual SLAM system consists essentially of four modules, namely, visual odometry, loopback detection, back-end non-linear optimization, and mapping.

The visual odometer is used for establishing a corresponding relation between data and solving relative transformation between two frames of images according to two frames of observed images, so that observed data under different coordinate systems can be placed in the same coordinate system. Because the two frames of images are respectively represented in the corresponding sensor coordinate systems, the obtained relative transformation is the relative pose of the sensor.

The loop detection means detecting whether the unmanned aerial vehicle enters the same place again, and judging whether the unmanned aerial vehicle returns to the place where the unmanned aerial vehicle has arrived through a loop. Because errors are accumulated continuously, the estimated trajectory of the unmanned aerial vehicle is different from the real situation when a loop is detected. After the loop is detected, global optimization is started to optimize the pose of the whole loop according to the error between the estimated value and the true value, and the precision of pose estimation and map construction of the unmanned aerial vehicle is improved.

And the back-end nonlinear optimization receives the camera poses measured by the visual odometer at different moments and the loop detection information, optimizes the camera poses and the loop detection information, and obtains a globally consistent track and map.

And the map construction is to build a map corresponding to the task requirement according to the estimated track.

The invention provides a visual navigation method of a micro unmanned aerial vehicle in a high dynamic scene, aiming at the requirement of autonomous navigation of the micro unmanned aerial vehicle in a low altitude and high dynamic scene.

Firstly, aiming at the problem of random change of illumination in low-altitude and high-dynamic scenes, an exposure control and compensation algorithm is adopted to preprocess an input image; aiming at the problem that the anti-view transformation capability of the first feature in image frame matching is weak, the anti-view transformation fast image matching of the first feature is improved, so that the speed is higher, and the robustness is stronger; aiming at the problem of interference of high-mobility persons or objects on camera motion estimation and map construction accuracy, a lightweight image semantic segmentation algorithm is adopted to segment the high-mobility persons or objects from a static scene or the accurate scene.

Secondly, aiming at the problem of low local optimization precision, a target function is constructed by using a visual reprojection error, an image photometric error and a pixel semantic error together, a joint constraint relation is established, and local optimization is realized through a nonlinear optimization algorithm. In order to meet the requirement of multi-sensor fusion of the microminiature unmanned aerial vehicle, the method supports the combined navigation of visual measurement and inertial measurement in local map optimization.

And thirdly, aiming at the problem that the accuracy and the positioning accuracy of the existing algorithm are low in loop detection, the loop detection algorithm based on deep learning is adopted, and the image matching positioning accuracy is improved to a sub-pixel level.

And finally, optimizing the global pose map, constructing the global map, and further improving the precision of camera motion estimation and map construction.

The invention provides a visual navigation method of a micro unmanned aerial vehicle in a high dynamic scene, which is shown in a flow chart in figure 1 and mainly comprises the following four steps:

step 1, image preprocessing

The method comprises the steps of obtaining an input image, and carrying out first feature extraction, semantic segmentation and deep neural network feature extraction.

The first feature extraction may be performed by using FAST, Corner, ORB, SURF, or SIFT algorithms, and the ORB feature extraction is described in this embodiment.

Before ORB feature extraction, semantic segmentation and deep neural network feature extraction, the influence of illumination change can be eliminated by utilizing camera luminosity calibration, and luminosity parameters comprise lens attenuation, gamma correction and exposure time. The photometric camera model maps the true energy (irradiance) received by a pixel on the imaging sensor to a corresponding intensity value. The imaging model is

I_i(x)＝G(t_iV(x)B_i(x)) (1)

Wherein G: R → [0,255]Denotes a nonlinear response function, V: omega → [0,1]Representing lens attenuation (halo), B_iAnd I_iIrradiance and observed illumination intensity, t, respectively, over an image frame i_iThe exposure time is indicated. This model is used as a first step in the photometric correction of each video frame, calculated as follows

And substituting the formula (2) into the formula (1) to estimate the photometric parameters in the local map optimization process.

Aiming at the problems that the ORB image matching algorithm is high in error point rate, pixel point contrast is easily influenced by noise and meanwhile the anti-view transformation capability is weak, the rapid image matching algorithm based on the improved ORB anti-view transformation can be adopted.

The semantic features are to further generalize and organize the local features to reach the classification level understood by people. Although the transformation of viewpoint, illumination and scale can affect the underlying feature representation of the object, the semantic representation of the object is not changed.

After comparing the current mainstream image segmentation algorithm, the DFANet or Mask R-CNN algorithm is adopted as a real-time semantic segmentation scheme, and the method has two advantages:

1) fully utilizing high-level Feature information of the network through a Deep hierarchy Aggregation (Deep hierarchy Aggregation) structure;

2) lightweight coding is achieved through a DFA lightweight feature aggregation structure.

Step 2, vision mileometer

And performing inter-frame image matching and local map optimization according to the ORB characteristics of the image and the semantic segmentation result, so as to update the map and the state.

As shown in fig. 2, the method specifically comprises the following steps:

step 2.1, inter-frame image matching

Judging whether the current frame is a key frame or not through frame tracking, and if the current frame is the key frame, updating map points and semantics so as to complete local map optimization to update a map; if not, triangle matching is carried out and semantic and depth information is updated to update the state.

In the first flight, the image is matched with a pre-stored basic map pose.

And 2.2, under the geometric constraint, luminosity constraint and semantic constraint between the two frames of images, establishing a reprojection error constraint term in a sliding window, a pixel luminosity error constraint term in a neighborhood and a semantic error constraint term.

(a) Reprojection error

The position deviation of the back projection in the image is defined as

Wherein T is_iIndicating the position, T, of the i-frame image in the world coordinate system_jThe position of the j frame image under a world coordinate system is represented, K represents a camera internal parameter calibration matrix, pi represents the transformation from homogeneous coordinates to Cartesian coordinates, and d represents the estimated depth of the key point;

(b) pixel luminance error in the neighborhood

Any point p on the image frame i belongs to omega_iThe photometric error on the other frame image j is defined as

(c) semantic error

Semantic error represents map point with semantic c

Probability of pixel point semantic being c in corresponding image

Is defined as

Wherein, the first and the second end of the pipe are connected with each other,

defined as a distance transform function on a binary image, sigma representing the uncertainty of the semantic image segmentation, and a probability vector passing through the corresponding map point p_iIs calculated from all observations of (a), if by the camera T_iIt is observed that,f is the set of key frames to be optimized, then

Wherein eta is_R、η_PAnd η_sThe weight coefficients are respectively a reprojection error constraint term, a pixel luminosity error constraint term in the neighborhood and a semantic error constraint term, and the coefficients are determined by analyzing the image semantic information under the current scene.

And 2.3, performing combined navigation on the visual measurement and the inertial measurement, assuming that the relative poses of the camera and the inertial measurement unit are accurately calibrated, and simultaneously calculating a solution when the visual error and the function and the inertial error and the function obtain the minimum value in a sliding window to realize local map optimization.

Estimating the camera motion by using the information of a plurality of key frames in the sliding window is equivalent to finding a solution when the formula (8) obtains the minimum value, so that the estimation of the camera motion state is converted into a nonlinear optimization problem, as shown in the formula (9)

Wherein the content of the first and second substances,

in order to be a weight matrix, the weight matrix,

is a vector of the residual errors,

is the Jacobian matrix.

Step 3, loop detection

The loop detection is mainly used for solving the problem of pose drift caused by the increase of a visual odometer along with time in the positioning and mapping process of the unmanned aerial vehicle platform motion, and the essence of the loop detection is the image matching problem.

The method adopts a loop detection method based on a deep learning method, so that the excellent capability of learning the image features through the deep learning method is fully utilized, good feature representation of the image acquired by the unmanned aerial vehicle platform is acquired, and the effectiveness of loop detection is improved. After the deep neural network features of the image are obtained by the convolutional neural network, dimension reduction processing is carried out on the learned features, and loop detection is realized through similarity calculation.

In the method, a network pre-trained by VGG16 which is trained and widely used is used as a backbone network, different convolutional layers and full connection layers are added to detect the target of each frame of image frame, and higher-layer semantic information is acquired at the same time, such as a classic target detection network SSD (Single shot Multi-box detector). In the SSD, by adding a convolution network and an auxiliary network structure after a VGG16 network, after layer-by-layer convolution and pooling, the acquired features are enabled to continuously reduce as the network becomes deeper.

Generally, the highest layer of the convolutional neural network uses a fully connected layer, and the dimensionality of the image features obtained by the convolutional neural network is generally higher, which brings great computational requirements for training a corresponding classifier and a regression method. In order to adapt to the low-power-consumption platform of the unmanned aerial vehicle, affine transformation is used for carrying out dimension reduction processing on the features obtained by the convolutional neural network, and subsequent similarity calculation and positioning are necessary. On the premise that the high-dimensional data and the low-dimensional data after dimensionality reduction have similar or the same probability distribution, the distances mapped to the high-dimensional data have similar characteristics after being mapped to the low-dimensional data. Using an energy-based model, the probability distributions of the samples in the high-dimensional space and the low-dimensional space can be defined as

Where y is the vector that x maps to the low dimensional space.

Since the distributions of feature vectors in the high-dimensional space and the low-dimensional space are kept consistent, the distance between the distributions can be measured using the KL divergence to obtain

H ＝∑_iKL(P_i||Q_i)＝∑_i∑_jp_ji(logp_ji-logq_ji) (11)

In the formula P_iAnd Q_iRespectively, the probability distribution of the sample i in the high dimensional space and the probability distribution in the low dimensional space. By minimizing the KL divergence measure, a corresponding dimension reduction method is learned, namely, the features are simplified on the premise of not losing data information, so that the processing speed of the image data is increased.

In the visual SLAM, an unmanned aerial vehicle acquires a key frame of a frame by continuously flying, extracts required features by using a deep convolution network every time one key frame is acquired, and then vectorizes the required features to obtain corresponding vectorized feature vectors. And performing loop detection by calculating similarity comparison between the current frame and the historical frame. The method comprises the steps of recording a key frame set obtained by an unmanned aerial vehicle platform as P ═ { P1, P2, …, pn }, and calculating loop detection for each frame image in the key frame, namely obtaining a corresponding vectorization feature set for the image of each key frame through a convolutional neural network, and expressing the vectorization feature set by using F ═ { F1, F2, …, fn }. After the feature vectors are subjected to dimension reduction processing, similarity calculation is performed on the feature vectors. In the process of calculating the similarity, the feature vectors of the key frame 1 and the key frame 2 after dimensionality reduction are respectively recorded as

Wherein m is the characteristic dimension after the dimension reduction operation.

In the method, each characteristic value after dimension reduction is calculated through a Gaussian function to obtain

Then, the similarity score between two key frames is the average of the sum of all features to get

Wherein m is the characteristic dimension after dimension reduction.

The loop detection is to determine whether the similarity score between two key frames reaches a similarity threshold. If the similarity score is larger than the similarity threshold, detecting a loop, performing similarity transformation on the current key frame image, and outputting a corrected key frame pose; if the similarity score is less than the similarity threshold, no loopback is detected.

Step 4, global optimization and map creation

And (3) optimizing a global map and planning a flight path for navigation according to the map and the state updated in the step (2) and the keyframe pose corrected in the step (3).

With the continuous increase of the visual mileage, map points are more and more, and in order to ensure the real-time operation of the visual odometer, the local map optimization only processes the map points near the current position. And optimizing a tree or a map through a filtering theory (EKF, PF and the like) or an optimization theory to obtain the optimal pose estimation and map. Aiming at the repeatability of the unmanned aerial vehicle task, the constructed map is fully utilized, and semantic information is added in the map so as to support interframe image matching, local map optimization and loop detection based on semantic constraints.

On the basis of comprehensively analyzing the technical advantages and disadvantages of various typical SLAM methods, the method improves the SLAM method according to the characteristics of a complex application environment. Aiming at the defects that a point cloud map constructed by a mainstream visual SLAM algorithm is sparse and environment structure information is difficult to accurately express due to the fact that points are used as features, a straight line is used as the features to construct the map, and a map optimization method is used to improve positioning accuracy and map construction accuracy.

Aiming at the problems of linearization error, low updating efficiency and the like existing in the traditional pose estimation method by using a filtering method, the invention designs a pose optimization method based on nonlinear optimization according to a graph optimization thought in SLAM research, and realizes local pose optimization and global pose optimization at the same time. The classical SLAM mathematical model consists of three parts, namely a state variable, a motion equation and an observation equation. For the input of the monocular camera, the objective function of the method only uses the state variable and the observation equation, and the objective function is shown as the formula (15)

Where C represents the optimal consistent set of samples, x_kRepresenting the moving pose of the unmanned aerial vehicle at the kth moment, and is called a position node in the graph optimization; y is_iThe landmark which represents that the kth moment can be observed by a robot vision sensor is called a point node in the graph optimization; z is a radical of formula_k,jThen the constraints of the point node and the point node are represented. When the optimization process is loop detection, z_k,jRepresenting the constraints of a pos node and a pos node, e (x)_k,y_i,z_k,j) It is a vector error function that reflects the degree to which two nodes satisfy the constraint, represented in the graph optimization as a line between two nodes, also called an edge. The smaller the value of the matching constraint, the higher the degree of the matching constraint satisfied by the two nodes, and the value of 0 indicates that the matching constraint is completely satisfied by the two nodes.

Assuming that the error function conforms to a gaussian distribution, the bayesian probability of the SLAM problem is described as: solving for the appropriate (x, y) makes the system most likely to produce the current observed data Z, i.e., the maximum likelihood estimate, as shown in equation (16)

Final e (x) of_k,y_i,z_k,j) The quadratic form of (c) is essentially equivalent to a least squares problem after solving for the maximum likelihood, i.e., after assuming that the noise is gaussian distributed. The non-linear least squares problem described above can be solved by means of the graph optimization tool g2 o.

The SLAM problem is completely abstracted into nodes and edges, the nodes represent optimization variables and comprise position nodes and point nodes, and the edges represent error constraint conditions and comprise position-point constraints and position-position constraints. The position-point constraint depends on the constraint generated by the camera observation, and the position-position constraint depends on the constraint generated by the loop detection in the system. The combination of the two can effectively inhibit the accumulated error of the system, thereby obtaining a map with consistent information.

According to the method, an NVIDIA Jetson TX2 embedded computer module is used as a computing platform, the computer is an AI single-module super computer based on an NVIDIA Pascal framework, and technical parameters are detailed in a table 1.

TABLE 1 NVIDIA Jetson TX computer Module hardware resources

And performing algorithm test and verification on a ground experiment platform by adopting four authoritative open data sets and one self-owned data set. Wherein, the ground experiment platform is a NVIDIA Jetson TX2 development kit. Selecting a corresponding test data set aiming at a typical flight environment of the micro unmanned aerial vehicle, wherein the test verification items comprise: navigation accuracy, computational complexity, illumination change robustness, scene change robustness, and the like, as detailed in table 2.

TABLE 2 data set correspondence test item

Data set	Test environment	Characteristics of the test
			KITTI data set	Indoor and city block	Dynamic scenes
EuRoC data set	Indoor room, industrial area	Dynamic scenes
			TUM data set	Indoor corridor and jungle	Indoor circular path
Oxford data set	City block (building and road)	Outdoor circular path
			ICL-NUIM dataset	Indoor room	Indoor circular path
Self-contained data set	Road	Homomorphic scenes

Claims

1. A visual navigation method for a microminiature unmanned aerial vehicle in a high dynamic scene is characterized by comprising the following steps:

image preprocessing

Firstly, performing luminosity correction on each frame of input image to eliminate the influence of illumination change on the image, and then performing first feature extraction, semantic segmentation and deep neural network feature extraction on the input image;

the luminosity correction of each frame of input image is specifically as follows:

the imaging model of each frame of input image is

I_i(x)＝G(t_iV(x)B_i(x)) (1)

Wherein G: R → [0,255]Denotes a nonlinear response function, V: omega → [0,1]Representing lens attenuation, B_iAnd I_iIrradiance and observed illumination intensity, t, respectively, over an image frame i_iRepresents an exposure time;

the luminosity correction algorithm of each frame of input image is

Substituting the formula (2) into the formula (1) to obtain a correction result, and estimating photometric parameters in the local map optimization process in the step 2;

visual odometer

the inter-frame image matching specifically comprises: judging whether the current frame is a key frame or not through frame tracking, and if the current frame is the key frame, updating map points and semantics so as to complete local map optimization to update a map; if the non-key frame is not the key frame, performing triangle matching and updating semantic and depth information to update the state;

the specific steps of judging whether the current frame is a key frame or not and judging whether the current frame is a key frame and updating map points and semantics for the key frame so as to complete local map optimization are as follows:

wherein (a) reprojection error

The position deviation of the back projection in the image is defined as

(b) pixel luminance error in the neighborhood

(c) semantic error

Semantic error represents map point with semantic c

Probability of pixel point semantic c in corresponding image

Is defined as

Wherein the content of the first and second substances,

defined as a distance transform function on a binary image, sigma representing the uncertainty of the semantic image segmentation, and a probability vector passing through the corresponding map point p_iIs calculated from all observations of (a), if by the camera T_iObserve that F is the set of keyframes to be optimized, then have

Defining the sum of a reprojection error constraint term of a first characteristic point in a sliding window, a pixel luminosity error constraint term in a neighborhood and a semantic error constraint term as a visual error sum function to obtain

performing combined navigation on the vision measurement and the inertia measurement, assuming that the relative poses of the camera and the inertia measurement unit are accurately calibrated, and simultaneously calculating a solution when the vision error and the function and the inertia error and the function obtain the minimum value in a sliding window to realize local map optimization; estimating the camera motion by using the information of a plurality of key frames in the sliding window is equivalent to finding a solution when the formula (8) obtains the minimum value, so that the estimation of the camera motion state is converted into a nonlinear optimization problem, as shown in the formula (9)

in order to be a weight matrix, the weight matrix,

in the form of a residual vector, the vector,

is the Jacobian matrix;

3 ] Loop detection

global optimization and map creation

2. The visual navigation method for the micro unmanned aerial vehicle in the high dynamic scene as claimed in claim 1, wherein:

the first feature extraction in step 1 is performed by using FAST, conner, ORB, SURF, or SIFT algorithm.

3. The visual navigation method for the micro unmanned aerial vehicle in the high dynamic scene as claimed in claim 2, wherein:

the step 3 of performing dimension reduction processing on the extracted deep neural network features in the loop detection to simplify the features specifically comprises the following steps:

on the premise that the high-dimensional data and the low-dimensional data after dimensionality reduction have similar or the same probability distribution, the distances mapped on the high-dimensional data have similar characteristics after being mapped into the low-dimensional data;

using an energy-based model, the probability distributions of the samples in the high-dimensional space and the low-dimensional space can be defined as

Where y is a vector with x mapped to a low dimensional space;

H＝∑_iKL(P_i||Q_i)＝∑_i∑_jp_j|i(logp_j|i-logq_j|i) (11)

In the formula P_iAnd Q_iRespectively representing the probability distribution of the sample i in a high-dimensional space and the probability distribution of the sample i in a low-dimensional space; by minimizing the KL divergence measure, a corresponding dimension reduction method is learned, namely, the features are simplified on the premise of not losing data information.

4. The visual navigation method for the micro unmanned aerial vehicle in the high dynamic scene as claimed in claim 3, wherein:

the calculating the similarity between the current frame and the historical frame specifically comprises:

the current frame set is recorded as P ═ { P1, P2, …, pn };

the historical frame set is recorded as F ═ { F1, F2, …, fn };

Wherein m is the characteristic dimension after dimension reduction;

judging whether the similarity score between the two key frames reaches a similarity threshold value, if the similarity score is larger than the similarity threshold value, detecting a loop, performing similarity transformation on the current key frame image, and outputting a corrected key frame pose; if the similarity score is less than the similarity threshold, no loopback is detected.

5. The visual navigation method for the micro unmanned aerial vehicle in the high dynamic scene as claimed in claim 2, wherein:

the optimization of the tree or the map by the filtering theory or the optimization theory of all map points in the step 4 specifically comprises the following steps:

the pose optimization method based on nonlinear optimization simultaneously realizes local pose optimization and global pose optimization; namely, when the monocular camera inputs images, the objective function is determined by the state variable and the observation equation:

where C represents the optimal consistent set of samples, x_kRepresenting the moving pose of the unmanned aerial vehicle at the kth moment, and is called a position node in the graph optimization; y is_iThe landmark which represents that the kth moment can be observed by a robot vision sensor is called a point node in the graph optimization; z is a radical of_k,jRepresenting the constraint of the point node and the point node; when the optimization process is loop detection, z_k,jRepresenting the constraints of a pos node and a pos node, e (x)_k,y_i,z_k,j) It is a vector error function that reflects the degree to which two nodes satisfy the constraint, represented in the graph optimization as a line between the two nodesThe smaller the connecting line value is, the higher the degree that the two nodes meet the matching constraint is, and the value of 0 indicates that the two nodes completely meet the matching constraint;

assuming that the error function conforms to a gaussian distribution, the bayesian probability is described as: solving for the appropriate (x, y) makes it most likely that the system will produce the current observation Z, i.e. the maximum likelihood estimate is as follows:

final e (x) of_k,y_i,z_k,j) Is essentially solving for the maximum likelihood.

6. The visual navigation method for the micro unmanned aerial vehicle in the high dynamic scene as claimed in any one of claims 1 to 5, wherein:

and the semantic segmentation of the target in the image is performed by adopting a DFANet or Mask R-CNN semantic segmentation algorithm, and the deep neural network feature extraction is performed by adopting a convolutional neural network.