CN111369608A

CN111369608A - Visual odometer method based on image depth estimation

Info

Publication number: CN111369608A
Application number: CN202010478460.0A
Authority: CN
Inventors: 王燕清; 陈长伟; 王寅同; 石朝侠; 杨鑫; 徐创
Original assignee: Nanjing Xiaozhuang University
Current assignee: Nanjing Xiaozhuang University
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-07-03

Abstract

The invention discloses a visual odometer method based on image depth estimation, and provides an algorithm idea for realizing scale consistency constraint by combining a depth image and a monocular image aiming at a common scale fuzzy problem in a monocular visual odometer. In the network structure design, a long-time memory unit and a short-time memory unit are fused into a convolution neural unit, monocular images are used for training a depth estimation network, and luminosity consistency loss is introduced into a loss function, and smoothness loss calculation is added to capture more image characteristics and generate more accurate depth images. And then, the estimated depth image and the original monocular image are combined to realize scale consistency constraint, the pose estimation network is trained, the depth estimation network and the pose estimation network are respectively subjected to experiment and result analysis, and the result shows that the visual odometer combined with the depth image estimation can solve the scale fuzzy problem in the monocular visual odometer to a certain extent.

Description

Visual odometer method based on image depth estimation

Technical Field

The invention relates to the technical field of visual odometry, in particular to a visual odometry method based on image depth estimation.

Background

The input used by monocular visual odometers (i.e. self-motion estimation of vehicles, robots from a sequence of images in a single view) is RGB images, but in the field of computer vision and robotics, the use of Depth image (Depth Map) information also provides vital information for various applications, such as autopilot, virtual reality VR, augmented reality AR applications, etc. A depth Image, also known as a Range Image, refers to an Image or Image channel that contains information about the distance to the surface of an object in a viewpoint scene, and its pixel value represents the actual distance of the sensor from the object. When the traditional monocular vision odometer is used for pose estimation, compared with a binocular vision odometer, the pose estimation method has a very obvious defect of scale ambiguity (Scale ambiguity). Scale-blurring means that the monocular visual odometer cannot judge the specific length of translational motion, i.e., the scale factor, simply by correlation between features. Most of the existing methods for solving the problem are to fuse image measurement information and other sensor information, such as Inertial Navigation System (INS), GNSS sensor information, and the like. Although the scale ambiguity problem can be solved by additionally using other sensors, the method breaks one of the biggest advantages of the monocular vision odometer structure, namely small volume and low cost.

However, most methods based on the convolutional neural network CNN only use depth estimation as a single-view task, neglecting important timing information in monocular or binocular video, the basic principle of the single-view depth estimation method is the possibility that human beings sense depth through a single image, but neglecting more important motion for human beings when distance is inferred, and moving objects existing when geometric image reconstruction is performed in a monocular vision meter affect the static assumption of a scene, thereby affecting the performance of the scene.

Disclosure of Invention

1. Network architecture design

The network architecture comprises a neural network of a monocular depth image and a monocular RGB image, the depth image estimation network and the self-movement estimation network are trained by using a monocular image frame sequence, the scale consistency constraint is realized, and the whole process comprises the following steps:

s1, image from given two consecutive frames

Respectively estimating the depth by using a depth estimation network to obtain corresponding depth images

；

S2, converting the original image

With corresponding depth image estimation

Collectively as inputs to an auto-motion estimation network, and outputs a pose prediction for the camera at time t

；

S3, converting the estimated pose into a pose transformation matrix of 4 × 4

Is calculated according to the transformation matrix

Depth image of next frame

By calculating

And the consistency between the pose and the pose is lost, and the model training is carried out to improve the scale consistency of the pose prediction, as shown in figure 1.

2. Depth estimation network

The depth estimation network employs a self-encoding-decoding U-type network architecture, as shown in fig. 2. According to the invention, a cyclic nerve unit and an encoder unit are fused to form a long-time memory unit as a self-encoding part of a network, so that spatial information and time information are simultaneously utilized; the spatio-temporal features computed by the encoder are then input into a decoder network for accurate depth image estimation and reconstruction, and the decoder part fuses low-level feature representations from different levels of the encoder by using a jump connection method, and fig. 3 shows specific parameter settings of a neural network architecture for depth estimation.

3. Pose estimation network

The neural network for pose estimation uses a VGG16 convolutional neural network architecture and is designed by fusing a cyclic neural unit, and the visual odometry network is characterized in that: 1) in the scheme, the input of the visual odometer comprises the depth image information of the current frame, so that the scale consistency of a scene between the depth and the pose is ensured; 2) the input adopted by the visual odometer is the joint representation of the image frame and the depth image corresponding to a single time point, and the information of the previous frame is stored in the hidden layer; 3) the visual odometry network is able to maintain the same scene scale when run over the entire image sequence.

4. Loss function

Computing predicted depth images

And a loss of photometric consistency between the known depth image data, performing supervised training on the depth estimation neural network; the luminosity loss provides less information in a low-texture environment, and smoothness loss calculation is also added during depth estimation; in the part of the visual odometer, pose information estimated by a network and truth value information provided in a data set are used for calculating pose estimation loss, so that supervision training of a pose estimation network is realized; introducing geometric consistency loss, performing torsion transformation on the depth image estimated in the previous frame according to the pose transformation matrix, and calculating the difference between the depth image estimated in the previous frame and the depth image estimated in the next frame; the overall objective loss function is calculated as follows:

（1）

where the loss of photometric consistency and the loss of smoothness are represented respectively,

representing pose estimation loss and representing geometric consistency loss; in order to balance the scale and the size of each loss calculation result, a corresponding weight parameter is added for the calculation of the loss of each category; parameters are also added to control the degree of smoothing of the depth image.

4.1 loss of photometric uniformity and smoothness

The brightness consistency and the space smooth prior used in the dense association algorithm are used for calculating the luminosity difference between the estimated depth image and the really acquired depth image information, and are used as the loss function of network training, and the calculation formula of the luminosity consistency loss function is expressed as follows:

（2）

wherein,

expressing the number of pixel points in the image, and expressing the set of all the pixel points in the image by V; the L1 norm loss function is selected in the calculation of the loss function; the L1 norm loss function, also called the minimum absolute deviation or minimum absolute error, is calculated as and minimized by the sum of the absolute values of the differences between the estimated and target values; compared with the L2 norm loss function for calculating the sum of the squares of the differences, the calculation method of the L1 loss function has better robustness in processing abnormal values, and the L1 norm loss in the photometric consistency difference can be calculated according to the following formula:

（3）

the luminance loss is less in the information quantity provided when the scene is uniformly distributed and the texture is less, more information is generated by calculating multiple differences, and the calculation of smoothness loss is introduced, so that the network can more sensitively sense the edge information in the image, and the accuracy of the output result in a low-texture environment is ensured; the smoothness loss is calculated as follows:

（4）

wherein

Representing the first derivative along the spatial direction by which it is ensured that the smoothness is guided by edges in the image.

4.2 pose estimation loss

The pose estimation loss is used for representing the estimated absolute pose in a six-dimensional vector form, and the six-dimensional pose vector consists of a three-dimensional vector representing the position and a three-dimensional vector representing the posture; true pose vector to be provided

And fitting the estimated pose vector, and calculating the error between the two as a loss function of pose estimation:

（5）

wherein the parameters

And represents a scale factor to balance the difference between the displacement error and the rotation error.

4.3 loss of geometric consistency

Loss of geometric consistency, enhancement of geometric consistency of predicted results, and requirement of depth images of two frames at adjacent moments

And

the method conforms to the same scene architecture and minimizes the difference between the two; the geometric consistency between sample images of the same training batch can be improved, and the geometric consistency of the whole image sequence is realized through the transitivity of the sample images, for example, the depth images of It and It +1 in the same training batch are kept consistent, while the depth images of It +1 and It +2 are consistent in another training batch, so that the consistency of the depth images of It and It +2 can be ensured even though not necessarily in the same training batchThe consistency of the depth images of the whole image sequence is realized; in the training process, the pose estimation network and the depth estimation network are naturally coupled, and a prediction result with consistent scale can be generated in the whole image sequence; according to the constraint, the inconsistency of the depth images of the adjacent frames is calculated, and for any pixel point P in the depth image, the depth image difference of the adjacent frames

The formula is defined as follows:

（6）

wherein,

the depth image corresponding to the image frame at the t +1 moment calculated by the depth estimation neural network is shown,

the expression is that the depth estimation neural network carries out depth image estimation on the image frame at the t moment

And estimating the pose transformation matrix from the current time to the next time output by the neural network according to the self-motion

To pair

The depth image after being transformed, i.e.

（7）

Because the camera is in continuous motion, the acquired image scene is continuously changed, and the inconsistency of calculation is ensured by cutting the depth imageValidity of pixel points, each pixel point calculated correspondingly

Summing to standardize the calculation difference of the depth images; during optimization, points with different absolute depths are equally processed, so that the calculation of absolute distances is more visual than that of absolute distances; the function is a symmetrical function, and the value range of the function is between 0 and 1, so that the stability of the training value is ensured; from the inconsistency map described above, the proposed geometric consistency penalty is defined as follows:

（8）

wherein V represents all pixel points after performing matrix transformation calculation and clipping on the depth image,

representing the number of pixel points in V; the formula algorithm guarantees scale consistency between adjacent image pairs by minimizing the geometric distance of the predicted depth, and propagates the consistency into the whole image sequence through training. The self-motion estimation network and the depth estimation network are closely linked, and the self-motion estimation network can finally predict tracks with consistent scale in the global range.

Advantageous effects

Drawings

FIG. 1 is a diagram of a visual odometry network architecture incorporating depth image estimation.

Fig. 2 is a structural design diagram of a depth estimation network.

Fig. 3 is a parameter setting diagram of the depth estimation network.

Fig. 4 is a pose estimation network architecture diagram.

Fig. 5 is a parameter setting diagram of the pose estimation network.

FIG. 6 is a graph of test results under the Eigen split data set.

Fig. 7 is a graph of test results under the KITTI Odometry data set.

Fig. 8 is a track reconstruction result diagram of the sequence 01 pose estimation network model in each sequence.

Fig. 9 is a track reconstruction result diagram of the sequence 05 pose estimation network model in each sequence.

Fig. 10 is a track reconstruction result diagram of the sequence 09 pose estimation network model in each sequence.

Detailed Description

1. Introduction to data set

The invention analyzes and evaluates the frame performance provided by the display experiment result, and compares the frame performance with the prior work for depth estimation and pose estimation of the visual odometer. The system mainly trains on a KITTI original data set (raw data), wherein the acquisition frequency of the data set is 10Hz, the data set comprises an original binocular color and gray level image sequence (which is not synchronized and corrected) and a binocular color and gray level image sequence after synchronization and correction, 3D point cloud map information (about 10 ten thousand points are corresponding to each frame and stored in a binary floating point matrix form), 3D GPS/IMU data information (txt files storing positioning information, speed, acceleration and meta information), related camera calibration information and label information of a 3D object. There were 61 video sequences in the entire data set. When the experiment of the monocular depth image estimation network is carried out, an Eigen data set segmentation method and a KITTI Odometry data set segmentation method are referred to at the same time. When the performance of the visual odometer part is evaluated, the experiment is based on a KITTIOdometer data set, and meanwhile, the images in the data set and the depth images generated by the depth estimation network are combined to carry out network model training. And it is noted that there are overlapping portions between the Eigen and odometric data sets, the two segmentation methods are described below.

Data set segmentation method

In Eigen et al, a total of 697 frame images from a sequence of 28 images were selected as a test set for monocular depth estimation. The other 33 scene sequences, 23488 frames of binocular image pairs as training sets, and the binocular images are respectively used as images acquired by two monocular cameras. Since image reprojection loss is caused by parallax at the time of motion, all still frame images with a motion s less than 0.3 meters from the baseline are discarded during the data preparation phase.

There are 11 image sequences in the Odometry dataset that contain true values of camera pose. When the pose estimation network is evaluated, a 00-08 (03 is not included) image sequence in the data set is used as a training data set, and a 09-10 image sequence is used as a data set for test evaluation.

3. Results and analysis of the experiments

Fig. 6 shows the results of the depth estimation images output after inputting different images, respectively, and it can be seen that the models can output more accurate depth estimation results in different scenes. In order to show the higher robustness of the model, two images of the same object in the same scene under different illumination conditions are respectively input in fig. 7 (the vehicle circled in the figure is under the conditions of sunlight irradiation and tree shadow shielding in the two images respectively), and the corresponding output results show that the model can still accurately detect the object in the image under different illumination conditions.

When the performance of the depth estimation neural network is evaluated and compared with other existing methods, two test set segmentation methods, namely Eigen segmentation and KITTI segmentation, are used simultaneously. The comparison index is divided into an error index part and a precision index part, the error index part comprises an absolute relative error Abs Rel, a square relative error Sq Rel, a root mean square error RMSE and a root mean square logarithmic error RMSE log, and the smaller the error value is, the better the performance is; the accuracy index portion includes the larger the value the better the performance. In the KITTI segmentation method, the test data set contains a total of 200 images acquired from 28 different scenes, each image having corresponding true value data. Fig. 6 and fig. 7 show the test results of the depth estimation network under the Eigen segmentation data set and the kttiodometry data set, respectively, and compare with the existing method.

After the depth estimation network training is carried out, the pose estimation network, namely the visual odometer part, is subjected to model training by combining the output depth image and the original RGB image. Fig. 8-10 show the final trajectory reconstruction effect, the results of which indicate that the visual odometer combined with depth image estimation can solve the scale blur problem in monocular visual odometers to some extent.

It is to be noted that, in the present invention, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A visual odometry method based on image depth estimation is characterized in that: the network architecture comprises a neural network of a monocular depth image and a monocular RGB image, the depth image estimation network and the self-movement estimation network are trained by using a monocular image frame sequence, the scale consistency constraint is realized, and the whole process comprises the following steps:

s1, image from given two consecutive frames

；

S2, converting the original image

With corresponding depth image estimation

；

S3, converting the predicted pose into a pose transformation matrix of 4 × 4

Is calculated according to the transformation matrix

Depth image of next frame

By calculating

And the consistency between the pose and the pose is lost, and the model training is carried out to improve the scale consistency of the pose prediction.

2. The visual odometry method based on image depth estimation according to claim 1, characterized in that: the depth estimation network uses a self-coding-decoding U-shaped network architecture, a cyclic neural unit and an encoder unit are fused to form a long-time memory unit which is used as a self-coding part of the network, and space information and time information are utilized simultaneously; the space-time characteristics calculated by the encoder are input into a decoder network for accurate depth image estimation and reconstruction, and the decoder part fuses low-level characteristic representations from different levels of the encoder by using a jump connection method.

3. The visual odometry method based on image depth estimation according to claim 1, characterized in that: the neural network for pose estimation uses a VGG16 convolutional neural network architecture and is designed by fusing a cyclic neural unit, and the visual odometry network is characterized in that: 1) the input of the visual odometer comprises the depth image information of the current frame, so that the scale consistency of a scene between the depth and the pose is ensured; 2) the input adopted by the visual odometer is the joint representation of the image frame and the depth image corresponding to a single time point, and the information of the previous frame is stored in the hidden layer; 3) the visual odometry network is able to maintain the same scene scale when run over the entire image sequence.

4. The visual odometry method based on image depth estimation according to claim 1, characterized in that: computing predicted depth images

And a loss of photometric consistency between the known depth image data, performing supervised training on the depth estimation neural network; the luminosity loss provides less information in a low-texture environment, and smoothness loss calculation is also added during depth estimation;in the part of the visual odometer, pose information estimated by a network and truth value information provided in a data set are used for calculating pose estimation loss, so that supervision training of a pose estimation network is realized; introducing geometric consistency loss, performing torsion transformation on the depth image estimated in the previous frame according to the pose transformation matrix, and calculating the difference between the depth image estimated in the previous frame and the depth image estimated in the next frame; the overall objective loss function is calculated as follows:

（1）

representing pose estimation loss and representing geometric consistency loss; in order to balance the scale and size of each loss calculation result, a corresponding weight parameter is added for the calculation of the loss of each category, and a parameter is also added to control the smoothness degree of the depth image.

5. The visual odometry method based on image depth estimation according to claim 4, characterized in that: luminosity consistency loss and smoothness loss, brightness consistency and space smooth prior used in a dense correlation algorithm, luminosity difference calculation is carried out on the estimated depth image and the really acquired depth image information and is used as a loss function of network training, and a calculation formula of the luminosity consistency loss function is expressed as follows:

（2）

wherein,

expressing the number of pixel points in the image, and expressing the set of all the pixel points in the image by V; the L1 norm loss function is selected in the calculation of the loss function; l1 norm lossThe loss function, also called the minimum absolute deviation or minimum absolute error, is calculated as the sum of the absolute values of the differences between the estimated value and the target value and minimized; compared with the L2 norm loss function for calculating the sum of the squares of the differences, the calculation method of the L1 loss function has better robustness in processing abnormal values, and the L1 norm loss in the photometric consistency difference can be calculated according to the following formula:

（3）

（4）

wherein

Representing the first derivative along the spatial direction, P is an arbitrary pixel in the image.

6. The visual odometry method based on image depth estimation according to claim 4, characterized in that: the pose estimation loss is used for representing the estimated absolute pose in a six-dimensional vector form, and the six-dimensional pose vector consists of a three-dimensional vector representing the position and a three-dimensional vector representing the posture; true pose vector to be provided

（5）

wherein the parameters

7. The visual odometry method based on image depth estimation according to claim 4, characterized in that: loss of geometric consistency, enhancement of geometric consistency of predicted results, and requirement of depth images of two frames at adjacent moments

And

the method conforms to the same scene architecture and minimizes the difference between the two; geometric consistency between sample images of the same training batch can be improved, and the geometric consistency of the whole image sequence is realized through the transitivity of the sample images, for example, the depth images of It and It +1 in the same training batch are kept consistent, while the depth images of It +1 and It +2 are consistent in another training batch, so that the consistency of the depth images of It and It +2 can be ensured even though not necessarily in the same training batch, and the consistency of the depth images of the whole image sequence is realized; in the training process, the pose estimation network and the depth estimation network are naturally coupled, and a prediction result with consistent scale can be generated in the whole image sequence; according to the constraint, the inconsistency of the depth images of the adjacent frames is calculated, and for any pixel point P in the depth image, the depth image difference of the adjacent frames

The formula is defined as follows:

（6）

wherein,

To pair

The depth image after being transformed, i.e.

（7）

Because the camera is in continuous motion, the acquired image scene is in continuous change, the effectiveness of calculating inconsistent pixel points is ensured by cutting the depth image, and the depth image difference of adjacent frames calculated by each pixel point is corresponding to

Summing to standardize the calculation difference of the depth images; during optimization, points with different absolute depths are equally processed, so that the calculation of absolute distances is more visual than that of absolute distances; the function is a symmetrical function, and the value range of the function is between 0 and 1, so that the stability of the training value is ensured; according to the above-mentioned inconsistency map, proposed geometry oneSexual loss is defined as follows:

（8）

representing the number of pixel points in V; the formula algorithm guarantees scale consistency between adjacent image pairs by minimizing the geometric distance of the predicted depth, and propagates the consistency into the whole image sequence through training.