CN115953460A

CN115953460A - Visual odometer method based on self-supervision deep learning

Info

Publication number: CN115953460A
Application number: CN202210949902.4A
Authority: CN
Inventors: 吴锦洲; 冯小渝; 吕文琪; 向毅; 何龙; 刘子樊; 蒋鸿伟; 傅普杰; 简夜明
Original assignee: Chongqing University of Science and Technology
Current assignee: Chongqing University of Science and Technology
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2023-04-11

Abstract

The invention discloses a visual odometer method based on self-supervision deep learning, which comprises the following steps: firstly, calibrating a binocular camera before acquiring picture data by using a binocular camera hardware device; secondly, acquiring video image data through a binocular camera; thirdly, preprocessing the acquired video image data; fourthly, building a depth estimation model; fifthly, building a pose estimation model; and sixthly, building a binocular vision SLAM system framework. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes.

Description

Visual odometer method based on self-supervision deep learning

Technical Field

The invention relates to a visual odometry method based on self-supervision deep learning.

Background

Since the 21 st century, artificial intelligence technology has begun to be widely used in various aspects of people's daily lives, such as advanced driver assistance systems, automated driving, intelligent vehicles, robots, and the like. Wherein, perceiving scene 3D structure and analyzing scene geometry helps the robot to understand the real-world environment, which is crucial for a wide range of artificial intelligence applications. In many artificial intelligence engineering applications, when a scene 3D structure is perceived and analyzed, tasks such as detection, identification, path planning, target positioning and the like need to be performed by applying a computer vision technology. These computer vision tasks are simplified precisely with the help of the scene 3D information. After all, knowing the structural information of the scene, the boundary of the object can be more easily distinguished, and the object can be more easily detected and identified by distinguishing the boundary of the object, and the object detection and identification are the basis of other computer vision tasks.

The most basic problem of the front-end visual odometer in the whole visual SLAM technology is that the optimization and loop detection of the rear end and the final drawing construction can be smoothly carried out only by a good front-end initial value. On the aspect of solving the problem of the visual odometer, the image only has two-dimensional information, the scene depth information is lost, and how to recover the depth information is very important for solving the visual odometer. In visual SLAM, how can the scene depth be found? For the monocular camera mode, the depth estimation has the problem of scale uncertainty, and is not beneficial to practical application; for the mode of combining a monocular camera and laser, although the actual scene depth can be measured through the laser, the problem on the scale is solved, only sparse depth information can be collected, and sometimes dense depth maps are needed; the other mode is that the depth map of the scene can be directly measured by the RGB-D camera, the problem of scale can be solved, the dense depth map can be obtained, but the depth information marking can be carried out only in an indoor scene, the higher marking quality is difficult to achieve in an outdoor scene, and the price is expensive compared with that of a common camera; for stereo matching or binocular depth estimation, not only can dense depth maps be obtained, but also the problem on scale can be solved, and the method is suitable for indoor and outdoor and is low in price.

It is of great academic interest to estimate the field depth from camera pictures (or in combination with low cost depth sensors) using computer vision methods. The method realizes the depth estimation of the scene and the pose estimation of the camera by using a binocular camera mode and combining a deep learning technology. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes. Therefore, the study of the SLAM front-end visual odometer part has important theoretical and research significance on the aspects of advanced driver assistance systems, automatic driving, intelligent vehicles, robots and the like.

Disclosure of Invention

In order to solve the problems, the invention provides a visual odometry method based on self-supervision deep learning, which has strong robustness and is suitable for multiple scenes and used for processing the problems caused by illumination change, image noise and image motion blur.

The visual odometry method based on the self-supervision deep learning comprises the following steps:

firstly, calibrating a binocular camera before acquiring picture data by using a binocular camera hardware device;

secondly, acquiring video image data through a binocular camera;

thirdly, preprocessing the acquired video image data;

fourthly, building a depth estimation model;

fifthly, building a pose estimation model;

and sixthly, building a binocular vision SLAM system framework.

Further, in the third step, the data is preprocessed as follows: the image noise reduction adopts Gaussian smoothing filtering; the image enhancement uses the methods of scale transformation, random clipping and color adjustment; the RGB image data is normalized to between 0 and 1.

Further, in the fourth step, the function g is set to be implemented

Of (a) is ₁ ，I _r Respectively left and right eye images, D ₁ ，D _r And I ₁ ，I _r Left and right eye difference maps with aligned pixels.

Further, in the fifth step, N is set as the number of the pixel points, and the image reconstruction loss function is defined as follows:

the photometric errors of the reconstructed image and the original image are comprehensively calculated through an image similarity index SSIM, and the photometric errors are disclosed as follows:

where α is the weight of the basic reconstruction error and the similarity error, and α is 0.85.

The invention has the beneficial effects that:

the invention realizes the depth estimation of the scene and the pose estimation of the camera by using a binocular camera mode and combining a deep learning technology. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes. The SLAM front-end visual odometer part of the invention has important theoretical and research significance for advanced driver assistance systems, automatic driving, intelligent vehicles, robots and the like.

Drawings

FIG. 1 is a flow chart of the system of the present invention;

Detailed Description

The invention will be described in detail below with reference to the accompanying drawing 1:

the visual odometer method based on the self-supervision deep learning comprises the following specific steps:

1) Calibrating a camera: the projection of light onto the imaging plane is distorted by the presence of the lens on the camera lens. The distortion is again radial distortion or tangential distortion. In order to eliminate the influence of distortion on the image shot by a common camera and determine the conversion relation of the image in an image coordinate system, a camera coordinate system and a world coordinate system, the binocular camera must be calibrated before a binocular camera hardware device is used for collecting image data.

Barrel distortion is due to the fact that image magnification decreases with increasing distance from the optical axis, whereas pincushion distortion is the opposite. In both of these distortions, a straight line passing through the center of the image and having an intersection with the optical axis can also be kept unchanged in shape. In addition to the radial distortion introduced by the shape of the lens, tangential distortion is introduced during the assembly of the camera by not making the lens and imaging plane strictly parallel.

For radial distortions, both barrel and pincushion distortions, increase as the distance from the center increases. We can describe the coordinate changes before and after distortion by a polynomial function: such distortions can be corrected using quadratic and higher order polynomial functions related to center distance as in 3.1,3.2:

x _corrected ＝x(1+k ₁ r ² +k ₂ r ⁴ +k ₃ r ⁶ ) (3.1)

y _corrected ＝y(1+k ₁ r ² +k ₂ r ⁴ +k ₃ r ⁶ ) (3.2)

wherein is [ x, y] ^T Coordinates of uncorrected points, [ x ] _corrected ，y _corrected ] ^T Are the coordinates of the corrected points, noting that they are all dueA point on a normalized plane, not a point on a pixel plane.

On the other hand, for tangential distortion, two other parameters p may be used ₁ ，p ₂ To correct for, the formula 3.3,3.4:

x _corrected ＝x+2p ₁ +xy+p ₂ (r ² +2x ² ) (3.3)

y _corrected ＝y+p ₁ (r ₂ +2y ² )+2p ₂ xy (3.4)

thus, equations 3.1,3.2 and 3.3,3.4 combine for a point P [ X, Y, Z ] in the camera coordinate system] ^T We can find the correct position of this point on the pixel plane by five distortion coefficients, formula 3.5,3.6:

1. the three-dimensional spatial points are projected onto a normalized image plane. Let its normalized coordinates be [ x, y ]] ^T ；

2. Performing radial distortion and tangential distortion correction on points on the normalization plane;

x _corrected ＝x(1+k ₁ r ² +k ₂ r ⁴ +k ₃ r ⁶ )+2p ₁ xy+p ₂ (r ² +2x ² ) (3.5)

y _corrected ＝y(1+k ₁ r ² +k ₂ r ⁴ +k ₃ r ⁶ )+p ₁ (r ² +2y ² )+2p ₂ xy (3.6)

3. and projecting the corrected point to a pixel plane through an internal parameter matrix to obtain the correct position of the point on the image, wherein the formula is 3.7,3.8.

u＝f _x x _corrected +c _x (3.7)

v＝f _y y _corrected +c _y (3.8)

Coordinate transformation

The process by which a camera maps coordinate points (in meters) in the three-dimensional world to two-dimensional image planes (in pixels) can be described by a geometric model. There are many such models, the simplest of which is referred to as a pinhole model. Pinhole models are very common and efficient models that describe the relationship of a beam of light after passing through a pinhole to project an image behind the pinhole.

Simple geometric modeling of a pinhole camera model. Let O-x-y-z be the camera coordinate system, we customarily let the z-axis point in front of the camera, x to the right, y down. And O is the optical center of the camera and is a spatial point P of the real world of the pinhole in the pinhole model, and the spatial point P falls on a physical imaging plane after being projected by the pinhole O, wherein the imaging point is P'. Let P coordinate be [ X, Y, Z ]] ^T P' is

And let the distance from the physical imaging plane to the aperture be f (focal length). Then, according to the triangle similarity relationship, there is the following formula 3.9:

wherein the negative sign indicates that the image is inverted. To simplify the model, we place the plane of the imagable image symmetrically in front of the camera, on the same side of the camera coordinate system as the three-dimensional spatial points. This can be done by removing the minus sign in equation 3.9, making the equation more compact:

the formula (3.9) is found in the arrangement, as 3.10,3.11:

equations 3.10,3.11 describe the spatial relationship between point P and its image. A pixel plane o-u-v is fixed in the physical imaging plane. We get the pixel coordinates of P' in the pixel plane: [ u, v ]] ^T . The pixel coordinate system is usually defined as follows: the origin o is located at the upper left corner of the image, and the axial right is parallel to the x-axisAnd the v-axis is downward and parallel to the y-axis. The difference between the pixel coordinate system and the imaging plane is a zoom and a translation of the origin. Let us assume that the pixel coordinates are scaled by a times on the u-axis and by β times on v. At the same time, the origin is shifted by [ c ] _x ，c _y ] ^T . Then, the coordinates of P' are associated with the pixel coordinates [ u, v ]] ^T The relationship of (d) is as follows, equation 3.12:

substituting into equations 3.10 and 3.11 and converting alpha _f Are combined into f _x Beta. A _f Are combined into f _y Obtaining:

wherein, the unit of f is meter, the unit of alpha and beta is pixel per meter, f _x ，f _y The unit is a pixel. Writing this formula into a matrix form will be more compact, but the left side needs to use homogeneous coordinates:

moving Z to the left side, and finishing to obtain:

in equation 3.15, the matrix composed of the intermediate quantities is referred to as the Camera intrinsic parameter matrix (Camera intraprinsics) K. The parameter matrix in the camera and the image distortion correction can be determined through camera calibration, and the method is prepared for the estimation of a data set for the next step of image acquisition and image production.

In the actual calibration process, as the parameters of the camera are changed after the binocular baselines are changed, baselines with different lengths are preset, and the camera calibration is carried out on the baselines with different lengths in a classified manner.

2) Data acquisition

And after the calibration of the two eyes is completed, acquiring video image data through the two eyes.

3) Data pre-processing

The acquired video image data cannot be directly taken to train a network model, and the video image data also needs to be preprocessed, and different data preprocessing methods aim at different problems. For this document, only image denoising, image enhancement, normalization operations are required to meet the data requirements. The image noise reduction adopts Gaussian smoothing filtering, so that the salt and pepper noise of the image can be effectively reduced; for image enhancement, the influence of rigid transformation of an object on pose estimation is considered, image translation or random rotation cannot be used, and only methods of scale transformation, random cutting and color adjustment are used; and finally, the RGB image data is normalized to be between 0 and 1, so that the situation of gradient explosion or gradient disappearance in the subsequent optimization process can be effectively prevented, and the convergence of the algorithm is accelerated.

(2) Building a depth estimation model

The structure of the disparity estimation network model is as follows: let function g be implemented

Of (a) is _l ，I _r Respectively left and right eye images, D ₁ ，D _t And I ₁ ，I _r Left and right eye disparity maps (corresponding to the disparity of each pixel in the image) with aligned pixels. It is very difficult to artificially construct an accurate expression of the function g. Deep neural networks have a very strong learning ability and can approximate any high-order, non-linear function through a large number of sample training, so DNN is used herein as an approximation of function g. If DNN can be according to I _l ，I _r Predict D _l And D _r Then can be according to D _l From I _r The mid-sampling (performed by the image sampler S) reconstructs a new left eye image->

I.e. is>

Correspondingly, the right eye image can be reconstructed by sampling>

The more accurate the DNN predicted disparity map is, the reconstructed image->

Therefore, when DNN is trained, I is gradually reduced ₁ 、I _r And/or>

The error between the two can make the predicted disparity map gradually approximate to the true value. The whole training process of the network only needs binocular images as samples, does not need depth data as labels, belongs to self-supervision learning, provides possibility for the network to realize online learning and lifetime learning, and enables the model to adapt to complex and changeable working scenes.

Note: i is _l 、I _r Respectively representing left and right eye images; d _l ，D _r Respectively showing a left visual difference chart and a right visual difference chart;

respectively representing a left eye reconstructed image and a right eye reconstructed image; s denotes an image sampler.

(3) Pose estimation model

The visual odometer is concerned with the relative motion of the camera between adjacent images, the simplest being to consider the relative pose change of the camera between two adjacent images. Similarly, a rotation matrix R and a displacement matrix t are used to describe the relative pose transformation of the camera. According to different implementation methods, the visual odometer can be divided into a feature point method and a direct method, wherein feature points need to be extracted. And (3) establishing a self-supervision pose model, and establishing the model by using the idea of a direct method.

By definition, all that needs to be solved is the relative camera pose of the second frame picture relative to the first frame picture, i.e. the rotation R and the displacement t. Using the first frame of picture as a reference frame, and setting the internal parameter of the camera as K, which can be obtained from the camera model, equations 3.16 and 3.17:

wherein Z is ₁ Is the depth of the spatial point P. Z ₂ Is the value of the third coordinate of PR + t, i.e. the depth of the spatial point P in the second camera coordinate system. The basic assumption of direct method is that the pixel gray values in each picture are fixed and invariant for the same spatial point. According to the formula (3.17), under the condition that the current pixel pose is known, P can be found ₁ Corresponds to P ₂ The elemental position of (a). According to the basic assumption, P ₁ And P ₂ The pixel grey values of the corresponding positions are equal. Therefore, the pose can be found by minimizing the photometric error, i.e., the luminance error of two pixel positions, as shown in equation 3.18 below:

e＝I ₁ (p ₁ )-I(p ₂ ) (3.18)

the direct method establishes the minimum photometric error as an objective function through geometric constraint. The method utilizes the convolution network to improve the advanced features of the image, combines the convolution network with the thought of a direct method, establishes an end-to-end pose estimation network structure, and transmits depth estimation information to the pose estimation network, thereby solving the problem of scale uncertainty, only considering the geometric features and introducing smoothness loss in an error function.

In the training phase, the depth estimation network and the pose estimation network are coupled together and jointly trained by using geometric constraints between continuous binocular images. Both the left and right images are used during training, and only the monocular image is used during testing. And when the depth estimation network is trained, the right image is taken as the supervision information, and the absolute scale of the image can be obtained after training. Due to the coupling of the depth estimation network and the pose estimation network, absolute scale information is shared to the pose estimation network. In a testing phase, the system can perform dense depth reconstruction and camera pose estimation using monocular images.

The loss is similar to that of the self-encoder, and we are best at bestIt is easy to think of making the loss function by reconstructing the image. Assume that the original left image (reference image) is

(i, j represents the position coordinates of the pixel), based on the predicted parallax d and the original right picture->

We can derive a reconstructed left map by a remapping operation>

The remapping operation is obtained by searching corresponding pixel points in the right image according to corresponding parallax values of each pixel point of the left image and then calculating difference values. Assuming that N is the number of pixel points, the simplest image reconstruction loss function is defined as the following formula 3.19:

the reconstructed image has great distortion, and only by adopting the comparison between the reconstructed image and the original image is insufficient, the image similarity index SSIM is introduced to comprehensively calculate the photometric errors of the reconstructed image and the original image, as shown in the following formula 3.20.

Where a is the weight of the basic reconstruction error and the similarity error. a is generally 0.85, the similarity error occupies a larger proportion, and the value can be properly adjusted according to the experimental result. Since depth discontinuities are typically associated with image gradients, a smoothness penalty L of edge-perceived depth weighted by image gradients is introduced _smoth The following equation 3.21.

In summary, L of the entire network _final The loss function is:

L _final ＝L _v +λL _amooth (3.22)

where λ is the weight of the depth smoothing penalty.

(4) Building binocular vision SLAM system framework

The system runs on a notebook computer with a built-in Ubuntu18.04 system, three-party open source library environments such as OpenCV3.4.1, PCL, G2O and the like based on the LINUX environment are built, algorithms among all modules are integrated by C + + language according to a SLAM system framework, and then the integrated visual SLAM system engineering is formed by transplanting the integrated open source library environments to a development board. In order to evaluate the accuracy and the real-time performance of the system, a KITTI standard data set and a binocular camera are adopted to collect data in real time to simulate the motion of the mobile robot and evaluate the system. And (3) constructing a binocular vision SLAM system framework, applying the framework to the intelligent mobile robot in the ROS environment, and carrying out field test in an actual scene. After the experiment is completed, because pure vision may generate relatively large errors, in order to improve the accuracy of the system, the system may be tested by combining a camera with other sensors (such as an IMU, a laser, and the like).

Claims

1. A visual odometry method based on self-supervision deep learning is characterized by comprising the following steps:

secondly, acquiring video image data through binocular;

thirdly, preprocessing the acquired video image data;

fourthly, building a depth estimation model;

fifthly, building a pose estimation model;

and sixthly, building a binocular vision SLAM system framework.

2. The visual odometry method based on the self-supervised deep learning of claim 1, wherein in the third step, the data are preprocessed as follows: the image noise reduction adopts Gaussian smoothing filtering; the image enhancement uses the methods of scale transformation, random clipping and color adjustment; the RGB image data is normalized to between 0 and 1.

3. The visual odometry method based on self-supervised deep learning of claim 1, wherein in the fourth step, the function g is set to be implemented

Mapping of (a), I _l ，I _r Respectively left and right eye images, D _l ，D _r And I _l ，I _r Left and right eye difference maps with aligned pixels.

4. The visual odometry method based on the self-supervised deep learning of claim 1, wherein in the fifth step, N is taken as the number of pixel points, and the image reconstruction loss function is defined as follows: