CN115953460A - Visual odometer method based on self-supervision deep learning - Google Patents

Visual odometer method based on self-supervision deep learning Download PDF

Info

Publication number
CN115953460A
CN115953460A CN202210949902.4A CN202210949902A CN115953460A CN 115953460 A CN115953460 A CN 115953460A CN 202210949902 A CN202210949902 A CN 202210949902A CN 115953460 A CN115953460 A CN 115953460A
Authority
CN
China
Prior art keywords
image
self
deep learning
method based
building
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210949902.4A
Other languages
Chinese (zh)
Inventor
吴锦洲
冯小渝
吕文琪
向毅
何龙
刘子樊
蒋鸿伟
傅普杰
简夜明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Science and Technology
Original Assignee
Chongqing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Science and Technology filed Critical Chongqing University of Science and Technology
Priority to CN202210949902.4A priority Critical patent/CN115953460A/en
Publication of CN115953460A publication Critical patent/CN115953460A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Processing (AREA)

Abstract

The invention discloses a visual odometer method based on self-supervision deep learning, which comprises the following steps: firstly, calibrating a binocular camera before acquiring picture data by using a binocular camera hardware device; secondly, acquiring video image data through a binocular camera; thirdly, preprocessing the acquired video image data; fourthly, building a depth estimation model; fifthly, building a pose estimation model; and sixthly, building a binocular vision SLAM system framework. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes.

Description

Visual odometer method based on self-supervision deep learning
Technical Field
The invention relates to a visual odometry method based on self-supervision deep learning.
Background
Since the 21 st century, artificial intelligence technology has begun to be widely used in various aspects of people's daily lives, such as advanced driver assistance systems, automated driving, intelligent vehicles, robots, and the like. Wherein, perceiving scene 3D structure and analyzing scene geometry helps the robot to understand the real-world environment, which is crucial for a wide range of artificial intelligence applications. In many artificial intelligence engineering applications, when a scene 3D structure is perceived and analyzed, tasks such as detection, identification, path planning, target positioning and the like need to be performed by applying a computer vision technology. These computer vision tasks are simplified precisely with the help of the scene 3D information. After all, knowing the structural information of the scene, the boundary of the object can be more easily distinguished, and the object can be more easily detected and identified by distinguishing the boundary of the object, and the object detection and identification are the basis of other computer vision tasks.
The most basic problem of the front-end visual odometer in the whole visual SLAM technology is that the optimization and loop detection of the rear end and the final drawing construction can be smoothly carried out only by a good front-end initial value. On the aspect of solving the problem of the visual odometer, the image only has two-dimensional information, the scene depth information is lost, and how to recover the depth information is very important for solving the visual odometer. In visual SLAM, how can the scene depth be found? For the monocular camera mode, the depth estimation has the problem of scale uncertainty, and is not beneficial to practical application; for the mode of combining a monocular camera and laser, although the actual scene depth can be measured through the laser, the problem on the scale is solved, only sparse depth information can be collected, and sometimes dense depth maps are needed; the other mode is that the depth map of the scene can be directly measured by the RGB-D camera, the problem of scale can be solved, the dense depth map can be obtained, but the depth information marking can be carried out only in an indoor scene, the higher marking quality is difficult to achieve in an outdoor scene, and the price is expensive compared with that of a common camera; for stereo matching or binocular depth estimation, not only can dense depth maps be obtained, but also the problem on scale can be solved, and the method is suitable for indoor and outdoor and is low in price.
It is of great academic interest to estimate the field depth from camera pictures (or in combination with low cost depth sensors) using computer vision methods. The method realizes the depth estimation of the scene and the pose estimation of the camera by using a binocular camera mode and combining a deep learning technology. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes. Therefore, the study of the SLAM front-end visual odometer part has important theoretical and research significance on the aspects of advanced driver assistance systems, automatic driving, intelligent vehicles, robots and the like.
Disclosure of Invention
In order to solve the problems, the invention provides a visual odometry method based on self-supervision deep learning, which has strong robustness and is suitable for multiple scenes and used for processing the problems caused by illumination change, image noise and image motion blur.
The visual odometry method based on the self-supervision deep learning comprises the following steps:
firstly, calibrating a binocular camera before acquiring picture data by using a binocular camera hardware device;
secondly, acquiring video image data through a binocular camera;
thirdly, preprocessing the acquired video image data;
fourthly, building a depth estimation model;
fifthly, building a pose estimation model;
and sixthly, building a binocular vision SLAM system framework.
Further, in the third step, the data is preprocessed as follows: the image noise reduction adopts Gaussian smoothing filtering; the image enhancement uses the methods of scale transformation, random clipping and color adjustment; the RGB image data is normalized to between 0 and 1.
Further, in the fourth step, the function g is set to be implemented
Figure SMS_1
Of (a) is 1 ,I r Respectively left and right eye images, D 1 ,D r And I 1 ,I r Left and right eye difference maps with aligned pixels.
Further, in the fifth step, N is set as the number of the pixel points, and the image reconstruction loss function is defined as follows:
Figure SMS_2
the photometric errors of the reconstructed image and the original image are comprehensively calculated through an image similarity index SSIM, and the photometric errors are disclosed as follows:
Figure SMS_3
where α is the weight of the basic reconstruction error and the similarity error, and α is 0.85.
The invention has the beneficial effects that:
the invention realizes the depth estimation of the scene and the pose estimation of the camera by using a binocular camera mode and combining a deep learning technology. The method has strong robustness to solve the problems caused by illumination change, image noise and image motion blur, and is suitable for multiple scenes. The SLAM front-end visual odometer part of the invention has important theoretical and research significance for advanced driver assistance systems, automatic driving, intelligent vehicles, robots and the like.
Drawings
FIG. 1 is a flow chart of the system of the present invention;
Detailed Description
The invention will be described in detail below with reference to the accompanying drawing 1:
the visual odometer method based on the self-supervision deep learning comprises the following specific steps:
1) Calibrating a camera: the projection of light onto the imaging plane is distorted by the presence of the lens on the camera lens. The distortion is again radial distortion or tangential distortion. In order to eliminate the influence of distortion on the image shot by a common camera and determine the conversion relation of the image in an image coordinate system, a camera coordinate system and a world coordinate system, the binocular camera must be calibrated before a binocular camera hardware device is used for collecting image data.
Barrel distortion is due to the fact that image magnification decreases with increasing distance from the optical axis, whereas pincushion distortion is the opposite. In both of these distortions, a straight line passing through the center of the image and having an intersection with the optical axis can also be kept unchanged in shape. In addition to the radial distortion introduced by the shape of the lens, tangential distortion is introduced during the assembly of the camera by not making the lens and imaging plane strictly parallel.
For radial distortions, both barrel and pincushion distortions, increase as the distance from the center increases. We can describe the coordinate changes before and after distortion by a polynomial function: such distortions can be corrected using quadratic and higher order polynomial functions related to center distance as in 3.1,3.2:
x corrected =x(1+k 1 r 2 +k 2 r 4 +k 3 r 6 ) (3.1)
y corrected =y(1+k 1 r 2 +k 2 r 4 +k 3 r 6 ) (3.2)
wherein is [ x, y] T Coordinates of uncorrected points, [ x ] corrected ,y corrected ] T Are the coordinates of the corrected points, noting that they are all dueA point on a normalized plane, not a point on a pixel plane.
On the other hand, for tangential distortion, two other parameters p may be used 1 ,p 2 To correct for, the formula 3.3,3.4:
x corrected =x+2p 1 +xy+p 2 (r 2 +2x 2 ) (3.3)
y corrected =y+p 1 (r 2 +2y 2 )+2p 2 xy (3.4)
thus, equations 3.1,3.2 and 3.3,3.4 combine for a point P [ X, Y, Z ] in the camera coordinate system] T We can find the correct position of this point on the pixel plane by five distortion coefficients, formula 3.5,3.6:
1. the three-dimensional spatial points are projected onto a normalized image plane. Let its normalized coordinates be [ x, y ]] T
2. Performing radial distortion and tangential distortion correction on points on the normalization plane;
x corrected =x(1+k 1 r 2 +k 2 r 4 +k 3 r 6 )+2p 1 xy+p 2 (r 2 +2x 2 ) (3.5)
y corrected =y(1+k 1 r 2 +k 2 r 4 +k 3 r 6 )+p 1 (r 2 +2y 2 )+2p 2 xy (3.6)
3. and projecting the corrected point to a pixel plane through an internal parameter matrix to obtain the correct position of the point on the image, wherein the formula is 3.7,3.8.
u=f x x corrected +c x (3.7)
v=f y y corrected +c y (3.8)
Coordinate transformation
The process by which a camera maps coordinate points (in meters) in the three-dimensional world to two-dimensional image planes (in pixels) can be described by a geometric model. There are many such models, the simplest of which is referred to as a pinhole model. Pinhole models are very common and efficient models that describe the relationship of a beam of light after passing through a pinhole to project an image behind the pinhole.
Simple geometric modeling of a pinhole camera model. Let O-x-y-z be the camera coordinate system, we customarily let the z-axis point in front of the camera, x to the right, y down. And O is the optical center of the camera and is a spatial point P of the real world of the pinhole in the pinhole model, and the spatial point P falls on a physical imaging plane after being projected by the pinhole O, wherein the imaging point is P'. Let P coordinate be [ X, Y, Z ]] T P' is
Figure SMS_4
And let the distance from the physical imaging plane to the aperture be f (focal length). Then, according to the triangle similarity relationship, there is the following formula 3.9:
Figure SMS_5
wherein the negative sign indicates that the image is inverted. To simplify the model, we place the plane of the imagable image symmetrically in front of the camera, on the same side of the camera coordinate system as the three-dimensional spatial points. This can be done by removing the minus sign in equation 3.9, making the equation more compact:
the formula (3.9) is found in the arrangement, as 3.10,3.11:
Figure SMS_6
Figure SMS_7
equations 3.10,3.11 describe the spatial relationship between point P and its image. A pixel plane o-u-v is fixed in the physical imaging plane. We get the pixel coordinates of P' in the pixel plane: [ u, v ]] T . The pixel coordinate system is usually defined as follows: the origin o is located at the upper left corner of the image, and the axial right is parallel to the x-axisAnd the v-axis is downward and parallel to the y-axis. The difference between the pixel coordinate system and the imaging plane is a zoom and a translation of the origin. Let us assume that the pixel coordinates are scaled by a times on the u-axis and by β times on v. At the same time, the origin is shifted by [ c ] x ,c y ] T . Then, the coordinates of P' are associated with the pixel coordinates [ u, v ]] T The relationship of (d) is as follows, equation 3.12:
Figure SMS_8
substituting into equations 3.10 and 3.11 and converting alpha f Are combined into f x Beta. A f Are combined into f y Obtaining:
Figure SMS_9
wherein, the unit of f is meter, the unit of alpha and beta is pixel per meter, f x ,f y The unit is a pixel. Writing this formula into a matrix form will be more compact, but the left side needs to use homogeneous coordinates:
Figure SMS_10
moving Z to the left side, and finishing to obtain:
Figure SMS_11
in equation 3.15, the matrix composed of the intermediate quantities is referred to as the Camera intrinsic parameter matrix (Camera intraprinsics) K. The parameter matrix in the camera and the image distortion correction can be determined through camera calibration, and the method is prepared for the estimation of a data set for the next step of image acquisition and image production.
In the actual calibration process, as the parameters of the camera are changed after the binocular baselines are changed, baselines with different lengths are preset, and the camera calibration is carried out on the baselines with different lengths in a classified manner.
2) Data acquisition
And after the calibration of the two eyes is completed, acquiring video image data through the two eyes.
3) Data pre-processing
The acquired video image data cannot be directly taken to train a network model, and the video image data also needs to be preprocessed, and different data preprocessing methods aim at different problems. For this document, only image denoising, image enhancement, normalization operations are required to meet the data requirements. The image noise reduction adopts Gaussian smoothing filtering, so that the salt and pepper noise of the image can be effectively reduced; for image enhancement, the influence of rigid transformation of an object on pose estimation is considered, image translation or random rotation cannot be used, and only methods of scale transformation, random cutting and color adjustment are used; and finally, the RGB image data is normalized to be between 0 and 1, so that the situation of gradient explosion or gradient disappearance in the subsequent optimization process can be effectively prevented, and the convergence of the algorithm is accelerated.
(2) Building a depth estimation model
The structure of the disparity estimation network model is as follows: let function g be implemented
Figure SMS_12
Of (a) is l ,I r Respectively left and right eye images, D 1 ,D t And I 1 ,I r Left and right eye disparity maps (corresponding to the disparity of each pixel in the image) with aligned pixels. It is very difficult to artificially construct an accurate expression of the function g. Deep neural networks have a very strong learning ability and can approximate any high-order, non-linear function through a large number of sample training, so DNN is used herein as an approximation of function g. If DNN can be according to I l ,I r Predict D l And D r Then can be according to D l From I r The mid-sampling (performed by the image sampler S) reconstructs a new left eye image->
Figure SMS_13
I.e. is>
Figure SMS_14
Correspondingly, the right eye image can be reconstructed by sampling>
Figure SMS_15
The more accurate the DNN predicted disparity map is, the reconstructed image->
Figure SMS_16
Therefore, when DNN is trained, I is gradually reduced 1 、I r And/or>
Figure SMS_17
The error between the two can make the predicted disparity map gradually approximate to the true value. The whole training process of the network only needs binocular images as samples, does not need depth data as labels, belongs to self-supervision learning, provides possibility for the network to realize online learning and lifetime learning, and enables the model to adapt to complex and changeable working scenes.
Note: i is l 、I r Respectively representing left and right eye images; d l ,D r Respectively showing a left visual difference chart and a right visual difference chart;
Figure SMS_18
respectively representing a left eye reconstructed image and a right eye reconstructed image; s denotes an image sampler.
(3) Pose estimation model
The visual odometer is concerned with the relative motion of the camera between adjacent images, the simplest being to consider the relative pose change of the camera between two adjacent images. Similarly, a rotation matrix R and a displacement matrix t are used to describe the relative pose transformation of the camera. According to different implementation methods, the visual odometer can be divided into a feature point method and a direct method, wherein feature points need to be extracted. And (3) establishing a self-supervision pose model, and establishing the model by using the idea of a direct method.
By definition, all that needs to be solved is the relative camera pose of the second frame picture relative to the first frame picture, i.e. the rotation R and the displacement t. Using the first frame of picture as a reference frame, and setting the internal parameter of the camera as K, which can be obtained from the camera model, equations 3.16 and 3.17:
Figure SMS_19
Figure SMS_20
wherein Z is 1 Is the depth of the spatial point P. Z 2 Is the value of the third coordinate of PR + t, i.e. the depth of the spatial point P in the second camera coordinate system. The basic assumption of direct method is that the pixel gray values in each picture are fixed and invariant for the same spatial point. According to the formula (3.17), under the condition that the current pixel pose is known, P can be found 1 Corresponds to P 2 The elemental position of (a). According to the basic assumption, P 1 And P 2 The pixel grey values of the corresponding positions are equal. Therefore, the pose can be found by minimizing the photometric error, i.e., the luminance error of two pixel positions, as shown in equation 3.18 below:
e=I 1 (p 1 )-I(p 2 ) (3.18)
the direct method establishes the minimum photometric error as an objective function through geometric constraint. The method utilizes the convolution network to improve the advanced features of the image, combines the convolution network with the thought of a direct method, establishes an end-to-end pose estimation network structure, and transmits depth estimation information to the pose estimation network, thereby solving the problem of scale uncertainty, only considering the geometric features and introducing smoothness loss in an error function.
In the training phase, the depth estimation network and the pose estimation network are coupled together and jointly trained by using geometric constraints between continuous binocular images. Both the left and right images are used during training, and only the monocular image is used during testing. And when the depth estimation network is trained, the right image is taken as the supervision information, and the absolute scale of the image can be obtained after training. Due to the coupling of the depth estimation network and the pose estimation network, absolute scale information is shared to the pose estimation network. In a testing phase, the system can perform dense depth reconstruction and camera pose estimation using monocular images.
The loss is similar to that of the self-encoder, and we are best at bestIt is easy to think of making the loss function by reconstructing the image. Assume that the original left image (reference image) is
Figure SMS_21
(i, j represents the position coordinates of the pixel), based on the predicted parallax d and the original right picture->
Figure SMS_22
We can derive a reconstructed left map by a remapping operation>
Figure SMS_23
The remapping operation is obtained by searching corresponding pixel points in the right image according to corresponding parallax values of each pixel point of the left image and then calculating difference values. Assuming that N is the number of pixel points, the simplest image reconstruction loss function is defined as the following formula 3.19:
Figure SMS_24
the reconstructed image has great distortion, and only by adopting the comparison between the reconstructed image and the original image is insufficient, the image similarity index SSIM is introduced to comprehensively calculate the photometric errors of the reconstructed image and the original image, as shown in the following formula 3.20.
Figure SMS_25
Where a is the weight of the basic reconstruction error and the similarity error. a is generally 0.85, the similarity error occupies a larger proportion, and the value can be properly adjusted according to the experimental result. Since depth discontinuities are typically associated with image gradients, a smoothness penalty L of edge-perceived depth weighted by image gradients is introduced smoth The following equation 3.21.
Figure SMS_26
In summary, L of the entire network final The loss function is:
L final =L v +λL amooth (3.22)
where λ is the weight of the depth smoothing penalty.
(4) Building binocular vision SLAM system framework
The system runs on a notebook computer with a built-in Ubuntu18.04 system, three-party open source library environments such as OpenCV3.4.1, PCL, G2O and the like based on the LINUX environment are built, algorithms among all modules are integrated by C + + language according to a SLAM system framework, and then the integrated visual SLAM system engineering is formed by transplanting the integrated open source library environments to a development board. In order to evaluate the accuracy and the real-time performance of the system, a KITTI standard data set and a binocular camera are adopted to collect data in real time to simulate the motion of the mobile robot and evaluate the system. And (3) constructing a binocular vision SLAM system framework, applying the framework to the intelligent mobile robot in the ROS environment, and carrying out field test in an actual scene. After the experiment is completed, because pure vision may generate relatively large errors, in order to improve the accuracy of the system, the system may be tested by combining a camera with other sensors (such as an IMU, a laser, and the like).

Claims (4)

1. A visual odometry method based on self-supervision deep learning is characterized by comprising the following steps:
firstly, calibrating a binocular camera before acquiring picture data by using a binocular camera hardware device;
secondly, acquiring video image data through binocular;
thirdly, preprocessing the acquired video image data;
fourthly, building a depth estimation model;
fifthly, building a pose estimation model;
and sixthly, building a binocular vision SLAM system framework.
2. The visual odometry method based on the self-supervised deep learning of claim 1, wherein in the third step, the data are preprocessed as follows: the image noise reduction adopts Gaussian smoothing filtering; the image enhancement uses the methods of scale transformation, random clipping and color adjustment; the RGB image data is normalized to between 0 and 1.
3. The visual odometry method based on self-supervised deep learning of claim 1, wherein in the fourth step, the function g is set to be implemented
Figure FDA0003789085970000011
Mapping of (a), I l ,I r Respectively left and right eye images, D l ,D r And I l ,I r Left and right eye difference maps with aligned pixels.
4. The visual odometry method based on the self-supervised deep learning of claim 1, wherein in the fifth step, N is taken as the number of pixel points, and the image reconstruction loss function is defined as follows:
Figure FDA0003789085970000012
the photometric errors of the reconstructed image and the original image are comprehensively calculated through an image similarity index SSIM, and the photometric errors are disclosed as follows:
Figure FDA0003789085970000013
where α is the weight of the basic reconstruction error and the similarity error, and α is 0.85.
CN202210949902.4A 2022-08-09 2022-08-09 Visual odometer method based on self-supervision deep learning Pending CN115953460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210949902.4A CN115953460A (en) 2022-08-09 2022-08-09 Visual odometer method based on self-supervision deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210949902.4A CN115953460A (en) 2022-08-09 2022-08-09 Visual odometer method based on self-supervision deep learning

Publications (1)

Publication Number Publication Date
CN115953460A true CN115953460A (en) 2023-04-11

Family

ID=87289774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210949902.4A Pending CN115953460A (en) 2022-08-09 2022-08-09 Visual odometer method based on self-supervision deep learning

Country Status (1)

Country Link
CN (1) CN115953460A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117291804A (en) * 2023-09-28 2023-12-26 武汉星巡智能科技有限公司 Binocular image real-time splicing method, device and equipment based on weighted fusion strategy

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117291804A (en) * 2023-09-28 2023-12-26 武汉星巡智能科技有限公司 Binocular image real-time splicing method, device and equipment based on weighted fusion strategy

Similar Documents

Publication Publication Date Title
CN108717712B (en) Visual inertial navigation SLAM method based on ground plane hypothesis
US11954813B2 (en) Three-dimensional scene constructing method, apparatus and system, and storage medium
CN112902953B (en) Autonomous pose measurement method based on SLAM technology
CN111325794A (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN108648194B (en) Three-dimensional target identification segmentation and pose measurement method and device based on CAD model
CN110689562A (en) Trajectory loop detection optimization method based on generation of countermeasure network
CN110610486B (en) Monocular image depth estimation method and device
CN114359509A (en) Multi-view natural scene reconstruction method based on deep learning
CN113793266A (en) Multi-view machine vision image splicing method, system and storage medium
CN116778288A (en) Multi-mode fusion target detection system and method
CN110033483A (en) Based on DCNN depth drawing generating method and system
CN116958419A (en) Binocular stereoscopic vision three-dimensional reconstruction system and method based on wavefront coding
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
CN115222884A (en) Space object analysis and modeling optimization method based on artificial intelligence
CN111951339A (en) Image processing method for performing parallax calculation by using heterogeneous binocular cameras
CN117274515A (en) Visual SLAM method and system based on ORB and NeRF mapping
CN115953460A (en) Visual odometer method based on self-supervision deep learning
CN114998507A (en) Luminosity three-dimensional reconstruction method based on self-supervision learning
Zhang et al. Improved feature point extraction method of ORB-SLAM2 dense map
CN110378995A (en) A method of three-dimensional space modeling is carried out using projection feature
Nouduri et al. Deep realistic novel view generation for city-scale aerial images
CN113689326A (en) Three-dimensional positioning method based on two-dimensional image segmentation guidance
CN116433822B (en) Neural radiation field training method, device, equipment and medium
CN117274349A (en) Transparent object reconstruction method and system based on RGB-D camera consistency depth prediction
CN115511970B (en) Visual positioning method for autonomous parking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication