CN114693720A

CN114693720A - Design method of monocular vision odometer based on unsupervised deep learning

Info

Publication number: CN114693720A
Application number: CN202210195358.9A
Authority: CN
Inventors: 李鹏; 蔡成林; 周彦; 盘宏斌; 陈洋卓; 窦杰; 孟步敏; 蔡晓雯; 张莹; 黄鹏; 李锡敏
Original assignee: Suzhou Xiangbo Intelligent Technology Co ltd
Current assignee: Suzhou Xiangbo Intelligent Technology Co ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-07-01

Abstract

The invention discloses a design method of a monocular vision odometer based on unsupervised deep learning, which improves the performance of the vision odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front-back consistency error, selects an optimal tracking mode by model score, aligns with the depth information to obtain the vision odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and effectively reduces the problems of inconsistent scale and scale drift through experiments.

Description

Design method of monocular vision odometer based on unsupervised deep learning

Technical Field

The invention relates to a design method of a monocular vision odometer based on unsupervised deep learning.

Background

In recent years, mobile robots and automatic driving technologies are rapidly developed, and the requirements on the precision of autonomous positioning and navigation are higher and higher. In an indoor or weak satellite navigation environment, a Simultaneous Localization And Mapping (SLAM) technology based on vision plays a crucial role, And a Visual Odometer (VO) is also drawing more And more attention And research as a key link of the Visual SLAM.

Visual odometry can be divided into feature-based methods and direct methods. Feature-based methods optimize camera pose by detecting feature points and extracting local descriptors as intermediate representations, then performing feature matching between images, and using reprojection errors. The direct method models the image forming process and optimizes the photometric error function by assuming gray scale invariance.

Deep learning has rolled the field of computer vision in recent years, and SLAM research based on deep learning has also made significant progress. At present, related work mainly focuses on sub-problems of SLAM standard links, such as feature extraction, feature matching, outlier rejection, Bundle Adjustment (BA) and the like. The end-to-end visual odometer framework proposes to directly return camera relative pose or positioning information from the CNN network. The CNN-SLAM replaces the depth estimation and the image matching with a method based on a CNN network on the basis of LSD-SLAM, but the accuracy is seriously insufficient outdoors. GEN-SLAM uses monocular RGB cameras to train the network with the results of traditional geometric SLAM for pose and depth estimates. The SfM-Learner trains the posture and the depth network simultaneously to obtain a result competing with the ORB-SLAM, and Deep-VO-Feat and D3VO train by using a binocular camera, so that a track under a real scale can be directly obtained under the operation of the monocular camera. However, due to the lack of multi-view geometric constraints, the end-to-end deep learning method often faces a great scale drift problem, and it is difficult to obtain a result competing with the conventional VO. In the above research, the monocular visual odometer still has the problems of scale drift and scale inconsistency.

Disclosure of Invention

In order to solve the technical problems, the invention provides a design method of a monocular vision odometer based on unsupervised deep learning, which can effectively reduce the problems of inconsistent scales and scale drift.

The technical scheme for solving the technical problems is as follows: a design method of a monocular vision odometer based on unsupervised deep learning comprises the following steps:

the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;

step two: according to the consistency error before and after, sparse sampling is carried out in the dense optical flow to obtain the corresponding relation;

step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.

In the above design method of the monocular visual odometer based on unsupervised deep learning, in the first step, the framework of the unsupervised deep learning network includes three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.

In the first step, during training, the depths of two adjacent frames of images are estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;

under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related through a relative pose network, wherein the relative pose network is used for helping to constrain the deep network and the optical flow network and is only used during training;

consider two adjacent images I_kAnd I_k+1，I_kRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth network_k→k+1And a single view depth D_k,D_k+1，D_kRepresents the kth single view depth, according to the equation

Wherein the content of the first and second substances,

the (k + 1) th image is obtained by transforming the pixel coordinates of the (k) th imagePixel coordinates of the image;

to predict the relative camera motion from sequence k to sequence k +1,

the predicted depth of the k image pixel; p is the image pixel coordinate, p_kIs the pixel coordinate of the kth image, K is the camera internal reference matrix; to input an image I_kTransforming to obtain a composite image

Since the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefrom_syn：

Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:

wherein L is_p: a loss function result;

a structurally similar loss function; α is 0.85, SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame effective co-view region;

in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:

wherein L is_s: smoothing the loss function result; d_i,j：

Representing an image parallax gradient;

representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions;_Ii,j: image (i, j) location pixel;

first derivatives in the x, y directions, respectively;

for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:

wherein ω: a binary mask;

L_p(I_k,I_k+1): the image reconstruction loss function mentioned in text equation (3);

respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:

L_f: jointly fine-tuning the loss function result;_flow: the image optical flow field is a common effective pixel area; f_R(p): RAFT network optical flow prediction results; f_syn(p): synthesizing an optical flow network optical flow prediction result;

in training, D is predicted through optical flow network for depth to structure consistency_k+1And D_kAlignment, compute depth consistency loss:

wherein L is_dc: a depth consistency loss function result; d_k(p): the kth image depth;

matching and calculating the depth of the k image through an optical flow network; s is a common effective area of the optical flow and the depth, so as to obtain consistent depth estimation;

in summary, the network total loss function L is:

L＝L_p+λ_sL_s+λ_fL_f+λ_dcL_dc (8)

wherein λ is_s、λ_fAnd λ_dcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.

The design method of the monocular vision odometer based on unsupervised deep learning comprises the following specific steps:

(2-1) front-to-back optical flow uniformity: deriving a forward optical flow F from an optical flow network^fAnd backward optical flow F^bAnd calculates the front-back consistency d_F＝|F^f(p)+F^b(p+F^f(p))|²；

(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions and taking d in each region_FThe first 20 sets of sparse matching points that are less than the threshold δ.

The design method of the monocular vision odometer based on unsupervised deep learning comprises the following three specific steps:

(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score R_F，R_F＝S_F/(S_F+S_H)，S_F、S_HRespectively scoring for the F and R models; if R is_F>0.5, selecting a 2D-2D tracking mode; otherwise, selecting a 3D-3D tracking mode;

(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking mode

R represents a rotation matrix, and R represents a rotation matrix,

represents a displacement vector with a modulo length of unit 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain

Which is indicative of the motion of the camera(s),

representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain T_k→k+1＝[R|t]，[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.

In the step (3-1), the design method of the monocular vision odometer based on unsupervised deep learning is inspired by an ORB-SLAM initialization method, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3D_FCarrying on withSelecting a tracking mode; first, solving a homography matrix H_crAnd essence matrix F_cr：

Wherein p is_cFor the matching point of the previous frame of two adjacent frames, p_rCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F model_HAnd S_FScoring:

wherein M is H or F, rho_M: the intermediate calculation result of the model score S; d²: representing symmetric transfer errors; f: is equivalent to T_HIs invalid data exclusion threshold;

the ith' matching point of the current frame;

the ith' matching point of the reference frame;

the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error T_MFor distance threshold, refer to ORB-SLAM, let T_H＝5.99,T_F3.84, Γ and T_HThe definitions are the same;

when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:

n represents the characteristic matching of two adjacent imagesCounting; i' represents the serial number of the matching point; r: representing a camera rotation matrix;

indicating the matching point at the i' of the k image.

The invention has the beneficial effects that: the invention improves the performance of the visual odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front and back consistency errors, selects an optimal tracking mode by model scoring, aligns with the depth information to obtain the visual odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and proves that the method effectively reduces the problems of inconsistent scale and scale drift through experiments.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is an architecture diagram of an unsupervised deep learning network according to the present invention.

Fig. 3 is a schematic diagram of depth estimation.

FIG. 4 is a schematic diagram of extracting sparse matching relationships from forward and backward optical flows.

Fig. 5 is a KITTI 09 sequence trace diagram.

Fig. 6 is a KITTI 10 sequence trace diagram.

Detailed Description

The invention is further described below with reference to the drawings and examples.

As shown in fig. 1, a design method of a monocular visual odometer based on unsupervised deep learning includes the following steps:

the method comprises the following steps: and combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow.

The key to unsupervised learning is to compute the difference between the synthetic image and the target image using the estimated depth, pose, optical flow and source image using image reconstruction loss. The single-view depth network, the optical flow network and the camera relative posture network are three separate tasks, but the three tasks have an image similarity constraint relation which is mutually related, and a training process of the three networks is coupled by combining a space consistency loss function and an image similarity loss function on the basis of the existing unsupervised signal through view synthesis.

The framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.

An unsupervised deep learning network architecture is shown in FIG. 2, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow; compared with network independent training, the method combines multitask consistency constraint, strengthens the relation among networks, and obtains more accurate and robust depth, attitude and optical flow estimation.

Under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related by a relative pose network, wherein the relative pose network is used to help constrain the deep network and the optical flow network and is only used during training.

Wherein, the first and the second end of the pipe are connected with each other,

converting pixel coordinates of the kth image into pixel coordinates of the (k + 1) th image;

to predict the relative camera motion from sequence k to sequence k +1,

wherein L is_p: decrease in the thickness of the steelA function-loss result;

wherein L is_s: smoothing the loss function result; d_i,j：

Representing an image parallax gradient;

first derivatives in the x, y directions, respectively;

wherein ω: a binary mask;

L_p(I_k,I_k+1): text formula (3)The image reconstruction loss function mentioned in (1);

respectively, the k-th image, the composite image of the k-th image, and the (k + 1) -th image.

And selecting and using a RAFT network with strong generalization capability, rapidness and accuracy as an optical flow backbone network. Compared with using a coarse-to-fine pyramid iterative network, RAFT maintains and updates only a single optical flow field at high resolution, and shares weights during iteration, overcoming the two difficulties of coarse-to-fine optimization of the network that it is difficult to correct errors at coarse resolution and to detect rapid motion of small objects. Fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:

matching and calculating the depth of the k image through an optical flow network; s is the common effective area of optical flow and depth, so that a consistent depth estimation is obtained.

In summary, the network total loss function L is:

L＝L_p+λ_sL_s+λ_fL_f+λ_dcL_dc (8)

Step two: and carrying out sparse sampling in the dense optical flow according to the consistency error before and after the consistency error to obtain the corresponding relation.

The method comprises the following specific steps:

(2-1) front-to-back optical flow uniformity: deriving forward optical flow from optical flow network F^fAnd backward optical flow F^bAnd calculates the front-back consistency d_F＝|F^f(p)+F^b(p+F^f(p))|²。

Depth information with consistent scales is obtained by using a depth network, and the triangularization alignment process is independently carried out, so that the scale drift problem can be reduced to the maximum extent.

In order to extract sparse matching from the optical flow network, a forward optical flow and a backward optical flow are used simultaneously, and a precise sparse corresponding relation is obtained by utilizing bidirectional consistency error filtering. The problem of mismatching of a large number of feature points caused by sudden motion steering is avoided because local matching is carried out without depending on a motion model.

The method comprises the following specific steps:

(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score R_F，R_F＝S_F/(S_F+S_H)，S_F、S_HRespectively scoring for the F and R models; if R is_F>0.5, selecting a 2D-2D tracking mode; otherwise, the 3D-3D tracking mode is selected.

Subject to ORB-SLAM initializationThe method is inspired, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3D_FSelecting a tracking mode; first, solving a homography matrix H_crAnd essence matrix F_cr：

the ith' matching point of the current frame;

the ith' matching point of the reference frame;

n represents the matching points of the characteristics of two adjacent images; i' represents the serial number of the matching point; r: representing a camera rotation matrix;

indicating the matching point at the i' of the k image.

R represents a rotation matrix, and R represents a rotation matrix,

Which is indicative of the motion of the camera,

Experimental validation and result analysis

A Ubuntu20.04 system is adopted, a CPU is i5-10300H, a GPU is NVIDIA Geforce GTX 1660Ti, a memory of a display card is 6GB, and a memory of equipment is 16 GB. And performing a visual odometry experiment on the KITTI data set, comparing the visual odometry experiment with a traditional method and an end-to-end deep learning-based method, and verifying the effectiveness of the method.

Network architecture and parameter setting:

the depth estimation network is based on a universal U-net network architecture, namely an encoder-decoder structure, and ResNet18 is used as an encoder network; the decoder uses a hopping chaining architecture, with hopping connections between network layers enabling it to fuse both shallow geometry information and high-level abstract features. Because the motion between adjacent frames is very small, multi-scale output is neither accurate nor necessary, and only a single-scale depth prediction result is output, so that the computing resource is greatly saved. The optical flow network uses RAFT network as backbone network; the relative pose network is a pose estimation network with the structure of ResNet18, using axis angles to represent three-dimensional rotations.

The network model was implemented using a Pythrch framework, with two-stage training using an Adam optimizer. The first stage trains for 20 cycles, and the learning rate is set to 10^-4And the batch sample size is 8. The second stage trains for 100 periods, and the learning rate is set to 10^-5And the batch sample size is 4. In training, λ is set_s＝0.4,λ_f＝0.4,λ_dcThe image sequence size is adjusted to 480 × 640, 0.1.

And (3) visual odometer:

the KITTI data set provides 22 sets of sequence data, where 0-10 provides the true trajectory, the experiment was trained on the 0-8 sequence, the 9-10 sequence was evaluated and studied against ORB-SLAM2 and end-to-end deep learning methods. Since the monocular visual odometer cannot obtain the scale in the real world, the result is uniformly aligned with the real trajectory scale for fair comparison. Qualitative trajectory results As shown in FIGS. 5 and 6, the amount of trajectory translation drift for our method is significantly reduced compared to ORB-SLAM2, SfMLearner, SC-SfMLerner, and Depth-VO-Feat, which benefits from our scale-consistent Depth estimation. Although Depth-VO-Feat trained using binocular cameras can achieve results consistent with real world dimensions, the problem of scale drift is the most severe. On the contrary, since the exact matching relationship is extracted, after the scale alignment, the method is more consistent with the real track.

TABLE 1 KITTI 09&10 sequence alignment

Table 1 Comparison of KITTI sequence 09&10

Using translation errors (t) on different subsequences (100m, 200m, 800m)_err) Rotation error (r)_err) Relative pose error RPE (m/°) and Absolute Trajectory Error (ATE) were analyzed in more detail, with bold font representing the best results in this evaluation. As can be seen from table 1, most of the indexes of our method are better than those of the conventional method and the pure deep learning method, and ORB-SLAM2 has better performance on rotation error, because the vehicle mainly runs at a constant speed, and the motion model thereof also assumes that there is a constant motion between two frames, which results in smaller error and has a very small difference from our method. The SC-SfMLearner also uses a depth consistency constraint in the training process, the absolute track error on the sequence 9 is slightly better than that of the SC-SfMLearner, and has a larger difference with other methods, but the absolute track error is not as good as that of the SC-SfMLearner in most other indexes because the absolute track error does not apply a multi-view geometric constraint in the pose estimation. The sparse matching relation of multiple structural features is extracted, epipolar geometric constraint is utilized, obvious advantages are shown in other evaluation aspects, and the comprehensive performance is better.

Claims

1. A design method of a monocular vision odometer based on unsupervised deep learning is characterized by comprising the following steps:

the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised depth learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;

2. The design method of the monocular visual odometer based on unsupervised deep learning of claim 1, wherein in the step one, the framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.

3. The design method of monocular visual odometer based on unsupervised deep learning as claimed in claim 2, wherein in the first step, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is made consistent by using spatial consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;

to predict camera relative motion from sequence k to sequence k +1,

the predicted depth of the k image pixel; p is the image pixel coordinate, p_kIs the pixel coordinate of the kth image, and K is the camera internal reference matrix; to input an image I_kTransforming to obtain a composite image

wherein L is_p: a loss function result;

a structurally similar loss function; SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame valid co-view region;

wherein L is_s: smoothing the loss function result; d_i,j：

Representing an image parallax gradient;

representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions; I.C. A_i,j: image (i, j) location pixel;

first derivatives in the x, y directions, respectively;

wherein ω: a binary mask;

L_p(I_k,I_k+1): the image reconstruction loss function mentioned in text equation (3); I.C. A_k,

I_k+1Respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:

L_f: jointly fine-tuning the loss function result; flow: the image optical flow field is a common effective pixel area; f_R(p): RAFT network optical flow prediction results; f_syn(p): synthesizing an optical flow network optical flow prediction result;

in training, D is predicted to a consistent depth by an optical flow network_k+1And D_kAlignment, compute depth consistency loss:

to sum up, the network total loss function L is:

L＝L_p+λ_sL_s+λ_fL_f+λ_dcL_dc (8)

4. The design method of the monocular vision odometer based on unsupervised deep learning of claim 3, wherein the second step is specifically:

(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions, and taking d in each region_FThe first 20 sets of sparse matching points that are less than the threshold δ.

5. The design method of the monocular vision odometer based on unsupervised deep learning of claim 4, wherein the third concrete step is:

R represents a rotation matrix of the optical disk,

represents a displacement vector with a modulo length of 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain

Which is indicative of the motion of the camera,

6. The design method of monocular visual odometer based on unsupervised deep learning of claim 5, wherein in the step (3-1), inspired by ORB-SLAM initialization method, two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3D_FSelecting a tracking mode; first, solving a homography matrix H_crAnd essence matrix F_cr：

the ith' matching point of the current frame;

the ith' matching point of the reference frame;

indicating the matching point at the i' of the k image.