CN114693720A - Design method of monocular vision odometer based on unsupervised deep learning - Google Patents
Design method of monocular vision odometer based on unsupervised deep learning Download PDFInfo
- Publication number
- CN114693720A CN114693720A CN202210195358.9A CN202210195358A CN114693720A CN 114693720 A CN114693720 A CN 114693720A CN 202210195358 A CN202210195358 A CN 202210195358A CN 114693720 A CN114693720 A CN 114693720A
- Authority
- CN
- China
- Prior art keywords
- network
- optical flow
- image
- depth
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C22/00—Measuring distance traversed on the ground by vehicles, persons, animals or other moving solid bodies, e.g. using odometers, using pedometers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30241—Trajectory
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a design method of a monocular vision odometer based on unsupervised deep learning, which improves the performance of the vision odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front-back consistency error, selects an optimal tracking mode by model score, aligns with the depth information to obtain the vision odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and effectively reduces the problems of inconsistent scale and scale drift through experiments.
Description
Technical Field
The invention relates to a design method of a monocular vision odometer based on unsupervised deep learning.
Background
In recent years, mobile robots and automatic driving technologies are rapidly developed, and the requirements on the precision of autonomous positioning and navigation are higher and higher. In an indoor or weak satellite navigation environment, a Simultaneous Localization And Mapping (SLAM) technology based on vision plays a crucial role, And a Visual Odometer (VO) is also drawing more And more attention And research as a key link of the Visual SLAM.
Visual odometry can be divided into feature-based methods and direct methods. Feature-based methods optimize camera pose by detecting feature points and extracting local descriptors as intermediate representations, then performing feature matching between images, and using reprojection errors. The direct method models the image forming process and optimizes the photometric error function by assuming gray scale invariance.
Deep learning has rolled the field of computer vision in recent years, and SLAM research based on deep learning has also made significant progress. At present, related work mainly focuses on sub-problems of SLAM standard links, such as feature extraction, feature matching, outlier rejection, Bundle Adjustment (BA) and the like. The end-to-end visual odometer framework proposes to directly return camera relative pose or positioning information from the CNN network. The CNN-SLAM replaces the depth estimation and the image matching with a method based on a CNN network on the basis of LSD-SLAM, but the accuracy is seriously insufficient outdoors. GEN-SLAM uses monocular RGB cameras to train the network with the results of traditional geometric SLAM for pose and depth estimates. The SfM-Learner trains the posture and the depth network simultaneously to obtain a result competing with the ORB-SLAM, and Deep-VO-Feat and D3VO train by using a binocular camera, so that a track under a real scale can be directly obtained under the operation of the monocular camera. However, due to the lack of multi-view geometric constraints, the end-to-end deep learning method often faces a great scale drift problem, and it is difficult to obtain a result competing with the conventional VO. In the above research, the monocular visual odometer still has the problems of scale drift and scale inconsistency.
Disclosure of Invention
In order to solve the technical problems, the invention provides a design method of a monocular vision odometer based on unsupervised deep learning, which can effectively reduce the problems of inconsistent scales and scale drift.
The technical scheme for solving the technical problems is as follows: a design method of a monocular vision odometer based on unsupervised deep learning comprises the following steps:
the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;
step two: according to the consistency error before and after, sparse sampling is carried out in the dense optical flow to obtain the corresponding relation;
step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
In the above design method of the monocular visual odometer based on unsupervised deep learning, in the first step, the framework of the unsupervised deep learning network includes three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
In the first step, during training, the depths of two adjacent frames of images are estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;
under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related through a relative pose network, wherein the relative pose network is used for helping to constrain the deep network and the optical flow network and is only used during training;
consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Wherein the content of the first and second substances,the (k + 1) th image is obtained by transforming the pixel coordinates of the (k) th imagePixel coordinates of the image;to predict the relative camera motion from sequence k to sequence k +1,the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite imageSince the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn:
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
wherein L isp: a loss function result;a structurally similar loss function; α is 0.85, SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame effective co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
wherein L iss: smoothing the loss function result; di,j:Representing an image parallax gradient;representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions;Ii,j: image (i, j) location pixel;first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
wherein ω: a binary mask;Lp(Ik,Ik+1): the image reconstruction loss function mentioned in text equation (3);respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Lf: jointly fine-tuning the loss function result;flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted through optical flow network for depth to structure consistencyk+1And DkAlignment, compute depth consistency loss:
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;matching and calculating the depth of the k image through an optical flow network; s is a common effective area of the optical flow and the depth, so as to obtain consistent depth estimation;
in summary, the network total loss function L is:
L=Lp+λsLs+λfLf+λdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
The design method of the monocular vision odometer based on unsupervised deep learning comprises the following specific steps:
(2-1) front-to-back optical flow uniformity: deriving a forward optical flow F from an optical flow networkfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2;
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
The design method of the monocular vision odometer based on unsupervised deep learning comprises the following three specific steps:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, selecting a 3D-3D tracking mode;
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking modeR represents a rotation matrix, and R represents a rotation matrix,represents a displacement vector with a modulo length of unit 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain Which is indicative of the motion of the camera(s),representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
In the step (3-1), the design method of the monocular vision odometer based on unsupervised deep learning is inspired by an ORB-SLAM initialization method, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFCarrying on withSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr:
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;the ith' matching point of the current frame;the ith' matching point of the reference frame;the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
n represents the characteristic matching of two adjacent imagesCounting; i' represents the serial number of the matching point; r: representing a camera rotation matrix;indicating the matching point at the i' of the k image.
The invention has the beneficial effects that: the invention improves the performance of the visual odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front and back consistency errors, selects an optimal tracking mode by model scoring, aligns with the depth information to obtain the visual odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and proves that the method effectively reduces the problems of inconsistent scale and scale drift through experiments.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is an architecture diagram of an unsupervised deep learning network according to the present invention.
Fig. 3 is a schematic diagram of depth estimation.
FIG. 4 is a schematic diagram of extracting sparse matching relationships from forward and backward optical flows.
Fig. 5 is a KITTI 09 sequence trace diagram.
Fig. 6 is a KITTI 10 sequence trace diagram.
Detailed Description
The invention is further described below with reference to the drawings and examples.
As shown in fig. 1, a design method of a monocular visual odometer based on unsupervised deep learning includes the following steps:
the method comprises the following steps: and combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow.
The key to unsupervised learning is to compute the difference between the synthetic image and the target image using the estimated depth, pose, optical flow and source image using image reconstruction loss. The single-view depth network, the optical flow network and the camera relative posture network are three separate tasks, but the three tasks have an image similarity constraint relation which is mutually related, and a training process of the three networks is coupled by combining a space consistency loss function and an image similarity loss function on the basis of the existing unsupervised signal through view synthesis.
The framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
An unsupervised deep learning network architecture is shown in FIG. 2, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow; compared with network independent training, the method combines multitask consistency constraint, strengthens the relation among networks, and obtains more accurate and robust depth, attitude and optical flow estimation.
Under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related by a relative pose network, wherein the relative pose network is used to help constrain the deep network and the optical flow network and is only used during training.
Consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Wherein, the first and the second end of the pipe are connected with each other,converting pixel coordinates of the kth image into pixel coordinates of the (k + 1) th image;to predict the relative camera motion from sequence k to sequence k +1,the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite imageSince the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn:
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
wherein L isp: decrease in the thickness of the steelA function-loss result;a structurally similar loss function; α is 0.85, SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame effective co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
wherein L iss: smoothing the loss function result; di,j:Representing an image parallax gradient;representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions;Ii,j: image (i, j) location pixel;first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
wherein ω: a binary mask;Lp(Ik,Ik+1): text formula (3)The image reconstruction loss function mentioned in (1);respectively, the k-th image, the composite image of the k-th image, and the (k + 1) -th image.
And selecting and using a RAFT network with strong generalization capability, rapidness and accuracy as an optical flow backbone network. Compared with using a coarse-to-fine pyramid iterative network, RAFT maintains and updates only a single optical flow field at high resolution, and shares weights during iteration, overcoming the two difficulties of coarse-to-fine optimization of the network that it is difficult to correct errors at coarse resolution and to detect rapid motion of small objects. Fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Lf: jointly fine-tuning the loss function result;flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted through optical flow network for depth to structure consistencyk+1And DkAlignment, compute depth consistency loss:
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;matching and calculating the depth of the k image through an optical flow network; s is the common effective area of optical flow and depth, so that a consistent depth estimation is obtained.
In summary, the network total loss function L is:
L=Lp+λsLs+λfLf+λdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
Step two: and carrying out sparse sampling in the dense optical flow according to the consistency error before and after the consistency error to obtain the corresponding relation.
The method comprises the following specific steps:
(2-1) front-to-back optical flow uniformity: deriving forward optical flow from optical flow network FfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2。
Depth information with consistent scales is obtained by using a depth network, and the triangularization alignment process is independently carried out, so that the scale drift problem can be reduced to the maximum extent.
In order to extract sparse matching from the optical flow network, a forward optical flow and a backward optical flow are used simultaneously, and a precise sparse corresponding relation is obtained by utilizing bidirectional consistency error filtering. The problem of mismatching of a large number of feature points caused by sudden motion steering is avoided because local matching is carried out without depending on a motion model.
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
Step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
The method comprises the following specific steps:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, the 3D-3D tracking mode is selected.
Subject to ORB-SLAM initializationThe method is inspired, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr:
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;the ith' matching point of the current frame;the ith' matching point of the reference frame;the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
n represents the matching points of the characteristics of two adjacent images; i' represents the serial number of the matching point; r: representing a camera rotation matrix;indicating the matching point at the i' of the k image.
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking modeR represents a rotation matrix, and R represents a rotation matrix,represents a displacement vector with a modulo length of unit 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain Which is indicative of the motion of the camera,representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
Experimental validation and result analysis
A Ubuntu20.04 system is adopted, a CPU is i5-10300H, a GPU is NVIDIA Geforce GTX 1660Ti, a memory of a display card is 6GB, and a memory of equipment is 16 GB. And performing a visual odometry experiment on the KITTI data set, comparing the visual odometry experiment with a traditional method and an end-to-end deep learning-based method, and verifying the effectiveness of the method.
Network architecture and parameter setting:
the depth estimation network is based on a universal U-net network architecture, namely an encoder-decoder structure, and ResNet18 is used as an encoder network; the decoder uses a hopping chaining architecture, with hopping connections between network layers enabling it to fuse both shallow geometry information and high-level abstract features. Because the motion between adjacent frames is very small, multi-scale output is neither accurate nor necessary, and only a single-scale depth prediction result is output, so that the computing resource is greatly saved. The optical flow network uses RAFT network as backbone network; the relative pose network is a pose estimation network with the structure of ResNet18, using axis angles to represent three-dimensional rotations.
The network model was implemented using a Pythrch framework, with two-stage training using an Adam optimizer. The first stage trains for 20 cycles, and the learning rate is set to 10-4And the batch sample size is 8. The second stage trains for 100 periods, and the learning rate is set to 10-5And the batch sample size is 4. In training, λ is sets=0.4,λf=0.4,λdcThe image sequence size is adjusted to 480 × 640, 0.1.
And (3) visual odometer:
the KITTI data set provides 22 sets of sequence data, where 0-10 provides the true trajectory, the experiment was trained on the 0-8 sequence, the 9-10 sequence was evaluated and studied against ORB-SLAM2 and end-to-end deep learning methods. Since the monocular visual odometer cannot obtain the scale in the real world, the result is uniformly aligned with the real trajectory scale for fair comparison. Qualitative trajectory results As shown in FIGS. 5 and 6, the amount of trajectory translation drift for our method is significantly reduced compared to ORB-SLAM2, SfMLearner, SC-SfMLerner, and Depth-VO-Feat, which benefits from our scale-consistent Depth estimation. Although Depth-VO-Feat trained using binocular cameras can achieve results consistent with real world dimensions, the problem of scale drift is the most severe. On the contrary, since the exact matching relationship is extracted, after the scale alignment, the method is more consistent with the real track.
TABLE 1 KITTI 09&10 sequence alignment
Table 1 Comparison of KITTI sequence 09&10
Using translation errors (t) on different subsequences (100m, 200m, 800m)err) Rotation error (r)err) Relative pose error RPE (m/°) and Absolute Trajectory Error (ATE) were analyzed in more detail, with bold font representing the best results in this evaluation. As can be seen from table 1, most of the indexes of our method are better than those of the conventional method and the pure deep learning method, and ORB-SLAM2 has better performance on rotation error, because the vehicle mainly runs at a constant speed, and the motion model thereof also assumes that there is a constant motion between two frames, which results in smaller error and has a very small difference from our method. The SC-SfMLearner also uses a depth consistency constraint in the training process, the absolute track error on the sequence 9 is slightly better than that of the SC-SfMLearner, and has a larger difference with other methods, but the absolute track error is not as good as that of the SC-SfMLearner in most other indexes because the absolute track error does not apply a multi-view geometric constraint in the pose estimation. The sparse matching relation of multiple structural features is extracted, epipolar geometric constraint is utilized, obvious advantages are shown in other evaluation aspects, and the comprehensive performance is better.
Claims (6)
1. A design method of a monocular vision odometer based on unsupervised deep learning is characterized by comprising the following steps:
the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised depth learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;
step two: according to the consistency error before and after, sparse sampling is carried out in the dense optical flow to obtain the corresponding relation;
step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
2. The design method of the monocular visual odometer based on unsupervised deep learning of claim 1, wherein in the step one, the framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
3. The design method of monocular visual odometer based on unsupervised deep learning as claimed in claim 2, wherein in the first step, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is made consistent by using spatial consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;
under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related through a relative pose network, wherein the relative pose network is used for helping to constrain the deep network and the optical flow network and is only used during training;
consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Wherein, the first and the second end of the pipe are connected with each other,converting pixel coordinates of the kth image into pixel coordinates of the (k + 1) th image;to predict camera relative motion from sequence k to sequence k +1,the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, and K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite imageSince the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn:
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
wherein L isp: a loss function result;a structurally similar loss function; SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame valid co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
wherein L iss: smoothing the loss function result; di,j:Representing an image parallax gradient;representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions; I.C. Ai,j: image (i, j) location pixel;first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
wherein ω: a binary mask;Lp(Ik,Ik+1): the image reconstruction loss function mentioned in text equation (3); I.C. Ak,Ik+1Respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Lf: jointly fine-tuning the loss function result; flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted to a consistent depth by an optical flow networkk+1And DkAlignment, compute depth consistency loss:
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;matching and calculating the depth of the k image through an optical flow network; s is a common effective area of the optical flow and the depth, so as to obtain consistent depth estimation;
to sum up, the network total loss function L is:
L=Lp+λsLs+λfLf+λdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
4. The design method of the monocular vision odometer based on unsupervised deep learning of claim 3, wherein the second step is specifically:
(2-1) front-to-back optical flow uniformity: deriving a forward optical flow F from an optical flow networkfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2;
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions, and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
5. The design method of the monocular vision odometer based on unsupervised deep learning of claim 4, wherein the third concrete step is:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, selecting a 3D-3D tracking mode;
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking modeR represents a rotation matrix of the optical disk,represents a displacement vector with a modulo length of 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain Which is indicative of the motion of the camera,representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
6. The design method of monocular visual odometer based on unsupervised deep learning of claim 5, wherein in the step (3-1), inspired by ORB-SLAM initialization method, two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr:
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;the ith' matching point of the current frame;the ith' matching point of the reference frame;the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210195358.9A CN114693720A (en) | 2022-02-28 | 2022-02-28 | Design method of monocular vision odometer based on unsupervised deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210195358.9A CN114693720A (en) | 2022-02-28 | 2022-02-28 | Design method of monocular vision odometer based on unsupervised deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114693720A true CN114693720A (en) | 2022-07-01 |
Family
ID=82137606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210195358.9A Pending CN114693720A (en) | 2022-02-28 | 2022-02-28 | Design method of monocular vision odometer based on unsupervised deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114693720A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115187638A (en) * | 2022-09-07 | 2022-10-14 | 南京逸智网络空间技术创新研究院有限公司 | Unsupervised monocular depth estimation method based on optical flow mask |
CN115290084A (en) * | 2022-08-04 | 2022-11-04 | 中国人民解放军国防科技大学 | Visual inertia combined positioning method and device based on weak scale supervision |
CN116309036A (en) * | 2022-10-27 | 2023-06-23 | 杭州图谱光电科技有限公司 | Microscopic image real-time stitching method based on template matching and optical flow method |
CN117392228A (en) * | 2023-12-12 | 2024-01-12 | 华润数字科技有限公司 | Visual mileage calculation method and device, electronic equipment and storage medium |
-
2022
- 2022-02-28 CN CN202210195358.9A patent/CN114693720A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115290084A (en) * | 2022-08-04 | 2022-11-04 | 中国人民解放军国防科技大学 | Visual inertia combined positioning method and device based on weak scale supervision |
CN115290084B (en) * | 2022-08-04 | 2024-04-19 | 中国人民解放军国防科技大学 | Visual inertial combined positioning method and device based on weak scale supervision |
CN115187638A (en) * | 2022-09-07 | 2022-10-14 | 南京逸智网络空间技术创新研究院有限公司 | Unsupervised monocular depth estimation method based on optical flow mask |
WO2024051184A1 (en) * | 2022-09-07 | 2024-03-14 | 南京逸智网络空间技术创新研究院有限公司 | Optical flow mask-based unsupervised monocular depth estimation method |
CN116309036A (en) * | 2022-10-27 | 2023-06-23 | 杭州图谱光电科技有限公司 | Microscopic image real-time stitching method based on template matching and optical flow method |
CN116309036B (en) * | 2022-10-27 | 2023-12-29 | 杭州图谱光电科技有限公司 | Microscopic image real-time stitching method based on template matching and optical flow method |
CN117392228A (en) * | 2023-12-12 | 2024-01-12 | 华润数字科技有限公司 | Visual mileage calculation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111325794B (en) | Visual simultaneous localization and map construction method based on depth convolution self-encoder | |
CN109166149B (en) | Positioning and three-dimensional line frame structure reconstruction method and system integrating binocular camera and IMU | |
CN108416840B (en) | Three-dimensional scene dense reconstruction method based on monocular camera | |
US20210142095A1 (en) | Image disparity estimation | |
CN108986037B (en) | Monocular vision odometer positioning method and positioning system based on semi-direct method | |
CN114693720A (en) | Design method of monocular vision odometer based on unsupervised deep learning | |
CN107564061B (en) | Binocular vision mileage calculation method based on image gradient joint optimization | |
CN111311666B (en) | Monocular vision odometer method integrating edge features and deep learning | |
US9613420B2 (en) | Method for locating a camera and for 3D reconstruction in a partially known environment | |
CN110009674B (en) | Monocular image depth of field real-time calculation method based on unsupervised depth learning | |
CN108010081B (en) | RGB-D visual odometer method based on Census transformation and local graph optimization | |
Liu et al. | Direct visual odometry for a fisheye-stereo camera | |
CN104537709A (en) | Real-time three-dimensional reconstruction key frame determination method based on position and orientation changes | |
CN105869120A (en) | Image stitching real-time performance optimization method | |
CN112750198B (en) | Dense correspondence prediction method based on non-rigid point cloud | |
CN113256698B (en) | Monocular 3D reconstruction method with depth prediction | |
CN111797688A (en) | Visual SLAM method based on optical flow and semantic segmentation | |
CN112556719B (en) | Visual inertial odometer implementation method based on CNN-EKF | |
CN107808391B (en) | Video dynamic target extraction method based on feature selection and smooth representation clustering | |
CN111860651A (en) | Monocular vision-based semi-dense map construction method for mobile robot | |
CN114996814A (en) | Furniture design system based on deep learning and three-dimensional reconstruction | |
CN112037282B (en) | Aircraft attitude estimation method and system based on key points and skeleton | |
CN112419411B (en) | Realization method of vision odometer based on convolutional neural network and optical flow characteristics | |
CN113888629A (en) | RGBD camera-based rapid object three-dimensional pose estimation method | |
CN104156933A (en) | Image registering method based on optical flow field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |