CN114693720A - Design method of monocular vision odometer based on unsupervised deep learning - Google Patents

Design method of monocular vision odometer based on unsupervised deep learning Download PDF

Info

Publication number
CN114693720A
CN114693720A CN202210195358.9A CN202210195358A CN114693720A CN 114693720 A CN114693720 A CN 114693720A CN 202210195358 A CN202210195358 A CN 202210195358A CN 114693720 A CN114693720 A CN 114693720A
Authority
CN
China
Prior art keywords
network
optical flow
image
depth
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210195358.9A
Other languages
Chinese (zh)
Inventor
李鹏
蔡成林
周彦
盘宏斌
陈洋卓
窦杰
孟步敏
蔡晓雯
张莹
黄鹏
李锡敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Xiangbo Intelligent Technology Co ltd
Original Assignee
Suzhou Xiangbo Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Xiangbo Intelligent Technology Co ltd filed Critical Suzhou Xiangbo Intelligent Technology Co ltd
Priority to CN202210195358.9A priority Critical patent/CN114693720A/en
Publication of CN114693720A publication Critical patent/CN114693720A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C22/00Measuring distance traversed on the ground by vehicles, persons, animals or other moving solid bodies, e.g. using odometers, using pedometers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a design method of a monocular vision odometer based on unsupervised deep learning, which improves the performance of the vision odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front-back consistency error, selects an optimal tracking mode by model score, aligns with the depth information to obtain the vision odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and effectively reduces the problems of inconsistent scale and scale drift through experiments.

Description

Design method of monocular vision odometer based on unsupervised deep learning
Technical Field
The invention relates to a design method of a monocular vision odometer based on unsupervised deep learning.
Background
In recent years, mobile robots and automatic driving technologies are rapidly developed, and the requirements on the precision of autonomous positioning and navigation are higher and higher. In an indoor or weak satellite navigation environment, a Simultaneous Localization And Mapping (SLAM) technology based on vision plays a crucial role, And a Visual Odometer (VO) is also drawing more And more attention And research as a key link of the Visual SLAM.
Visual odometry can be divided into feature-based methods and direct methods. Feature-based methods optimize camera pose by detecting feature points and extracting local descriptors as intermediate representations, then performing feature matching between images, and using reprojection errors. The direct method models the image forming process and optimizes the photometric error function by assuming gray scale invariance.
Deep learning has rolled the field of computer vision in recent years, and SLAM research based on deep learning has also made significant progress. At present, related work mainly focuses on sub-problems of SLAM standard links, such as feature extraction, feature matching, outlier rejection, Bundle Adjustment (BA) and the like. The end-to-end visual odometer framework proposes to directly return camera relative pose or positioning information from the CNN network. The CNN-SLAM replaces the depth estimation and the image matching with a method based on a CNN network on the basis of LSD-SLAM, but the accuracy is seriously insufficient outdoors. GEN-SLAM uses monocular RGB cameras to train the network with the results of traditional geometric SLAM for pose and depth estimates. The SfM-Learner trains the posture and the depth network simultaneously to obtain a result competing with the ORB-SLAM, and Deep-VO-Feat and D3VO train by using a binocular camera, so that a track under a real scale can be directly obtained under the operation of the monocular camera. However, due to the lack of multi-view geometric constraints, the end-to-end deep learning method often faces a great scale drift problem, and it is difficult to obtain a result competing with the conventional VO. In the above research, the monocular visual odometer still has the problems of scale drift and scale inconsistency.
Disclosure of Invention
In order to solve the technical problems, the invention provides a design method of a monocular vision odometer based on unsupervised deep learning, which can effectively reduce the problems of inconsistent scales and scale drift.
The technical scheme for solving the technical problems is as follows: a design method of a monocular vision odometer based on unsupervised deep learning comprises the following steps:
the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;
step two: according to the consistency error before and after, sparse sampling is carried out in the dense optical flow to obtain the corresponding relation;
step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
In the above design method of the monocular visual odometer based on unsupervised deep learning, in the first step, the framework of the unsupervised deep learning network includes three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
In the first step, during training, the depths of two adjacent frames of images are estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;
under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related through a relative pose network, wherein the relative pose network is used for helping to constrain the deep network and the optical flow network and is only used during training;
consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Figure BDA0003525199910000031
Wherein the content of the first and second substances,
Figure BDA0003525199910000032
the (k + 1) th image is obtained by transforming the pixel coordinates of the (k) th imagePixel coordinates of the image;
Figure BDA0003525199910000033
to predict the relative camera motion from sequence k to sequence k +1,
Figure BDA0003525199910000034
the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite image
Figure BDA0003525199910000035
Since the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn
Figure BDA0003525199910000036
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
Figure BDA0003525199910000037
wherein L isp: a loss function result;
Figure BDA0003525199910000038
a structurally similar loss function; α is 0.85, SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame effective co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
Figure BDA0003525199910000041
wherein L iss: smoothing the loss function result; di,j
Figure BDA0003525199910000042
Representing an image parallax gradient;
Figure BDA0003525199910000043
representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions;Ii,j: image (i, j) location pixel;
Figure BDA0003525199910000044
first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
Figure BDA0003525199910000045
wherein ω: a binary mask;
Figure BDA0003525199910000046
Lp(Ik,Ik+1): the image reconstruction loss function mentioned in text equation (3);
Figure BDA0003525199910000047
respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Figure BDA0003525199910000048
Lf: jointly fine-tuning the loss function result;flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted through optical flow network for depth to structure consistencyk+1And DkAlignment, compute depth consistency loss:
Figure BDA0003525199910000049
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;
Figure BDA00035251999100000410
matching and calculating the depth of the k image through an optical flow network; s is a common effective area of the optical flow and the depth, so as to obtain consistent depth estimation;
in summary, the network total loss function L is:
L=LpsLsfLfdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
The design method of the monocular vision odometer based on unsupervised deep learning comprises the following specific steps:
(2-1) front-to-back optical flow uniformity: deriving a forward optical flow F from an optical flow networkfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
The design method of the monocular vision odometer based on unsupervised deep learning comprises the following three specific steps:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, selecting a 3D-3D tracking mode;
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking mode
Figure BDA0003525199910000051
R represents a rotation matrix, and R represents a rotation matrix,
Figure BDA0003525199910000052
represents a displacement vector with a modulo length of unit 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain
Figure BDA0003525199910000053
Figure BDA0003525199910000054
Which is indicative of the motion of the camera(s),
Figure BDA0003525199910000055
representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
In the step (3-1), the design method of the monocular vision odometer based on unsupervised deep learning is inspired by an ORB-SLAM initialization method, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFCarrying on withSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr
Figure BDA0003525199910000061
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
Figure BDA0003525199910000062
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;
Figure BDA0003525199910000063
the ith' matching point of the current frame;
Figure BDA0003525199910000064
the ith' matching point of the reference frame;
Figure BDA0003525199910000065
the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
Figure BDA0003525199910000066
n represents the characteristic matching of two adjacent imagesCounting; i' represents the serial number of the matching point; r: representing a camera rotation matrix;
Figure BDA0003525199910000067
indicating the matching point at the i' of the k image.
The invention has the beneficial effects that: the invention improves the performance of the visual odometer by training depth, relative posture and optical flow jointly, obtains depth information and dense optical flow information with consistent long sequence by using a depth network and an optical flow network, carries out accurate sparse optical flow sampling by front and back consistency errors, selects an optimal tracking mode by model scoring, aligns with the depth information to obtain the visual odometer with consistent scale, combines the geometric constraint condition of the traditional method and the robustness matching of the depth network, is obviously superior to a simple geometric method and an end-to-end deep learning method in multiple error evaluation indexes, and proves that the method effectively reduces the problems of inconsistent scale and scale drift through experiments.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is an architecture diagram of an unsupervised deep learning network according to the present invention.
Fig. 3 is a schematic diagram of depth estimation.
FIG. 4 is a schematic diagram of extracting sparse matching relationships from forward and backward optical flows.
Fig. 5 is a KITTI 09 sequence trace diagram.
Fig. 6 is a KITTI 10 sequence trace diagram.
Detailed Description
The invention is further described below with reference to the drawings and examples.
As shown in fig. 1, a design method of a monocular visual odometer based on unsupervised deep learning includes the following steps:
the method comprises the following steps: and combining the depth consistency and the image similarity loss function to obtain an unsupervised deep learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow.
The key to unsupervised learning is to compute the difference between the synthetic image and the target image using the estimated depth, pose, optical flow and source image using image reconstruction loss. The single-view depth network, the optical flow network and the camera relative posture network are three separate tasks, but the three tasks have an image similarity constraint relation which is mutually related, and a training process of the three networks is coupled by combining a space consistency loss function and an image similarity loss function on the basis of the existing unsupervised signal through view synthesis.
The framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
An unsupervised deep learning network architecture is shown in FIG. 2, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is consistent by using space consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow; compared with network independent training, the method combines multitask consistency constraint, strengthens the relation among networks, and obtains more accurate and robust depth, attitude and optical flow estimation.
Under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related by a relative pose network, wherein the relative pose network is used to help constrain the deep network and the optical flow network and is only used during training.
Consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Figure BDA0003525199910000081
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003525199910000082
converting pixel coordinates of the kth image into pixel coordinates of the (k + 1) th image;
Figure BDA0003525199910000083
to predict the relative camera motion from sequence k to sequence k +1,
Figure BDA0003525199910000084
the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite image
Figure BDA0003525199910000091
Since the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn
Figure BDA0003525199910000092
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
Figure BDA0003525199910000093
wherein L isp: decrease in the thickness of the steelA function-loss result;
Figure BDA0003525199910000094
a structurally similar loss function; α is 0.85, SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame effective co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
Figure BDA0003525199910000095
wherein L iss: smoothing the loss function result; di,j
Figure BDA0003525199910000098
Representing an image parallax gradient;
Figure BDA0003525199910000099
representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions;Ii,j: image (i, j) location pixel;
Figure BDA00035251999100000910
first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
Figure BDA0003525199910000096
wherein ω: a binary mask;
Figure BDA0003525199910000097
Lp(Ik,Ik+1): text formula (3)The image reconstruction loss function mentioned in (1);
Figure BDA0003525199910000101
respectively, the k-th image, the composite image of the k-th image, and the (k + 1) -th image.
And selecting and using a RAFT network with strong generalization capability, rapidness and accuracy as an optical flow backbone network. Compared with using a coarse-to-fine pyramid iterative network, RAFT maintains and updates only a single optical flow field at high resolution, and shares weights during iteration, overcoming the two difficulties of coarse-to-fine optimization of the network that it is difficult to correct errors at coarse resolution and to detect rapid motion of small objects. Fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Figure BDA0003525199910000102
Lf: jointly fine-tuning the loss function result;flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted through optical flow network for depth to structure consistencyk+1And DkAlignment, compute depth consistency loss:
Figure BDA0003525199910000103
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;
Figure BDA0003525199910000104
matching and calculating the depth of the k image through an optical flow network; s is the common effective area of optical flow and depth, so that a consistent depth estimation is obtained.
In summary, the network total loss function L is:
L=LpsLsfLfdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
Step two: and carrying out sparse sampling in the dense optical flow according to the consistency error before and after the consistency error to obtain the corresponding relation.
The method comprises the following specific steps:
(2-1) front-to-back optical flow uniformity: deriving forward optical flow from optical flow network FfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2
Depth information with consistent scales is obtained by using a depth network, and the triangularization alignment process is independently carried out, so that the scale drift problem can be reduced to the maximum extent.
In order to extract sparse matching from the optical flow network, a forward optical flow and a backward optical flow are used simultaneously, and a precise sparse corresponding relation is obtained by utilizing bidirectional consistency error filtering. The problem of mismatching of a large number of feature points caused by sudden motion steering is avoided because local matching is carried out without depending on a motion model.
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
Step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
The method comprises the following specific steps:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, the 3D-3D tracking mode is selected.
Subject to ORB-SLAM initializationThe method is inspired, and two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr
Figure BDA0003525199910000111
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
Figure BDA0003525199910000121
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;
Figure BDA0003525199910000122
the ith' matching point of the current frame;
Figure BDA0003525199910000123
the ith' matching point of the reference frame;
Figure BDA0003525199910000124
the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
Figure BDA0003525199910000125
n represents the matching points of the characteristics of two adjacent images; i' represents the serial number of the matching point; r: representing a camera rotation matrix;
Figure BDA0003525199910000126
indicating the matching point at the i' of the k image.
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking mode
Figure BDA0003525199910000127
R represents a rotation matrix, and R represents a rotation matrix,
Figure BDA0003525199910000128
represents a displacement vector with a modulo length of unit 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain
Figure BDA0003525199910000129
Figure BDA00035251999100001210
Which is indicative of the motion of the camera,
Figure BDA00035251999100001211
representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
Experimental validation and result analysis
A Ubuntu20.04 system is adopted, a CPU is i5-10300H, a GPU is NVIDIA Geforce GTX 1660Ti, a memory of a display card is 6GB, and a memory of equipment is 16 GB. And performing a visual odometry experiment on the KITTI data set, comparing the visual odometry experiment with a traditional method and an end-to-end deep learning-based method, and verifying the effectiveness of the method.
Network architecture and parameter setting:
the depth estimation network is based on a universal U-net network architecture, namely an encoder-decoder structure, and ResNet18 is used as an encoder network; the decoder uses a hopping chaining architecture, with hopping connections between network layers enabling it to fuse both shallow geometry information and high-level abstract features. Because the motion between adjacent frames is very small, multi-scale output is neither accurate nor necessary, and only a single-scale depth prediction result is output, so that the computing resource is greatly saved. The optical flow network uses RAFT network as backbone network; the relative pose network is a pose estimation network with the structure of ResNet18, using axis angles to represent three-dimensional rotations.
The network model was implemented using a Pythrch framework, with two-stage training using an Adam optimizer. The first stage trains for 20 cycles, and the learning rate is set to 10-4And the batch sample size is 8. The second stage trains for 100 periods, and the learning rate is set to 10-5And the batch sample size is 4. In training, λ is sets=0.4,λf=0.4,λdcThe image sequence size is adjusted to 480 × 640, 0.1.
And (3) visual odometer:
the KITTI data set provides 22 sets of sequence data, where 0-10 provides the true trajectory, the experiment was trained on the 0-8 sequence, the 9-10 sequence was evaluated and studied against ORB-SLAM2 and end-to-end deep learning methods. Since the monocular visual odometer cannot obtain the scale in the real world, the result is uniformly aligned with the real trajectory scale for fair comparison. Qualitative trajectory results As shown in FIGS. 5 and 6, the amount of trajectory translation drift for our method is significantly reduced compared to ORB-SLAM2, SfMLearner, SC-SfMLerner, and Depth-VO-Feat, which benefits from our scale-consistent Depth estimation. Although Depth-VO-Feat trained using binocular cameras can achieve results consistent with real world dimensions, the problem of scale drift is the most severe. On the contrary, since the exact matching relationship is extracted, after the scale alignment, the method is more consistent with the real track.
TABLE 1 KITTI 09&10 sequence alignment
Table 1 Comparison of KITTI sequence 09&10
Figure BDA0003525199910000141
Using translation errors (t) on different subsequences (100m, 200m, 800m)err) Rotation error (r)err) Relative pose error RPE (m/°) and Absolute Trajectory Error (ATE) were analyzed in more detail, with bold font representing the best results in this evaluation. As can be seen from table 1, most of the indexes of our method are better than those of the conventional method and the pure deep learning method, and ORB-SLAM2 has better performance on rotation error, because the vehicle mainly runs at a constant speed, and the motion model thereof also assumes that there is a constant motion between two frames, which results in smaller error and has a very small difference from our method. The SC-SfMLearner also uses a depth consistency constraint in the training process, the absolute track error on the sequence 9 is slightly better than that of the SC-SfMLearner, and has a larger difference with other methods, but the absolute track error is not as good as that of the SC-SfMLearner in most other indexes because the absolute track error does not apply a multi-view geometric constraint in the pose estimation. The sparse matching relation of multiple structural features is extracted, epipolar geometric constraint is utilized, obvious advantages are shown in other evaluation aspects, and the comprehensive performance is better.

Claims (6)

1. A design method of a monocular vision odometer based on unsupervised deep learning is characterized by comprising the following steps:
the method comprises the following steps: combining the depth consistency and the image similarity loss function to obtain an unsupervised depth learning network with consistent scale, and performing combined training with the RAFT optical flow network to obtain a more robust optical flow;
step two: according to the consistency error before and after, sparse sampling is carried out in the dense optical flow to obtain the corresponding relation;
step three: and selecting an optimal tracking mode according to the corresponding relation, and performing depth alignment by combining a depth network so as to obtain the visual odometer with consistent scale.
2. The design method of the monocular visual odometer based on unsupervised deep learning of claim 1, wherein in the step one, the framework of the unsupervised deep learning network comprises three parts: the depth network receives a single RGB image as input and outputs an inverse depth map, the relative posture network and the optical flow network both receive two frames of images as input, the relative posture network outputs six-degree-of-freedom relative pose between the two frames, and the optical flow network outputs two-channel optical flow between the two frames.
3. The design method of monocular visual odometer based on unsupervised deep learning as claimed in claim 2, wherein in the first step, during training, the depth of two adjacent frames of images is estimated simultaneously, and the depth information is made consistent by using spatial consistency constraint; inputting an attitude network and an optical flow network into two adjacent RGB images, combining relative pose estimation and depth estimation to obtain a synthetic image, optimizing depth information and camera attitude by adopting a luminosity consistency loss function and an image smoothing loss function, and performing combined optimization on the RAFT network through the synthetic optical flow;
under the condition of lacking real depth information and optical flow information, the unsupervised deep learning network trains a network model by utilizing a synthesized view and taking interframe similarity as a supervision signal; the deep network and the optical flow network are geometrically related through a relative pose network, wherein the relative pose network is used for helping to constrain the deep network and the optical flow network and is only used during training;
consider two adjacent images IkAnd Ik+1,IkRepresenting the k image, and obtaining the relative motion T between adjacent frames through a relative attitude network and a depth networkk→k+1And a single view depth Dk,Dk+1,DkRepresents the kth single view depth, according to the equation
Figure FDA0003525199900000021
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003525199900000022
converting pixel coordinates of the kth image into pixel coordinates of the (k + 1) th image;
Figure FDA0003525199900000023
to predict camera relative motion from sequence k to sequence k +1,
Figure FDA0003525199900000024
the predicted depth of the k image pixel; p is the image pixel coordinate, pkIs the pixel coordinate of the kth image, and K is the camera internal reference matrix; to input an image IkTransforming to obtain a composite image
Figure FDA0003525199900000025
Since the image is a continuous discrete number, successive pixel coordinate values are obtained using a micro-bilinear interpolation, and a composite optical flow F is obtained therefromsyn
Figure FDA0003525199900000026
Unsupervised training assumes that the surface appearance of the same object between frames is also the same, and introduces structural similarity loss to learn structural information between adjacent frames on the basis of simple pixel-by-pixel difference, and uses the combination of L1 and SSIM loss as the loss of the reconstructed image:
Figure FDA0003525199900000027
wherein L isp: a loss function result;
Figure FDA0003525199900000028
a structurally similar loss function; SSIM is calculated using a window of 3 × 3 size, V is the adjacent frame valid co-view region;
in low texture scenes or homogeneous regions, the assumed photometric invariance can cause the problem of prediction holes, and in order to obtain smooth depth prediction, a first-order edge smoothing term loss is introduced:
Figure FDA0003525199900000029
wherein L iss: smoothing the loss function result; di,j
Figure FDA0003525199900000031
Representing an image parallax gradient;
Figure FDA0003525199900000032
representing the image edge probability map gradient, the subscript i, j representing the pixel coordinates; x, y denote pixel directions; I.C. Ai,j: image (i, j) location pixel;
Figure FDA0003525199900000033
first derivatives in the x, y directions, respectively;
for dynamic objects, the joint image segmentation network performs a masking process, according to monadepth 2, using a binary mask, ignoring objects that move synchronously with the camera, which mask is automatically computed in the forward pass of the network:
Figure FDA0003525199900000034
wherein ω: a binary mask;
Figure FDA0003525199900000035
Lp(Ik,Ik+1): the image reconstruction loss function mentioned in text equation (3); I.C. Ak,
Figure FDA0003525199900000036
Ik+1Respectively showing the k < th > image, the composite image of the k < th > image and the (k + 1) < th > image; fine tuning by error joint training of the synthetic optical flow and RAFT optical flow networks:
Figure FDA0003525199900000037
Lf: jointly fine-tuning the loss function result; flow: the image optical flow field is a common effective pixel area; fR(p): RAFT network optical flow prediction results; fsyn(p): synthesizing an optical flow network optical flow prediction result;
in training, D is predicted to a consistent depth by an optical flow networkk+1And DkAlignment, compute depth consistency loss:
Figure FDA0003525199900000038
wherein L isdc: a depth consistency loss function result; dk(p): the kth image depth;
Figure FDA0003525199900000039
matching and calculating the depth of the k image through an optical flow network; s is a common effective area of the optical flow and the depth, so as to obtain consistent depth estimation;
to sum up, the network total loss function L is:
L=LpsLsfLfdcLdc (8)
wherein λ iss、λfAnd λdcWeights representing losses of the terms, all losses being applied jointly to the deep network and the optical flow network.
4. The design method of the monocular vision odometer based on unsupervised deep learning of claim 3, wherein the second step is specifically:
(2-1) front-to-back optical flow uniformity: deriving a forward optical flow F from an optical flow networkfAnd backward optical flow FbAnd calculates the front-back consistency dF=|Ff(p)+Fb(p+Ff(p))|2
(2-2) sparse point sampling: dividing the image into 10 × 10 grid regions, and taking d in each regionFThe first 20 sets of sparse matching points that are less than the threshold δ.
5. The design method of the monocular vision odometer based on unsupervised deep learning of claim 4, wherein the third concrete step is:
(3-1) model selection: computing the essential matrix and homography matrix, and then computing the model score RF,RF=SF/(SF+SH),SF、SHRespectively scoring for the F and R models; if R isF>0.5, selecting a 2D-2D tracking mode; otherwise, selecting a 3D-3D tracking mode;
(3-2) dimension recovery: normalized camera motion derived from essential matrix decomposition for 2D-2D tracking mode
Figure FDA0003525199900000041
R represents a rotation matrix of the optical disk,
Figure FDA0003525199900000042
represents a displacement vector with a modulo length of 1; then, the triangularization method is used for carrying out scale alignment, and the scale factor s is recovered to obtain
Figure FDA0003525199900000043
Figure FDA0003525199900000044
Which is indicative of the motion of the camera,
Figure FDA0003525199900000045
representing the actual displacement length obtained by scale recovery; for the 3D-3D tracking mode, an ICP mode is used for solving to obtain Tk→k+1=[R|t],[R|t]Representing the motion pattern of the camera and t representing the camera displacement vector.
6. The design method of monocular visual odometer based on unsupervised deep learning of claim 5, wherein in the step (3-1), inspired by ORB-SLAM initialization method, two tracking modes of 2D-2D and 3D-3D are considered; the ORB-SLAM only carries out initial model selection by using a model score method, the tracking process solves the motion trail by using a constant-speed motion model and a PnP method, and only uses the model score R due to the fact that the ORB-SLAM simultaneously has the corresponding relation of 2D-2D and 3D-3DFSelecting a tracking mode; first, solving a homography matrix HcrAnd essence matrix Fcr
Figure FDA0003525199900000051
Wherein p iscFor the matching point of the previous frame of two adjacent frames, prCalculating S for the matching point of the next frame in the two adjacent frames and then respectively calculating S for the H model and the F modelHAnd SFScoring:
Figure FDA0003525199900000052
wherein M is H or F, rhoM: the intermediate calculation result of the model score S; d2: representing symmetric transfer errors; f: is equivalent to THIs invalid data exclusion threshold;
Figure FDA0003525199900000053
the ith' matching point of the current frame;
Figure FDA0003525199900000054
the ith' matching point of the reference frame;
Figure FDA0003525199900000055
the symmetric transmission errors from the current frame to the reference frame and from the reference frame to the current frame are respectively; error TMFor distance threshold, refer to ORB-SLAM, let TH=5.99,TF3.84, Γ and THThe definitions are the same;
when the three-dimensional point cloud structure is degraded, depth information with consistent scale is obtained through a depth network, a homography matrix is avoided being decomposed, and [ R | t ] is solved through an SVD method:
Figure FDA0003525199900000056
n represents the matching points of the characteristics of two adjacent images; i' represents the serial number of the matching point; r: representing a camera rotation matrix;
Figure FDA0003525199900000057
indicating the matching point at the i' of the k image.
CN202210195358.9A 2022-02-28 2022-02-28 Design method of monocular vision odometer based on unsupervised deep learning Pending CN114693720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210195358.9A CN114693720A (en) 2022-02-28 2022-02-28 Design method of monocular vision odometer based on unsupervised deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210195358.9A CN114693720A (en) 2022-02-28 2022-02-28 Design method of monocular vision odometer based on unsupervised deep learning

Publications (1)

Publication Number Publication Date
CN114693720A true CN114693720A (en) 2022-07-01

Family

ID=82137606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210195358.9A Pending CN114693720A (en) 2022-02-28 2022-02-28 Design method of monocular vision odometer based on unsupervised deep learning

Country Status (1)

Country Link
CN (1) CN114693720A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
CN115290084A (en) * 2022-08-04 2022-11-04 中国人民解放军国防科技大学 Visual inertia combined positioning method and device based on weak scale supervision
CN116309036A (en) * 2022-10-27 2023-06-23 杭州图谱光电科技有限公司 Microscopic image real-time stitching method based on template matching and optical flow method
CN117392228A (en) * 2023-12-12 2024-01-12 华润数字科技有限公司 Visual mileage calculation method and device, electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115290084A (en) * 2022-08-04 2022-11-04 中国人民解放军国防科技大学 Visual inertia combined positioning method and device based on weak scale supervision
CN115290084B (en) * 2022-08-04 2024-04-19 中国人民解放军国防科技大学 Visual inertial combined positioning method and device based on weak scale supervision
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
WO2024051184A1 (en) * 2022-09-07 2024-03-14 南京逸智网络空间技术创新研究院有限公司 Optical flow mask-based unsupervised monocular depth estimation method
CN116309036A (en) * 2022-10-27 2023-06-23 杭州图谱光电科技有限公司 Microscopic image real-time stitching method based on template matching and optical flow method
CN116309036B (en) * 2022-10-27 2023-12-29 杭州图谱光电科技有限公司 Microscopic image real-time stitching method based on template matching and optical flow method
CN117392228A (en) * 2023-12-12 2024-01-12 华润数字科技有限公司 Visual mileage calculation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN109166149B (en) Positioning and three-dimensional line frame structure reconstruction method and system integrating binocular camera and IMU
CN108416840B (en) Three-dimensional scene dense reconstruction method based on monocular camera
US20210142095A1 (en) Image disparity estimation
CN108986037B (en) Monocular vision odometer positioning method and positioning system based on semi-direct method
CN114693720A (en) Design method of monocular vision odometer based on unsupervised deep learning
CN107564061B (en) Binocular vision mileage calculation method based on image gradient joint optimization
CN111311666B (en) Monocular vision odometer method integrating edge features and deep learning
US9613420B2 (en) Method for locating a camera and for 3D reconstruction in a partially known environment
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN108010081B (en) RGB-D visual odometer method based on Census transformation and local graph optimization
Liu et al. Direct visual odometry for a fisheye-stereo camera
CN104537709A (en) Real-time three-dimensional reconstruction key frame determination method based on position and orientation changes
CN105869120A (en) Image stitching real-time performance optimization method
CN112750198B (en) Dense correspondence prediction method based on non-rigid point cloud
CN113256698B (en) Monocular 3D reconstruction method with depth prediction
CN111797688A (en) Visual SLAM method based on optical flow and semantic segmentation
CN112556719B (en) Visual inertial odometer implementation method based on CNN-EKF
CN107808391B (en) Video dynamic target extraction method based on feature selection and smooth representation clustering
CN111860651A (en) Monocular vision-based semi-dense map construction method for mobile robot
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN112037282B (en) Aircraft attitude estimation method and system based on key points and skeleton
CN112419411B (en) Realization method of vision odometer based on convolutional neural network and optical flow characteristics
CN113888629A (en) RGBD camera-based rapid object three-dimensional pose estimation method
CN104156933A (en) Image registering method based on optical flow field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination