US20140139635A1 - Real-time monocular structure from motion - Google Patents

Real-time monocular structure from motion Download PDF

Info

Publication number
US20140139635A1
US20140139635A1 US13/858,041 US201313858041A US2014139635A1 US 20140139635 A1 US20140139635 A1 US 20140139635A1 US 201313858041 A US201313858041 A US 201313858041A US 2014139635 A1 US2014139635 A1 US 2014139635A1
Authority
US
United States
Prior art keywords
points
pose
frame
epipolar
ground plane
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/858,041
Inventor
Manmohan Chandraker
Shiyu Song
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Laboratories America Inc
Original Assignee
NEC Laboratories America Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories America Inc filed Critical NEC Laboratories America Inc
Priority to US13/858,041 priority Critical patent/US20140139635A1/en
Publication of US20140139635A1 publication Critical patent/US20140139635A1/en
Priority to US15/965,480 priority patent/US10706582B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • H04N13/0203
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/579Depth or shape recovery from multiple images from motion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle

Definitions

  • the present invention relates to monocular structure from motion (SFM).
  • SLAM Simultaneous Localization And Mapping
  • systems and methods are disclosed for multithreaded visual odometry by acquired with a single camera on-board a vehicle; using 2D-3D correspondences for continuous pose estimation; and combining the pose estimation with 2D-2D epipolar search to replenish 3D points.
  • an accurate, robust and real-time large-scale SFM system for real-world autonomous outdoor driving application uses multithreaded architectures for SFM that allow handling large motions and rapidly changing imagery for fast-moving vehicles.
  • the system includes parallel epipolar search for extensive validation of feature matches over long tracks and a keyframe architecture that allows insertion at low cost. This allows robust operation of the system at 30 fps on the average, with output guaranteed at every frame within 50 ms.
  • the system estimates the height of the ground plane at every frame.
  • Cues for ground plane estimation include triangulated 3D points and plane-guided dense stereo matching. These cues are combined in a flexible Kalman filtering framework, which is trained rigorously to operate with the correct empirical covariances.
  • the system makes judicious use of a multithreaded design to ensure that motion estimates (and the underlying structure variables) become available only after extensive validation with long-range constraints and thorough bundle adjustments, but without delay.
  • the system is optimized for worst-case timing scenarios, rather than the average-case optimization for most traditional systems.
  • the multithreaded system produces pose outputs in at most 50 ms, regardless of whether a keyframe is added or scale correction performed.
  • the average frame rate of the system is much higher, at above 30 fps.
  • the system provides a real-time, accurate, large-scale monocular visual odometry system for real-world autonomous outdoor driving applications.
  • the architecture of the system addresses the challenge of robust multithreading even for scenes with large motions and rapidly changing imagery.
  • the design is extensible for three or more parallel CPU threads.
  • the system uses 3D-2D correspondences for robust pose estimation across all threads, followed by local bundle adjustment in the primary thread.
  • epipolar search operates in parallel in other threads to generate new 3D points at every frame. This significantly boosts robustness and accuracy, since only extensively validated 3D points with long tracks are inserted at keyframes.
  • Fast-moving vehicles also necessitate immediate global bundle adjustment, which is triggered by the instant keyframe design in parallel with pose estimation in a thread-safe architecture.
  • Scale drift is corrected using a mechanism that detects (rather than assumes) local planarity of the road by combining information from triangulated 3D points and the inter-image planar homography.
  • the system is optimized to output pose within 50 ms in the worst case, while average case operation is over 30 fps. Evaluations are presented on the challenging KITTI dataset for autonomous driving, where the system achieves better rotation and translation accuracy than other systems.
  • FIG. 1 shows an exemplary multithreaded architecture with 2D-3D correspondences for continuous pose estimation.
  • FIG. 2 shows an exemplary pose-guided matching module to provide fast 3D-2D correspondences.
  • FIG. 3 shows an exemplary epipolar constrained search.
  • FIG. 4 shows an exemplary local bundle module.
  • FIG. 5 shows an exemplary multithreaded keyframe architecture to handle insertion of new 3D points in the main thread.
  • FIG. 6 shows an exemplary collection and refining module that allows bundle adjustment using long tracks.
  • FIG. 7 shows an exemplary real-time global bundle adjustment module in a thread-safe architecture with real-time pose estimation.
  • FIG. 8 shows in more details a process for ground plane estimation with 3D points 301 .
  • FIG. 9 shows an exemplary process for ground plane estimation with guided dense stereo.
  • FIG. 10 shows an exemplary process for real-time lane detection.
  • FIG. 11 shows an exemplary process for scale correction.
  • FIG. 1 shows a monocular SFM embodiment with performance equal to or better than that of stereo SFM systems.
  • the system design entails a multithreaded architecture. Camera poses are determined using 2D-3D correspondences, where a dedicated epipolar search thread continually updates candidate 3D points.
  • a keyframe design ensures that the epipolar search supplies long range constraints to a bundle adjustment module operating in parallel. This allows the system to utilize only points that have undergone extensive validation and output refined poses in real-time, at every frame—in contrast to browsing systems like PTAM that perform epipolar search on-demand and delayed refinement.
  • the monocular SFM handles scale drift, which does not occur in fixed baseline stereo SFM.
  • the most popular SLAM approach for scale correction is loop closure.
  • it is impractical for autonomous driving applications, since one cannot rely on the presence of loops in general road conditions.
  • applications like scene understanding and driver safety cannot compromise on accuracy or timing between loops closures.
  • the system achieves a high degree of robustness through an extensive, principled scale correction mechanism that relies on real-time estimates of camera height from the ground plane.
  • Ground plane estimation is itself a challenging problem, due to the lack of reliable texture on the road.
  • the system counters this by combining cues from multiple methods in a rigorously operated, but flexible, Kalman filter framework.
  • the cues used are triangulated 3D points and a dense stereo matching in real-time to find the optimal planar homography mapping between the road in two frames. While each of these methods is unreliable on its own, a well-designed Kalman filter successfully combines them to produce robust estimates. This allows the system to greatly improve and complete the KITTI test sequences with low errors.
  • FIG. 1 shows an exemplary multithreaded architecture with 2D-3D correspondences for continuous pose estimation.
  • One embodiment is a multithreaded structure-from-motion (SFM) architecture 100 .
  • the steady-state architecture 100 includes a pose local bundle adjustment (LBA) system 101 / 103 and an epipolar search unit 102 .
  • the multithreaded architecture uses 2D-3D correspondences for continuous pose estimation in 101 / 103 . This is combined with 2D-2D epipolar search to replenish 3D points.
  • This architecture allows camera pose estimation and 3D reconstruction using fast moving imagery acquired with a single camera on-board a vehicle, thereby enabling autonomous driving applications.
  • the output of the pose LBA is provided to a keyframe and keyframe+1 unit 201 / 202 .
  • Unit 201 provides a collection and refinding mechanism allows bundle adjustment using long tracks, as opposed to two-view estimations for previous works, while unit 202 handles the keyframe+1 where real-time global bundle adjustment is conducted in a thread-safe architecture with real-time pose estimation.
  • the system also includes a ground plane estimation unit 300 using 3D points 301 and guided dense stereo information 302 .
  • the system also performs lane detection 400 with real time line matching.
  • the result from ground plane estimation and lane detection are provided to a scale recovery 500 that uses a Kalman filter for cue combination in one embodiment.
  • the multithreaded architecture allows elegant extension to as many threads as desired. Besides speed advantages, multithreading also greatly contributes to the accuracy and robustness of the system.
  • a single-thread version of a system that relies on 2D-3D correspondences might update its stable point set by performing an epipolar search in the frame preceding a keyframe.
  • the support for the 3D points introduced by this mechanism is limited to just the triplet used for the circular matching and triangulation.
  • the system may supply 3D points with tracks of length up to the distance from the preceding keyframe.
  • the set of long tracks provided by the epipolar thread in the multithread system is far more likely to be free of outliers.
  • the above multithreaded architecture for monocular SFM is designed to meet the challenges of autonomous driving.
  • Real-time operation is achieved at 30 fps, with guaranteed refined pose output within 50 ms.
  • Robust scale correction is done by ground plane estimation using multiple cues in a flexible Kalman filter framework.
  • Data fusion is achieved with rigorous, data-driven computation of observation covariances for each cue.
  • the system achieves accuracy that outperforms or rivals state-of-the-art stereo systems in rotation and translation.
  • FIG. 2 shows one exemplary pose-guided matching module 100 ( FIG. 1 ) to provide fast 3D-2D correspondences.
  • the system has access to a stable set of 3D points.
  • the poses of the previous frames have undergone multiple non-linear optimizations and are considered accurate.
  • Each frame is processed by three modules: the pose module estimates the pose of the current frame and occupies all available threads; 2) the epipolar update module uses epipolar constrained search (ECS) and triangulation T to prepare new 3d points to be added; and 3) the local bundle adjustment (LBA) unit performs non-linear optimization over a sliding window of a few previous frames and takes place in the main thread.
  • ECS epipolar constrained search
  • T triangulation
  • LBA local bundle adjustment
  • Pose-guided matching with fast 3D-2D correspondences is supported by the architecture of FIG. 2 for every steady state frame.
  • the modules are depicted in their multithreading arrangement, in correct synchronization order but not to scale.
  • Camera images are provided to a detection module, then provided to a PGM unit for pose-guided matching, and then to a perspective n-point (PnP) pose estimation unit.
  • the outputs of the PnP unit are provided to the LBA unit, then to a finding unit R and an update motion model U.
  • the output of the PnP unit are also provided to an ECS unit for epipolar constrained search and then to a triangulation unit T.
  • the system extracts FAST corners with ORB descriptors and matches between consecutive frames using locality sensitive hashing (LSH). With sufficient baseline (around 5 frames), a set of 3D points is initialized by relative pose estimation, triangulation and bundle adjustment. Each frame during initialization is processed within 10 ms.
  • LSH locality sensitive hashing
  • the preceding poses have also undergone multiple non-linear refinements, so can be considered highly accurate.
  • the system architecture at every frame in steady state operation is illustrated in FIG. 1 .
  • the system can easily handle other choices for matching, in particular, it has achieved similar results using normalized cross-correlation (NCC) instead of ORB. But associating a descriptor like ORB with a 3D point can have ancillary benefits.
  • NCC normalized cross-correlation
  • the pose module is enough to maintain camera pose for extended periods.
  • scene points rapidly move out of view in outdoor applications and candidate sets of points usable for pose estimation must be continually updated in a thread of their own (rather than on-demand like PTAM).
  • a sliding window bundle adjustment operates in a parallel thread with the epipolar update module.
  • the publicly available SBA package [13] is used for bundle adjustment. Timings for epipolar update and local bundle adjustment are summarized in Table 1.
  • the pose module 101 is shown in FIG. 3 . As shown in FIG. 3 , the pose module has the following functions: detection of features and ORB descriptors; pose-guided matching (PGM); and pose estimation (PnP).
  • PGM pose-guided matching
  • PnP pose estimation
  • the pose module by itself would be sufficient to maintain the camera pose.
  • 3D scene points rapidly move out of view within a few frames.
  • the stable set of points used for pose computation must be continually updated, which is the task entrusted to the epipolar search module 102 of FIG. 1 .
  • FIG. 4 shows in more details the epipolar constrained search module 102 .
  • the epipolar search module is parallelized across two threads and follows pose estimation at each frame.
  • the mechanism for epipolar search is illustrated in FIG. 2 .
  • the system After pose computation at frame n, for every feature f 0 in the keyframe at location (x 0 , y 0 ), the system considers a square of side 2r e centered at (x 0 , y 0 ) in frame n.
  • the system considers the intersection region of this square with a rectilinear band p pixels wide, centered around the epipolar line corresponding to f 0 in frame n.
  • the ORB descriptors for all FAST corners that lie within this intersection region are compared to the descriptor for f 0 .
  • the closest match, f n is found in terms of Hamming distance.
  • This epipolar matching procedure is also repeated by computing the closest match to f n in frame n ⁇ 1, call it f n-1 .
  • a match is accepted only if f n-1 also matches f 0 . Note that only two sets of matches with respect to frames (0, n) and (n ⁇ 1, n) must be computed at the frame n, since the matches between (0, n ⁇ 1) have already been computed at frame n ⁇ 1.
  • the Hamming distance computation for 256-bit ORB descriptors in a region of interest is performed as a block, with a fast SSE implementation. To rapidly search for features that lie within the above region of interest, the detected features in an image are stored in a lookup table data structure.
  • the key into the table is the column index of the feature and within each bucket, features are stored in sorted row order. Across two threads, this allows circular matching for a triplet of images, with up to 500 features in each, in 10-15 ms. As opposed to brute-force searching, the lookup table results in speedups by up to a factor of 10, especially in the autonomous driving application where the images traditionally have wide aspect ratios (to cover greater field of view while limiting uninformative regions like sky).
  • the epipolar constrained search is implemented on a thread of its own to produce per-frame 2D-2D correspondences. For current frame n, only 3D points that are validated against all frames 1 to n ⁇ 1 are retained. Only persistent 3D points that survive for greater than L frames may be collected by the next keyframe.
  • the pose module by itself would be sufficient to maintain the camera pose over extended periods.
  • 3D scene points rapidly move out of view within a few frames.
  • the stable set of points used for pose computation must be continually updated, which is the task entrusted to the epipolar search module.
  • FIG. 4 shows an exemplary local bundle module 103 .
  • the local bundle adjustment refines cameras and 3D points.
  • Data structures are implemented to collect and refine 3D points from the epipolar thread.
  • the system implements a sliding window local bundle adjustment.
  • the key data structure is the local bundle cache, which is composed of a frame cache and a match cache.
  • the frame cache stores feature locations, descriptors and camera poses from the most recent N frames. It also stores images for those N frames, for display and debugging purposes.
  • N 10.
  • the match cache is a list of tables, one element corresponding to each frame. The key into the table is the identity of a 3D point visible in the frame and the stored entries are the identities of the corresponding 2D features in various frames.
  • One implementation uses the publicly available SBA package for bundle adjustment.
  • the motion model for predicting the pose of the next frame is also updated in this module.
  • the timings for the parallel epipolar update and local bundle adjustment modules are summarized in Table 2.
  • FIG. 6 shows an exemplary multithreaded keyframe architecture 201 to handle insertion of new 3D points in the main thread.
  • the system architecture for keyframes is similar to that of FIG. 1 , with the addition of a collection and refinding module C+R. It collates persistent 3D points tracked over at least frames in the epipolar thread and re-finds them in the current frame using the output of the pose module.
  • the LBA is now different from that for steady state, since its cache has been updated with 3D points and their corresponding 2D locations in all the relevant frames on the epipolar thread.
  • the modules that define operations at a keyframe are illustrated in FIG. 6 .
  • the collected 3D points must reside on a keyframe for all subsequent operations, so a re-finding operation is performed by projecting them using the estimated pose for the frame and finding the best ORB match in a circular region of radius 10 pixels.
  • a re-finding operation is performed by projecting them using the estimated pose for the frame and finding the best ORB match in a circular region of radius 10 pixels.
  • the existing stable 3D points, the collected 3D points from the epipolar thread, their projections in all the frames within the local bundle cache and the corresponding cameras undergo local bundle adjustment.
  • the bundle adjustment at keyframes differs from steady state operation, but adding long tracks into the bundle adjustment at keyframes is a reason the system can avoid more expensive multiview triangulation at each frame in the epipolar thread.
  • the refined 3D points are now ready to be added to the stable set.
  • the modules that define operations at the frame immediately after a keyframe are illustrated in FIG. 7 .
  • the pose module re-finds the (new) set of stable 3D points.
  • the pose module is now split across only two threads, in order to accommodate a global bundle adjustment in the main thread.
  • This bundle adjustment involves the previous K keyframes and their associated 3D points, in order to introduce long-range constraints to better optimize the newly added set of 3D points.
  • There are two reasons a more expensive bundle adjustment involving a much larger set of previous keyframes (or even the whole map) is not necessary to refine 3D points with long-range constraints.
  • the imagery in autonomous driving applications is fast moving and does not involve repetitions, so introducing more keyframes into the global bundle adjustment yields at best marginal benefits.
  • the goal is instantaneous pose output rather than map-building, so even keyframes are not afforded the luxury of delayed output. This is in contrast to parallel systems like where keyframes may produce a noticeable spike in the per-frame timings.
  • the previous K frames are provided to a GBA or global bundle adjustment unit.
  • the GBA unit usually finishes within the time consumed by the pose module.
  • the cache update module reconciles the 3D points modified by both PnP and GBA, before it is used by LBA. Following global bundle adjustment, the 3D coordinates of all the points are updated. Note that overlapping sets of 3D points are used by both global bundle adjustment and pose modules in parallel, however, both may also cause this set to change (PnP may reject 3D points that are outliers, while bundle adjustment may move the position of 3D points). To ensure thread safety, an update module is included that reconciles changes in the 3D point cloud from both the prior parallel modules.
  • the local bundle adjustment module which simply reads in 3D point identities, receives this updated set for optimization based on the N frames in the local bundle cache.
  • the epipolar search also makes use of the updated keyframe pose. While the keyframe pose has seen a global bundle adjustment, the pose of the subsequent frame has not. This does not cause any inconsistency in practice since poses tend to be much more stable than points—a camera is constrained by hundreds of points, but a point is visible only in a few cameras. Thereafter, the system resumes steady-state operation until the next keyframe, unless a recovery or firewall condition is triggered.
  • FIG. 8 shows in more details a process for ground plane estimation with 3D points 301 .
  • FIG. 9 shows an exemplary process for ground plane estimation with guided dense stereo.
  • FIG. 10 shows an exemplary process for real-time lane detection, and
  • FIG. 11 shows an exemplary process for scale correction.
  • a Plane-guided Dense Stereo technique is used.
  • the system assumes a region in the foreground (middle fifth of the lower third of the image) to be the road plane.
  • the stereo cost function computation determines the homography mapping between frames k and k+1 as
  • G R + 1 h ⁇ tn T .
  • Triangulated 3D Points are also used.
  • the system considers matched SIFT descriptors between frames k and k+1, computed within the above region of interest (ORB descriptors are not powerful enough for the low texture in this region).
  • ORB descriptors are not powerful enough for the low texture in this region.
  • one option is to estimate (h, n) using a 3-point RANSAC for plane-fitting, however, better results are obtained by assuming the camera pitch to be fixed at ⁇ .
  • the height difference ⁇ h is computed with respect to every other point.
  • the estimated ground plane height is the height of the point that maximizes the score
  • a natural framework is a Kalman filter.
  • the system performs a rigorous training to compute the involved covariances.
  • the Kalman filter model of state evolution is given by
  • x is the state variable
  • z the observation
  • Q and U are the covariances of the process and observation noise, respectively, that are assumed to be zero mean multivariate normal distributions.
  • the state transition matrix and the observation model are given by
  • the system makes the approximation that state variables are uncorrelated.
  • the system first fixes the values of n 1 and n 3 to the optimal values from dense stereo and for a range of h, then constructs the homography mapping from frame k to k+1, given by
  • a univariate Gaussian is now fit to the distribution of s and the variance ⁇ k s is recorded (a different ⁇ s is computed for h, n 1 and n 3 ).
  • the system fits a 1D Gaussian to the homography-mapped SAD scores to get the value of ⁇ s , corresponding to h, n 1 and n 3 .
  • the system can predict the value of ⁇ e s .
  • the covariance estimation for the method that uses triangulated 3D points differs from the stereo method, since the normal n is assumed known from the camera pitch and only the height h is an estimated entity.
  • the system computes the height error e k p with respect to the ground truth and the sum q defined in (1).
  • n 1 and n 3 are fixed to values from camera pitch angle, fixed variance estimates ⁇ p (n 1 ) and ⁇ p (n 3 ) are computed for them, as the variance of the errors in n 1 and n 3 with respect to ground truth.
  • the value of q is computed by the ground plane estimation using (1) and the corresponding value of ⁇ p (h) is estimated from the above line-fit.
  • U 1 diag( ⁇ p (n 1 ), ⁇ p (n 3 ), ⁇ p (h)).
  • the multithreaded system enables large-scale, real-time, monocular visual odometry, targeted towards autonomous driving applications with fast-changing imagery.
  • the multithreaded design can boost both the speed and accuracy for handling challenging road conditions.
  • the system is optimized to provide pose output in real-time at every frame, without delays for keyframe insertion or global bundle adjustment. This is achieved through a per-frame epipolar search mechanism that generates redundantly validated 3D points persistent across long tracks and an efficient keyframe architecture to perform online thread-safe global bundle adjustment in parallel with pose computation. Further, the system is accurate enough to require only occasional scale correction, for which an automatic mechanism is provided that detects planarity of the ground to compute reliable scale factors.
  • Multithreaded bundle adjustment can be optimized for small-sized problems that arise in autonomous driving applications. Real-time detection of pedestrians and cars can also be done for better handling of crowded scenes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Length Measuring Devices By Optical Means (AREA)

Abstract

Systems and methods are disclosed for multithreaded navigation assistance by acquired with a single camera on-board a vehicle; using 2D-3D correspondences for continuous pose estimation; and combining the pose estimation with 2D-2D epipolar search to replenish 3D points.

Description

  • This application claims priority to Provisional Application Ser. Nos. 61/701,877 filed on Sep. 17, 2012 and 61/725,733 filed on Nov. 13, 2012, the content of which is incorporated by reference.
  • BACKGROUND
  • The present invention relates to monocular structure from motion (SFM).
  • Autonomous driving faces unique challenges as a difficult corner case for SFM. Traditional systems for unordered image collections are inapplicable in such situations, since fundamentally, forward motion is an ill-posed situation. This is compounded by the fact that high vehicle speeds lead to rapidly changing imagery, so successful indoor systems like that rely on repeated observations of the same scene elements also break down. Further, the timing demands on autonomous driving systems are extremely stringent—a reliable camera pose is expected at every frame, with no luxury of delayed verifications or bundle adjustments.
  • While stereo Simultaneous Localization And Mapping (SLAM) systems routinely achieve high accuracy and real-time performance, the challenge remains daunting for monocular ones. Yet, monocular systems are attractive for the automobile industry since they are cheaper and the calibration effort is lower. Costs of consumer cameras have steadily declined in recent years, but cameras for practical visual odometry in automobiles are expensive since they are produced in lesser volume, must support high frame rates and be robust to extreme temperatures, weather and jitters.
  • The challenges of monocular visual odometry for autonomous driving are both fundamental and practical. For instance, it has been observed empirically and theoretically that forward motion with epipoles within the image is a “high error” situation for SFM. Vehicle speeds in outdoor environments can be high, so even with cameras that capture imagery at high frame rates, large motions may occur between consecutive frames. This places severe demands on an autonomous driving visual odometry system, necessitating extensive validation and refinement mechanisms that conventional systems do not require. The timing requirements for visual odometry in autonomous driving are equally stringent—a pose must be output at every frame in a fixed amount of time. For instance, traditional systems may produce a spike in timings when keyframes are added, or loop closure is performed.
  • SUMMARY
  • In one aspect, systems and methods are disclosed for multithreaded visual odometry by acquired with a single camera on-board a vehicle; using 2D-3D correspondences for continuous pose estimation; and combining the pose estimation with 2D-2D epipolar search to replenish 3D points.
  • In another aspect, an accurate, robust and real-time large-scale SFM system for real-world autonomous outdoor driving application is disclosed. The system uses multithreaded architectures for SFM that allow handling large motions and rapidly changing imagery for fast-moving vehicles. The system includes parallel epipolar search for extensive validation of feature matches over long tracks and a keyframe architecture that allows insertion at low cost. This allows robust operation of the system at 30 fps on the average, with output guaranteed at every frame within 50 ms. To resolve the scale ambiguity of monocular SFM, the system estimates the height of the ground plane at every frame. Cues for ground plane estimation include triangulated 3D points and plane-guided dense stereo matching. These cues are combined in a flexible Kalman filtering framework, which is trained rigorously to operate with the correct empirical covariances.
  • Advantages of the above aspect may include one or more of the following. The system makes judicious use of a multithreaded design to ensure that motion estimates (and the underlying structure variables) become available only after extensive validation with long-range constraints and thorough bundle adjustments, but without delay. Thus, the system is optimized for worst-case timing scenarios, rather than the average-case optimization for most traditional systems. In particular, the multithreaded system produces pose outputs in at most 50 ms, regardless of whether a keyframe is added or scale correction performed. The average frame rate of the system is much higher, at above 30 fps.
  • The system provides a real-time, accurate, large-scale monocular visual odometry system for real-world autonomous outdoor driving applications. The architecture of the system addresses the challenge of robust multithreading even for scenes with large motions and rapidly changing imagery. The design is extensible for three or more parallel CPU threads. The system uses 3D-2D correspondences for robust pose estimation across all threads, followed by local bundle adjustment in the primary thread. In contrast to prior work, epipolar search operates in parallel in other threads to generate new 3D points at every frame. This significantly boosts robustness and accuracy, since only extensively validated 3D points with long tracks are inserted at keyframes. Fast-moving vehicles also necessitate immediate global bundle adjustment, which is triggered by the instant keyframe design in parallel with pose estimation in a thread-safe architecture. To handle inevitable tracking failures, a recovery method is provided. Scale drift is corrected using a mechanism that detects (rather than assumes) local planarity of the road by combining information from triangulated 3D points and the inter-image planar homography. The system is optimized to output pose within 50 ms in the worst case, while average case operation is over 30 fps. Evaluations are presented on the challenging KITTI dataset for autonomous driving, where the system achieves better rotation and translation accuracy than other systems.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows an exemplary multithreaded architecture with 2D-3D correspondences for continuous pose estimation.
  • FIG. 2 shows an exemplary pose-guided matching module to provide fast 3D-2D correspondences.
  • FIG. 3 shows an exemplary epipolar constrained search.
  • FIG. 4 shows an exemplary local bundle module.
  • FIG. 5 shows an exemplary multithreaded keyframe architecture to handle insertion of new 3D points in the main thread.
  • FIG. 6 shows an exemplary collection and refining module that allows bundle adjustment using long tracks.
  • FIG. 7 shows an exemplary real-time global bundle adjustment module in a thread-safe architecture with real-time pose estimation.
  • FIG. 8 shows in more details a process for ground plane estimation with 3D points 301.
  • FIG. 9 shows an exemplary process for ground plane estimation with guided dense stereo.
  • FIG. 10 shows an exemplary process for real-time lane detection.
  • FIG. 11 shows an exemplary process for scale correction.
  • DESCRIPTION
  • While stereo-based SFM is commonly available in commercial products, the lack of a fixed baseline renders monocular SFM a far more daunting challenge. Yet, the monocular SFM is cost-effective in mass production, thus, an attractive proposition for the autonomous driving industry.
  • FIG. 1 shows a monocular SFM embodiment with performance equal to or better than that of stereo SFM systems. To meet the competing demands of accuracy and timing, the system design entails a multithreaded architecture. Camera poses are determined using 2D-3D correspondences, where a dedicated epipolar search thread continually updates candidate 3D points. A keyframe design ensures that the epipolar search supplies long range constraints to a bundle adjustment module operating in parallel. This allows the system to utilize only points that have undergone extensive validation and output refined poses in real-time, at every frame—in contrast to browsing systems like PTAM that perform epipolar search on-demand and delayed refinement.
  • The monocular SFM handles scale drift, which does not occur in fixed baseline stereo SFM. The most popular SLAM approach for scale correction is loop closure. However, it is impractical for autonomous driving applications, since one cannot rely on the presence of loops in general road conditions. Moreover, applications like scene understanding and driver safety cannot compromise on accuracy or timing between loops closures. The system achieves a high degree of robustness through an extensive, principled scale correction mechanism that relies on real-time estimates of camera height from the ground plane.
  • Ground plane estimation is itself a challenging problem, due to the lack of reliable texture on the road. The system counters this by combining cues from multiple methods in a rigorously operated, but flexible, Kalman filter framework. The cues used are triangulated 3D points and a dense stereo matching in real-time to find the optimal planar homography mapping between the road in two frames. While each of these methods is unreliable on its own, a well-designed Kalman filter successfully combines them to produce robust estimates. This allows the system to greatly improve and complete the KITTI test sequences with low errors.
  • An innovation central to the excellent performance of the data fusion is the correct computation of observation covariances—an aspect often overlooked by other SFM systems that use Kalman filters. This is achieved through an extensive training procedure over long distances of real-world driving sequences, that learns relationships between variance in ground plane estimation to the underlying variables for each method. Thus, during the actual operation of the filter, the system predicts meaningful empirical covariances that are combined to produce a reliable ground plane estimate.
  • FIG. 1 shows an exemplary multithreaded architecture with 2D-3D correspondences for continuous pose estimation. One embodiment is a multithreaded structure-from-motion (SFM) architecture 100. The steady-state architecture 100 includes a pose local bundle adjustment (LBA) system 101/103 and an epipolar search unit 102. The multithreaded architecture uses 2D-3D correspondences for continuous pose estimation in 101/103. This is combined with 2D-2D epipolar search to replenish 3D points. This architecture allows camera pose estimation and 3D reconstruction using fast moving imagery acquired with a single camera on-board a vehicle, thereby enabling autonomous driving applications.
  • The output of the pose LBA is provided to a keyframe and keyframe+1 unit 201/202. Unit 201 provides a collection and refinding mechanism allows bundle adjustment using long tracks, as opposed to two-view estimations for previous works, while unit 202 handles the keyframe+1 where real-time global bundle adjustment is conducted in a thread-safe architecture with real-time pose estimation.
  • The system also includes a ground plane estimation unit 300 using 3D points 301 and guided dense stereo information 302. The system also performs lane detection 400 with real time line matching. The result from ground plane estimation and lane detection are provided to a scale recovery 500 that uses a Kalman filter for cue combination in one embodiment.
  • The multithreaded architecture allows elegant extension to as many threads as desired. Besides speed advantages, multithreading also greatly contributes to the accuracy and robustness of the system. As an example, consider the epipolar contrained search. A single-thread version of a system that relies on 2D-3D correspondences might update its stable point set by performing an epipolar search in the frame preceding a keyframe. However, the support for the 3D points introduced by this mechanism is limited to just the triplet used for the circular matching and triangulation. By moving the epipolar search to a separate thread and performing the circular matching at every frame, the system may supply 3D points with tracks of length up to the distance from the preceding keyframe. Clearly, the set of long tracks provided by the epipolar thread in the multithread system is far more likely to be free of outliers.
  • The above multithreaded architecture for monocular SFM is designed to meet the challenges of autonomous driving. Real-time operation is achieved at 30 fps, with guaranteed refined pose output within 50 ms. Robust scale correction is done by ground plane estimation using multiple cues in a flexible Kalman filter framework. Data fusion is achieved with rigorous, data-driven computation of observation covariances for each cue. The system achieves accuracy that outperforms or rivals state-of-the-art stereo systems in rotation and translation.
  • FIG. 2 shows one exemplary pose-guided matching module 100 (FIG. 1) to provide fast 3D-2D correspondences. At steady state, the system has access to a stable set of 3D points. The poses of the previous frames have undergone multiple non-linear optimizations and are considered accurate. Each frame is processed by three modules: the pose module estimates the pose of the current frame and occupies all available threads; 2) the epipolar update module uses epipolar constrained search (ECS) and triangulation T to prepare new 3d points to be added; and 3) the local bundle adjustment (LBA) unit performs non-linear optimization over a sliding window of a few previous frames and takes place in the main thread.
  • Pose-guided matching with fast 3D-2D correspondences is supported by the architecture of FIG. 2 for every steady state frame. The modules are depicted in their multithreading arrangement, in correct synchronization order but not to scale. Camera images are provided to a detection module, then provided to a PGM unit for pose-guided matching, and then to a perspective n-point (PnP) pose estimation unit. The outputs of the PnP unit are provided to the LBA unit, then to a finding unit R and an update motion model U. The output of the PnP unit are also provided to an ECS unit for epipolar constrained search and then to a triangulation unit T.
  • To initialize, the system extracts FAST corners with ORB descriptors and matches between consecutive frames using locality sensitive hashing (LSH). With sufficient baseline (around 5 frames), a set of 3D points is initialized by relative pose estimation, triangulation and bundle adjustment. Each frame during initialization is processed within 10 ms.
  • At steady state, the system has access to a stable set of at least N s 3D points that have undergone extensive bundle adjustment in prior frames (for Ns=100). The preceding poses have also undergone multiple non-linear refinements, so can be considered highly accurate. The system architecture at every frame in steady state operation is illustrated in FIG. 1.
  • Around 2000 FAST corners with Shi-Tomasi filtering are extracted from a typical outdoors image and ORB descriptors are computed. Using the pose of the previous frame, the pose of the current frame is predicted, assuming constant velocity. The system explicitly computes the camera pose at each frame using correspondences, the motion model is only used as guidance to expedite matching. The existing set of stable 3D points are projected into the image using the predicted pose and the ORB descriptor for each is compared to those within a square of side 2rs pixels (for example rs=15). Given these 2D-3D point correspondences, the system computes the actual camera pose using perspective n-point (PnP) pose estimation in a RANSAC framework. The particular implementation used is EPnP with a model size of 4 points. The RANSAC pose with the largest consensus set is refined using a Levenberg-Marquardt nonlinear optimization.
  • The system can easily handle other choices for matching, in particular, it has achieved similar results using normalized cross-correlation (NCC) instead of ORB. But associating a descriptor like ORB with a 3D point can have ancillary benefits.
  • If the set of 3D points in application scenes remains unchanged, the pose module is enough to maintain camera pose for extended periods. However, unlike small workspace environments, scene points rapidly move out of view in outdoor applications and candidate sets of points usable for pose estimation must be continually updated in a thread of their own (rather than on-demand like PTAM).
  • For every feature f0 in the most recent keyframe at location (x0, y0), a square of side 2re centered at (x0+Δx, y0+Δy) in frame n is considered. The displacement (Δx, Δy) is computed based on the distance of (x0, y0) from the vanishing point, which is a strong cue in highway sequences. This vastly improves the feature matching performance when the vehicle is moving at high speeds.
  • The ORB descriptors for all FAST corners within the intersection of this square with a rectilinear band p pixels wide centered around the epipolar line corresponding to f0 in frame n are compared with the descriptor for f0. The closest Hamming match, fn, is found. A match is accepted only if there is also a matching feature in frame fn-1. Only two matches, (0, n) and (n−1, n), are needed at frame n, since matches (0, n−1) have already been computed for fn-1. Parameter values used are p=3 and re=min{1200PωP2, 10}, where ω is the differential rotation between frames.
  • The features matched in frame n are triangulated with respect to the most recent keyframe, which takes about 2 ms per frame. The 3D point is back-projected into all frames 1, . . . , n−1 and retained only if a matching ORB descriptor is found within a tight square of side 2rb pixels (rb=3). A two-view triangulation suffices instead of a more expensive multiview alternative, since the long tracks built by a dedicated epipolar search module crucially ensure that only the most reliable 3D points are inserted at keyframes, as described next.
  • A sliding window bundle adjustment operates in a parallel thread with the epipolar update module. A frame cache is used for storing feature locations, descriptors, and camera poses for the most recent N frames (N=10). Another new feature of the system is that it forces the previous keyframe to be in the local bundle cache if it is not already present. Since a criterion for keyframes to be added is that the pose has changed significantly, adding the previous keyframe allows the system to produce stable pose results even when the vehicle is not moving (or moving slowly). In the LBA module, the system is also given a chance to re-find 3D points temporarily lost due to artifacts like blur or specularities. The publicly available SBA package [13] is used for bundle adjustment. Timings for epipolar update and local bundle adjustment are summarized in Table 1.
  • TABLE 1
    Pose, epipolar update and local bundle module timings in steady state,
    with the latter two designed to operate in parallel.
    Module Operation Timing
    4*Pose FAST + Shi-Tomasi 1 ms
    ORB descriptor 5 ms
    Pose-guided matching 1 ms
    PnP (RANSAC, 500 15 ms
    iter.)
    Nonlin. pose refine 1 ms
    2*Epi. Update Constrained search 10-15 ms
    Triangulation 1-3 ms
    3*Local Bundle Windowed bundle adj. 10-20 ms
    Re-find 3D points 1 ms
    Update motion model 0 ms
  • The pose module 101 is shown in FIG. 3. As shown in FIG. 3, the pose module has the following functions: detection of features and ORB descriptors; pose-guided matching (PGM); and pose estimation (PnP).
  • If the application scenario involves scenes where the same set of 3D points is viewed for extended periods of time, then the pose module by itself would be sufficient to maintain the camera pose. However, in outdoor applications like autonomous navigation, 3D scene points rapidly move out of view within a few frames. Thus, the stable set of points used for pose computation must be continually updated, which is the task entrusted to the epipolar search module 102 of FIG. 1.
  • FIG. 4 shows in more details the epipolar constrained search module 102. As depicted in FIG. 4, the epipolar search module is parallelized across two threads and follows pose estimation at each frame. The mechanism for epipolar search is illustrated in FIG. 2. Let the most recent prior keyframe be frame 0. After pose computation at frame n, for every feature f0 in the keyframe at location (x0, y0), the system considers a square of side 2re centered at (x0, y0) in frame n. The system considers the intersection region of this square with a rectilinear band p pixels wide, centered around the epipolar line corresponding to f0 in frame n. The ORB descriptors for all FAST corners that lie within this intersection region are compared to the descriptor for f0. The closest match, fn, is found in terms of Hamming distance. This epipolar matching procedure is also repeated by computing the closest match to fn in frame n−1, call it fn-1. A match is accepted only if fn-1 also matches f0. Note that only two sets of matches with respect to frames (0, n) and (n−1, n) must be computed at the frame n, since the matches between (0, n−1) have already been computed at frame n−1.
  • In one embodiment, the parameter re is automatically determined by the size of the motion, the system uses re=min{1200∥ω∥2, 10}, where ω is the differential rotation between frames n−1 and n. Since pose estimates are highly accurate due to continuous refinement by bundle adjustment, epipolar lines are deemed accurate and a stringent value of p=3 can be used to impose the epipolar constraint. The Hamming distance computation for 256-bit ORB descriptors in a region of interest is performed as a block, with a fast SSE implementation. To rapidly search for features that lie within the above region of interest, the detected features in an image are stored in a lookup table data structure. The key into the table is the column index of the feature and within each bucket, features are stored in sorted row order. Across two threads, this allows circular matching for a triplet of images, with up to 500 features in each, in 10-15 ms. As opposed to brute-force searching, the lookup table results in speedups by up to a factor of 10, especially in the autonomous driving application where the images traditionally have wide aspect ratios (to cover greater field of view while limiting uninformative regions like sky).
  • The features that are circularly matched in frame n are triangulated with respect to the most recent keyframe (frame 0). This two-view triangulation requires approximately 2 ms per frame. The reconstructed 3D point is back-projected in all the frames 1, . . . , n−1 and is retained only if a matching ORB descriptor is found within a very tight square of side 2rb pixels (set rb=3). This acts as a replacement for a more accurate, but expensive, multiview triangulation and is satisfactory since epipolar search produces a large number of 3D points, but only the most reliable ones may be used for pose estimation. However, these 3D points are not added to the stable point cloud yet. For that they must first undergo a local bundle adjustment and be collected by the main thread at a keyframe, which are aspects explained in the following sections.
  • The epipolar constrained search is implemented on a thread of its own to produce per-frame 2D-2D correspondences. For current frame n, only 3D points that are validated against all frames 1 to n−1 are retained. Only persistent 3D points that survive for greater than L frames may be collected by the next keyframe.
  • The advantage of the above approach is that the system can construct long tracks, so when new 3D points are inserted, they are guaranteed to be accurate. To boost robustness, each 3D point is validated against all the frames in real-time, while prior systems could only do this in computational off-cycles.
  • If the application scenario involves scenes where the set of 3D points being viewed remains unchanged, then the pose module by itself would be sufficient to maintain the camera pose over extended periods. However, in outdoor applications like autonomous navigation, 3D scene points rapidly move out of view within a few frames. Thus, the stable set of points used for pose computation must be continually updated, which is the task entrusted to the epipolar search module.
  • FIG. 4 shows an exemplary local bundle module 103. The local bundle adjustment refines cameras and 3D points. Data structures are implemented to collect and refine 3D points from the epipolar thread.
  • To refine camera poses and 3D points incorporating information from multiple frames, the system implements a sliding window local bundle adjustment. The key data structure is the local bundle cache, which is composed of a frame cache and a match cache. The frame cache stores feature locations, descriptors and camera poses from the most recent N frames. It also stores images for those N frames, for display and debugging purposes. In the system, N=10. The match cache is a list of tables, one element corresponding to each frame. The key into the table is the identity of a 3D point visible in the frame and the stored entries are the identities of the corresponding 2D features in various frames.
  • The local bundle adjustment module also has other functions. After bundle adjustment, the system has a chance to re-find lost 3D points using the optimized pose. Since the system spends considerable effort in maintaining a high-quality set of 3D points for pose computation, it is worthwhile to incur a small overhead to recover any temporarily lost ones (due to image artifacts like blur, specularities or shadows). In fact, a stable 3D point is permanently discarded only when its projection using the current pose falls outside the image boundaries. Since the bundle adjusted pose is highly accurate, the system can perform re-finding by matching ORB descriptors on FAST corners within a very tight square of side 2rf pixels (with rf=10). This ensures re-finding is rapidly achieved within 1 ms.
  • One implementation uses the publicly available SBA package for bundle adjustment. In parallel, the motion model for predicting the pose of the next frame is also updated in this module. The timings for the parallel epipolar update and local bundle adjustment modules are summarized in Table 2.
  • TABLE 2
    Epipolar update and local bundle timings in steady state (parallel modules)
    Module Operation Timing
    2*Epipolar Constrained search 10-15 ms
    Update
    Triangulation 1-3 ms
    3*Local Bundle Windowed bundle 10-20 ms
    adjustment
    Re-find
    3D points 1 ms
    Update motion model 0 ms
  • FIG. 6 shows an exemplary multithreaded keyframe architecture 201 to handle insertion of new 3D points in the main thread. The system architecture for keyframes is similar to that of FIG. 1, with the addition of a collection and refinding module C+R. It collates persistent 3D points tracked over at least frames in the epipolar thread and re-finds them in the current frame using the output of the pose module. The LBA is now different from that for steady state, since its cache has been updated with 3D points and their corresponding 2D locations in all the relevant frames on the epipolar thread.
  • The system cannot maintain steady state indefinitely, since 3D points are gradually lost due to tracking failures or when they move out of the field of view. The latter is an important consideration in “forward moving” systems for autonomous driving (as opposed to “browsing” systems such as PTAM), so the role of keyframes is very important in keeping the system alive. The purpose of a keyframe is threefold:
      • Collect 3D points with long tracks from the epipolar thread, refine them with local bundle adjustment and add to the set of stable points in the main thread.
      • Trigger global bundle adjustment based on previous few keyframes that refines 3D points and keyframe poses.
      • Provide the frame where newly added 3D points “reside”.
  • The modules that define operations at a keyframe are illustrated in FIG. 6. The pose module remains unchanged from the steady state. It is followed by a collection stage, where 3D points triangulated at each frame in the epipolar thread are gathered by the main thread. Only persistent 3D points that stem from features matched over at least L frames are collected (our circular matching for epipolar search ensures this is easily achieved by seeking 3D points only from at least L frames after the previous keyframe). Note that this mechanism imposes two necessary conditions for a point to be considered for inclusion into the stable set—it must be visible in at least two keyframes and must be tracked over at least L frames. While stringent, these conditions inherently enhance the chances that only reliable 3D points are added into the main thread. In the system, L=3 regardless of environment.
  • The collected 3D points must reside on a keyframe for all subsequent operations, so a re-finding operation is performed by projecting them using the estimated pose for the frame and finding the best ORB match in a circular region of radius 10 pixels. Now the existing stable 3D points, the collected 3D points from the epipolar thread, their projections in all the frames within the local bundle cache and the corresponding cameras undergo local bundle adjustment. The bundle adjustment at keyframes differs from steady state operation, but adding long tracks into the bundle adjustment at keyframes is a reason the system can avoid more expensive multiview triangulation at each frame in the epipolar thread. The refined 3D points are now ready to be added to the stable set.
  • The modules that define operations at the frame immediately after a keyframe are illustrated in FIG. 7. The pose module re-finds the (new) set of stable 3D points. The pose module is now split across only two threads, in order to accommodate a global bundle adjustment in the main thread. This bundle adjustment involves the previous K keyframes and their associated 3D points, in order to introduce long-range constraints to better optimize the newly added set of 3D points. For the system, choosing K=5 allows the global bundle adjustment to finish within 15 ms. There are two reasons a more expensive bundle adjustment involving a much larger set of previous keyframes (or even the whole map) is not necessary to refine 3D points with long-range constraints. First, the imagery in autonomous driving applications is fast moving and does not involve repetitions, so introducing more keyframes into the global bundle adjustment yields at best marginal benefits. Second, the goal is instantaneous pose output rather than map-building, so even keyframes are not afforded the luxury of delayed output. This is in contrast to parallel systems like where keyframes may produce a noticeable spike in the per-frame timings.
  • In FIG. 7, the previous K frames are provided to a GBA or global bundle adjustment unit. The GBA unit usually finishes within the time consumed by the pose module. The cache update module reconciles the 3D points modified by both PnP and GBA, before it is used by LBA. Following global bundle adjustment, the 3D coordinates of all the points are updated. Note that overlapping sets of 3D points are used by both global bundle adjustment and pose modules in parallel, however, both may also cause this set to change (PnP may reject 3D points that are outliers, while bundle adjustment may move the position of 3D points). To ensure thread safety, an update module is included that reconciles changes in the 3D point cloud from both the prior parallel modules. The local bundle adjustment module, which simply reads in 3D point identities, receives this updated set for optimization based on the N frames in the local bundle cache. In parallel with local bundle adjustment, the epipolar search also makes use of the updated keyframe pose. While the keyframe pose has seen a global bundle adjustment, the pose of the subsequent frame has not. This does not cause any inconsistency in practice since poses tend to be much more stable than points—a camera is constrained by hundreds of points, but a point is visible only in a few cameras. Thereafter, the system resumes steady-state operation until the next keyframe, unless a recovery or firewall condition is triggered. The following sections explain those concepts in detail.
  • FIG. 8 shows in more details a process for ground plane estimation with 3D points 301. FIG. 9 shows an exemplary process for ground plane estimation with guided dense stereo. FIG. 10 shows an exemplary process for real-time lane detection, and FIG. 11 shows an exemplary process for scale correction.
  • Since scale information is lost in monocular SFM, an integral component of monocular visual odometry is scale correction. Traditional methods for scale correction include loop closure and estimating the height of the ground plane. For autonomous driving applications, since loop closure is an unlikely scenario, the latter method is used. The KITTI dataset provides the ground truth mounting of the camera height as 1.70 meters, with a camera pitch angle of θ=−0.03 radians. Multiple methods are used for ground plane estimation and a principled approach is used to combine the cues using a Kalman filter, whose process covariances are rigorously learned from training data.
  • A Plane-guided Dense Stereo technique is used. The system assumes a region in the foreground (middle fifth of the lower third of the image) to be the road plane. For a hypothesized value of (h,n), the stereo cost function computation determines the homography mapping between frames k and k+1 as
  • G = R + 1 h tn T .
  • Pixels in frame k+1 are mapped to frame k and the sum of absolute differences (SAD) is computed over bilinearly interpolated image intensities. A Nelder-Mead simplex routine is used to estimate the (h, n) that minimize this cost function. Note that the optimization only involves the three variables h, n1 and n3, since ∥n∥=1.
  • Triangulated 3D Points are also used. The system considers matched SIFT descriptors between frames k and k+1, computed within the above region of interest (ORB descriptors are not powerful enough for the low texture in this region). To fit a plane through the triangulated 3D points, one option is to estimate (h, n) using a 3-point RANSAC for plane-fitting, however, better results are obtained by assuming the camera pitch to be fixed at θ. For every triangulated 3D point, the height difference Δh is computed with respect to every other point. The estimated ground plane height is the height of the point that maximizes the score
  • q = i = 1 n - μΔ h , where μ = 50. ( 1 )
  • To combine the height estimates from various methods, a natural framework is a Kalman filter. The system performs a rigorous training to compute the involved covariances. The Kalman filter model of state evolution is given by

  • x k =Ax k-1 +w k-1 , p(w):N(0,Q),  (2)

  • z k =Hx k +v k-1 , p(v):N(0,U),  (3)
  • where x is the state variable, z the observation, while Q and U are the covariances of the process and observation noise, respectively, that are assumed to be zero mean multivariate normal distributions. In one case, state variable is simply the equation of the ground plane, thus, x=(n1, n2, n3, h)T. Since ∥n∥=1, n2 is determined by n1 and n3 and the observation is z=(n1, n3, h)T. Thus, the state transition matrix and the observation model are given by
  • A = [ R t 0 T 1 ] T , H = [ 1 0 0 0 0 0 1 0 0 0 0 1 ] . ( 4 )
  • If methods j=1, . . . , m are used for estimating the ground plane, each with its observation covariance Uj Then, with Uk −1j=1 mUj,k −1, the fusion equations at time instant k are
  • z k = U k j = 1 m U j , k - 1 z j , k , H k = U k j = 1 m U j , k - 1 H k . ( 5 )
  • Meaningful estimation of Uk at every frame, with the correctly proportional Uj,k for each cue, is essential to the success of a Kalman filter-based cue combination. In the following, a comprehensive training procedure estimates the various observation covariances.
  • For Dense Stereo, the system makes the approximation that state variables are uncorrelated. The system first fixes the values of n1 and n3 to the optimal values from dense stereo and for a range of h, then constructs the homography mapping from frame k to k+1, given by
  • R + 1 h tn T .
  • For each homography mapping, the system computes the SAD score corresponding to the road region using bilinearly interpolated image intensities and consider the value s=1−η−SAD, where η=1.5. A univariate Gaussian is now fit to the distribution of s and the variance σk s is recorded (a different σs is computed for h, n1 and n3).
  • For each frame, let ek s be the error of the ground plane height estimated from dense stereo alone. Then, the system considers the histogram of ek s with B=1000 bins over the variances σk s, with the bin centers positioned to match the density of σk s (that is, distribute roughly F/B error observations within each bin). The variances σe s corresponding to the es are computed within each bin b=1, . . . , B and a curve is fitted to the distribution of σe s versus σs. Empirically, a straight line suffices to produce a good fit. A similar process is repeated for n1 and n3.
  • During testing when the Kalman filter is in operation, the system fits a 1D Gaussian to the homography-mapped SAD scores to get the value of σs, corresponding to h, n1 and n3. Using the line-fit parameters estimated above, the system can predict the value of σe s. The covariance matrix for the dense stereo method is now available as U1=diag(σe s(n1),σe s(n3),σe s(h)).
  • The covariance estimation for the method that uses triangulated 3D points differs from the stereo method, since the normal n is assumed known from the camera pitch and only the height h is an estimated entity. During training, for various trial values of h at frame k, the system computes the height error ek p with respect to the ground truth and the sum q defined in (1). As in the case of dense stereo, a histogram is computed for the with B=1000 bins and approximately F/B observations of ek p are recorded at each bin, centered at qb, for b=1, . . . , B.
  • Since n1 and n3 are fixed to values from camera pitch angle, fixed variance estimates σp(n1) and σp(n3) are computed for them, as the variance of the errors in n1 and n3 with respect to ground truth. During testing, the value of q is computed by the ground plane estimation using (1) and the corresponding value of σp(h) is estimated from the above line-fit. The covariance matrix for this method in the Kalman filter data fusion is now available as U1=diag(σp(n1),σp(n3),σp(h)).
  • The multithreaded system enables large-scale, real-time, monocular visual odometry, targeted towards autonomous driving applications with fast-changing imagery. The multithreaded design can boost both the speed and accuracy for handling challenging road conditions. The system is optimized to provide pose output in real-time at every frame, without delays for keyframe insertion or global bundle adjustment. This is achieved through a per-frame epipolar search mechanism that generates redundantly validated 3D points persistent across long tracks and an efficient keyframe architecture to perform online thread-safe global bundle adjustment in parallel with pose computation. Further, the system is accurate enough to require only occasional scale correction, for which an automatic mechanism is provided that detects planarity of the ground to compute reliable scale factors. Multithreaded bundle adjustment can be optimized for small-sized problems that arise in autonomous driving applications. Real-time detection of pedestrians and cars can also be done for better handling of crowded scenes.

Claims (26)

What is claimed is:
1. A method for vehicular navigation, comprising
acquiring images with a single camera on-board a vehicle;
using 2D-3D correspondences for continuous pose estimation on a multithreaded processor;
combining the pose estimation with 2D-2D epipolar search to replenish 3D points with the multithreaded processor to determine visual odometry; and
applying the visual odometry for driving the vehicle.
2. The method of claim 1, comprising pose-guided matching to provide fast 3D-2D correspondences.
3. The method of claim 1, comprising performing epipolar constrained search to produce per-frame 2D-2D correspondences.
4. The method of claim 1, comprising performing vanishing point detection to hypothesize a feature match search window along one or more radial lines from the VP for pruning mismatches due to repeated features.
5. The method of claim 1, comprising validating each 3D point against all frames in real-time, performing local bundle adjustment to refine cameras and 3D points, and collecting and refining 3D points from an epipolar thread.
6. The method of claim 1, comprising inserting new 3D points in a main thread.
7. The method of claim 1, comprising collecting and refinding to allow bundle adjustment using long tracks.
8. The method of claim 1, comprising performing real-time global bundle adjustment in a thread-safe architecture with real-time pose estimation.
9. The method of claim 1, comprising scale correcting by combining scale estimates from 3D points and planar homography mappings.
10. The method of claim 1, comprising estimating a camera height from the ground plane to solve scale ambiguity in monocular SFM.
11. The method of claim 1, comprising using a 1-point RANSAC method to estimate a height of a best-fitting plane to scattered 3D points that are hypothesized to lie on the ground.
12. The method of claim 1, comprising using a homography-guided dense stereo process to recover height and normal of a best fit ground plane.
13. The method of claim 1, comprising detecting a road lane in real-time as another cue for ground plane height and for driver and vehicle safety applications.
14. The method of claim 1, comprising applying a Kalman filter to combine cues for one or more ground plane height estimates.
15. The method of claim 1, comprising performing dense guided stereo estimation, where for a hypothesized value of (h,n), a stereo cost function determines a homography mapping between frames k and k+1 as
G = R + 1 h tn T ,
where pixels in frame k+1 are mapped to frame k and a sum of absolute differences (SAD) is computed over bilinearly interpolated image intensities.
16. The method of claim 1, wherein for each homography mapping, determining a SAD score corresponding to the road region using bilinearly interpolated image intensities and consider the value s=1−η−SAD.
17. The method of claim 1, comprising performing triangulation on 3D points with matched SIFT descriptors between frames k and k+1, computed within a region of interest.
18. The method of claim 1, comprising determining height estimates with a Kalman filter and training the Kalman filter to determining covariances.
19. The method of claim 18, comprising estimating the ground plane with techniques j=1, . . . , m, each with observation covariance Uj and with Uk −1j=1 mUj,k −1, determining fusion equations at time instant k as
z k = U k j = 1 m U j , k - 1 z j , k , H k = U k j = 1 m U j , k - 1 H k .
20. A system for navigation, comprising
a single camera on-board a vehicle; and
a multi-threaded processor coupled to the camera, the multi-threaded processor using 2D-3D correspondences for continuous pose estimation and combining the pose estimation with 2D-2D epipolar search to replenish 3D points and applying visual odometry for driving the vehicle.
21. The system of claim 20, comprising code for scale correcting by combining scale estimates from 3D points and planar homography mappings.
22. The system of claim 20, comprising code for pose-guided matching to provide fast 3D-2D correspondences.
23. The system of claim 20, comprising code to perform epipolar constrained search to produce per-frame 2D-2D correspondences.
24. The system of claim 20, comprising code for estimating a camera height from the ground plane to solve scale ambiguity in monocular SFM.
25. The system of claim 20, comprising code for detecting a road lane in real-time as another cue for ground plane height and for driver and vehicle safety applications.
26. The system of claim 20, comprising code for applying a Kalman filter to combine cues for one or more ground plane height estimates.
US13/858,041 2012-09-17 2013-04-06 Real-time monocular structure from motion Abandoned US20140139635A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/858,041 US20140139635A1 (en) 2012-09-17 2013-04-06 Real-time monocular structure from motion
US15/965,480 US10706582B2 (en) 2012-09-17 2018-04-27 Real-time monocular structure from motion

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201261701877P 2012-09-17 2012-09-17
US201261725733P 2012-11-13 2012-11-13
US13/858,041 US20140139635A1 (en) 2012-09-17 2013-04-06 Real-time monocular structure from motion

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/965,480 Continuation-In-Part US10706582B2 (en) 2012-09-17 2018-04-27 Real-time monocular structure from motion

Publications (1)

Publication Number Publication Date
US20140139635A1 true US20140139635A1 (en) 2014-05-22

Family

ID=50274054

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/858,040 Active 2034-03-04 US9148650B2 (en) 2012-09-17 2013-04-06 Real-time monocular visual odometry
US13/858,041 Abandoned US20140139635A1 (en) 2012-09-17 2013-04-06 Real-time monocular structure from motion

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/858,040 Active 2034-03-04 US9148650B2 (en) 2012-09-17 2013-04-06 Real-time monocular visual odometry

Country Status (1)

Country Link
US (2) US9148650B2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160140729A1 (en) * 2014-11-04 2016-05-19 The Regents Of The University Of California Visual-inertial sensor fusion for navigation, localization, mapping, and 3d reconstruction
US9678210B2 (en) 2014-12-19 2017-06-13 Caterpillar Inc. Error estimation in real-time visual odometry system
US20170186177A1 (en) * 2015-12-29 2017-06-29 Texas Instruments Incorporated Method and System for Real Time Structure From Motion in a Computer Vision System
CN107356252A (en) * 2017-06-02 2017-11-17 青岛克路德机器人有限公司 A kind of Position Method for Indoor Robot for merging visual odometry and physics odometer
CN107421566A (en) * 2017-08-25 2017-12-01 北京理工大学 A kind of unmanned vehicle Multiple Source Sensor information emulator platform
FR3060115A1 (en) * 2016-12-14 2018-06-15 Commissariat A L'energie Atomique Et Aux Energies Alternatives LOCATION OF A VEHICLE
CN109255316A (en) * 2018-08-30 2019-01-22 深圳市路畅科技股份有限公司 A kind of lane shift detection method and system
CN109272532A (en) * 2018-08-31 2019-01-25 中国航空工业集团公司沈阳空气动力研究所 Model pose calculation method based on binocular vision
US10424070B2 (en) * 2016-04-21 2019-09-24 Texas Instruments Incorporated Methods and apparatus for structure from motion estimation
US10634777B2 (en) * 2018-05-30 2020-04-28 Ford Global Technologies, Llc Radar odometry for vehicle
US10957062B2 (en) 2018-10-31 2021-03-23 Bentley Systems, Incorporated Structure depth-aware weighting in bundle adjustment
WO2022033066A1 (en) * 2020-08-12 2022-02-17 华为技术有限公司 Target detection method and device

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6396499B2 (en) * 2014-05-27 2018-09-26 アプリケーション・ソリューションズ・(エレクトロニクス・アンド・ヴィジョン)・リミテッド Scale measurement of 3D information
DE102016211805A1 (en) 2015-10-09 2017-04-13 Volkswagen Aktiengesellschaft Fusion of position data using poses graph
US20170171525A1 (en) 2015-12-14 2017-06-15 Sony Corporation Electronic system including image processing unit for reconstructing 3d surfaces and iterative triangulation method
US10156441B2 (en) 2016-01-05 2018-12-18 Texas Instruments Incorporated Ground plane estimation in a computer vision system
EP3432291A4 (en) * 2016-03-15 2019-03-27 Ricoh Company, Ltd. Image processing apparatus, object recognition apparatus, device control system, image processing method, and program
DE102016124888A1 (en) * 2016-12-20 2018-06-21 Connaught Electronics Ltd. A method for assisting a driver of a motor vehicle when parking using a driver assistance device, corresponding computer program product and driver assistance device
US11348274B2 (en) 2017-01-23 2022-05-31 Oxford University Innovation Limited Determining the location of a mobile device
JP7221203B2 (en) 2017-01-23 2023-02-13 オックスフォード ユニヴァーシティ イノヴェーション リミテッド Mobile device location method
US10269148B2 (en) 2017-05-01 2019-04-23 Lockheed Martin Corporation Real-time image undistortion for incremental 3D reconstruction
US10269147B2 (en) 2017-05-01 2019-04-23 Lockheed Martin Corporation Real-time camera position estimation with drift mitigation in incremental structure from motion
US10977818B2 (en) * 2017-05-19 2021-04-13 Manor Financial, Inc. Machine learning based model localization system
US10282860B2 (en) 2017-05-22 2019-05-07 Honda Motor Co., Ltd. Monocular localization in urban environments using road markings
CN109325978B (en) * 2017-07-31 2022-04-05 深圳市腾讯计算机系统有限公司 Augmented reality display method, and attitude information determination method and apparatus
EP3486871B1 (en) 2017-11-16 2021-05-05 Veoneer Sweden AB A vision system and method for autonomous driving and/or driver assistance in a motor vehicle
CN109819178B (en) * 2017-11-21 2022-07-08 虹软科技股份有限公司 Method and device for frame processing
US10916135B2 (en) 2018-01-13 2021-02-09 Toyota Jidosha Kabushiki Kaisha Similarity learning and association between observations of multiple connected vehicles
US10586118B2 (en) 2018-01-13 2020-03-10 Toyota Jidosha Kabushiki Kaisha Localizing traffic situation using multi-vehicle collaboration
US10963706B2 (en) 2018-01-13 2021-03-30 Toyota Jidosha Kabushiki Kaisha Distributable representation learning for associating observations from multiple vehicles
CN108257161B (en) * 2018-01-16 2021-09-10 重庆邮电大学 Multi-camera-based vehicle environment three-dimensional reconstruction and motion estimation system and method
CN113139456A (en) * 2018-02-05 2021-07-20 浙江商汤科技开发有限公司 Electronic equipment state tracking method and device, electronic equipment and control system
CN109934108B (en) * 2019-01-31 2023-01-31 华南师范大学 Multi-target and multi-type vehicle detection and distance measurement system and implementation method
CN109933856B (en) * 2019-02-19 2023-07-07 百度在线网络技术(北京)有限公司 Automatic driving decision planning data reproduction method and device
CN109885929B (en) * 2019-02-19 2023-07-07 百度在线网络技术(北京)有限公司 Automatic driving decision planning data reproduction method and device
CN110533724B (en) * 2019-09-06 2021-10-22 电子科技大学 Computing method of monocular vision odometer based on deep learning and attention mechanism
US11435756B2 (en) * 2019-12-01 2022-09-06 Nvidia Corporation Visual odometry in autonomous machine applications
CN112902911B (en) * 2019-12-03 2023-06-20 上海高德威智能交通系统有限公司 Ranging method, device, equipment and storage medium based on monocular camera
US11640675B2 (en) 2019-12-13 2023-05-02 Toyota Motor Engineering & Manufacturing North America, Inc. Methods and systems for interpreting traffic scenes
US11688090B2 (en) 2021-03-16 2023-06-27 Toyota Research Institute, Inc. Shared median-scaling metric for multi-camera self-supervised depth evaluation

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPP299498A0 (en) * 1998-04-15 1998-05-07 Commonwealth Scientific And Industrial Research Organisation Method of tracking and sensing position of objects
US8340349B2 (en) * 2006-06-20 2012-12-25 Sri International Moving target detection in the presence of parallax
US8442304B2 (en) * 2008-12-29 2013-05-14 Cognex Corporation System and method for three-dimensional alignment of objects using machine vision
US8259994B1 (en) * 2010-09-14 2012-09-04 Google Inc. Using image and laser constraints to obtain consistent and improved pose estimates in vehicle pose databases
US8761439B1 (en) * 2011-08-24 2014-06-24 Sri International Method and apparatus for generating three-dimensional pose using monocular visual sensor and inertial measurement unit

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
I.Markelic, et.al., "The Driving School System: Learning Basic Driving Skills From a Teacher in a Real Car," IEEE Transactions on Intelligent Transportation Systems, Vol. 12, NO. 4 , December 2011. *
J. Arrospide L. Salgado M. Nieto R. Mohedano, "Homography-based ground plane detection using a single on-board camera," IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 2, pp. 149-160. *
Paul Wohlhart, Peter M. Roth, and Horst Bischof 3D Camera Tracking in Unknown Environments by On-line Keypoint Learning “” Computer Vision Winter Workshop 2010, Libor Spacek, Vojtech Franc (eds.) Nove Hrady, Czech Republic, (February 3–5) *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160140729A1 (en) * 2014-11-04 2016-05-19 The Regents Of The University Of California Visual-inertial sensor fusion for navigation, localization, mapping, and 3d reconstruction
US20190236399A1 (en) * 2014-11-04 2019-08-01 The Regents Of The University Of California Visual-inertial sensor fusion for navigation, localization, mapping, and 3d reconstruction
US9678210B2 (en) 2014-12-19 2017-06-13 Caterpillar Inc. Error estimation in real-time visual odometry system
US10186024B2 (en) * 2015-12-29 2019-01-22 Texas Instruments Incorporated Method and system for real time structure from motion in a computer vision system
CN108352071A (en) * 2015-12-29 2018-07-31 德州仪器公司 Method for the motion structure processing in computer vision system
WO2017127220A1 (en) * 2015-12-29 2017-07-27 Texas Instruments Incorporated Method for structure from motion processing in a computer vision system
JP7222582B2 (en) 2015-12-29 2023-02-15 テキサス インスツルメンツ インコーポレイテッド Method for structure-from-motion (SfM) processing in computer vision systems
JP2021119477A (en) * 2015-12-29 2021-08-12 テキサス インスツルメンツ インコーポレイテッド METHOD FOR STRUCTURE FROM MOTION (SfM) PROCESSING IN COMPUTER-VISION SYSTEM
EP3398166A4 (en) * 2015-12-29 2019-06-26 Texas Instruments Incorporated Method for structure from motion processing in a computer vision system
US20170186177A1 (en) * 2015-12-29 2017-06-29 Texas Instruments Incorporated Method and System for Real Time Structure From Motion in a Computer Vision System
US10424070B2 (en) * 2016-04-21 2019-09-24 Texas Instruments Incorporated Methods and apparatus for structure from motion estimation
FR3060115A1 (en) * 2016-12-14 2018-06-15 Commissariat A L'energie Atomique Et Aux Energies Alternatives LOCATION OF A VEHICLE
WO2018109384A1 (en) * 2016-12-14 2018-06-21 Commissariat A L'energie Atomique Et Aux Energies Alternatives Locating a vehicle
CN107356252A (en) * 2017-06-02 2017-11-17 青岛克路德机器人有限公司 A kind of Position Method for Indoor Robot for merging visual odometry and physics odometer
CN107421566A (en) * 2017-08-25 2017-12-01 北京理工大学 A kind of unmanned vehicle Multiple Source Sensor information emulator platform
US10634777B2 (en) * 2018-05-30 2020-04-28 Ford Global Technologies, Llc Radar odometry for vehicle
CN109255316A (en) * 2018-08-30 2019-01-22 深圳市路畅科技股份有限公司 A kind of lane shift detection method and system
CN109272532A (en) * 2018-08-31 2019-01-25 中国航空工业集团公司沈阳空气动力研究所 Model pose calculation method based on binocular vision
US10957062B2 (en) 2018-10-31 2021-03-23 Bentley Systems, Incorporated Structure depth-aware weighting in bundle adjustment
WO2022033066A1 (en) * 2020-08-12 2022-02-17 华为技术有限公司 Target detection method and device

Also Published As

Publication number Publication date
US9148650B2 (en) 2015-09-29
US20140078258A1 (en) 2014-03-20

Similar Documents

Publication Publication Date Title
US20140139635A1 (en) Real-time monocular structure from motion
US10706582B2 (en) Real-time monocular structure from motion
Mur-Artal et al. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras
US9070202B2 (en) Moving object localization in 3D using a single camera
Lianos et al. Vso: Visual semantic odometry
Engel et al. Large-scale direct SLAM with stereo cameras
Cvišić et al. SOFT‐SLAM: Computationally efficient stereo visual simultaneous localization and mapping for autonomous unmanned aerial vehicles
Song et al. Parallel, real-time monocular visual odometry
US10762643B2 (en) Method for evaluating image data of a vehicle camera
Kneip et al. Robust real-time visual odometry with a single camera and an IMU
CN111445526B (en) Method, device and storage medium for estimating pose of image frame
Herrera et al. Dt-slam: Deferred triangulation for robust slam
Liu et al. Towards robust visual odometry with a multi-camera system
Maity et al. Edge slam: Edge points based monocular visual slam
WO2016035324A1 (en) Method for estimating motion, mobile agent and non-transitory computer-readable medium encoded with a computer program code for causing a processor to execute a method for estimating motion
Gräter et al. Robust scale estimation for monocular visual odometry using structure from motion and vanishing points
Sucar et al. Bayesian scale estimation for monocular slam based on generic object detection for correcting scale drift
US9189689B2 (en) Robust scale estimation in real-time monocular SFM for autonomous driving
WO2015043872A1 (en) Semi-dense simultaneous localization and mapping
Liu et al. Direct visual odometry for a fisheye-stereo camera
Zhu Image Gradient-based Joint Direct Visual Odometry for Stereo Camera.
Chen et al. A stereo visual-inertial SLAM approach for indoor mobile robots in unknown environments without occlusions
CN112802096A (en) Device and method for realizing real-time positioning and mapping
Barnada et al. Estimation of automotive pitch, yaw, and roll using enhanced phase correlation on multiple far-field windows
EP3718302B1 (en) Method and system for handling 360 degree image content

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION