WO2021243281A1 - System and method for visual odometry from log-logistic dense optical flow residuals - Google Patents

System and method for visual odometry from log-logistic dense optical flow residuals Download PDF

Info

Publication number
WO2021243281A1
WO2021243281A1 PCT/US2021/034981 US2021034981W WO2021243281A1 WO 2021243281 A1 WO2021243281 A1 WO 2021243281A1 US 2021034981 W US2021034981 W US 2021034981W WO 2021243281 A1 WO2021243281 A1 WO 2021243281A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
optical flow
depth
estimators
probabilistic model
Prior art date
Application number
PCT/US2021/034981
Other languages
French (fr)
Other versions
WO2021243281A9 (en
Inventor
Enrique DUNN
Zhixiang MIN
Original Assignee
The Trustees Of The Stevens Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of The Stevens Institute Of Technology filed Critical The Trustees Of The Stevens Institute Of Technology
Publication of WO2021243281A1 publication Critical patent/WO2021243281A1/en
Publication of WO2021243281A9 publication Critical patent/WO2021243281A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/77Determining position or orientation of objects or cameras using statistical methods

Definitions

  • the present invention relates to the estimation of the 3D movements of a camera from the analysis of image content in video analysis, and, more specifically, to methods of recovering camera locations and dense scene structure from video recordings.
  • Visual odometry is a fundamental problem in computer vision, addressing the recovery of camera poses from an input video sequence, which supports applications such as augmented reality, robotics and autonomous driving.
  • Traditional indirect VO methods rely on the geometric analysis of sparse key-point correspondences to determine multi-view relationships among the input video frames.
  • indirect methods pose the VO problem as a reprojection-error minimization task.
  • State-of-the art supervised learning frameworks have proven more effective at finding dense correspondences than standard hand-designed approaches.
  • Deep-learning frameworks that jointly estimate depth, optical flow and camera motion have been presented, including those with the addition of recurrent neural networks for learning temporal information.
  • deep learning methods are usually less explainable and have difficulty when transferring to unseen datasets or cameras with different calibration.
  • analytical solutions to multi-view geometry problems still provide the “gold-standard” in terms of accuracy.
  • SUMMARY OF THE INVENTION [005] The present invention represents a novel dense-indirect approach for video analysis, eliminating the need for feature extraction for the geometric analysis of viewing rays and providing an alternative robust estimation framework with bounded computational burden.
  • a log-logistic error model is employed, in contrast to the standard practice of using Gaussian error models.
  • the proposed statistical error model and inference framework may be utilized as a general estimation procedure for applications beyond image-content analysis.
  • the present invention provides information on both camera poses and dense scene structure. Camera poses are useful for applications that need localization such as augmented reality, robotics and autonomous driving. Dense scene structure is useful for applications that need 3D reconstruction, such as 3D modelling, engineering surveys and geography research. Software embodying the present invention might benefit robotic, AR/VR, and autonomous navigation applications. [007] These objectives are met by solving the problem of recovering camera locations and dense scene structure using video sequences from monocular image capture devices.
  • the present invention defines the dense indirect visual odometry (VO) as a probabilistic model and solves it with a generalized expectation maximization (i.e., EM) formulation for the joint inference of camera motion, pixel depth, and motion-track confidence from externally estimated optical flow fields under the supervision of an adaptive log-logistic distribution model.
  • VO dense indirect visual odometry
  • EM expectation maximization
  • the advantages of both supervised learning and analytical solutions may be leveraged.
  • the methods of the present invention can be applied to autonomous driving applications. Such methods of the present invention can recover the driving track, as well as the scene geometry from car-mounted cameras that provide local and global environment geometry information to the autonomous driving system for further planning, such as to avoid collision, for navigation, etc.
  • the present invention can triangulate indoor geometry from phone cameras for the application of further providing occlusion-aware 3D interactions with a user interface.
  • the present invention can triangulate an environment geometry (e.g., from drone-mounted cameras) that provides real- world 3D models for further applications, such as distance measuring, 3D-printing, geography analysis, etc.
  • FIG.1 is a schematic diagram of a graphical probabilistic model in accordance with an embodiment of the present invention
  • FIG.2 is a flow diagram of an algorithm in accordance with an embodiment of the present invention
  • FIG.3 is a flow diagram illustrating an exemplary process for using the methods of the present invention
  • FIG.4 is a conceptual diagram of the present invention
  • FIG.5 is a model for depth inference
  • FIG.6 is a schematic illustration of a Pose MAP approximation via a meanshift-based mode search
  • FIG.7 shows graphical results of a Fisk model qualitative study
  • FIG.8 shows graphical results of an ablation study and runtime
  • FIG.9 shows graphical data
  • the approaches proposed in accordance with the present invention provide an estimate of the relative displacements of a monocular camera system by taking as input dense optical flow estimates (which are dense pixel registration maps between two images) among successive frames.
  • dense optical flow estimates which are dense pixel registration maps between two images
  • the present invention develops a dense indirect framework for monocular VO that takes as input externally computed optical flow from supervised-learning estimators. It was empirically observed that optical flow residuals tend to conform to a log-logistic (i.e., Fisk) distribution model parametrized by the optical flow magnitude.
  • FIG.1 is a VOLDOR probabilistic graphical model. Optical flow field sequences are modeled as observed variables subject to Fisk-distributed measurement errors. Camera poses, depth maps and rigidness maps are modeled as hidden variables. [029] FIG.2 is an iterative estimation workflow. Externally computed optical flow estimates of a video sequence are input. Scene depth, camera poses and a rigidness maps are alternatively estimated by enforcing congruence between predicted rigid flow and input flow observations. Estimation is posed as a probabilistic inference task governed by a Fisk-distributed residual model.
  • the present invention constitutes a probabilistic dense, indirect visual odometry method taking as input externally estimated optical flow fields, as well as identifying and utilizing the log-logistic dense optical flow residual in its framework.
  • the pose estimation relies on mode-seeking from dense sampling in the pose space of the cameras.
  • FIGS.3 and 4 provide general overviews of the methods and components used in association with the present invention.
  • Publicly available datasets and benchmarks were used to validate the correctness of the approach of the present invention. The present approach provides the most accurate estimation results in the current academic benchmarks.
  • a hierarchical (i.e. coarse-to-fine) variant of the currently developed solution could be employed.
  • EXAMPLE 1 [035] Optical flow can been seen as a combination of rigid flow, which is relevant to the camera motion and scene structure, along with an unconstrained flow describing general object motion.
  • the present VO method inputs a batch of externally computed optical flow fields and infers the underlying temporally-consistent aforementioned scene structure (depth map), camera motions, as well as pixel-wise probabilities for the “rigidness” of each optical flow estimate. Furthermore, the system framework is posited under the supervision of an empirically validated adaptive log-logistic residual model over the end-point-error (EPE) between the estimated rigid flow and the input (i.e., observed) flow.
  • EPE end-point-error
  • t 1, ⁇ , t N ⁇ , where X t is the optical flow map from image I t ⁇ 1 to I t , while denotes the optical flow vector of pixel j at time t.
  • t 1, ⁇ , t N ⁇ , where T ⁇ SE(3) represents the relative motion from time t – 1 to t.
  • ⁇ t ( ⁇ j ) denotes the pixel coordinate of projecting the 3D point ⁇ j into the camera image plane at time t using given camera poses T, by [039] where K is the camera intrinsic and x j , y j are the image coordinates of pixel j.
  • the visual odometry problem can be modeled as a maximum likelihood estimation problem [041] Furthermore, spatial consistency is promoted among both the dense hidden variables ⁇ and W, through different mechanisms described hereinbelow.
  • FISK RESIDUAL MODEL [042] The choice of an observation residual model impacts accurate statistical inference from a reduced number of observations, where in this instance, a residual is defined as the end-point error in pixels between two optical flow vectors. [043] In practice, hierarchical optical flow methods (i.e. relying on recursive scale-space analysis) tend to amplify estimation errors in proportion to the magnitude of the pixel flow vector.
  • the outlier likelihood function ⁇ (.) was modeled.
  • the general approach is to assign outliers to a uniform distribution to improve the robustness.
  • the density of the uniform distribution is provided as a function ⁇ (.) over the observation flow vector.
  • is a hyper-parameter adjusting the density, which is also the strictness for choosing inliers.
  • MLE MAXIMUM LIKELIHOOD ESTIMATOR
  • MIE MAXIMUM INLIER ESTIMATION
  • the robustness to outliers correspondences and the independence from initialization provided to the present method is useful for bootstrapping the visual odometry system described herein, where the camera pose needs to be estimated from scratch with an uninformative rigidness map (all one).
  • the problem can be written down as maximum a posteriori (MAP) by [055]
  • Finding the optimum camera pose is equal to computing the maximum posterior distribution , which is not tractable since it requires to integral over T to compute P(X).
  • a Monte-Carlo based approximation was used, where for each depth map position ⁇ j two additional distinct positions ⁇ k, l ⁇ were randomly sampled to form a 3-tuple , with associated rigidness values th to represent the g group.
  • the posterior can be approximated as where S is the total number of groups.
  • PnP reaches its minimal form of P3P, which can be solved efficiently using the P3P algorithm.
  • the first input argument indicates the 3D coordinates of selected depth map pixels at time t ⁇ 1 obtained by combining previous camera poses, while the second input argument is their 2D correspondences at time t, obtained using optical flow displacement.
  • q(T t ) a tractable variational distribution to approximate its true posterior is used.
  • q(T t ) is a normal distribution with mean and predefined fixed covariance matrix ⁇ for simplicity. Furthermore, each variational distribution is weighted with , such that potential outliers indicated by rigidness maps can be excluded or down weighted. Then, the full posterior can be approximated as [056] The posterior was approximated with a weighted combination of q(T t ). Solving for the optimumT t on the posterior equates to finding the mode of the posterior. Since all q(T t ) is assumed to share the same covariance structure, mode finding on this distribution equates to applying meanshift with Gaussian kernels of covariance ⁇ .
  • FIG.6 shows pose MAP approximation via a meanshift-based mode search.
  • Each 3D-2D correspondence is part of a unique minimal P3P instance, constituting a pose sample that is weighted by the rigidness map. Samples were mapped to SE(3) and meanshift was run to find the mode.
  • Algorithm Integration [057] The integrated workflow of the present visual odometry algorithm will now be described, which is denoted VOLDOR.
  • the input is a sequence of dense optical flows and the output will be the camera poses of each frame as well as the depth map ⁇ of the first frame.
  • 4-8 optical flows per batch are used.
  • VOLDOR initializes all W to one
  • T 1 is initialized from epipolar geometry estimated from X 1 using a least median of squares estimator or, alternatively, from previous estimates if available (i.e. overlapping consecutive frame batches).
  • is obtained from two-view triangulation using T 1 and X 1 .
  • the optimization loop between camera pose, depth map and rigidness map runs until convergence, usually within 3 to 5 iterations.
  • EXPERIMENTS KITTI BENCHMARK [058] A KITTI odometry benchmark of a car driving at urban and highway environments was tested. PWC-Net was used as the external dense optical flow input. The sliding window size was set to 6 frames. Set ⁇ was used in Eq. (8) to 0.15, ⁇ in Eq. (11) to 0.9. The Gaussian kernel covariance matrix ⁇ in Eq. (18) was set to diagonal, scaled to 0.1 and 0.004 at the dimensions of translation and rotation respectively. The hyper-parameters for the Fisk residual model are obtained from the analysis of residual distribution described hereinabove.
  • Table 2 shows results on KITTI training sequences 0-10. The translation and rotation errors are averaged over all sub-sequences of length from 100 meters to 800 meters with 100 meter steps. Table 2 shows results on the KITTI odometry training set sequences 0-10. VISO2 and MLM-SFM were picked as baselines, whose scales are also estimated from ground height. [061] Table 3 compares the present results with recent popular methods on KITTI test set sequences 11-21, which was downloaded from the KITTI odometry official ranking board.
  • Table 4 shows the depth map quality on the KITTI stereo benchmark. A foreground moving object was masked out, and depth was aligned with groundtruth to solve scale ambiguity. The depth quality was separately evaluated for different rigidness probabilities, where The EPE of PSMNet, GC-Net and GA-Net were measured on stereo 2012 test set, and background outlier percentage was measured on stereo 2015 test set while the present method was method is measured on training set on stereo 2015. In Table 4, a pixel is considered as outlier if disparity EPE is > 3px and > 5%.
  • Wj denotes the sum of pixel rigidness Table 4
  • TUM RGB-D Benchmark Table 5 [063] Accuracy experiments on TUM RGB-D compared VOLDOR vs. full SLAM systems. In all instances, trajectories were rigidly aligned to groundtruth for segments with 6 frames and estimate mean translation RMSE of all segments. Parameters remained the same to the KITTI experiments.
  • the comparative baselines are an indirect sparse method ORB-SLAM2, a direct sparse method DSO and a dense direct method DVO-SLAM. Per Table 5, VOLDOR performs well under indoor capture exhibiting smaller camera motions and diverse motion patterns.
  • ABLATION AND PERFORMANCE STUDY [064] FIG.
  • FIG. 7 visualizes the depth likelihood function and camera pose sampling distribution, wherein the values are translation RMSE in meters.7(a) and 7(c) visualize the depth likelihood function under Gaussian and Fisk residual models with MLE and MIE criteria. Dashed lines indicate the likelihood given by a single optical flow. Solid lines are joint likelihood obtained by fusing all dashed lines. MLE and MIE are therefore depicted.7(c) and 7(d) visualize the epipole distribution for 40K camera pose samples. For better visualization, the density color bars of (b) (d) are scaled differently. With the Fisk residual model, depth likelihood from each single frame had a well localized extremum (FIG.7(c)), compared to a Gaussian residual model (FIG.7(a)).
  • FIG.8(a) shows the camera pose error of VOLDOR under different residual models and dense optical flow input (* indicating that due to noisy ground estimations given by C2F-Flow, its scale is corrected using groundtruth).
  • FIG.8(b) shows depth map accuracy under different residual models.
  • FIGS.8(c) and 8(d) show the runtime of the present method tested on a GTX 1080Ti GPU.
  • FIG.8(a) shows the ablation study on camera pose estimation for three optical flow methods, four residual models and the proposed MIE criteria. The accuracy of combining PWCNet optical flow, a Fisk residual model, and the presently proposed MIE criteria strictly dominates (in the Pareto sense) all other combinations.
  • FIG.8(b) shows that the MIE criteria yield depth estimates that are more consistent across the entire image sequence, leading to improved overall accuracy. However, at the extreme case of predominantly unreliable observations (very low Wj) MLE provides the most accurate depth.
  • FIG.8(c) shows the overall runtime evaluation for each component under different frame numbers.
  • FIG.8(d) shows runtime for pose update under different sample rates. It should be understood that the data (e.g., that in FIG.8) can be represented in a variety of manners depending on operating conditions, definitions and other factors, such as scale. [066] GROUND PLANE ESTIMATION In Table 6, the ground plane estimation algorithm coupled with KITTI odometry benchmark is provided.
  • Table 6 [067] In the ground plane estimation algorithm of Table 6, a normal vector and height for each pixel was estimated in the ROI and the height was scaled with its median. Meanshift was applied to find the mode with prior knowledge of the ground normal and the height scale (median). Finally, the estimated normal vector was actively checked to determine if the ground estimate was correct. COMPARISON WITH DEEP-LEARNING VO [068] As Table 7 shows, the visual odometry result is compared with recent SOTA deep-learning VO methods, where ORBSLAM is used as baseline. Specifically, it shows a comparison with recent deep-learning methods on KITTI visual odometry benchmark sequence 09, 10. T able 7 [069] Furthermore, the performance of the present method will now be detailed.
  • the geometry representation of the first frame’s depth map causes the coverage of registrable depth pixels to vary with the frames within the optimization window.
  • Some of the depth pixels can only be registered in a small number of frames, and this will yield less reliable depth estimates since they are triangulated from less observations. However, those depth pixels will increase the depth coverage of the frames where they are observed.
  • the performance of each frame within the optimization window varies. In FIG.9, the performance over 10 frames was tested separately using KITTI dataset. As the result shows, the third and fourth frames usually have the best performance. This property will be utilized hereinbelow when discussion fusing segment estimates. [070] FIG.9(a) shows rotation error over frame.
  • FIG.9(b) shows translation error over frame.
  • FIG.9(c) shows rotation error over speed.
  • FIG.9(d) shows translation error over speed.
  • FIGS. 9(c) and 9(d) specifically show the performance over different camera moving speeds in the KITTI dataset.
  • the inventive method can stably work with a wide range of speeds at 10-50 km/h. When the speed is low, camera baselines are small and the triangulation has large uncertainty, which decreases the performance. When the speed is high, frames have limited overlap and provide little mutual information that also causes a decrease of performance.
  • IMPLEMENTATION DETAILS DEPTH UPDATE [071] In the depth update process, a depth value is sampled from a uniform distribution with inverse depth representation for each pixel, while the best depth value is kept.
  • the weight was binarized with a threshold at 0.5 for a faster meanshift implementation.
  • meanshift initialization the estimated camera pose in the previous iteration was used. In the case of first iteration, the camera pose of previous time step is used if the initialization can obtain a kernel weight larger than 0.1. Otherwise, 10 start points are randomly picked and the initialized is started with the start point with largest kernel weight. TRUNCATION [073] In the whole workflow, for robustness, badly registered frames were actively detected and the optimization window size was truncated. This mainly happens when the camera motion is large, such that the depth map of first frame cannot been registered to the tail frames.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Length Measuring Devices By Optical Means (AREA)

Abstract

A method for calculating camera poses and dense scene structure via probabilistic dense indirect visual odometry methods is disclosed. It may be used in various video analysis applications. Through use of a probabilistic model relating to the dependencies among camera movement, scene geometry and the input optical estimates, a generalized expectation maximization inference framework may be employed. The inference is build around an empirically validated error model for observation.

Description

SYSTEM AND METHOD FOR VISUAL ODOMETRY FROM LOG-LOGISTIC DENSE OPTICAL FLOW RESIDUALS CROSS-REFERENCE TO RELATED APPLICATION [001] This application claims priority to U.S. Provisional Patent Application Serial No. 63/031,374 filed May 28, 2020, the entire disclosure of which is incorporated herein by reference. FIELD OF THE INVENTION [002] The present invention relates to the estimation of the 3D movements of a camera from the analysis of image content in video analysis, and, more specifically, to methods of recovering camera locations and dense scene structure from video recordings. BACKGROUND OF THE INVENTION [003] Visual odometry (VO) is a fundamental problem in computer vision, addressing the recovery of camera poses from an input video sequence, which supports applications such as augmented reality, robotics and autonomous driving. Traditional indirect VO methods rely on the geometric analysis of sparse key-point correspondences to determine multi-view relationships among the input video frames. However, by virtue of relying on local feature detection and correspondence pre-processing modules, indirect methods pose the VO problem as a reprojection-error minimization task [004] State-of-the art supervised learning frameworks have proven more effective at finding dense correspondences than standard hand-designed approaches. Deep-learning frameworks that jointly estimate depth, optical flow and camera motion have been presented, including those with the addition of recurrent neural networks for learning temporal information. However, deep learning methods are usually less explainable and have difficulty when transferring to unseen datasets or cameras with different calibration. As such, for fine-grained high-accuracy content-based estimation of camera motion, analytical solutions to multi-view geometry problems still provide the “gold-standard” in terms of accuracy. SUMMARY OF THE INVENTION [005] The present invention represents a novel dense-indirect approach for video analysis, eliminating the need for feature extraction for the geometric analysis of viewing rays and providing an alternative robust estimation framework with bounded computational burden. A log-logistic error model is employed, in contrast to the standard practice of using Gaussian error models. The proposed statistical error model and inference framework may be utilized as a general estimation procedure for applications beyond image-content analysis. [006] The present invention provides information on both camera poses and dense scene structure. Camera poses are useful for applications that need localization such as augmented reality, robotics and autonomous driving. Dense scene structure is useful for applications that need 3D reconstruction, such as 3D modelling, engineering surveys and geography research. Software embodying the present invention might benefit robotic, AR/VR, and autonomous navigation applications. [007] These objectives are met by solving the problem of recovering camera locations and dense scene structure using video sequences from monocular image capture devices. More specifically, the present invention defines the dense indirect visual odometry (VO) as a probabilistic model and solves it with a generalized expectation maximization (i.e., EM) formulation for the joint inference of camera motion, pixel depth, and motion-track confidence from externally estimated optical flow fields under the supervision of an adaptive log-logistic distribution model. [008] By applying learned correspondence estimators as input to a rigorous probabilistic geometric inference framework, the advantages of both supervised learning and analytical solutions may be leveraged. [009] The methods of the present invention can be applied to autonomous driving applications. Such methods of the present invention can recover the driving track, as well as the scene geometry from car-mounted cameras that provide local and global environment geometry information to the autonomous driving system for further planning, such as to avoid collision, for navigation, etc. [010] For augmented reality systems, the present invention can triangulate indoor geometry from phone cameras for the application of further providing occlusion-aware 3D interactions with a user interface. [011] For engineering surveys, 3D modelling or geography, the present invention can triangulate an environment geometry (e.g., from drone-mounted cameras) that provides real- world 3D models for further applications, such as distance measuring, 3D-printing, geography analysis, etc. BRIEF DESCRIPTION OF THE DRAWINGS [012] For a better understanding of the present invention, reference is made to the following detailed description of various exemplary embodiments considered in conjunction with the accompanying drawings, in which: [013] FIG.1 is a schematic diagram of a graphical probabilistic model in accordance with an embodiment of the present invention; [014] FIG.2 is a flow diagram of an algorithm in accordance with an embodiment of the present invention; [015] FIG.3 is a flow diagram illustrating an exemplary process for using the methods of the present invention; [016] FIG.4 is a conceptual diagram of the present invention; [017] FIG.5 is a model for depth inference; [018] FIG.6 is a schematic illustration of a Pose MAP approximation via a meanshift-based mode search; [019] FIG.7 shows graphical results of a Fisk model qualitative study; [020] FIG.8 shows graphical results of an ablation study and runtime; and [021] FIG.9 shows graphical results of performance analysis conducted over frame and speed on KITTI benchmark. DESCRIPTION OF EMBODIMENTS OF THE INVENTION [022] Various embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative of the invention that can be embodied in various forms. In addition, each of the examples given in connection with the various embodiments is intended to be illustrative, and not restrictive. Further, the figures are not necessarily to scale, and some features may be exaggerated to show details of particular components (and any size, material and similar details shown in the figures are intended to be illustrative and not restrictive). Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the disclosed embodiments. [023] Subject matter will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific exemplary embodiments. Subject matter may, however, be embodied in a variety of different forms and, therefore, covered or disclosed subject matter is intended to be construed as not being limited to any exemplary embodiments set forth herein, it being understood that such exemplary embodiments are provided merely to be illustrative. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. The following detailed description is, therefore, not intended to be taken in a limiting sense. [024] Throughout the specification, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment and the phrases “in another embodiment” and “other embodiments” as used herein do not necessarily refer to a different embodiment. It is intended, for example, that covered or disclosed subject matter includes combinations of the exemplary embodiments in whole or in part. [025] In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and”, “or”, or “and/or,” as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B, or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B, or C, here used in the exclusive sense. In addition, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context. [026] The approaches proposed in accordance with the present invention provide an estimate of the relative displacements of a monocular camera system by taking as input dense optical flow estimates (which are dense pixel registration maps between two images) among successive frames. [027] The present invention develops a dense indirect framework for monocular VO that takes as input externally computed optical flow from supervised-learning estimators. It was empirically observed that optical flow residuals tend to conform to a log-logistic (i.e., Fisk) distribution model parametrized by the optical flow magnitude. This insight was leveraged to propose a probabilistic framework that fuses a dense optical flow sequences and joint estimates of camera motion, pixel depth, and motion-track confidence through a generalized-EM formulation. The approach is dense in the sense that each pixel corresponds to an instance of estimated random variables; it is indirect in the sense that it treats individual pixels as viewing rays within minimal feature-based multi-view geometry models (i.e. Perspective 3-Point, or P3P, for camera pose, 3D triangulation for pixel depth) and implicitly optimizes for reprojection error. Starting from a deterministic bootstrap of camera pose and pixel depths attained from optical flow inputs, the method of the present invention iteratively alternates the inference of depth, pose and track confidence over a batch of consecutive images. [028] FIG.1 is a VOLDOR probabilistic graphical model. Optical flow field sequences are modeled as observed variables subject to Fisk-distributed measurement errors. Camera poses, depth maps and rigidness maps are modeled as hidden variables. [029] FIG.2 is an iterative estimation workflow. Externally computed optical flow estimates of a video sequence are input. Scene depth, camera poses and a rigidness maps are alternatively estimated by enforcing congruence between predicted rigid flow and input flow observations. Estimation is posed as a probabilistic inference task governed by a Fisk-distributed residual model. [030] By proposing a probabilistic model relating to the dependencies among camera movement, scene geometry and the input optical estimates, a generalized expectation maximization inference framework may be employed. Importantly, the inference is built around an empirically validated error model for observation. [031] Furthermore, the present invention constitutes a probabilistic dense, indirect visual odometry method taking as input externally estimated optical flow fields, as well as identifying and utilizing the log-logistic dense optical flow residual in its framework. The pose estimation relies on mode-seeking from dense sampling in the pose space of the cameras. [032] There are several advantages to this framework. First, it constitutes a modular framework that is agnostic to the optical flow estimator engine, which takes full advantage of recent deep-learning optical flow methods. Contrary to learning-based monocular depth approaches, which develop and impose strong semantic priors, learning for optical flow estimation may be informed by photometric error and achieve better generalization. Moreover, by replacing sparse, hand-crafted features inputs with learned dense optical flows, surface information for poorly textured (i.e., feature-less) regions can be obtained. Second, by leveraging the empirically-validated log-logistic residual model, highly accurate probabilistic estimates for scene depth and camera motion can be obtained, which estimates do not rely on Gaussian error assumptions. Third, the methods of the present invention are highly parallelizable, which also allows for real-time application on commodity GPU-based architectures. Given that there is only linear computational and storage growth, the inventive method is GPU-friendly. [033] FIGS.3 and 4 provide general overviews of the methods and components used in association with the present invention. [034] Publicly available datasets and benchmarks were used to validate the correctness of the approach of the present invention. The present approach provides the most accurate estimation results in the current academic benchmarks. As a further modification of the processing related to the present invention, a hierarchical (i.e. coarse-to-fine) variant of the currently developed solution could be employed. EXAMPLE 1: [035] Optical flow can been seen as a combination of rigid flow, which is relevant to the camera motion and scene structure, along with an unconstrained flow describing general object motion. The present VO method inputs a batch of externally computed optical flow fields and infers the underlying temporally-consistent aforementioned scene structure (depth map), camera motions, as well as pixel-wise probabilities for the “rigidness” of each optical flow estimate. Furthermore, the system framework is posited under the supervision of an empirically validated adaptive log-logistic residual model over the end-point-error (EPE) between the estimated rigid flow and the input (i.e., observed) flow. GEOMETRIC NOTATION [036] A sequence of externally computed (observed) dense optical flow fields is input: X = {Xt | t = 1, ··· , tN }, where Xt is the optical flow map from image It−1 to It, while
Figure imgf000010_0003
denotes the optical flow vector of pixel j at time t. The aim is to infer the camera
Figure imgf000010_0004
poses T = {Tt | t = 1, ··· , tN }, where T ∈ SE(3) represents the relative motion from time t – 1 to t. [037] To define a likelihood model relating prior observations X to T, two additional (latent) variable types were introduced: 1) a depth field θ defined over I0, and 2) a rigidness probability map Wt associated to each Xt; where θ j denotes the depth value at pixel j, while
Figure imgf000010_0006
denotes the set of rigidness maps, and
Figure imgf000010_0005
denotes the rigidness probability of pixel j at time t. [038] Having depth map θ and rigidness maps W, rigid flow ξtj ) can be obtained through applying the rigid transformation T to the point cloud associated with θ, conditioned on W. Assuming T0 = I, πtj ) denotes the pixel coordinate of projecting the 3D point θ j into the camera image plane at time t using given camera poses T, by
Figure imgf000010_0001
[039] where K is the camera intrinsic and xj, yj are the image coordinates of pixel j. Hence, rigid flow can be defined as ξtj ) = πtj ) – πt−1(θ j ). MIXTURE LIKELIHOOD MODEL. [040] The residuals between observation flow and rigid flow were modeled with respect to the continuous rigidness probability
Figure imgf000010_0007
Figure imgf000010_0002
where the probability density function
Figure imgf000011_0001
represents the probability for having the rigid flow ξtj ) under the observation flow of
Figure imgf000011_0002
and μ(·) is a uniform distribution whose density varies with
Figure imgf000011_0004
These two functions will be defined hereinbelow. Henceforth, when modeling the probability of Xt, only Tt will be given in the conditional probability, although it should be noted that in these instances the projection also depends on preceding camera poses
Figure imgf000011_0005
which are assumed fixed and for which Xt inherently does not contain any information. Moreover, jointly modeling for all previous camera poses along with Xt would bias them as well as increase the computational complexity. In the following paragraph, Eq. (2) will be noted simply as.
Figure imgf000011_0006
At this point, the visual odometry problem can be modeled as a maximum likelihood estimation problem
Figure imgf000011_0003
[041] Furthermore, spatial consistency is promoted among both the dense hidden variables θ and W, through different mechanisms described hereinbelow. FISK RESIDUAL MODEL [042] The choice of an observation residual model impacts accurate statistical inference from a reduced number of observations, where in this instance, a residual is defined as the end-point error in pixels between two optical flow vectors. [043] In practice, hierarchical optical flow methods (i.e. relying on recursive scale-space analysis) tend to amplify estimation errors in proportion to the magnitude of the pixel flow vector. In light of this, an adaptive residual model was explored for determining the residuals distribution w.r.t, the magnitude of optical flow observations. The residual distribution of multiple leading optical flow methods w.r.t. groundtruth were analyzed. The empirical distribution to five different analytic models was fitted, and it was found that the Fisk distribution yielded the most congruent shape over all flow magnitudes. To quantify the goodness of fit, K-S test results were posited, which quantify the supremum distance (D value) of the CDF between the empirical distribution and a reference analytic distribution. [044] Hence, given
Figure imgf000012_0007
the probability of
Figure imgf000012_0002
was modeled, matching the underlying groundtruth, as
Figure imgf000012_0001
where the functional form for the PDF of the Fisk distribution F is given by
Figure imgf000012_0004
[045] The parameters of the Fisk distribution were determined. Since a clear linear correspondence was shown in the results, instead of using a look-up table, the fitted function was applied to find the parameters, as
Figure imgf000012_0003
where a1, a2 and b1, b2 are learned parameters depending on the optical flow estimation method. [046] Next, the outlier likelihood function μ(.) was modeled. The general approach is to assign outliers to a uniform distribution to improve the robustness. In the present example, for utilizing the prior given by the observation flow, the density of the uniform distribution is provided as a function μ(.) over the observation flow vector.
Figure imgf000012_0006
where λ is a hyper-parameter adjusting the density, which is also the strictness for choosing inliers. The numerical interpretation of λ is the optical flow percentage EPE for indistinguishable outliers
Figure imgf000012_0005
Hence, flows with different magnitudes can be compared under fair metrics to be selected as an inlier. INFERENCE [047] In this section, an iterative inference framework is introduced which alternatively optimizes depth map, camera poses and rigidness maps. DEPTH AND RIGIDNESS UPDATE GENERALIZED EXPECTATION-MAXIMIZATION (GEM). [048] Depth θ and its rigidness W over time were inferred while assuming fixed known camera pose T. The true posterior
Figure imgf000013_0009
was approximated through a GEM framework. In this section, Eq. (2) will be given as
Figure imgf000013_0008
where the fixed T is omitted. The intractable real posterior
Figure imgf000013_0007
was approximated with a restricted family of distributions wher j
Figure imgf000013_0006
e
Figure imgf000013_0001
For tractability, q(θ ) is further constrained to the family of Kronecker delta functions j
Figure imgf000013_0002
where θ is a parameter to be estimated. Moreover, q(Wt) inherits the smoothness defined on the ridgidness map Wt in Eq. (11), which was previously proved as minimizing KL divergence between the variational distribution and the true posterior. In the M step, an estimate of an optimal value for θj given an estimated PDF on
Figure imgf000013_0005
was sought. Next, the choice of the estimators used for this task will be explained. MAXIMUM LIKELIHOOD ESTIMATOR (MLE) [049] The standard definition of the MLE for the problem is given by
Figure imgf000013_0003
where
Figure imgf000013_0004
is the estimated distribution density given by E step. However, it was found empirically that the MLE criteria tended to be overly sensitive to inaccurate initialization. More specifically, the depth map was bootstrapped using only the first optical flow and using its depth values to sequentially bootstrap subsequent camera poses (further details will follow). Hence, for noisy/inaccurate initialization, using MLE for estimate refinement will impose high selectivity pressure on the rigidness probabilities W, favoring a reduced number of higher- accuracy initialization. Given the sequential nature of the image-batch analysis, this tends to effectively reduce the set of useful down-stream observations used to estimate subsequent cameras. MAXIMUM INLIER ESTIMATION (MIE) [050] To reduce the bias caused by initialization and sequential updating, the MLE criteria were relaxed to the following MIE criteria,
Figure imgf000014_0001
which finds a depth maximizing for the rigidness (inlier selection) map W. Experimental details regarding the MIE criteria are given hereinbelow. [051] was optimized through a sampling-propagation scheme as shown in FIG.5. The image 2D field is broken into alternatively directed 1D chains, while depth values are propagated through each chain. Hidden Markov chain smoothing is imposed on the rigidness maps. A randomly sampled depth
Figure imgf000014_0002
is compared with the previous depth value together with a value propagated from the previous neighbor
Figure imgf000014_0003
. Then
Figure imgf000014_0005
will be updated to the value of the best estimation among these three options. The updated
Figure imgf000014_0006
will further be propagated to the neighbor pixel j + 1. UPDATING THE RIGIDNESS MAPS [052] A scheme where the image is split into rows and columns was adopted, reducing a 2D image to several 1D hidden Markov chains, and a pairwise smoothness term is posed on the rigidness map
Figure imgf000014_0004
where γ is a transition probability encouraging similar neighboring rigidness. In the E step, rigidness maps W were updated according to θ. As the smoothness defined in Eq. (11), the forward-backward algorithm is used for inferring W in the hidden Markov chain.
Figure imgf000015_0001
where A is a normalization factor while
Figure imgf000015_0002
and
Figure imgf000015_0003
are the forward and backward message of computed recursively as
Figure imgf000015_0004
where is the emission probability referred to in Eq. (2),
Figure imgf000015_0005
POSE UPDATE [053] Camera poses were updated while assuming fixed known depth θ and rigidness maps W. The optical flow chains in X were used to determine the 2D projection of any given 3D point extracted from the depth map. Since the aim is to estimate relative camera motion, scene depth was expressed relative to the camera pose at time t – 1, and the attained 3D-2D correspondences were used to define a dense PnP instance. This instance was solved by estimating the mode on an approximated posterior distribution given by Monte-Carlo sampling of the pose space through (minimal) P3P instances. The robustness to outliers correspondences and the independence from initialization provided to the present method is useful for bootstrapping the visual odometry system described herein, where the camera pose needs to be estimated from scratch with an uninformative rigidness map (all one). [054] The problem can be written down as maximum a posteriori (MAP) by
Figure imgf000015_0006
[055] Finding the optimum camera pose is equal to computing the maximum posterior distribution
Figure imgf000015_0007
, which is not tractable since it requires to integral over T to compute P(X). A Monte-Carlo based approximation was used, where for each depth map position θ j two additional distinct positions {k, l} were randomly sampled to form a 3-tuple
Figure imgf000016_0001
, with associated rigidness values th
Figure imgf000016_0008
to represent the g group. Then the posterior can be approximated as
Figure imgf000016_0002
where S is the total number of groups. Although the posterior
Figure imgf000016_0003
is still not tractable, using 3 pairs of 3D-2D correspondences, PnP reaches its minimal form of P3P, which can be solved efficiently using the P3P algorithm. Hence,
Figure imgf000016_0004
where
Figure imgf000016_0009
denotes the P3P solver, for which AP3P is used. The first input argument indicates the 3D coordinates of selected depth map pixels at time t − 1 obtained by combining previous camera poses, while the second input argument is their 2D correspondences at time t, obtained using optical flow displacement. Hence, a tractable variational distribution q(Tt) to approximate its true posterior is used.
Figure imgf000016_0007
where q(Tt) is a normal distribution with mean and predefined fixed covariance matrix Σ for
Figure imgf000016_0010
simplicity. Furthermore, each variational distribution is weighted with
Figure imgf000016_0005
, such that potential outliers indicated by rigidness maps can be excluded or down weighted. Then, the full posterior can be approximated as
Figure imgf000016_0006
[056] The posterior
Figure imgf000016_0011
was approximated with a weighted combination of q(Tt). Solving for the optimumTt on the posterior equates to finding the mode of the posterior. Since all q(Tt) is assumed to share the same covariance structure, mode finding on this distribution equates to applying meanshift with Gaussian kernels of covariance Σ. Note that since Tt lies in SE(3), while meanshift is applied to vector space, an obtained a mode cannot be guaranteed to lie in SE(3). Thus, poses Tt are first converted to a 6-vector
Figure imgf000017_0003
in Lie algebra through logarithm mapping, and meanshift is applied to the 6-vector space. FIG.6 shows pose MAP approximation via a meanshift-based mode search. Each 3D-2D correspondence is part of a unique minimal P3P instance, constituting a pose sample that is weighted by the rigidness map. Samples were mapped to SE(3) and meanshift was run to find the mode. Algorithm Integration
Figure imgf000017_0004
[057] The integrated workflow of the present visual odometry algorithm will now be described, which is denoted VOLDOR. Per Table 1, the input is a sequence of dense optical flows
Figure imgf000017_0001
and the output will be the camera poses of each frame
Figure imgf000017_0002
as well as the depth map θ of the first frame. Usually, 4-8 optical flows per batch are used. Firstly, VOLDOR initializes all W to one, and T1 is initialized from epipolar geometry estimated from X1 using a least median of squares estimator or, alternatively, from previous estimates if available (i.e. overlapping consecutive frame batches). Then, θ is obtained from two-view triangulation using T1 and X1. Next, the optimization loop between camera pose, depth map and rigidness map runs until convergence, usually within 3 to 5 iterations. Note, the rigidness map was not smoothed before updating camera poses to prevent loss of fine details indicating the potential high-frequency noise in the observations. EXPERIMENTS KITTI BENCHMARK [058] A KITTI odometry benchmark of a car driving at urban and highway environments was tested. PWC-Net was used as the external dense optical flow input. The sliding window size was set to 6 frames. Set λ was used in Eq. (8) to 0.15, γ in Eq. (11) to 0.9. The Gaussian kernel covariance matrix Σ in Eq. (18) was set to diagonal, scaled to 0.1 and 0.004 at the dimensions of translation and rotation respectively. The hyper-parameters for the Fisk residual model are
Figure imgf000018_0001
obtained from the analysis of residual distribution described hereinabove. Finally, absolute scale is estimated from ground plane by taking the mode of pixels with surface normal vector near perpendicular. More details of ground plane estimation are provided hereinbelow.
Figure imgf000018_0002
Table 2 [060] Table 2 shows results on KITTI training sequences 0-10. The translation and rotation errors are averaged over all sub-sequences of length from 100 meters to 800 meters with 100 meter steps. Table 2 shows results on the KITTI odometry training set sequences 0-10. VISO2 and MLM-SFM were picked as baselines, whose scales are also estimated from ground height. [061] Table 3 compares the present results with recent popular methods on KITTI test set sequences 11-21, which was downloaded from the KITTI odometry official ranking board. As the results shows, VOLDOR has achieved top-ranking accuracy under KITTI dataset among monocular methods. * indicates the method is based on stereo input.
Figure imgf000019_0003
Table 3 [062] Table 4 shows the depth map quality on the KITTI stereo benchmark. A foreground moving object was masked out, and depth was aligned with groundtruth to solve scale ambiguity. The depth quality was separately evaluated for different rigidness probabilities, where
Figure imgf000019_0001
The EPE of PSMNet, GC-Net and GA-Net were measured on stereo 2012 test set, and background outlier percentage was measured on stereo 2015 test set while the present method was method is measured on training set on stereo 2015. In Table 4, a pixel is considered as outlier if disparity EPE is > 3px and > 5%. Wj denotes the sum of pixel rigidness
Figure imgf000019_0002
Figure imgf000019_0004
Table 4 TUM RGB-D Benchmark
Figure imgf000020_0001
Table 5 [063] Accuracy experiments on TUM RGB-D compared VOLDOR vs. full SLAM systems. In all instances, trajectories were rigidly aligned to groundtruth for segments with 6 frames and estimate mean translation RMSE of all segments. Parameters remained the same to the KITTI experiments. The comparative baselines are an indirect sparse method ORB-SLAM2, a direct sparse method DSO and a dense direct method DVO-SLAM. Per Table 5, VOLDOR performs well under indoor capture exhibiting smaller camera motions and diverse motion patterns. ABLATION AND PERFORMANCE STUDY [064] FIG. 7 visualizes the depth likelihood function and camera pose sampling distribution, wherein the values are translation RMSE in meters.7(a) and 7(c) visualize the depth likelihood function under Gaussian and Fisk residual models with MLE and MIE criteria. Dashed lines indicate the likelihood given by a single optical flow. Solid lines are joint likelihood obtained by fusing all dashed lines. MLE and MIE are therefore depicted.7(c) and 7(d) visualize the epipole distribution for 40K camera pose samples. For better visualization, the density color bars of (b) (d) are scaled differently. With the Fisk residual model, depth likelihood from each single frame had a well localized extremum (FIG.7(c)), compared to a Gaussian residual model (FIG.7(a)). This leads to a joint likelihood having a more distinguishable optimum, and results in more concentrated camera pose samplings (FIGS. 7(b), 7(d)). Also, with the MIE criteria, depth likelihood of the Fisk residual model is relaxed to a smoother shape whose effectiveness is further analyzed in FIG.8, while a Gaussian residual model is agnostic to the choice between MIE and MLE (FIG.7(a)). Per the quantitative study in FIG.8(b), compared to other analytic distributions, the Fisk residual model gives significantly better depth estimation when only a small number of reliable observations (low Wj) are used. Performance across different residual models tends to converge as the number of reliable samples increases (high Wj), while the Fisk residual model still provides the lowest EPE. [065] FIG.8(a) shows the camera pose error of VOLDOR under different residual models and dense optical flow input (* indicating that due to noisy ground estimations given by C2F-Flow, its scale is corrected using groundtruth). FIG.8(b) shows depth map accuracy under different residual models. FIGS.8(c) and 8(d) show the runtime of the present method tested on a GTX 1080Ti GPU. FIG.8(a) shows the ablation study on camera pose estimation for three optical flow methods, four residual models and the proposed MIE criteria. The accuracy of combining PWCNet optical flow, a Fisk residual model, and the presently proposed MIE criteria strictly dominates (in the Pareto sense) all other combinations. FIG.8(b) shows that the MIE criteria yield depth estimates that are more consistent across the entire image sequence, leading to improved overall accuracy. However, at the extreme case of predominantly unreliable observations (very low Wj) MLE provides the most accurate depth. FIG.8(c) shows the overall runtime evaluation for each component under different frame numbers. FIG.8(d) shows runtime for pose update under different sample rates. It should be understood that the data (e.g., that in FIG.8) can be represented in a variety of manners depending on operating conditions, definitions and other factors, such as scale. [066] GROUND PLANE ESTIMATION In Table 6, the ground plane estimation algorithm coupled with KITTI odometry benchmark is provided.
Figure imgf000022_0002
Table 6 [067] In the ground plane estimation algorithm of Table 6, a normal vector and height for each pixel was estimated in the ROI and the height was scaled with its median. Meanshift was applied to find the mode with prior knowledge of the ground normal and the height scale (median). Finally, the estimated normal vector was actively checked to determine if the ground estimate was correct. COMPARISON WITH DEEP-LEARNING VO [068] As Table 7 shows, the visual odometry result is compared with recent SOTA deep-learning VO methods, where ORBSLAM is used as baseline. Specifically, it shows a comparison with recent deep-learning methods on KITTI visual odometry benchmark sequence 09, 10.
Figure imgf000022_0001
Table 7 [069] Furthermore, the performance of the present method will now be detailed. Due to the camera motion, the geometry representation of the first frame’s depth map causes the coverage of registrable depth pixels to vary with the frames within the optimization window. Some of the depth pixels can only be registered in a small number of frames, and this will yield less reliable depth estimates since they are triangulated from less observations. However, those depth pixels will increase the depth coverage of the frames where they are observed. With a different balance between the reliability and coverage of depth pixels, the performance of each frame within the optimization window varies. In FIG.9, the performance over 10 frames was tested separately using KITTI dataset. As the result shows, the third and fourth frames usually have the best performance. This property will be utilized hereinbelow when discussion fusing segment estimates. [070] FIG.9(a) shows rotation error over frame. FIG.9(b) shows translation error over frame. FIG.9(c) shows rotation error over speed. FIG.9(d) shows translation error over speed. FIGS. 9(c) and 9(d) specifically show the performance over different camera moving speeds in the KITTI dataset. The inventive method can stably work with a wide range of speeds at 10-50 km/h. When the speed is low, camera baselines are small and the triangulation has large uncertainty, which decreases the performance. When the speed is high, frames have limited overlap and provide little mutual information that also causes a decrease of performance. IMPLEMENTATION DETAILS DEPTH UPDATE [071] In the depth update process, a depth value is sampled from a uniform distribution with inverse depth representation for each pixel, while the best depth value is kept. This process can be done multiple times to achieve better convergence. In practice, one depth value was sampled for each pixel when testing for camera pose accuracy, but two times when testing depth accuracy. Finally, the propagation scheme was applied from four directions only one time to spread the depth values. While computing the likelihood of a depth value, if a pixel falls out the boundary, its rigidness was set from that time step to zero. Bilinear interpolation was used to obtain a flow vector on continuous position of observed optical flow field. POSE UPDATE [072] In the pose sampling process, besides only referring to the rigidness map, pixels were selected that fall in a certain range of depth. Pixels with depth were picked,
Figure imgf000024_0001
where θj is the pixel depth value, and t is the translation vector magnitude. In the meanshift process, since a low variance of pose weights was observed, the weight was binarized with a threshold at 0.5 for a faster meanshift implementation. For meanshift initialization, the estimated camera pose in the previous iteration was used. In the case of first iteration, the camera pose of previous time step is used if the initialization can obtain a kernel weight larger than 0.1. Otherwise, 10 start points are randomly picked and the initialized is started with the start point with largest kernel weight. TRUNCATION [073] In the whole workflow, for robustness, badly registered frames were actively detected and the optimization window size was truncated. This mainly happens when the camera motion is large, such that the depth map of first frame cannot been registered to the tail frames. These two criteria are used to determine a truncation at time step t, 1) when the rigidness map at time t has less than 2000 inlier pixels, and 2) after meanshift for camera pose at time t converges, the kernel weight is less than 0.01. The truncation truncates the optimized window size to t − 1. and forces the algorithm to run 3 more iterations after truncation to ensure convergence. FUSION [074] The present method treats each optimization window independently. Thus, a fusion is applied as post-processing to obtain the full visual odometry result. In KITTI experiment, a sliding window of window size 6 with step 1 was applied, resulting in the acquisition of 6 pose candidates for each time step. As shown in FIG. 9, the pose quality of each frame in the optimization window varies. In light of this, poses of better quality are picked with higher priority according to the plot. RESULTS [075] When there are moving objects at foreground, rigidness maps can correctly segment the objects and exclude them from the estimation of depth map. However, the resulting depth map can still provide correct background depth estimation at regions occluded by dynamic foreground objects using the information from frames where the region is not occluded. With forward motion, depth pixels become invisible by exiting the field of view at the image boundary, which is reflected in the rigidness maps. Also, occlusions as well as outlier pixels (along object boundaries and the sky region of C2F-Flow and EpicFlow) are clearly indicated by rigidness maps. With rotating motion, depth pixels become invisible from the side opposite the rotation, which also is reflected in the rigidness maps. Good optical flow methods can usually provide more consistent output depth map along the object and frame boundaries. CONCLUSION [076] In summary, conceptually, the VO problem was posed as an instance of geometric parameter inference under the supervision of an adaptive model of the empirical distribution of dense optical flow residuals. Pragmatically, a monocular VO pipeline was developed which obviates the need for a) feature extraction, b) ransac-based estimation, and c) local bundle adjustment, yet still achieves top-ranked performance in the KITTI and TUM RGB-D benchmarks. Moreover, the use of dense-indirect representations and adaptive data-driven supervision were posited as a general and extensible framework for multiview geometric analysis tasks. [077] It will be understood that the embodiments described herein are merely exemplary and that a person skilled in the art may make many variations and modifications without departing from the spirit and scope of the invention. All such variations and modifications, including the embodiments, features and characteristics disclosed in attached Exhibit A, are intended to be included within the scope of the present invention.

Claims

WHAT IS CLAIMED IS: 1. A method for recovering data from video recordings, said method comprises the steps of: obtaining a video sequence; providing a plurality of supervised learning estimators corresponding to said video sequence; extracting an optical flow field sequence from said video sequence; defining an indirect visual odometry probabilistic model; treating said optical flow field as known variables subject to error within said probabilistic model; inputting said plurality of supervised learning estimators into said probabilistic model; performing generalized expectation maximization on said probabilistic model; and obtaining estimated random variables from said probabilistic model.
2. The method of Claim 1, wherein said learning estimators are computed externally.
3. The method of Claim 1, wherein said supervised learning estimators are computed with deep- learning techniques.
4. The method of Claim 1, further comprising the step of iteratively alternating between a plurality of parameters.
5. The method of Claim 1, wherein said probabilistic model comprises a geometric inference framework.
6. The method of Claim 1, wherein said probabilistic model is a log-logistic error model.
7. The method of Claim 1, wherein said estimates are inferred camera motion.
8. The method of Claim 1, wherein said estimates are inferred pixel depth.
9. The method of Claim 1, wherein said estimates are inferred motion-track confidence.
10. The method of Claim 1, wherein said step of obtaining a video sequence comprises the step recording with a monocular image capture device.
11. The method of Claim 1, wherein said probabilistic model treats said optical flow sequences as Fisk-distribution variables.
12. The method of Claim 1, further comprising the step of performing a deterministic bootstrap for camera pose and pixel depth on said plurality of supervised learning estimators.
13. The method of Claim 1, wherein said probabilistic model includes a P3P method.
14. The method of Claim 1, wherein said probabilistic model includes a 3D triangulation method.
15. The method of Claim 1, furthering comprising the step of converting said estimates into a 3D model or 3D reconstruction.
16. The method of Claim 15, wherein said 3D model or 3D reconstruction represents geographical information.
17. The method of Claim 17, wherein said 3D model or 3D reconstruction is translated into a virtual reality representation of geography.
18. The method of Claim 1, further comprising the step of navigating an autonomous vehicle with said estimates.
19. The method of Claim 1, wherein said optical flow field sequence is dense.
20. An estimation method, said method comprises the steps of: obtaining data; providing a plurality of supervised learning estimators corresponding to said data; extracting information from said data; defining a generalized error model; treating said information as an observed random variable within said error model; inputting said plurality of supervised learning estimators into said error model; performing generalized expectation maximization on said error model; and obtaining random estimated variables from said error model.
PCT/US2021/034981 2020-05-28 2021-05-28 System and method for visual odometry from log-logistic dense optical flow residuals WO2021243281A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063031374P 2020-05-28 2020-05-28
US63/031,374 2020-05-28

Publications (2)

Publication Number Publication Date
WO2021243281A1 true WO2021243281A1 (en) 2021-12-02
WO2021243281A9 WO2021243281A9 (en) 2022-02-03

Family

ID=76641811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/034981 WO2021243281A1 (en) 2020-05-28 2021-05-28 System and method for visual odometry from log-logistic dense optical flow residuals

Country Status (1)

Country Link
WO (1) WO2021243281A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272589A (en) * 2022-09-27 2022-11-01 江苏数字看点科技有限公司 Data processing method and system for lossless fusion of BIM (building information modeling) model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MIN ZHIXIANG ET AL: "VOLDOR: Visual Odometry From Log-Logistic Dense Optical Flow Residuals", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 13 June 2020 (2020-06-13), pages 4897 - 4908, XP033804985, DOI: 10.1109/CVPR42600.2020.00495 *
PARK S-C ET AL: "Qualitative estimation of camera motion parameters from the linear composition of optical flow", PATTERN RECOGNITION, ELSEVIER, GB, vol. 37, no. 4, April 2004 (2004-04-01), pages 767 - 779, XP004491490, ISSN: 0031-3203, DOI: 10.1016/J.PATCOG.2003.07.012 *
PRASADSINGH INDUSHEKHAR ET AL: "VISUAL ODOMETRY FOR AUTONOMOUS VEHICLES", INTERNATIONAL JOURNAL OF ADVANCED RESEARCH, vol. 7, no. 9, September 2019 (2019-09-01), pages 1136 - 1144, XP055826408, Retrieved from the Internet <URL:https://www.journalijar.com/uploads/806_IJAR-29044.pdf> [retrieved on 20210721], DOI: 10.21474/IJAR01/9765 *
ROBERTS R ET AL: "Learning general optical flow subspaces for egomotion estimation and detection of motion anomalies", 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION : CVPR 2009 ; MIAMI [BEACH], FLORIDA, USA, 20 - 25 JUNE 2009, IEEE, PISCATAWAY, NJ, 20 June 2009 (2009-06-20), pages 57 - 64, XP031607065, ISBN: 978-1-4244-3992-8 *
XUE FEI ET AL: "Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry Beijing Changcheng Aviation Measurement and Control Institute, AVIC 3 PKU-SenseTime Machine Vision Joint Lab", 5 April 2019 (2019-04-05), XP055826384, Retrieved from the Internet <URL:https://arxiv.org/pdf/1904.01892.pdf> [retrieved on 20210721] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272589A (en) * 2022-09-27 2022-11-01 江苏数字看点科技有限公司 Data processing method and system for lossless fusion of BIM (building information modeling) model
CN115272589B (en) * 2022-09-27 2022-12-16 江苏数字看点科技有限公司 Data processing method and system for lossless fusion of BIM (building information modeling) model

Also Published As

Publication number Publication date
WO2021243281A9 (en) 2022-02-03

Similar Documents

Publication Publication Date Title
Teed et al. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras
Menze et al. Object scene flow
US10915731B2 (en) Detecting objects in video data
US10553026B2 (en) Dense visual SLAM with probabilistic surfel map
Huang et al. Clusterslam: A slam backend for simultaneous rigid body clustering and motion estimation
Pizzoli et al. REMODE: Probabilistic, monocular dense reconstruction in real time
Zuo et al. Visual-inertial localization with prior LiDAR map constraints
Herbst et al. Toward online 3-d object segmentation and mapping
Vidas et al. Real-time mobile 3D temperature mapping
Jaimez et al. Motion cooperation: Smooth piece-wise rigid scene flow from rgb-d images
Jaegle et al. Fast, robust, continuous monocular egomotion computation
Theodorou et al. Visual SLAM algorithms and their application for AR, mapping, localization and wayfinding
Sucar et al. Probabilistic global scale estimation for monoslam based on generic object detection
Jiang et al. Semcal: Semantic lidar-camera calibration using neural mutual information estimator
Platinsky et al. Monocular visual odometry: Sparse joint optimisation or dense alternation?
Park et al. Nonparametric background model-based LiDAR SLAM in highly dynamic urban environments
Chen et al. I2D-Loc: Camera localization via image to lidar depth flow
Hou et al. Volumetric next best view by 3D occupancy mapping using Markov chain Gibbs sampler for precise manufacturing
Liu et al. Real-time dense construction with deep multi-view stereo using camera and IMU sensors
Lin et al. Asynchronous state estimation of simultaneous ego-motion estimation and multiple object tracking for LiDAR-inertial odometry
Liu et al. Stereo matching: fundamentals, state-of-the-art, and existing challenges
Fan et al. Large-scale dense mapping system based on visual-inertial odometry and densely connected U-Net
WO2021243281A1 (en) System and method for visual odometry from log-logistic dense optical flow residuals
Hosseinzadeh et al. Sparse point-plane SLAM
CN115239899B (en) Pose map generation method, high-precision map generation method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21735495

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21735495

Country of ref document: EP

Kind code of ref document: A1