WO2021243281A1

WO2021243281A1 - System and method for visual odometry from log-logistic dense optical flow residuals

Info

Publication number: WO2021243281A1
Application number: PCT/US2021/034981
Authority: WO
Inventors: Enrique DUNN; Zhixiang MIN
Original assignee: The Trustees Of The Stevens Institute Of Technology
Priority date: 2020-05-28
Filing date: 2021-05-28
Publication date: 2021-12-02
Also published as: WO2021243281A9

Abstract

A method for calculating camera poses and dense scene structure via probabilistic dense indirect visual odometry methods is disclosed. It may be used in various video analysis applications. Through use of a probabilistic model relating to the dependencies among camera movement, scene geometry and the input optical estimates, a generalized expectation maximization inference framework may be employed. The inference is build around an empirically validated error model for observation.

Description

SYSTEM AND METHOD FOR VISUAL ODOMETRY FROM LOG-LOGISTIC DENSE OPTICAL FLOW RESIDUALS CROSS-REFERENCE TO RELATED APPLICATION [001] This application claims priority to U.S. Provisional Patent Application Serial No. 63/031,374 filed May 28, 2020, the entire disclosure of which is incorporated herein by reference. FIELD OF THE INVENTION [002] The present invention relates to the estimation of the 3D movements of a camera from the analysis of image content in video analysis, and, more specifically, to methods of recovering camera locations and dense scene structure from video recordings. BACKGROUND OF THE INVENTION [003] Visual odometry (VO) is a fundamental problem in computer vision, addressing the recovery of camera poses from an input video sequence, which supports applications such as augmented reality, robotics and autonomous driving. Traditional indirect VO methods rely on the geometric analysis of sparse key-point correspondences to determine multi-view relationships among the input video frames. However, by virtue of relying on local feature detection and correspondence pre-processing modules, indirect methods pose the VO problem as a reprojection-error minimization task [004] State-of-the art supervised learning frameworks have proven more effective at finding dense correspondences than standard hand-designed approaches. Deep-learning frameworks that jointly estimate depth, optical flow and camera motion have been presented, including those with the addition of recurrent neural networks for learning temporal information. However, deep learning methods are usually less explainable and have difficulty when transferring to unseen datasets or cameras with different calibration. As such, for fine-grained high-accuracy content-based estimation of camera motion, analytical solutions to multi-view geometry problems still provide the “gold-standard” in terms of accuracy. SUMMARY OF THE INVENTION [005] The present invention represents a novel dense-indirect approach for video analysis, eliminating the need for feature extraction for the geometric analysis of viewing rays and providing an alternative robust estimation framework with bounded computational burden. A log-logistic error model is employed, in contrast to the standard practice of using Gaussian error models. The proposed statistical error model and inference framework may be utilized as a general estimation procedure for applications beyond image-content analysis. [006] The present invention provides information on both camera poses and dense scene structure. Camera poses are useful for applications that need localization such as augmented reality, robotics and autonomous driving. Dense scene structure is useful for applications that need 3D reconstruction, such as 3D modelling, engineering surveys and geography research. Software embodying the present invention might benefit robotic, AR/VR, and autonomous navigation applications. [007] These objectives are met by solving the problem of recovering camera locations and dense scene structure using video sequences from monocular image capture devices. More specifically, the present invention defines the dense indirect visual odometry (VO) as a probabilistic model and solves it with a generalized expectation maximization (i.e., EM) formulation for the joint inference of camera motion, pixel depth, and motion-track confidence from externally estimated optical flow fields under the supervision of an adaptive log-logistic distribution model. [008] By applying learned correspondence estimators as input to a rigorous probabilistic geometric inference framework, the advantages of both supervised learning and analytical solutions may be leveraged. [009] The methods of the present invention can be applied to autonomous driving applications. Such methods of the present invention can recover the driving track, as well as the scene geometry from car-mounted cameras that provide local and global environment geometry information to the autonomous driving system for further planning, such as to avoid collision, for navigation, etc. [010] For augmented reality systems, the present invention can triangulate indoor geometry from phone cameras for the application of further providing occlusion-aware 3D interactions with a user interface. [011] For engineering surveys, 3D modelling or geography, the present invention can triangulate an environment geometry (e.g., from drone-mounted cameras) that provides real- world 3D models for further applications, such as distance measuring, 3D-printing, geography analysis, etc. BRIEF DESCRIPTION OF THE DRAWINGS [012] For a better understanding of the present invention, reference is made to the following detailed description of various exemplary embodiments considered in conjunction with the accompanying drawings, in which: [013] FIG.1 is a schematic diagram of a graphical probabilistic model in accordance with an embodiment of the present invention; [014] FIG.2 is a flow diagram of an algorithm in accordance with an embodiment of the present invention; [015] FIG.3 is a flow diagram illustrating an exemplary process for using the methods of the present invention; [016] FIG.4 is a conceptual diagram of the present invention; [017] FIG.5 is a model for depth inference; [018] FIG.6 is a schematic illustration of a Pose MAP approximation via a meanshift-based mode search; [019] FIG.7 shows graphical results of a Fisk model qualitative study; [020] FIG.8 shows graphical results of an ablation study and runtime; and [021] FIG.9 shows graphical results of performance analysis conducted over frame and speed on KITTI benchmark. DESCRIPTION OF EMBODIMENTS OF THE INVENTION [022] Various embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative of the invention that can be embodied in various forms. In addition, each of the examples given in connection with the various embodiments is intended to be illustrative, and not restrictive. Further, the figures are not necessarily to scale, and some features may be exaggerated to show details of particular components (and any size, material and similar details shown in the figures are intended to be illustrative and not restrictive). Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the disclosed embodiments. [023] Subject matter will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific exemplary embodiments. Subject matter may, however, be embodied in a variety of different forms and, therefore, covered or disclosed subject matter is intended to be construed as not being limited to any exemplary embodiments set forth herein, it being understood that such exemplary embodiments are provided merely to be illustrative. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. The following detailed description is, therefore, not intended to be taken in a limiting sense. [024] Throughout the specification, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment and the phrases “in another embodiment” and “other embodiments” as used herein do not necessarily refer to a different embodiment. It is intended, for example, that covered or disclosed subject matter includes combinations of the exemplary embodiments in whole or in part. [025] In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and”, “or”, or “and/or,” as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B, or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B, or C, here used in the exclusive sense. In addition, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context. [026] The approaches proposed in accordance with the present invention provide an estimate of the relative displacements of a monocular camera system by taking as input dense optical flow estimates (which are dense pixel registration maps between two images) among successive frames. [027] The present invention develops a dense indirect framework for monocular VO that takes as input externally computed optical flow from supervised-learning estimators. It was empirically observed that optical flow residuals tend to conform to a log-logistic (i.e., Fisk) distribution model parametrized by the optical flow magnitude. This insight was leveraged to propose a probabilistic framework that fuses a dense optical flow sequences and joint estimates of camera motion, pixel depth, and motion-track confidence through a generalized-EM formulation. The approach is dense in the sense that each pixel corresponds to an instance of estimated random variables; it is indirect in the sense that it treats individual pixels as viewing rays within minimal feature-based multi-view geometry models (i.e. Perspective 3-Point, or P3P, for camera pose, 3D triangulation for pixel depth) and implicitly optimizes for reprojection error. Starting from a deterministic bootstrap of camera pose and pixel depths attained from optical flow inputs, the method of the present invention iteratively alternates the inference of depth, pose and track confidence over a batch of consecutive images. [028] FIG.1 is a VOLDOR probabilistic graphical model. Optical flow field sequences are modeled as observed variables subject to Fisk-distributed measurement errors. Camera poses, depth maps and rigidness maps are modeled as hidden variables. [029] FIG.2 is an iterative estimation workflow. Externally computed optical flow estimates of a video sequence are input. Scene depth, camera poses and a rigidness maps are alternatively estimated by enforcing congruence between predicted rigid flow and input flow observations. Estimation is posed as a probabilistic inference task governed by a Fisk-distributed residual model. [030] By proposing a probabilistic model relating to the dependencies among camera movement, scene geometry and the input optical estimates, a generalized expectation maximization inference framework may be employed. Importantly, the inference is built around an empirically validated error model for observation. [031] Furthermore, the present invention constitutes a probabilistic dense, indirect visual odometry method taking as input externally estimated optical flow fields, as well as identifying and utilizing the log-logistic dense optical flow residual in its framework. The pose estimation relies on mode-seeking from dense sampling in the pose space of the cameras. [032] There are several advantages to this framework. First, it constitutes a modular framework that is agnostic to the optical flow estimator engine, which takes full advantage of recent deep-learning optical flow methods. Contrary to learning-based monocular depth approaches, which develop and impose strong semantic priors, learning for optical flow estimation may be informed by photometric error and achieve better generalization. Moreover, by replacing sparse, hand-crafted features inputs with learned dense optical flows, surface information for poorly textured (i.e., feature-less) regions can be obtained. Second, by leveraging the empirically-validated log-logistic residual model, highly accurate probabilistic estimates for scene depth and camera motion can be obtained, which estimates do not rely on Gaussian error assumptions. Third, the methods of the present invention are highly parallelizable, which also allows for real-time application on commodity GPU-based architectures. Given that there is only linear computational and storage growth, the inventive method is GPU-friendly. [033] FIGS.3 and 4 provide general overviews of the methods and components used in association with the present invention. [034] Publicly available datasets and benchmarks were used to validate the correctness of the approach of the present invention. The present approach provides the most accurate estimation results in the current academic benchmarks. As a further modification of the processing related to the present invention, a hierarchical (i.e. coarse-to-fine) variant of the currently developed solution could be employed. EXAMPLE 1: [035] Optical flow can been seen as a combination of rigid flow, which is relevant to the camera motion and scene structure, along with an unconstrained flow describing general object motion. The present VO method inputs a batch of externally computed optical flow fields and infers the underlying temporally-consistent aforementioned scene structure (depth map), camera motions, as well as pixel-wise probabilities for the “rigidness” of each optical flow estimate. Furthermore, the system framework is posited under the supervision of an empirically validated adaptive log-logistic residual model over the end-point-error (EPE) between the estimated rigid flow and the input (i.e., observed) flow. GEOMETRIC NOTATION [036] A sequence of externally computed (observed) dense optical flow fields is input: X = {X_t | t = 1, ··· , t_N }, where X_t is the optical flow map from image I_t−1 to I_t, while

denotes the optical flow vector of pixel j at time t. The aim is to infer the camera

poses T = {T_t | t = 1, ··· , t_N }, where T ∈ SE(3) represents the relative motion from time t – 1 to t. [037] To define a likelihood model relating prior observations X to T, two additional (latent) variable types were introduced: 1) a depth field θ defined over I₀, and 2) a rigidness probability map W_t associated to each X_t; where θ ^j denotes the depth value at pixel j, while

denotes the set of rigidness maps, and

denotes the rigidness probability of pixel j at time t. [038] Having depth map θ and rigidness maps W, rigid flow ξ_t(θ ^j ) can be obtained through applying the rigid transformation T to the point cloud associated with θ, conditioned on W. Assuming T₀ = I, π_t(θ ^j ) denotes the pixel coordinate of projecting the 3D point θ ^j into the camera image plane at time t using given camera poses T, by

[039] where K is the camera intrinsic and x_j, y_j are the image coordinates of pixel j. Hence, rigid flow can be defined as ξ_t(θ ^j ) = π_t(θ ^j ) – π_t−1(θ ^j ). MIXTURE LIKELIHOOD MODEL. [040] The residuals between observation flow and rigid flow were modeled with respect to the continuous rigidness probability

where the probability density function

represents the probability for having the rigid flow ξ_t(θ ^j ) under the observation flow of

and μ(·) is a uniform distribution whose density varies with

These two functions will be defined hereinbelow. Henceforth, when modeling the probability of X_t, only T_t will be given in the conditional probability, although it should be noted that in these instances the projection also depends on preceding camera poses

which are assumed fixed and for which X_t inherently does not contain any information. Moreover, jointly modeling for all previous camera poses along with X_t would bias them as well as increase the computational complexity. In the following paragraph, Eq. (2) will be noted simply as.

At this point, the visual odometry problem can be modeled as a maximum likelihood estimation problem

[041] Furthermore, spatial consistency is promoted among both the dense hidden variables θ and W, through different mechanisms described hereinbelow. FISK RESIDUAL MODEL [042] The choice of an observation residual model impacts accurate statistical inference from a reduced number of observations, where in this instance, a residual is defined as the end-point error in pixels between two optical flow vectors. [043] In practice, hierarchical optical flow methods (i.e. relying on recursive scale-space analysis) tend to amplify estimation errors in proportion to the magnitude of the pixel flow vector. In light of this, an adaptive residual model was explored for determining the residuals distribution w.r.t, the magnitude of optical flow observations. The residual distribution of multiple leading optical flow methods w.r.t. groundtruth were analyzed. The empirical distribution to five different analytic models was fitted, and it was found that the Fisk distribution yielded the most congruent shape over all flow magnitudes. To quantify the goodness of fit, K-S test results were posited, which quantify the supremum distance (D value) of the CDF between the empirical distribution and a reference analytic distribution. [044] Hence, given

the probability of

was modeled, matching the underlying groundtruth, as

where the functional form for the PDF of the Fisk distribution F is given by

[045] The parameters of the Fisk distribution were determined. Since a clear linear correspondence was shown in the results, instead of using a look-up table, the fitted function was applied to find the parameters, as

where a1, a2 and b1, b2 are learned parameters depending on the optical flow estimation method. [046] Next, the outlier likelihood function μ(.) was modeled. The general approach is to assign outliers to a uniform distribution to improve the robustness. In the present example, for utilizing the prior given by the observation flow, the density of the uniform distribution is provided as a function μ(.) over the observation flow vector.

where λ is a hyper-parameter adjusting the density, which is also the strictness for choosing inliers. The numerical interpretation of λ is the optical flow percentage EPE for indistinguishable outliers

Hence, flows with different magnitudes can be compared under fair metrics to be selected as an inlier. INFERENCE [047] In this section, an iterative inference framework is introduced which alternatively optimizes depth map, camera poses and rigidness maps. DEPTH AND RIGIDNESS UPDATE GENERALIZED EXPECTATION-MAXIMIZATION (GEM). [048] Depth θ and its rigidness W over time were inferred while assuming fixed known camera pose T. The true posterior

was approximated through a GEM framework. In this section, Eq. (2) will be given as

where the fixed T is omitted. The intractable real posterior

was approximated with a restricted family of distributions wher ^j

e

For tractability, q(θ ) is further constrained to the family of Kronecker delta functions ^j

where θ is a parameter to be estimated. Moreover, q(Wt) inherits the smoothness defined on the ridgidness map Wt in Eq. (11), which was previously proved as minimizing KL divergence between the variational distribution and the true posterior. In the M step, an estimate of an optimal value for θ^j given an estimated PDF on

was sought. Next, the choice of the estimators used for this task will be explained. MAXIMUM LIKELIHOOD ESTIMATOR (MLE) [049] The standard definition of the MLE for the problem is given by

where

is the estimated distribution density given by E step. However, it was found empirically that the MLE criteria tended to be overly sensitive to inaccurate initialization. More specifically, the depth map was bootstrapped using only the first optical flow and using its depth values to sequentially bootstrap subsequent camera poses (further details will follow). Hence, for noisy/inaccurate initialization, using MLE for estimate refinement will impose high selectivity pressure on the rigidness probabilities W, favoring a reduced number of higher- accuracy initialization. Given the sequential nature of the image-batch analysis, this tends to effectively reduce the set of useful down-stream observations used to estimate subsequent cameras. MAXIMUM INLIER ESTIMATION (MIE) [050] To reduce the bias caused by initialization and sequential updating, the MLE criteria were relaxed to the following MIE criteria,

which finds a depth maximizing for the rigidness (inlier selection) map W. Experimental details regarding the MIE criteria are given hereinbelow. [051] was optimized through a sampling-propagation scheme as shown in FIG.5. The image 2D field is broken into alternatively directed 1D chains, while depth values are propagated through each chain. Hidden Markov chain smoothing is imposed on the rigidness maps. A randomly sampled depth

is compared with the previous depth value together with a value propagated from the previous neighbor

. Then

will be updated to the value of the best estimation among these three options. The updated

will further be propagated to the neighbor pixel j + 1. UPDATING THE RIGIDNESS MAPS [052] A scheme where the image is split into rows and columns was adopted, reducing a 2D image to several 1D hidden Markov chains, and a pairwise smoothness term is posed on the rigidness map

where γ is a transition probability encouraging similar neighboring rigidness. In the E step, rigidness maps W were updated according to θ. As the smoothness defined in Eq. (11), the forward-backward algorithm is used for inferring W in the hidden Markov chain.

where A is a normalization factor while

and

are the forward and backward message of computed recursively as

where is the emission probability referred to in Eq. (2),

POSE UPDATE [053] Camera poses were updated while assuming fixed known depth θ and rigidness maps W. The optical flow chains in X were used to determine the 2D projection of any given 3D point extracted from the depth map. Since the aim is to estimate relative camera motion, scene depth was expressed relative to the camera pose at time t – 1, and the attained 3D-2D correspondences were used to define a dense PnP instance. This instance was solved by estimating the mode on an approximated posterior distribution given by Monte-Carlo sampling of the pose space through (minimal) P3P instances. The robustness to outliers correspondences and the independence from initialization provided to the present method is useful for bootstrapping the visual odometry system described herein, where the camera pose needs to be estimated from scratch with an uninformative rigidness map (all one). [054] The problem can be written down as maximum a posteriori (MAP) by

[055] Finding the optimum camera pose is equal to computing the maximum posterior distribution

, which is not tractable since it requires to integral over T to compute P(X). A Monte-Carlo based approximation was used, where for each depth map position θ ^j two additional distinct positions {k, l} were randomly sampled to form a 3-tuple

, with associated rigidness values ^th

to represent the g group. Then the posterior can be approximated as

where S is the total number of groups. Although the posterior

is still not tractable, using 3 pairs of 3D-2D correspondences, PnP reaches its minimal form of P3P, which can be solved efficiently using the P3P algorithm. Hence,

where

denotes the P3P solver, for which AP3P is used. The first input argument indicates the 3D coordinates of selected depth map pixels at time t − 1 obtained by combining previous camera poses, while the second input argument is their 2D correspondences at time t, obtained using optical flow displacement. Hence, a tractable variational distribution q(T_t) to approximate its true posterior is used.

where q(T_t) is a normal distribution with mean and predefined fixed covariance matrix Σ for

simplicity. Furthermore, each variational distribution is weighted with

, such that potential outliers indicated by rigidness maps can be excluded or down weighted. Then, the full posterior can be approximated as

[056] The posterior

was approximated with a weighted combination of q(T_t). Solving for the optimumT_t on the posterior equates to finding the mode of the posterior. Since all q(T_t) is assumed to share the same covariance structure, mode finding on this distribution equates to applying meanshift with Gaussian kernels of covariance Σ. Note that since T_t lies in SE(3), while meanshift is applied to vector space, an obtained a mode cannot be guaranteed to lie in SE(3). Thus, poses T_t are first converted to a 6-vector

in Lie algebra through logarithm mapping, and meanshift is applied to the 6-vector space. FIG.6 shows pose MAP approximation via a meanshift-based mode search. Each 3D-2D correspondence is part of a unique minimal P3P instance, constituting a pose sample that is weighted by the rigidness map. Samples were mapped to SE(3) and meanshift was run to find the mode. Algorithm Integration

[057] The integrated workflow of the present visual odometry algorithm will now be described, which is denoted VOLDOR. Per Table 1, the input is a sequence of dense optical flows

and the output will be the camera poses of each frame

as well as the depth map θ of the first frame. Usually, 4-8 optical flows per batch are used. Firstly, VOLDOR initializes all W to one, and T₁ is initialized from epipolar geometry estimated from X₁ using a least median of squares estimator or, alternatively, from previous estimates if available (i.e. overlapping consecutive frame batches). Then, θ is obtained from two-view triangulation using T₁ and X₁. Next, the optimization loop between camera pose, depth map and rigidness map runs until convergence, usually within 3 to 5 iterations. Note, the rigidness map was not smoothed before updating camera poses to prevent loss of fine details indicating the potential high-frequency noise in the observations. EXPERIMENTS KITTI BENCHMARK [058] A KITTI odometry benchmark of a car driving at urban and highway environments was tested. PWC-Net was used as the external dense optical flow input. The sliding window size was set to 6 frames. Set λ was used in Eq. (8) to 0.15, γ in Eq. (11) to 0.9. The Gaussian kernel covariance matrix Σ in Eq. (18) was set to diagonal, scaled to 0.1 and 0.004 at the dimensions of translation and rotation respectively. The hyper-parameters for the Fisk residual model are

obtained from the analysis of residual distribution described hereinabove. Finally, absolute scale is estimated from ground plane by taking the mode of pixels with surface normal vector near perpendicular. More details of ground plane estimation are provided hereinbelow.

Table 2 [060] Table 2 shows results on KITTI training sequences 0-10. The translation and rotation errors are averaged over all sub-sequences of length from 100 meters to 800 meters with 100 meter steps. Table 2 shows results on the KITTI odometry training set sequences 0-10. VISO2 and MLM-SFM were picked as baselines, whose scales are also estimated from ground height. [061] Table 3 compares the present results with recent popular methods on KITTI test set sequences 11-21, which was downloaded from the KITTI odometry official ranking board. As the results shows, VOLDOR has achieved top-ranking accuracy under KITTI dataset among monocular methods. * indicates the method is based on stereo input.

Table 3 [062] Table 4 shows the depth map quality on the KITTI stereo benchmark. A foreground moving object was masked out, and depth was aligned with groundtruth to solve scale ambiguity. The depth quality was separately evaluated for different rigidness probabilities, where

The EPE of PSMNet, GC-Net and GA-Net were measured on stereo 2012 test set, and background outlier percentage was measured on stereo 2015 test set while the present method was method is measured on training set on stereo 2015. In Table 4, a pixel is considered as outlier if disparity EPE is > 3px and > 5%. Wj denotes the sum of pixel rigidness

Table 4 TUM RGB-D Benchmark

Table 5 [063] Accuracy experiments on TUM RGB-D compared VOLDOR vs. full SLAM systems. In all instances, trajectories were rigidly aligned to groundtruth for segments with 6 frames and estimate mean translation RMSE of all segments. Parameters remained the same to the KITTI experiments. The comparative baselines are an indirect sparse method ORB-SLAM2, a direct sparse method DSO and a dense direct method DVO-SLAM. Per Table 5, VOLDOR performs well under indoor capture exhibiting smaller camera motions and diverse motion patterns. ABLATION AND PERFORMANCE STUDY [064] FIG. 7 visualizes the depth likelihood function and camera pose sampling distribution, wherein the values are translation RMSE in meters.7(a) and 7(c) visualize the depth likelihood function under Gaussian and Fisk residual models with MLE and MIE criteria. Dashed lines indicate the likelihood given by a single optical flow. Solid lines are joint likelihood obtained by fusing all dashed lines. MLE and MIE are therefore depicted.7(c) and 7(d) visualize the epipole distribution for 40K camera pose samples. For better visualization, the density color bars of (b) (d) are scaled differently. With the Fisk residual model, depth likelihood from each single frame had a well localized extremum (FIG.7(c)), compared to a Gaussian residual model (FIG.7(a)). This leads to a joint likelihood having a more distinguishable optimum, and results in more concentrated camera pose samplings (FIGS. 7(b), 7(d)). Also, with the MIE criteria, depth likelihood of the Fisk residual model is relaxed to a smoother shape whose effectiveness is further analyzed in FIG.8, while a Gaussian residual model is agnostic to the choice between MIE and MLE (FIG.7(a)). Per the quantitative study in FIG.8(b), compared to other analytic distributions, the Fisk residual model gives significantly better depth estimation when only a small number of reliable observations (low Wj) are used. Performance across different residual models tends to converge as the number of reliable samples increases (high Wj), while the Fisk residual model still provides the lowest EPE. [065] FIG.8(a) shows the camera pose error of VOLDOR under different residual models and dense optical flow input (* indicating that due to noisy ground estimations given by C2F-Flow, its scale is corrected using groundtruth). FIG.8(b) shows depth map accuracy under different residual models. FIGS.8(c) and 8(d) show the runtime of the present method tested on a GTX 1080Ti GPU. FIG.8(a) shows the ablation study on camera pose estimation for three optical flow methods, four residual models and the proposed MIE criteria. The accuracy of combining PWCNet optical flow, a Fisk residual model, and the presently proposed MIE criteria strictly dominates (in the Pareto sense) all other combinations. FIG.8(b) shows that the MIE criteria yield depth estimates that are more consistent across the entire image sequence, leading to improved overall accuracy. However, at the extreme case of predominantly unreliable observations (very low Wj) MLE provides the most accurate depth. FIG.8(c) shows the overall runtime evaluation for each component under different frame numbers. FIG.8(d) shows runtime for pose update under different sample rates. It should be understood that the data (e.g., that in FIG.8) can be represented in a variety of manners depending on operating conditions, definitions and other factors, such as scale. [066] GROUND PLANE ESTIMATION In Table 6, the ground plane estimation algorithm coupled with KITTI odometry benchmark is provided.

Table 6 [067] In the ground plane estimation algorithm of Table 6, a normal vector and height for each pixel was estimated in the ROI and the height was scaled with its median. Meanshift was applied to find the mode with prior knowledge of the ground normal and the height scale (median). Finally, the estimated normal vector was actively checked to determine if the ground ^{estimate was correct.} COMPARISON WITH DEEP-LEARNING VO [068] As Table 7 shows, the visual odometry result is compared with recent SOTA deep-learning VO methods, where ORBSLAM is used as baseline. Specifically, it shows a comparison with recent deep-learning methods on KITTI visual odometry benchmark sequence 09, 10.

T^{able 7} [069] Furthermore, the performance of the present method will now be detailed. Due to the camera motion, the geometry representation of the first frame’s depth map causes the coverage of registrable depth pixels to vary with the frames within the optimization window. Some of the depth pixels can only be registered in a small number of frames, and this will yield less reliable depth estimates since they are triangulated from less observations. However, those depth pixels will increase the depth coverage of the frames where they are observed. With a different balance between the reliability and coverage of depth pixels, the performance of each frame within the optimization window varies. In FIG.9, the performance over 10 frames was tested separately using KITTI dataset. As the result shows, the third and fourth frames usually have the best performance. This property will be utilized hereinbelow when discussion fusing segment estimates. [070] FIG.9(a) shows rotation error over frame. FIG.9(b) shows translation error over frame. FIG.9(c) shows rotation error over speed. FIG.9(d) shows translation error over speed. FIGS. 9(c) and 9(d) specifically show the performance over different camera moving speeds in the KITTI dataset. The inventive method can stably work with a wide range of speeds at 10-50 km/h. When the speed is low, camera baselines are small and the triangulation has large uncertainty, which decreases the performance. When the speed is high, frames have limited overlap and provide little mutual information that also causes a decrease of performance. IMPLEMENTATION DETAILS DEPTH UPDATE [071] In the depth update process, a depth value is sampled from a uniform distribution with inverse depth representation for each pixel, while the best depth value is kept. This process can be done multiple times to achieve better convergence. In practice, one depth value was sampled for each pixel when testing for camera pose accuracy, but two times when testing depth accuracy. Finally, the propagation scheme was applied from four directions only one time to spread the depth values. While computing the likelihood of a depth value, if a pixel falls out the boundary, its rigidness was set from that time step to zero. Bilinear interpolation was used to obtain a flow vector on continuous position of observed optical flow field. POSE UPDATE [072] In the pose sampling process, besides only referring to the rigidness map, pixels were selected that fall in a certain range of depth. Pixels with depth were picked,

where θ^j is the pixel depth value, and t is the translation vector magnitude. In the meanshift process, since a low variance of pose weights was observed, the weight was binarized with a threshold at 0.5 for a faster meanshift implementation. For meanshift initialization, the estimated camera pose in the previous iteration was used. In the case of first iteration, the camera pose of previous time step is used if the initialization can obtain a kernel weight larger than 0.1. Otherwise, 10 start points are randomly picked and the initialized is started with the start point with largest kernel weight. TRUNCATION [073] In the whole workflow, for robustness, badly registered frames were actively detected and the optimization window size was truncated. This mainly happens when the camera motion is large, such that the depth map of first frame cannot been registered to the tail frames. These two criteria are used to determine a truncation at time step t, 1) when the rigidness map at time t has less than 2000 inlier pixels, and 2) after meanshift for camera pose at time t converges, the kernel weight is less than 0.01. The truncation truncates the optimized window size to t − 1. and forces the algorithm to run 3 more iterations after truncation to ensure convergence. FUSION [074] The present method treats each optimization window independently. Thus, a fusion is applied as post-processing to obtain the full visual odometry result. In KITTI experiment, a sliding window of window size 6 with step 1 was applied, resulting in the acquisition of 6 pose candidates for each time step. As shown in FIG. 9, the pose quality of each frame in the optimization window varies. In light of this, poses of better quality are picked with higher priority according to the plot. RESULTS [075] When there are moving objects at foreground, rigidness maps can correctly segment the objects and exclude them from the estimation of depth map. However, the resulting depth map can still provide correct background depth estimation at regions occluded by dynamic foreground objects using the information from frames where the region is not occluded. With forward motion, depth pixels become invisible by exiting the field of view at the image boundary, which is reflected in the rigidness maps. Also, occlusions as well as outlier pixels (along object boundaries and the sky region of C2F-Flow and EpicFlow) are clearly indicated by rigidness maps. With rotating motion, depth pixels become invisible from the side opposite the rotation, which also is reflected in the rigidness maps. Good optical flow methods can usually provide more consistent output depth map along the object and frame boundaries. CONCLUSION [076] In summary, conceptually, the VO problem was posed as an instance of geometric parameter inference under the supervision of an adaptive model of the empirical distribution of dense optical flow residuals. Pragmatically, a monocular VO pipeline was developed which obviates the need for a) feature extraction, b) ransac-based estimation, and c) local bundle adjustment, yet still achieves top-ranked performance in the KITTI and TUM RGB-D benchmarks. Moreover, the use of dense-indirect representations and adaptive data-driven supervision were posited as a general and extensible framework for multiview geometric analysis tasks. [077] It will be understood that the embodiments described herein are merely exemplary and that a person skilled in the art may make many variations and modifications without departing from the spirit and scope of the invention. All such variations and modifications, including the embodiments, features and characteristics disclosed in attached Exhibit A, are intended to be included within the scope of the present invention.

Claims

WHAT IS CLAIMED IS: 1. A method for recovering data from video recordings, said method comprises the steps of: obtaining a video sequence; providing a plurality of supervised learning estimators corresponding to said video sequence; extracting an optical flow field sequence from said video sequence; defining an indirect visual odometry probabilistic model; treating said optical flow field as known variables subject to error within said probabilistic model; inputting said plurality of supervised learning estimators into said probabilistic model; performing generalized expectation maximization on said probabilistic model; and obtaining estimated random variables from said probabilistic model.

2. The method of Claim 1, wherein said learning estimators are computed externally.

3. The method of Claim 1, wherein said supervised learning estimators are computed with deep- learning techniques.

4. The method of Claim 1, further comprising the step of iteratively alternating between a plurality of parameters.

5. The method of Claim 1, wherein said probabilistic model comprises a geometric inference framework.

6. The method of Claim 1, wherein said probabilistic model is a log-logistic error model.

7. The method of Claim 1, wherein said estimates are inferred camera motion.

8. The method of Claim 1, wherein said estimates are inferred pixel depth.

9. The method of Claim 1, wherein said estimates are inferred motion-track confidence.

10. The method of Claim 1, wherein said step of obtaining a video sequence comprises the step recording with a monocular image capture device.

11. The method of Claim 1, wherein said probabilistic model treats said optical flow sequences as Fisk-distribution variables.

12. The method of Claim 1, further comprising the step of performing a deterministic bootstrap for camera pose and pixel depth on said plurality of supervised learning estimators.

13. The method of Claim 1, wherein said probabilistic model includes a P3P method.

14. The method of Claim 1, wherein said probabilistic model includes a 3D triangulation method.

15. The method of Claim 1, furthering comprising the step of converting said estimates into a 3D model or 3D reconstruction.

16. The method of Claim 15, wherein said 3D model or 3D reconstruction represents geographical information.

17. The method of Claim 17, wherein said 3D model or 3D reconstruction is translated into a virtual reality representation of geography.

18. The method of Claim 1, further comprising the step of navigating an autonomous vehicle with said estimates.

19. The method of Claim 1, wherein said optical flow field sequence is dense.

20. An estimation method, said method comprises the steps of: obtaining data; providing a plurality of supervised learning estimators corresponding to said data; extracting information from said data; defining a generalized error model; treating said information as an observed random variable within said error model; inputting said plurality of supervised learning estimators into said error model; performing generalized expectation maximization on said error model; and obtaining random estimated variables from said error model.