WO2022174198A1 - Cadre d'estimation de profondeur auto-supervisé pour environnements intérieurs - Google Patents

Cadre d'estimation de profondeur auto-supervisé pour environnements intérieurs Download PDF

Info

Publication number
WO2022174198A1
WO2022174198A1 PCT/US2022/020511 US2022020511W WO2022174198A1 WO 2022174198 A1 WO2022174198 A1 WO 2022174198A1 US 2022020511 W US2022020511 W US 2022020511W WO 2022174198 A1 WO2022174198 A1 WO 2022174198A1
Authority
WO
WIPO (PCT)
Prior art keywords
depth
image
image frame
pose
depth map
Prior art date
Application number
PCT/US2022/020511
Other languages
English (en)
Inventor
Pan JI
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to CN202280011051.7A priority Critical patent/CN116745813A/zh
Publication of WO2022174198A1 publication Critical patent/WO2022174198A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the present disclosure relates generally to systems and methods for depth estimation from one or more images, and in particular, to self-supervised methods for depth estimation for indoor environments.
  • Depth estimation plays an essential role in a variety of 3D perceptual tasks, such as autonomous driving, virtual reality (VR), and augmented reality (AR).
  • Depth estimation may leverage a depth map that can be estimated from a single image in a supervised manner and/or a self-supervised manner.
  • self-supervision frees the approach from having to capture a ground-truth depth using depth sensors (e.g., LiDAR) and therefore, may be more attractive in scenarios where obtaining the ground-truth is not possible.
  • systems and methods are provided for estimating a depth map from one or more images in a self- supervised manner.
  • a methods and systems for depth estimation from monocular images comprises obtaining a plurality of image frames comprising at least first image frame and a second image frame, wherein the plurality of image frames are captured by at least one image sensor; deriving a depth map for the first image frame based on a depth model; factorizing the depth map into a global scale factor for the first image; determining a relative depth map for the first image by updating the depth map using the global scale factor; and training a depth estimation model to predict the first image frame from the second image frame based on the relative depth map and global scale factor.
  • a non-transitory computer-readable storage medium storing a plurality of instructions for depth estimation from monocular images.
  • the instructions are executable by one or more processors and when executed by the one or more processors cause the one or more processors to perform a method comprising obtaining a plurality of image frames comprising at least first image frame and a second image frame, wherein the plurality of image frames are captured by at least one image sensor; determining a relative pose of the image sensor between the first image frame and the second image frame based on one or more synthesized image frames, the one or more synthesized image frames are derived from the second image frame; and training the depth estimation model based on the determined relative pose.
  • a system for depth estimation comprises a memory configured to store instructions and one or more processors communicably coupled to the memory.
  • the one or more processors are configured to execute the instruction to execute a depth factorization module and a residual pose estimation module.
  • the depth factorization module comprises a depth network configured to determine a depth map from a target image as an input, and a scale network configured to determine a global scale factor from the target image as an input and determine a relative depth map based on updating the depth map with the global scale factor.
  • the residual pose estimation module is configured to configured to iteratively predict residual camera poses between iteratively reconstructed synthesized images and the target image, wherein a first iteratively reconstructed synthesized image is based on a relative camera pose between the target image and a source image, wherein each iteratively reconstructed synthesized image subsequent to the first iteratively reconstructed synthesized image is generated based on a residual camera pose between a preceding iteratively reconstructed synthesized image and the target image.
  • the one or more processors are configured to execute the instruction to train a depth estimation model based on the relative depth map, global scale factor, and the iteratively predicted residual camera poses.
  • FIG. 1 is a diagram illustrating an example architecture for self-supervised depth estimation system according to embodiments disclosed herein.
  • FIG. 2 illustrates an example residual pose estimation for learning the relative camera pose between images according to embodiments disclosed herein.
  • FIG. 3 illustrates a qualitative comparison of depth prediction performed using the depth estimation system of FIG. 1 on EuRoC MAV dataset.
  • FIG. 4 illustrates a qualitative comparison of depth prediction performed using the depth estimation system of FIG. 1 on NYUv2 depth dataset.
  • FIG. 5 is an example computing component that may be used to implement various features of embodiments described in the present disclosure.
  • Embodiments of the systems and methods disclosed herein can provide for estimating a depth map from one or more images in a self-supervised manner.
  • embodiments of the present disclosure provide for self-supervised depth estimation using at least one of a depth factorization module and a residual pose estimation module.
  • the depth factorization module may be configured to learn a global scale factor and/or a relative depth map through a branch added to a depth network.
  • the residual pose estimation module may be configured to estimate accurate camera poses for view synthesis that in turn improves the depth model.
  • Various embodiments comprise both the depth factorization module and the residual pose estimation module.
  • Embodiments disclosed herein provide for a monocular self-supervised depth estimation systems and methods tailored for indoor environments.
  • Embodiments disclosed herein comprise modules, for example, a depth factorization module and a residual pose estimation module.
  • the depth factorization module is configured to factorize the depth map into a global depth scale (for a current image) and a relative depth map.
  • the depth scale factor may be separately predicted by an additional branch in the depth network.
  • the depth network has improved model plasticity to adapt to depth scale changes during training.
  • the residual pose estimation module is configured to mitigate issues of inaccurate rotation prediction.
  • the residual pose estimate module performs residual pose estimation in addition to an initial large pose prediction.
  • Embodiments disclosed herein provide numerous non-limiting advantageous over existing technologies. For example, but not limited to, embodiments herein provide a depth factorization module that helps the depth network adapt to the rapid scale changes; a residual pose estimation module that mitigates the inaccurate rotation prediction issue in the pose network and in turn improves depth prediction; and improved performance of self- supervised depth prediction on indoor datasets, for example, publicly available indoor datasets such as, but not limited to, EuRoC and NYUv2.
  • Self-supervised depth does not require training with the ground truth.
  • existing method proposes to use color consistency loss between stereo images to train a monocular depth model.
  • Another method uses two networks (e.g., one depth network and one pose network) to construct photometric loss across temporal frames.
  • Many follow-up methods then tried to improve the self-supervision by new loss terms. For example, one method incorporated a left-right depth consistency loss for the stereo training, while another put forth a temporal depth consistency loss to encourage neighboring frames to have consistent depth predictions.
  • a third method observed the diminishing issue of the depth model during training and came up with a simple normalization method to counter this effect.
  • Some methods used three networks (e.g., one depth network, one pose network, and one extra flow network) to enforce cross-task consistency between optical flow and dense depth. Some methods leveraged recurrent neural networks, such as long short-term memory (LSTMs), to model long-term dependency in the pose network and/or the depth network.
  • LSTMs long short-term memory
  • One method (referred to as Monodepth2) improved the performance over previous methods via a set of techniques, such as, a per-pixel minimum photometric loss to handle occlusions, an auto-masking method to mask out static pixels, and a multi-scale depth estimation strategy to mitigate the texture-copying issue in depth. Due to the improved performance, some embodiments disclosed herein are based on Monodepth2, but changes to both the depth and the pose networks.
  • FIG. 1 is a diagram illustrating an example architecture for self-supervised depth estimation system 100 according to embodiments disclosed herein.
  • the depth estimation system 100 may be configured to teach and execute a depth estimation model.
  • the system may be implemented using, for example, one or more processors and memory elements such as, for example, computer system 500 of FIG. 5.
  • the depth estimation system 100 may also be referred to as Monolndoor according to some implementations.
  • the embodiments disclosed herein perform self-supervised depth estimation system comprises at least one of a depth factorization module and a residual pose estimation module.
  • Depth factorization module 101 may include an encoder-decoder based depth network architecture configured to predict a relative depth map and a non-local scale network to estimate a global scale factor.
  • the residual pose estimation module 102 may include a pose network configured to predict an initial camera pose from a pair of image frames (e.g., a pair of monocular images) and residual pose network configured to iteratively predict residual camera poses based on the predicted initial pose. While the following description is made with reference to an embodiment including both modules, it will be appreciated that embodiments disclosed herein may include the depth factorization module 101, the residual pose estimation module 102, or a combination of both.
  • Embodiments disclosed herein utilize images or image frames captured from cameras (e.g., visible light cameras, IR cameras, thermal cameras, ultrasound cameras, and other cameras) or other image sensors configured to capture video as a plurality of image frames and/or static images of an environment.
  • images may be captured by monocular image sensors configured to capture monocular videos as a plurality of image frames, each containing a different scene of the environment, in the form of monocular images.
  • a "monocular image” is an image from a single (e.g., monocular) camera, and encompasses a field-of-view (FOV) or a scene of a portion of the surrounding environment (e.g., a subregion of the surrounding environment).
  • FOV field-of-view
  • a monocular image may not include any explicit additional modality indicating depth nor any explicit corresponding image from another camera from which the depth can be derived (e.g., no stereo image sensor pair).
  • a monocular image does not include explicit depth information such as disparity maps derived from comparing the stereo images pixel-by-pixel.
  • a monocular image may implicitly provide depth information in the relationships of perspective and size of elements depicted therein.
  • the monocular image may be of a forward-facing (e.g., the direction of travel), 60-degree FOV, 90-degree FOV, 120-degree FOV, a rear/side facing FOV, or some other subregion based on the positioning and characteristics of the image sensor.
  • the images or image frames processed by the depth estimation system 100 may be acquired directly or indirectly from the camera.
  • images from a camera may be fed via a wired or wireless connection to the system 100 and processed in real-time.
  • images may be stored in a memory and retrieved for processing.
  • one image may be processed in real-time as captured by the camera, while a second image may be retrieved from storage.
  • Embodiments disclosed herein consider self-supervised depth estimation as a view-synthesis problem by training a model to predict a target image from different scene viewpoints (e.g., different camera poses) of source images.
  • the image synthesis process may be trained and constrained by using the depth map as the bridging variable. As such, the image synthesis process may require both a predicted depth map of a target image and an estimated relative pose between the target and a source image (e.g., a target and source image pair).
  • embodiments herein can be jointly trained to predict a dense depth map D t of the target image l t and a relative camera pose from the target image to the source image I f .
  • the photometric reprojection loss can then be constructed as follows:
  • the photometric reconstruction error is a weighted combination of the (e.g., a summation of the absolute value of each value in the target frame and a synthesized frame Structured SIMilarity (SSIM) loss defined as
  • proj() is a transformation function to map the image coordinated from the target image to its p f on the source image following [0035] and (.) is the bilinear sampling operator which is locally sub-differentiable.
  • Camera intrinsics K of all images are assumed to be the same, and an edge-ware smoothness term L s is employed as
  • d t represents a depth value from a corresponding depth map and represents partial derivatives where x and y are coordinates within of the corresponding depth map D t .
  • Various embodiments herein use an additional depth consistency loss to enforce consistent depth prediction across neighboring frames. For example, first the depth image D t , of the source image is warped by Equation (2) to generate which is a corresponding depth map in the coordinate system of the source image. We then transform to the coordinate system of the target image via Equation (4) to produce a synthesized target depth map The depth consistency loss can be written as
  • r and y are the weights for the edge-aware smoothness loss and the depth consistency loss, respectively.
  • embodiments herein provide a monocular self- supervised depth estimation architecture, as shown in FIG. 1, to provide for improved predicted depth quality in indoor environments.
  • the architecture for system 100 take, as input, a single color image and outputs a depth map based on executing a depth factorization module 101 and a residual pose estimation module 102.
  • Embodiments disclosed herein predict a depth model using a depth network 106 having a encoder 108/decoder 110 architecture that inputs an image 104 (It) and outputs a relative depth map 112 (D t ) for the input image.
  • An example depth network 106 employs an auto-encoder structure with skip connections between the encoder and the decoder.
  • input image 104 is a color image.
  • An illustrative depth network 106 may be the Monodepth2 model for depth prediction.
  • the depth network 106 may include a set of neural network layers including convolutional components (e.g., 2D convolutional layers forming an encoder 108) that flow into decoder layers (e.g., 2D convolutional layers with upsampling operators forming a decoder 110).
  • the encoder 108 accepts an image 104 (e.g., a monocular color image), as an input and processes the image to extract features therefrom (e.g., feature representations).
  • the features may be aspects of the image that are indicative of spatial information that the image intrinsically encodes.
  • the encoder 108 comprises multiple encoding layers formed from a combination of two-dimensional (2D) convolutional layers, packing blocks, and residual blocks.
  • the separate encoding layers generate outputs in the form of encoded feature maps (also referred to as tensors), which the encoding layers provide to subsequent layers in the depth network 106.
  • the encoder 108 may include a variety of separate layers that operate on the image 104, and subsequently on derived/intermediate feature maps that convert the visual information of the image 104 into embedded state information in the form of encoded features of different channels.
  • the decoder 110 may unfold (e.g., adapt dimensions of the tensor to extract the features) the previously encoded spatial information in order to derive a depth map 112 for the image according to learned correlations associated with the encoded features.
  • the decoding layers may function to up-sample, through sub-pixel convolutions and other mechanisms, the previously encoded features into the depth map 112, which may be provided at different resolutions.
  • the decoding layers comprise unpacking blocks, two-dimensional convolutional layers, and inverse depth layers that function as output layers for different scales of the feature/depth map.
  • the depth map may be a data structure corresponding to the input image that indicates distances/depths to objects/features represented therein.
  • the depth map 112 may be a tensor with separate data values indicating depths for corresponding locations in the image on a per-pixel basis.
  • the depth network 106 may further include skip connections for providing residual information between the encoder 108 and the decoder 110 to facilitate memory of higher-level features between the separate components. While a particular depth network 106 is discussed, as previously noted, the depth network 106, in various approaches, may take different forms and generally functions to process the monocular images and provide depth maps that are per-pixel estimates about distances of objects/features depicted in the images. [0047] Note that the resulting depth prediction (e.g., relative depth map 112) for the image may not be directly from the convolutional layers of the encoder 108, but after a sigmoid activation function and a linear scaling function as follows:
  • a and b are specified to constrain the depth map D within a certain depth range, thereby providing a relative depth map.
  • a and b are respectively set as a minimum depth value and a maximum depth value which can be obtained in a known environment. For instance, on the KITTI dataset, a is chosen as 0.1 and b as 100. The reason for setting a and b as fixed values is that the depth range is consistent across the video sequences when the camera always sees the sky at the far point. However, this setting may not be valid for most indoor environments. As a camera travels through an environment, the depth range for each image captured by the camera varies.
  • the depth range in a bathroom e.g. 0.1m ⁇ 3m
  • a lobby e.g. 0.1m ⁇ 10m
  • Presetting depth range may act as an inaccurate guidance that could be harmful for the model to capture accurate depth scales. This may be especially true when there are rapid scale changes, which are commonly observed in indoor environments.
  • the depth factorization module 101 is configured to learn a disentangled representation in the form of a relative depth map D t (e.g., as described above) and a global scale factor.
  • a relative depth map refers to a matrix containing entries (e.g., depth values) lying between [0, 1], whereas an absolute depth map contains depth values in metric scale (e.g., in meters).
  • the architecture for system 100 employs the depth network 106 to predict relative depth and adds a self-attention-guided scale regression network 114 to predict the global scale factor for the current view.
  • the scale network 114 may be a branch 116 from the depth network 106 which takes as input an image (e.g., a color image in various embodiments) and outputs a global scale factor for the image.
  • the input may be the image 104 or another image.
  • the global scale factor may be informed by certain areas (e.g., a far point, which represents the furthest point in the image) in the image, some embodiments use a self-attention block so that the network can be guided to pay more attention to a certain area of the image. This approach may be informative to induce the depth scale factor of the current view (e.g., current scene) in an environment.
  • a selfattention block uses feature representation F as an input, forming a query a key and a value output by
  • W h are parameters to be learnt by the embodiments herein.
  • Parameters and W h are convolution layers learnt in a manner similar to other network parameters and are illustrated as layers 118, 120, and 122, respectively.
  • Self-attentions G F are illustrated in FIG. 1 as layer 124 (e.g., a convolution layer). Next, the self-attention and feature representation F jointly contribute at to the output by using Eq. (10)
  • a parameter to be learnt by the embodiments herein and is a convolution layer learnt in a manner similar to other network parameters.
  • each scale s is an enumeration of 0, 1, . . . , Dmax, where is the maximum value for all depth maps. Dmax may be set in advance to constrain the maximum bound of the global scale factor (e.g., set based on the maximum value from all depth maps, such as 10, 20, etc.).
  • the predicted global scale S t is calculated as the sum of each scale s weighted by its probability as
  • [0054] where represents a feature vector after applying fully connected layers to the attentive representations By doing so, the regression problem can be smoothly resolved by a probabilistic classification-based strategy (see ablation results described below).
  • the predicted global scale may be used to multiply with the relative depth map for training the depth estimation model.
  • embodiments of the self-supervise depth estimation system is built upon the novel view synthesis, which uses both accurate depth maps and camera poses.
  • Estimating accurate relative poses may be key for the photometric reprojection loss because inaccurate poses might lead to wrong correspondences between the target and source pixels, causing problems in predicting the depth.
  • Existing methods generally employ a standalone pose network to estimate 6 Degrees-of-Freedom (DoF) pose between two images.
  • DoF Degrees-of-Freedom
  • relative camera poses are fairly simple because the cars are mostly moving forward with large changes in translational pose but minor changes rotational poses. This means that pose estimation is normally less challenging.
  • the sequences are typically recorded with hand-held devices (e.g., Kinect ® , smartphones, handheld recording devices, and the like), so there are more complicated ego-motions involved as well as much larger rotational motions. It is thus more difficult for the pose network to learn accurate camera poses.
  • hand-held devices e.g., Kinect ® , smartphones, handheld recording devices, and the like
  • embodiments herein include the residual pose estimation module 102 to learn the relative camera pose between a target image and a source image in an iterative manner.
  • a pose network 132 (also referred to herein as PoseNet 132) takes a target image and a source image as inputs (e.g., merges the image pair via a concat function) and predicts an initial relative camera pose where the subscript 0 in indicates that no transformation is applied yet.
  • An example PoseNet 132 accepts two monocular images (e.g., I t and / , each corresponding to a different camera pose and different view (e.g., different combinations of objects/features and/or different view point of objects/features) of an environment.
  • the PoseNet 132 processes the monocular images (e.g., and to produce an estimate a set of 6-DOF transformations (referred to as [R ⁇ t]) 134 that apply between the two images to represent the transition from one pose to the other (e.g., ).
  • the PoseNet 132 may be implemented, for example, as a convolutional neural network (CNN) or another learning model that is differentiable and performs dimensional reduction of the input images to produce the transformation.
  • the PoseNet 132 may include 7 stride-2 convolutions, a lxl convolution with 6*(N-1) output channels corresponding to 3 Euler angles and a 3-D translation for one of the images, and global average pooling to aggregate predictions at all spatial locations.
  • the 6-DOF transformation 134 may be, in some embodiments, a 6-DOF rigid-body transformation belonging to the special Euclidean group SE(3) that represents the change in pose between the pair of images provided as inputs to the PoseNet 132.
  • the PoseNet 132 performs a dimensional reduction of the monocular images to derive the transformations 134 between images therefrom.
  • Eq. (12) below is applied to bilinearly sample (e.g., an technique for performing inverse warp 136) from the source image reconstructing or generating a virtual view (also referred to as a synthesized image).
  • the synthesized image is expected to be the same as the target image I t if the correspondences match accurately. However, it may not be the case due to an inaccurate pose prediction.
  • the warp 136 may be defined as
  • a residual pose network 140 is applied, which takes the target image and the synthesized image as inputs (e.g., merges the image pair via a concat function).
  • the residual pose network 140 outputs a residual camera pose (shown in FIG. 1 as 142) representing the camera pose of the synthesized image with respect to the target image .
  • the residual pose network 140 may be similar to the PoseNet 132, except that the residual pose network 140 takes the synthesized image as an input, instead of the source image (
  • the synthesized image is bilinearly sampled (e.g., inverse warp 144) as
  • Eq. 13 reconstructs a new synthesized image which can be used to estimate a next residual pose for next view synthesis.
  • the subscript is replaced with to indicate that one warping transformation, and similarly for the sequentially transformation is applied to each sequential synthesized image.
  • Eq. (13) a general form of Eq. (14)
  • camera pose of source image with respect to the target image (I t ) can be written as
  • camera poses may be obtained that are more accurate than a pose predicted from a single-stage pose network.
  • the improved accuracy in the estimation of camera pose may provide for more accurate photometric reprojection loss that can be built up for better depth prediction.
  • FIG. 2 illustrates an example residual pose estimation for learning the relative camera pose between images according to embodiments disclosed herein.
  • FIG. 2 depicts how a single-stage pose can be decomposed into an initial pose and a residual pose by virtual view synthesis, for example, through application of the residual pose module estimation module.
  • FIG. 2 shows a target image and a source image
  • a single-stage pose network may estimate relative camera pose from the target image and the source image
  • the residual pose estimation module 102 iteratively reconstructs virtual views (e.g., synthesized images) using an inverse warping function.
  • An improved camera pose estimation can be determined by application of Eq. 15, based on the virtual views as compared to the single-stage approach.
  • the depth estimation system 100 may include the depth factorization module 101 or the residual pose estimation module 102, embodiments here are not limited to the inclusion of both.
  • pose estimation may be performed using a pose network as is known in the art.
  • the depth estimation may then be performed using the depth map as estimated by the depth factorization module 101 as applied to a known pose network.
  • the depth map may be estimated using methods known in the art, which may then be applied to the residual pose estimation module 102 in a manner similar to that described above.
  • the self-supervised depth estimation system according to embodiments disclosed herein were evaluated on two indoor datasets: the EuRoC MAV dataset and the NYUv2 depth dataset. To evaluate the results, the mean absolute relative error (AbsRel), root mean squared error (RMS), and the accuracy under threshold were used for both datasets.
  • AbsRel mean absolute relative error
  • RMS root mean squared error
  • the Monodepth2 depth network was used for the depth factorization module 101 and, for the scale network 114, two basic residual blocks followed by three fully-connected layers with a dropout layer in-between was used. The dropout rate was set to 0.5.
  • the residual pose module 102 the residual pose networks used a common architecture similar to Monodepth2, which consisted of a shared pose encoder and an independent pose regressor. Each experiment was trained for 40 epochs using the Adam optimizer and the learning rate was set to 10 -4 for the first 20 epochs and drops to 10 -5 for remaining epochs. The smoothness term r and consistency term g were set as 0.001 and 0.05 respectively.
  • the EuRoC MAV Dataset contains 11 video sequences of a collection of scenes captured in two main environments, a machine hall and a vicon room. Sequences are categorized as easy, medium and difficult according to the varying illumination and camera motions. For the training, three sequences of "Machine hall" (MH_01, MH_02, MH_04) and two sequences of "Vicon room” (Vl_01 and Vl_02) were used. Images are rectified with provided camera intrinsics to remove image distortion. During training, images are resized to 512x256. The Vicon room sequence V2_01 was used for testing where the ground-truth depths are generated by projecting Vicon 3D scans onto the image planes.
  • Table 1 below shows ablation results of design choices and the effectiveness of components in the depth factorization module 101 on the EuRoC MAV Dataset.
  • Prob Reg. refers to the probabilistic scale regression block. Note, here the residual pose estimation module 102 was used when experimenting with different network designs for the depth factorization module 101.
  • the depth estimation system 100 of FIG. 1 was compared with a baseline model Monodepth2 and the effectiveness of each module 101 and 102 was validated.
  • Table 2 below shows ablation results of the depth estimate system 100 (referred to in Table 2 as Monolndoor) and a quantitative comparison with a baseline on the test sequence V2_01 of EuRoC MAV dataset. Best results are in bold.
  • adding the depth factorization module 101 reduced the AbsRel from 15.7% to 14.9%, and the residual pose module 102 decreases the AbsRel to 14.1%, which verifies the usefulness of each module.
  • the full system achieves the best performance across all evaluation metrics.
  • the AbsRel by the depth estimation system disclosed herein significantly decreased from 15.7% to 12.5% and the is improved by around 6%, from 78.6% to 84.0%.
  • FIG. 3 illustrates a qualitative comparison of depth prediction performed using the depth estimation system 100 of FIG. 1 on the EuRoC MAV dataset. Input images are shown in the left most column, the depth map output for each input image by Monodepth2 in the middle column, and the depth map output by the depth estimation system 100 is shown on the right column.
  • the depth estimation system 100 was able to predict precise depths for the hole region
  • the depth estimation system 100 was evaluated on the NYUv2 depth dataset which contains 464 indoor video sequences captured by a hand-held Microsoft Kinect RGBD camera with a resolution of 640x480.
  • the official training and validation splits were used, which include 302 and 33 sequences respectively.
  • the images were rectified with provided camera parameters to remove distortions.
  • the raw dataset is firstly down-sampled 10 times along the temporal dimension to remove redundant frames, resulting in approximately 20K images for training. During training, images are resized to 320x256. 654 images with dense labelled depth maps were used for testing.
  • two pose heads may be enough to represent global motion.
  • Table 4 Quantitative results of the depth estimation system 100 and both state-of- the-art (SOTA) supervised and self-supervised methods on the NYUv2 depth data are shown in Table 4 below.
  • Table 4 provides a comparison of the depth estimation system 100 (referred to in Table 4 as Monolndoor) to existing supervised and self-supervised methods, with the best results shown in bold.
  • Table 4 depicts that the depth estimation system 100 outperforms previous self-supervised SOTA methods, reaching the best results across all metrics. Specifically, 13.4% for AbsRel and 82.3% for di are achieved using the depth estimation system 100.
  • the depth estimation system 100 outperforms a group of supervised methods and closes the performance gap between the self-supervised methods and fully-supervised methods.
  • FIG. 4 illustrates a qualitative comparison of depth prediction performed using the depth estimation system 100 of FIG. 1 on the NYUv2 depth dataset. Input images are shown in the left most column, the depth map output for each input image by Monodepth2 in the second column, the depth map output by the depth estimation system 100 is shown on the third column, and ground truth is shown in the right most column.
  • FIG. 4 visualizes the predicted depth maps for a given input image for each of the Monodepth2, the depth estimation system 100, and ground truth (GT).
  • GT ground truth
  • depth maps predicted the depth estimation system 100 are more precise and closer to the ground truth. For instance, for input image 401 having area 302 of chairs, the depth in the area 303 of chairs predicted from the depth estimation system 100 is much sharper and cleaner than the Monodepth2 estimated area 304, producing a depth map that more closely resembles the ground truth 305.
  • the depth estimation system 100 can produce better depth predictions for the area 307 than the area 308 from Monodepth2.
  • the embodiments herein provide for a monocular self-supervised depth estimation system configured for predicting depth maps in indoor environments.
  • the embodiments herein provide for a depth factorization module configured to jointly learn a global scale factor and a relative depth map.
  • embodiments herein estimate accurate camera poses for novel view synthesis via a residual pose estimation module that in turn improves the depth model.
  • Embodiments herein achieve the state-of-the-art performance among the self-supervised methods on indoor dataset as set forth above.
  • FIG. 5 is an example computing component that may be used to implement various features of embodiments described in the present disclosure.
  • FIG. 5 depicts a block diagram of an example computer system 500 in which various of the embodiments the self-supervised depth estimation system 100 described herein may be implemented.
  • the computer system 500 includes a bus 502 or other communication mechanism for communicating information, one or more hardware processors 504 coupled with bus 502 for processing information.
  • Hardware processor(s) 504 may be, for example, one or more general purpose microprocessors.
  • the computer system 500 also includes a main memory 506, such as a random access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 502 for storing information and instructions to be executed by processor 504, for example, instructions for executing the architecture of FIG. 1.
  • main memory 506 such as a random access memory (RAM), cache and/or other dynamic storage devices
  • Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504. Such instructions, when stored in storage media accessible to processor 504, render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
  • the computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504.
  • ROM read only memory
  • a storage device 510 such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 502 for storing information and instructions.
  • the computer system 500 may be coupled via bus 502 to a display 512, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user.
  • a display 512 such as a liquid crystal display (LCD) (or touch screen)
  • An input device 514 is coupled to bus 502 for communicating information and command selections to processor 504.
  • cursor control 516 is Another type of user input device
  • cursor control 516 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512.
  • the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.
  • One or more image sensors 518 may be coupled to bus 502 for capturing video as a plurality of image frames and/or static images of an environment.
  • Image sensors include any type of cameras (e.g., visible light cameras, IR cameras, thermal cameras, ultrasound cameras, and other cameras) or other image sensors configured to capture.
  • image sensors 518 may captures images that are processed according to the embodiments disclosed herein (e.g., in FIGS. 1 and 2).
  • image sensors 518 communicate information to main memory 506, ROM 508, and/or storage 510 for processing in real-time and/or for storage so to be processed at a later time.
  • image sensors 518 need not be included and images for processing may be retrieved from memory.
  • the computing system 500 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s).
  • This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • the word “component,” “engine,” “module,” “system,” “database,” data store,” and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++.
  • a software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts.
  • Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution).
  • a computer readable medium such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution).
  • Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device.
  • Software instructions may be embedded in firmware, such as an EPROM.
  • hardware components may be comprised of connected logic units, such as gates and flip- flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.
  • the computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor(s) 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor(s) 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • non-transitory media refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510.
  • Volatile media includes dynamic memory, such as main memory 506.
  • non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.
  • Non-transitory media is distinct from but may be used in conjunction with transmission media.
  • Transmission media participates in transferring information between non-transitory media.
  • transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 502.
  • transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • the computer system 500 also includes a communication interface 518 coupled to bus 502.
  • Communication interface 518 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks.
  • communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN).
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • a network link typically provides data communication through one or more networks to other data devices.
  • a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP).
  • ISP Internet Service Provider
  • the ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet.”
  • Internet Internet
  • Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams.
  • the signals through the various networks and the signals on network link and through communication interface 518, which carry the digital data to and from computer system 500, are example forms of transmission media.
  • the computer system 500 can send messages and receive data, including program code, through the network(s), network link and communication interface 518.
  • a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 518.
  • the received code may be executed by processor 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.
  • Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware.
  • the one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service” (SaaS).
  • SaaS software as a service
  • the processes and algorithms may be implemented partially or wholly in application-specific circuitry.
  • the various features and processes described above may be used independently of one another, or may be combined in various ways. Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations.
  • a circuit might be implemented utilizing any form of hardware, software, or a combination thereof.
  • processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit.
  • the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality.
  • a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 500.
  • circuit and component might describe a given unit of functionality that can be performed in accordance with one or more embodiments of the present application.
  • a component might be implemented utilizing any form of hardware, software, or a combination thereof.
  • processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a component.
  • Various components described herein may be implemented as discrete components or described functions and features can be shared in part or in total among one or more components. In other words, as would be apparent to one of ordinary skill in the art after reading this description, the various features and functionality described herein may be implemented in any given application.
  • computer program medium and “computer usable medium” are used to generally refer to transitory or non-transitory media. These and other various forms of computer program media or computer usable media may be involved in carrying one or more sequences of one or more instructions to a processing device for execution. Such instructions embodied on the medium, are generally referred to as “computer program code” or a “computer program product” (which may be grouped in the form of computer programs or other groupings). When executed, such instructions might enable a computing component to perform features or functions of the present application as discussed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne des systèmes et des procédés pour estimer une carte de profondeur à partir d'une ou de plusieurs images d'une manière auto-supervisée. Les systèmes et les procédés de l'invention peuvent exécuter un module de factorisation de profondeur comprenant un réseau de profondeur conçu pour déterminer une carte de profondeur à partir d'une image cible, et un réseau d'échelle conçu pour déterminer un facteur d'échelle globale à partir de l'image cible et mettre à jour la carte de profondeur avec le facteur d'échelle globale pour déterminer une carte de profondeur relative. Les systèmes et les procédés de l'invention peuvent également exécuter un module d'estimation de pose résiduelle conçu pour prédire de manière itérative des poses de caméra résiduelle entre des images synthétisées reconstruites et l'image cible et entraîner un modèle d'estimation de profondeur sur la base de la carte de profondeur relative, du facteur d'échelle globale, et des poses de caméra résiduelle prédites de manière itérative.
PCT/US2022/020511 2021-03-18 2022-03-16 Cadre d'estimation de profondeur auto-supervisé pour environnements intérieurs WO2022174198A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280011051.7A CN116745813A (zh) 2021-03-18 2022-03-16 室内环境的自监督式深度估计框架

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163162940P 2021-03-18 2021-03-18
US63/162,940 2021-03-18

Publications (1)

Publication Number Publication Date
WO2022174198A1 true WO2022174198A1 (fr) 2022-08-18

Family

ID=82837262

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/020511 WO2022174198A1 (fr) 2021-03-18 2022-03-16 Cadre d'estimation de profondeur auto-supervisé pour environnements intérieurs

Country Status (2)

Country Link
CN (1) CN116745813A (fr)
WO (1) WO2022174198A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686077A (zh) * 2019-10-17 2021-04-20 北京极智嘉科技有限公司 一种自驱动机器人和障碍物识别的方法
CN115145289A (zh) * 2022-09-02 2022-10-04 汕头大学 多智能体协同围捕方法、系统、设备及存储介质
CN116168070A (zh) * 2023-01-16 2023-05-26 南京航空航天大学 一种基于红外图像的单目深度估计方法及系统
CN116245927A (zh) * 2023-02-09 2023-06-09 湖北工业大学 一种基于ConvDepth的自监督单目深度估计方法及系统
CN116403269A (zh) * 2023-05-17 2023-07-07 智慧眼科技股份有限公司 一种遮挡人脸解析方法、系统、设备及计算机存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200090359A1 (en) * 2018-09-14 2020-03-19 Toyota Research Institute, Inc. Systems and methods for depth estimation using monocular images
US20200226773A1 (en) * 2018-07-27 2020-07-16 Shenzhen Sensetime Technology Co., Ltd. Method and apparatus for depth estimation of monocular image, and storage medium
WO2020221443A1 (fr) * 2019-04-30 2020-11-05 Huawei Technologies Co., Ltd. Localisation et cartographie monoculaires sensibles à l'échelle

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200226773A1 (en) * 2018-07-27 2020-07-16 Shenzhen Sensetime Technology Co., Ltd. Method and apparatus for depth estimation of monocular image, and storage medium
US20200090359A1 (en) * 2018-09-14 2020-03-19 Toyota Research Institute, Inc. Systems and methods for depth estimation using monocular images
WO2020221443A1 (fr) * 2019-04-30 2020-11-05 Huawei Technologies Co., Ltd. Localisation et cartographie monoculaires sensibles à l'échelle

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686077A (zh) * 2019-10-17 2021-04-20 北京极智嘉科技有限公司 一种自驱动机器人和障碍物识别的方法
CN112686077B (zh) * 2019-10-17 2024-04-26 北京极智嘉科技股份有限公司 一种自驱动机器人和障碍物识别的方法
CN115145289A (zh) * 2022-09-02 2022-10-04 汕头大学 多智能体协同围捕方法、系统、设备及存储介质
CN116168070A (zh) * 2023-01-16 2023-05-26 南京航空航天大学 一种基于红外图像的单目深度估计方法及系统
CN116168070B (zh) * 2023-01-16 2023-10-13 南京航空航天大学 一种基于红外图像的单目深度估计方法及系统
CN116245927A (zh) * 2023-02-09 2023-06-09 湖北工业大学 一种基于ConvDepth的自监督单目深度估计方法及系统
CN116245927B (zh) * 2023-02-09 2024-01-16 湖北工业大学 一种基于ConvDepth的自监督单目深度估计方法及系统
CN116403269A (zh) * 2023-05-17 2023-07-07 智慧眼科技股份有限公司 一种遮挡人脸解析方法、系统、设备及计算机存储介质
CN116403269B (zh) * 2023-05-17 2024-03-26 智慧眼科技股份有限公司 一种遮挡人脸解析方法、系统、设备及计算机存储介质

Also Published As

Publication number Publication date
CN116745813A (zh) 2023-09-12

Similar Documents

Publication Publication Date Title
WO2022174198A1 (fr) Cadre d'estimation de profondeur auto-supervisé pour environnements intérieurs
Ming et al. Deep learning for monocular depth estimation: A review
AU2017324923B2 (en) Predicting depth from image data using a statistical model
Ji et al. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments
Chiang et al. A unified point-based framework for 3d segmentation
US20210382497A1 (en) Scene representation using image processing
CN108491763B (zh) 三维场景识别网络的无监督训练方法、装置及存储介质
US20240119697A1 (en) Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes
KR20190074519A (ko) 영상의 상호작용 처리 방법 및 장치
JP6902811B2 (ja) 視差推定システムと方法、電子機器及びコンピュータ可読記憶媒体
CN113711276A (zh) 尺度感知单目定位和地图构建
CN112651423A (zh) 一种智能视觉系统
US20230130281A1 (en) Figure-Ground Neural Radiance Fields For Three-Dimensional Object Category Modelling
WO2022187753A1 (fr) Système d'affinement de profondeur monoculaire guidé par slam à l'aide d'un apprentissage en ligne auto-supervisé
CN114339409A (zh) 视频处理方法、装置、计算机设备及存储介质
EP3048563A1 (fr) Procédé et système d'apprentissage de collecteur incrémentiel
US20240096001A1 (en) Geometry-Free Neural Scene Representations Through Novel-View Synthesis
US11887248B2 (en) Systems and methods for reconstructing a scene in three dimensions from a two-dimensional image
Huang et al. Learning optical flow with R-CNN for visual odometry
EP4292059A1 (fr) Prédiction humaine neuronale multivue à l?aide d?un moteur de rendu différentiable implicite pour l?expression faciale, la forme et la pose du corps, et la capture de performance de vêtements
CN111915587A (zh) 视频处理方法、装置、存储介质和电子设备
US20230254230A1 (en) Processing a time-varying signal
US20230344962A1 (en) Video frame interpolation using three-dimensional space-time convolution
CN115565039A (zh) 基于自注意力机制的单目输入动态场景新视图合成方法
CN117321631A (zh) 使用自监督在线学习的slam引导的单目深度改进系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22753533

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280011051.7

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22753533

Country of ref document: EP

Kind code of ref document: A1