WO2021096324A1

WO2021096324A1 - Method for estimating depth of scene in image and computing device for implementation of the same

Info

Publication number: WO2021096324A1
Application number: PCT/KR2020/016094
Authority: WO
Inventors: Mikhail Viktorovich ROMANOV; Nikolay Andreevich PATAKIN; Ilia Igorevich BELIKOV; Anton Sergeevich Konushin
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2019-11-14
Filing date: 2020-11-16
Publication date: 2021-05-20

Abstract

The present invention relates to a method for estimating a depth of a scene in a scene image and a computing device for implementing said method, in which a scene depth estimation model obtained by neural network technologies is used. A technical result consists in enabling accurate and reliable scene depth estimate from a single image on a computing device that has limited computing resources and does not have special-purpose components for estimating the scene depth. Provided is the method for estimating a depth of a scene in an image, comprising the steps of: obtaining the image; estimating the depth of the scene on the image using a scale-invariant model that is based on a neural network having lightweight architecture, that is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.

Description

METHOD FOR ESTIMATING DEPTH OF SCENE IN IMAGE AND COMPUTING DEVICE FOR IMPLEMENTATION OF THE SAME

The present invention relates to the field of artificial intelligence (AI) and, particularly, to a method for estimating a depth of a scene (with the possibility of reconstructing geometry of the scene) in a scene image and a computing device for implementing said method, in which a scene depth estimation model obtained by neural network technologies is used.

Single-image monocular depth estimation plays a key role in understanding geometry of a 3D scene for such applications as, for example, AR (Augmented Reality) and 3D modelling. Classical depth estimation methods use various efficient and inventive ways of utilizing image data, which search helpful cues in visual data through detecting edges, estimating planes, or matching objects. Recently, deep learning-based approaches started to compete with classical computer vision algorithms that make use of hand-crafted features. The major advances in this field imply training convolutional neural networks to estimate the real-valued depth map from RGB image. Diverse training data is necessary for training a model able to perform in various real-world scenarios.

The sources of depth data are numerous and have different characteristics. LiDAR (Light Detection and Ranging) scanners that are typically used for self-driving scenarios output precise yet sparse depth measurements. Thus, this data requires careful filtering and manual processing. Cheap and miniature commodity-grade depth sensors based on active stereo with structured light (e.g. Microsoft Kinect), or Time-of-Flight sensors (e.g. Microsoft Kinect Azure or depth sensors in many smartphones), provide relatively dense estimates, yet being less accurate and having a limited range of detectable distances. These sensors are mainly used for indoor scenarios. In several RGB-D (a combination of RGB image and depth image) datasets such as RedWeb and DIML outdoor, stereo pairs serve as a source of depth information. However, the standard depth estimation procedure based on optical flow does not always provide accurate depth maps, especially for objects located at a large distance (10 meters and more).

Recently, Structure from Motion (SfM) method has been applied to estimate depth maps via scene reconstruction, e.g., Li, Z., Snavely, N.: Mega Depth: Learning single-view depth prediction from internet photos. In: Computer Vision and Pattern Recognition (CVPR), 2018, based on the results of this work, the MegaDepth RGB-D dataset has been published, which was obtained using SfM with iterative refinement. The same approach was used in the work Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people, 2019, for the Dataset of Frozen People. However, SfM method works under assumption that the scene is rigid and does not contain moving objects. Thereby, SfM method is mainly applied to reconstruct pieces of architecture or skylines.

[0005] While in some of the datasets absolute depth is comprised (usually measured by sensors or estimated from aligned stereo cameras with known intrinsic and extrinsic parameters), the others datasets comprise only up-to-scale depth (UTS, usually reconstructed by SfM method or estimated from aligned stereo cameras with unknown parameters). There are also several datasets comprising up-to-shift-scale inverse depth (UTSS, usually estimated from unaligned stereo cameras with unknown parameters).

Overall, none of the existing datasets used separately is sufficient in terms of accuracy, diversity and image quantity for training a robust depth estimation model. This drawback induced various strategies of mixing data from different sources during training, see, for example, Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2019. The first version of this paper may be considered as the closest prior art.

To train models that estimate depth in absolute values, only training data with depth absolute values can be used. UTS models can be trained on both absolute training data and UTS training data. It is proposed in the closest prior art specified in the previous paragraph of this specification to train a UTSS model on absolute data, UTS data, and UTSS data from different sources. The resulting model demonstrated an impressive generalization ability. However, UTSS models have a serious drawback of not being able to reconstruct scene geometry.

Provided in a first aspect of the present disclosure is a method for estimating a depth of a scene in an image, comprising the steps of: obtaining the image; estimating the depth of the scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, that is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.

Provided in a second aspect of the present disclosure is a user computing device comprising a processor and memory storing the trained neural network and processor-executable instructions, which, when executed, cause the processor to execute the method for estimating a depth of a scene in an image according to the first aspect of the present invention.

The disclosed invention solves at least some or all of the above problems in the prior art by providing an accurate and reliable model for estimating a depth of a scene in an image by applying the scale-invariant model that is based on a neural network, wherein at each training iteration of the neural network in addition to the training images, either a combination of absolute data and UTS data corresponding to the training images or a combination of UTSS data and UTS data corresponding to the training images is used alternately. Additionally, the proposed invention is suitable for use in computing devices having limited resources, since the architecture of the neural network in the used scale invariant model is lightweight. Finally, the use of the proposed invention does not require special-purpose components, such as a LiDAR scanner, a time-of-flight (ToF) sensor etc., for estimating a depth of a scene (on the basis of which the geometry of the scene can be further constructed), since for such an estimation only a single image (for example, an RGB image of a scene) is needed in the proposed invention, which can be obtained with a general-purpose camera.

Specific embodiments, implementations and other details of the disclosed invention are illustrated in the drawings, in which:

FIG. 1 illustrates a flowchart of a method for estimating a depth of a scene in an image according to an embodiment of the present invention.

FIG. 2 illustrates a refinement branch of a neural network having the proposed architecture according to an embodiment of the present invention.

FIG. 3 illustrates an example of the structure of CRP (Chain Residual Pooling) block of a neural network in the proposed architecture according to an embodiment of the present invention.

FIG. 4 illustrates a block diagram of a computing device according to an embodiment of the present invention.

First, some general concepts and terms of applicable neural network technologies will be described, and then this section will focus on the differences and modifications of these concepts in the present invention. A person skilled in the art will understand that below is not a complete theoretical description of all known neural network technologies, but only that part of such a description that borders on and is necessary for the theoretical foundation and practical implementation of the claimed invention. Special emphasis will be placed on various modifications and differences of the claimed invention from the prior art, as well as on various implementations and embodiments of the claimed invention.

Depth estimation is solved in the present invention as a dense labelling task in continuous space. An efficient solution to such a problem can be obtained using encoder-decoder architectures with skip connections originally developed for semantic segmentation. Such architectures make it possible to successfully combine a pre-trained main (backbone-) neural network that serves as a feature extractor with various decoder architectures. By way of example, and not limitation, a typical feature extractor may be a powerful classification network such as ResNet or ResNeXt pre-trained on the large and diverse dataset, for example, ImageNet dataset. The generalization ability of these models allows using them for various visual recognition tasks, including the current task of estimating a depth of a scene from an image of the scene.

The most well-known and widespread neural network architectures are too computationally expensive to run in real-time on resource-constrained computing devices such as smartphones, tablets, etc. In this case, a neural network model with a lightweight architecture such as MobileNetV2 or EfficientNet is applicable as an encoder, i.e. feature extractor.

A decoder is applicable in a neural network architecture for semantic segmentation and object detection. There are a number of efficient decoders for this purpose, e.g. Light-Weight Refine Net, EfficientDet, and HRNet. The Light-Weight Refine Net decoder iteratively fuses deep feature maps with shallower feature maps. The EfficientDet decoder operates in a similar way, but adds reverse fusion procedure. The HRNet decoder implements a slightly different strategy: by processing input data in several parallel branches with different resolutions, it extracts high-level features and propagates low-level features. As the result, the output data comprise both structural and semantic information, so the input data are used efficiently.

While the same lightweight encoders are often used in different efficient neural network architectures for solving various problems, a design of decoders tends to be more task-specific. Since computational efficiency is one of the key factors in implementing the present invention, a proper choice of decoder architecture is crucial. According to one of the possible techniques, it is possible to perform a search across various neural network architectures to find the most compact yet effective decoder block for scene depth estimation. According to the other technique, it is possible to balance performance and accuracy by training a lightweight architecture via transfer learning.

Absolute depth estimation. There are several approaches to address depth estimation problem. Most of the solutions are devoted to estimating absolute depth in metric units. However, it is not always possible to determine scale of a scene based on a single image. To acquire absolute depth training data, either a depth sensor should be used or stereo pairs obtained by a camera(s) with known extrinsic parameters should be provided, which greatly complicates the process of collecting training data.

Up-to-scale (UTS) depth estimation. UTS depth is a depth that is defined up to an unknown coefficient (and for the entire depth map). In other words, one can say about UTS depth that the units of measurement are unknown. In other words, with respect to UTS depth, it is not known whether the depth is measured in meters, kilometers or millimeters. Other approaches focus on estimating depth up to an unknown coefficient. They aim to reconstruct scene geometry rather then predicting distances to the single points of the scene. The UTS data for training models is easier to acquire than absolute training data, yet the pre-processing requires time and computational resources.

Up-to-shift-scale (UTSS) inverse depth estimation. UTSS depth (inverse depth data) is a depth that is defined with up-to-shift-scale precision. In other words, if a value d of inverse depth is known, then the UTSS data about such inverse depth can be defined as d* = a * d + b, where a and b are unknown coefficients. UTSS inverse depth estimation is applicable to solve the SVDE (Single-View Depth Estimation) task. However, this approach has a serious drawback: the scene geometry cannot be restored properly if a shift b of inverse depth is unknown. The major advantage of this approach is the simplicity of data acquisition, as UTSS depth is accessible and easy to process. In this application, it will be disclosed that training a scale invariant UTS model that is based on a neural network having lightweight architecture can be performed on absolute, UTS, and UTSS training data.

Proposed in the present application is the practical solution for estimating a depth (and optionally a geometry) of a scene based on an image of the scene, which ensures the balance between estimation accuracy and computational efficiency. Such a balance is achieved through the use of a neural network having a lightweight architecture and a certain features of the neural network training on absolute, UTS and UTSS training data. FIG. 1 illustrates a flowchart of a method for estimating an inverse depth of a scene in an image according to an embodiment of the present invention. The method comprises the step S100 of obtaining the image, and the step S110 of estimating depths of a scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, the neural network is trained using training images. At each training iteration, a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions is used. In other words, any N images with any labels are randomly selected into the mixture of training images. In an embodiment of the present invention, a depth of the scene in the image is estimated using the scale-invariant model that is based on the neural network having lightweight architecture as the inverse logarithm of the depth of the scene in the image. Since the scale-invariant model that is based on a neural network having a lightweight architecture estimates the depth with up-to-scale precision, a geometry of the scene may be further constructed based on such an estimate of the scene depth. Thus, in one embodiment of the present invention, the method may further comprise the step of constructing a scene geometry based on the obtained scene depth estimate.

The image in step S100 may be any image from among an image captured by a camera of a computing device, an image retrieved from memory of a computing device, or an image downloaded over a network. The estimation, in step S110, can be performed by lightweight neural network-based scale-invariant model on a central processing unit (CPU) or any other dedicated processor (ASIC, SoC, FPGA, GPU) of a computing device.

The lightweight neural network-based scale-invariant model comprises an encoder and a decoder. MobileNet encoder, MobileNetv2 encoder, or encoder architectures from EfficientNet (namely EfficientNet-Lite0, EfficientNet-b0, b1, b2, b3, b4, b5), previously trained in the classification task on the ImageNet training dataset can be used, but without the limitation, as the encoder. The decoder used in the proposed scale-invariant model is based on the vanilla Light-Weight Refine Net decoder, the architecture of which has been modified as follows to meet the requirements of computational efficiency and to solve stability problems. The first modification: the layer that maps encoder output signal to 256 channels is replaced with a fusion block that does not change a number of channels (i.e., a number of output channels of the fusion block is equal to a number of channels at the corresponding encoder layer). A number of channels in each subsequent fusion block configured to fuse a signal from an output of a deeper layer of the decoder and a signal from a corresponding layer of the encoder in a cascade of fusion blocks is reduced relative to a previous fusion block and is equal to a number of channels at an output of the corresponding layer of the encoder. The cascade of fusion blocks in the preferred embodiment comprises four fusion blocks, but the present invention is not limited to the specific number, as more or fewer fusion blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The cascade of CRP blocks in the preferred embodiment comprises five CRP blocks, but the present invention is not limited to the specific number, as more or fewer CRP blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The above features are illustrated with reference to FIG. 2, which is a schematic diagram of a refinement (decoder) branch of the neural network having the proposed architecture according to the embodiment of the present invention.

The second modification: the summation in a CRP (Chain Residual Pooling) block is supplemented with an averaging operation so that the CRP blocks do not prevent the trained scale invariant model from converging. In other words, in each CRP block comprising two CRP modules, each configured to perform an additive modification of an input signal using a pooling operation (e.g. MaxPooling) and a convolution operation with a filter, an operation of dividing a signal at an output of the CRP block by a number of CRP modules comprised in the given CRP block plus one is added. The cascade of CRP modules in a CRP block in the preferred embodiment comprises two CRP modules, but the present invention is not limited to the specific number, as more or fewer CRP modules in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The above features are illustrated with reference to FIG. 3, on which an example of the structure of CRP block of the neural network having the proposed architecture according to the embodiment of the present invention.

A scale-invariant model based on the neural network having the lightweight architecture as described above produces estimates of the logarithm of inverse depth (e.g., in the form of a map of the logarithm of inverse depth) that are twice as small as a target map of the logarithm of inverse depth, so the output estimates of the logarithm of inverse depth of the image are upscaled to a target (original) resolution using, for example, bilinear interpolation or any other known method. The output estimates of logarithms of inverse depth of an image are interpreted as values on a logarithmic scale.

Before training the neural network, weights of the to-be-trained neural network may be initialized randomly. At each training iteration of the neural network, a limited number of randomly selected training images is used. Additionally, at the training stage of the neural network, a sum of the loss functions, which is to be minimized, may be applied.

Scale invariant pairwise loss. When training the neural network, it is proposed in the present application to use an L1 pairwise loss function, which can be calculated as follows:

(1)

where d is a logarithm of predicted inverse depth and

is a logarithm of inverse ground truth depth. The proposed pairwise loss L1 is scale invariant (SI), so it can be used for training on both absolute depth maps and UTS depth maps. To calculate this loss, the summation across

terms is performed.

However, the above pairwise loss L1 function (1) can be calculated more efficiently in

time. Let

denote a list of ascendingly ordered difference values

between values d of the logarithm of predicted inverse depth and values

of the logarithm of inverse ground truth depth:

. After rearranging and grouping similar terms, the pairwise loss function L1 can be written as follows:

(2)

where

is an ordered list

, if i > j. To sort the list,

operations are required,

is computed in linear time. Overall, the computational cost of calculating pairwise loss

is

.

Therefore, at each training iteration of the neural network, for training images accompanied by absolute or UTS data, the pairwise scale invariant loss function

indicated above under the number (2) may be applied.

Shift-and-scale invariant (SSI) pairwise loss. SI pairwise loss may be easily converted to SSI pairwise loss. To do this, a logarithm of depth d is replaced with a normalized depth:

(3)

where μ and σ are mean value and standard deviation, respectively, where

_.Considering the above, shift-and-scale invariant (SSI) pairwise loss function can be written as follows:

(4)

Therefore, at each training iteration of the neural network, for training images accompanied by absolute, UTS or UTSS data, the modified pairwise shift-and-scale invariant pairwise loss function

indicated above under the number (4) may be applied.

In a further embodiment of the present invention, the cumulative loss function can be calculated

_,where

is a corresponding loss function,

is a corresponding weight of the loss function, wherein weights

are selected so that gradients from different loss functions are equal in absolute value:

. Loss functions are considered different for different datasets. SI and SSI loss functions are also considered different. Gradients may be calculated by averaging with an exponential moving average with a predetermined smoothing parameter. Other types of moving average may be used, such as simple, weighted, etc. A predetermined smoothing parameter, for example, a size of the moving average window, may be predefined or empirically fitted. The sum of weights of the loss functions is equal to 1, and each of the weights is non-negative. Thus, having a mixture of UTS and UTSS ground-truth data, SI loss may be used for training on absolute and UTS data, and SSI loss may be used for training on both absolute and UTS data as well as on UTSS data.

Absolute data for training the neural network to estimate the depth and geometry of a scene may be obtained using a motion sensor. UTS data for training the neural network to estimate the depth of a scene are obtained with up-to-scale precision using Structure From Motion algorithm from movies that are available on the Internet. UTSS data for training the neural network to estimate the depth and geometry of the scene are obtained from calibrated stereo images using an optical flow determination algorithm (RAFT). At each iteration of training the neural network, training images from the mixture of training images are provided to the trained neural network in a random order.

FIG. 4 illustrates a block diagram of a computing device 200 according to an embodiment of the present invention. The user computing device 200 comprises at least a processor 205 and a memory 210, which are operably connected to each other. The processor 205 may perform, among other operations, steps S100 and S110 of the method illustrated in FIG. 1. The memory 210 stores the trained neural network (a set of parameters/weights) and processor-executable instructions that, when executed, cause the processor to execute a method for estimating scene depth from an image using the trained neural network. Memory 210 is capable of storing any other data and information. The computing device 200 may comprise other not shown components, for example, a screen, a camera, a communication unit, a touch-sensitive panel, a speaker, a microphone, a Bluetooth module, an NFC module, a Wi-Fi module, a power supply and corresponding interconnections. The disclosed method for estimating a depth of a scene from an image can be implemented on a wide range of computing devices 200, such as laptops, smartphones, tablets, mobile robots and navigation systems. The implementation of the proposed method supports all kinds of devices capable of performing calculations on the CPU. In addition, if the computing device has an additional device for accelerating the neural network, such as a GPU (graphics processing unit), NPU (neural processing unit), TPU (tensor data processing unit), faster implementation is possible on such devices.

At least one of the plurality of modules, blocks, components, steps, sub-steps may be implemented through an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. One or a plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.

The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks. The learning algorithm is a method for training a predetermined target computing device using a plurality of learning data to cause, allow, or control the target computing device to make a determination, estimation, or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that not all technical effects mentioned herein need to be enjoyed in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without the user enjoying some of these technical effects, while other embodiments may be implemented with the user enjoying other technical effects or none at all.

Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting. The scope of the present technology is therefore intended to be limited solely by the scope of the appended claims.

While the above-described implementations have been described and shown with reference to particular steps performed in a particular order, it will be understood that these steps may be combined, sub-divided, or reordered without departing from the teachings of the present technology. Accordingly, an order and grouping of the steps is not a limitation of the present technology. The use of the singular form in relation to any element disclosed in this application does not preclude that two or more such elements may be in an actual implementation.

Claims

A method for estimating a depth of a scene in an image, comprising the steps of:

obtaining (S100) the image;

estimating (S110) the depth of the scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, the neural network is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.
The method of claim 1, wherein the depth of the scene in the image is estimated using the scale-invariant model that is based on the neural network having lightweight architecture as the inverse logarithm of the depth of the scene in the image.
The method of claim 1, further comprising the step of constructing a geometry of the scene based on the obtained scene depth estimate.
The method of claim 1, wherein before training the neural network, weights of the to-be-trained neural network are initialized randomly.
The method of claim 1, wherein at the stage of training the neural network, one or more loss functions to be minimized are applied.
The method of claim 5, wherein at each training iteration of the neural network for images accompanied by absolute or UTS data, a pairwise scale invariant loss
function is applied:

,

where
is a list of ascendingly sorted difference values
between values d of a predicted logarithm of inverse depth and values
of a logarithm of ground truth inverse depth.
The method of claim 5, wherein at each training iteration of the neural network for images accompanied by absolute, UTS or UTSS data, the following modified pairwise shift-scale invariant loss
function is applied:

,

wherein
is the normalized value of the predicted inverse depth, calculated as
where μ and σ are mean and standard deviation, respectively, where
.
The method of any one of claims 6-7, further comprising the step of calculating a cumulative loss function:

where
is a corresponding loss function,
is a corresponding weight of the loss function,

where weights
are selected so that gradients of different loss functions are equal in absolute value:
_,

the gradients are calculated by averaging with an exponential moving average with a predetermined smoothing parameter, and

the sum of weights of the loss functions is equal to 1, and each of the weights is non-negative.
The method of claim 1, wherein the neural network consists of an encoder and a decoder, the encoder is a MobileNetV2 or EfficientNet encoder, and the decoder is a modified Light-Weight RefineNet decoder, in which:

a number of channels in each subsequent fusion block configured to fuse a signal from an output of a deeper layer of the decoder and a signal from a corresponding layer of the encoder in a cascade of fusion blocks is reduced relative to a previous fusion block and is equal to a number of channels at an output of the corresponding layer of the encoder, and

in the process of training, in each CRP (chain residual pooling) block comprising two CRP modules, each configured to perform an additive modification of an input signal using a pooling operation (MaxPooling) and a convolution operation with a filter, an operation of dividing a signal at an output of the CRP block by a number of CRP modules plus one is added,

the cascade of fusion blocks comprises four fusion blocks, and the cascade of CRP blocks comprises five CRP blocks.
The method of claim 1, wherein the absolute data for training the neural network to estimate the depth and geometry of the scene are obtained using a motion sensor.
The method of claim 1, wherein the UTS data for training the neural network to estimate the depth and geometry of the scene are obtained with up-to-scale precision using Structure From Motion algorithm from movies that are available on the Internet.
The method of claim 1, wherein the UTSS data for training the neural network to estimate the depth and geometry of the scene are obtained from calibrated stereo images using an optical flow determination algorithm (RAFT).
The method of any one of claims 10-12, in which at each iteration of training the neural network, training images from the mixture of training images are provided to the trained neural network in a random order.
User computing device (200) comprising a processor (205) and memory (210) storing the trained neural network and processor-executable instructions, which, when executed, cause the processor to execute the method for estimating a depth of a scene in an image according to any one of claims 1 to 13 using the trained neural network.