WO2021096324A1 - Method for estimating depth of scene in image and computing device for implementation of the same - Google Patents

Method for estimating depth of scene in image and computing device for implementation of the same Download PDF

Info

Publication number
WO2021096324A1
WO2021096324A1 PCT/KR2020/016094 KR2020016094W WO2021096324A1 WO 2021096324 A1 WO2021096324 A1 WO 2021096324A1 KR 2020016094 W KR2020016094 W KR 2020016094W WO 2021096324 A1 WO2021096324 A1 WO 2021096324A1
Authority
WO
WIPO (PCT)
Prior art keywords
depth
neural network
scene
training
image
Prior art date
Application number
PCT/KR2020/016094
Other languages
French (fr)
Inventor
Mikhail Viktorovich ROMANOV
Nikolay Andreevich PATAKIN
Ilia Igorevich BELIKOV
Anton Sergeevich Konushin
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2020136895A external-priority patent/RU2761768C1/en
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2021096324A1 publication Critical patent/WO2021096324A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present invention relates to the field of artificial intelligence (AI) and, particularly, to a method for estimating a depth of a scene (with the possibility of reconstructing geometry of the scene) in a scene image and a computing device for implementing said method, in which a scene depth estimation model obtained by neural network technologies is used.
  • AI artificial intelligence
  • Single-image monocular depth estimation plays a key role in understanding geometry of a 3D scene for such applications as, for example, AR (Augmented Reality) and 3D modelling.
  • Classical depth estimation methods use various efficient and inventive ways of utilizing image data, which search helpful cues in visual data through detecting edges, estimating planes, or matching objects.
  • deep learning-based approaches started to compete with classical computer vision algorithms that make use of hand-crafted features.
  • the major advances in this field imply training convolutional neural networks to estimate the real-valued depth map from RGB image. Diverse training data is necessary for training a model able to perform in various real-world scenarios.
  • the sources of depth data are numerous and have different characteristics. LiDAR (Light Detection and Ranging) scanners that are typically used for self-driving scenarios output precise yet sparse depth measurements. Thus, this data requires careful filtering and manual processing.
  • Cheap and miniature commodity-grade depth sensors based on active stereo with structured light e.g. Microsoft Kinect
  • Time-of-Flight sensors e.g. Microsoft Kinect Azure or depth sensors in many smartphones
  • RGB-D a combination of RGB image and depth image
  • RGB-D a combination of RGB image and depth image
  • SfM Structure from Motion
  • Li, Z., Snavely, N. Mega Depth: Learning single-view depth prediction from internet photos.
  • CVPR Computer Vision and Pattern Recognition
  • the MegaDepth RGB-D dataset has been published, which was obtained using SfM with iterative refinement.
  • the same approach was used in the work Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people, 2019, for the Dataset of Frozen People.
  • SfM method works under assumption that the scene is rigid and does not contain moving objects. Thereby, SfM method is mainly applied to reconstruct pieces of architecture or skylines.
  • UTS models can be trained on both absolute training data and UTS training data. It is proposed in the closest prior art specified in the previous paragraph of this specification to train a UTSS model on absolute data, UTS data, and UTSS data from different sources. The resulting model demonstrated an impressive generalization ability. However, UTSS models have a serious drawback of not being able to reconstruct scene geometry.
  • a method for estimating a depth of a scene in an image comprising the steps of: obtaining the image; estimating the depth of the scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, that is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.
  • UTS Up-to-Scale
  • UTSS Up-to-Shift-Scale
  • a user computing device comprising a processor and memory storing the trained neural network and processor-executable instructions, which, when executed, cause the processor to execute the method for estimating a depth of a scene in an image according to the first aspect of the present invention.
  • the disclosed invention solves at least some or all of the above problems in the prior art by providing an accurate and reliable model for estimating a depth of a scene in an image by applying the scale-invariant model that is based on a neural network, wherein at each training iteration of the neural network in addition to the training images, either a combination of absolute data and UTS data corresponding to the training images or a combination of UTSS data and UTS data corresponding to the training images is used alternately.
  • the proposed invention is suitable for use in computing devices having limited resources, since the architecture of the neural network in the used scale invariant model is lightweight.
  • the use of the proposed invention does not require special-purpose components, such as a LiDAR scanner, a time-of-flight (ToF) sensor etc., for estimating a depth of a scene (on the basis of which the geometry of the scene can be further constructed), since for such an estimation only a single image (for example, an RGB image of a scene) is needed in the proposed invention, which can be obtained with a general-purpose camera.
  • special-purpose components such as a LiDAR scanner, a time-of-flight (ToF) sensor etc.
  • FIG. 1 illustrates a flowchart of a method for estimating a depth of a scene in an image according to an embodiment of the present invention.
  • FIG. 2 illustrates a refinement branch of a neural network having the proposed architecture according to an embodiment of the present invention.
  • FIG. 3 illustrates an example of the structure of CRP (Chain Residual Pooling) block of a neural network in the proposed architecture according to an embodiment of the present invention.
  • FIG. 4 illustrates a block diagram of a computing device according to an embodiment of the present invention.
  • Depth estimation is solved in the present invention as a dense labelling task in continuous space.
  • An efficient solution to such a problem can be obtained using encoder-decoder architectures with skip connections originally developed for semantic segmentation.
  • Such architectures make it possible to successfully combine a pre-trained main (backbone-) neural network that serves as a feature extractor with various decoder architectures.
  • a typical feature extractor may be a powerful classification network such as ResNet or ResNeXt pre-trained on the large and diverse dataset, for example, ImageNet dataset.
  • ResNet ResNet
  • ResNeXt pre-trained on the large and diverse dataset
  • ImageNet dataset for example, ImageNet dataset.
  • the generalization ability of these models allows using them for various visual recognition tasks, including the current task of estimating a depth of a scene from an image of the scene.
  • a neural network model with a lightweight architecture such as MobileNetV2 or EfficientNet is applicable as an encoder, i.e. feature extractor.
  • a decoder is applicable in a neural network architecture for semantic segmentation and object detection.
  • the Light-Weight Refine Net decoder iteratively fuses deep feature maps with shallower feature maps.
  • the EfficientDet decoder operates in a similar way, but adds reverse fusion procedure.
  • the HRNet decoder implements a slightly different strategy: by processing input data in several parallel branches with different resolutions, it extracts high-level features and propagates low-level features. As the result, the output data comprise both structural and semantic information, so the input data are used efficiently.
  • Absolute depth estimation There are several approaches to address depth estimation problem. Most of the solutions are devoted to estimating absolute depth in metric units. However, it is not always possible to determine scale of a scene based on a single image. To acquire absolute depth training data, either a depth sensor should be used or stereo pairs obtained by a camera(s) with known extrinsic parameters should be provided, which greatly complicates the process of collecting training data.
  • UTS depth is a depth that is defined up to an unknown coefficient (and for the entire depth map). In other words, one can say about UTS depth that the units of measurement are unknown. In other words, with respect to UTS depth, it is not known whether the depth is measured in meters, kilometers or millimeters. Other approaches focus on estimating depth up to an unknown coefficient. They aim to reconstruct scene geometry rather then predicting distances to the single points of the scene.
  • the UTS data for training models is easier to acquire than absolute training data, yet the pre-processing requires time and computational resources.
  • UTSS depth (inverse depth data) is a depth that is defined with up-to-shift-scale precision.
  • d* a * d + b
  • a and b are unknown coefficients.
  • UTSS inverse depth estimation is applicable to solve the SVDE (Single-View Depth Estimation) task.
  • SVDE Single-View Depth Estimation
  • this approach has a serious drawback: the scene geometry cannot be restored properly if a shift b of inverse depth is unknown.
  • the major advantage of this approach is the simplicity of data acquisition, as UTSS depth is accessible and easy to process.
  • training a scale invariant UTS model that is based on a neural network having lightweight architecture can be performed on absolute, UTS, and UTSS training data.
  • FIG. 1 illustrates a flowchart of a method for estimating an inverse depth of a scene in an image according to an embodiment of the present invention.
  • the method comprises the step S100 of obtaining the image, and the step S110 of estimating depths of a scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, the neural network is trained using training images.
  • a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions is used.
  • any N images with any labels are randomly selected into the mixture of training images.
  • a depth of the scene in the image is estimated using the scale-invariant model that is based on the neural network having lightweight architecture as the inverse logarithm of the depth of the scene in the image.
  • a geometry of the scene may be further constructed based on such an estimate of the scene depth.
  • the method may further comprise the step of constructing a scene geometry based on the obtained scene depth estimate.
  • the image in step S100 may be any image from among an image captured by a camera of a computing device, an image retrieved from memory of a computing device, or an image downloaded over a network.
  • the estimation, in step S110 can be performed by lightweight neural network-based scale-invariant model on a central processing unit (CPU) or any other dedicated processor (ASIC, SoC, FPGA, GPU) of a computing device.
  • the lightweight neural network-based scale-invariant model comprises an encoder and a decoder.
  • MobileNet encoder, MobileNetv2 encoder, or encoder architectures from EfficientNet namely EfficientNet-Lite0, EfficientNet-b0, b1, b2, b3, b4, b5
  • EfficientNet-Lite0, EfficientNet-b0, b1, b2, b3, b4, b5 previously trained in the classification task on the ImageNet training dataset
  • the decoder used in the proposed scale-invariant model is based on the vanilla Light-Weight Refine Net decoder, the architecture of which has been modified as follows to meet the requirements of computational efficiency and to solve stability problems.
  • the first modification the layer that maps encoder output signal to 256 channels is replaced with a fusion block that does not change a number of channels (i.e., a number of output channels of the fusion block is equal to a number of channels at the corresponding encoder layer).
  • a number of channels in each subsequent fusion block configured to fuse a signal from an output of a deeper layer of the decoder and a signal from a corresponding layer of the encoder in a cascade of fusion blocks is reduced relative to a previous fusion block and is equal to a number of channels at an output of the corresponding layer of the encoder.
  • the cascade of fusion blocks in the preferred embodiment comprises four fusion blocks, but the present invention is not limited to the specific number, as more or fewer fusion blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed.
  • the cascade of CRP blocks in the preferred embodiment comprises five CRP blocks, but the present invention is not limited to the specific number, as more or fewer CRP blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed.
  • FIG. 2 is a schematic diagram of a refinement (decoder) branch of the neural network having the proposed architecture according to the embodiment of the present invention.
  • the second modification the summation in a CRP (Chain Residual Pooling) block is supplemented with an averaging operation so that the CRP blocks do not prevent the trained scale invariant model from converging.
  • each CRP block comprising two CRP modules, each configured to perform an additive modification of an input signal using a pooling operation (e.g. MaxPooling) and a convolution operation with a filter, an operation of dividing a signal at an output of the CRP block by a number of CRP modules comprised in the given CRP block plus one is added.
  • a pooling operation e.g. MaxPooling
  • convolution operation with a filter an operation of dividing a signal at an output of the CRP block by a number of CRP modules comprised in the given CRP block plus one is added.
  • the cascade of CRP modules in a CRP block in the preferred embodiment comprises two CRP modules, but the present invention is not limited to the specific number, as more or fewer CRP modules in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed.
  • FIG. 3 an example of the structure of CRP block of the neural network having the proposed architecture according to the embodiment of the present invention.
  • a scale-invariant model based on the neural network having the lightweight architecture as described above produces estimates of the logarithm of inverse depth (e.g., in the form of a map of the logarithm of inverse depth) that are twice as small as a target map of the logarithm of inverse depth, so the output estimates of the logarithm of inverse depth of the image are upscaled to a target (original) resolution using, for example, bilinear interpolation or any other known method.
  • the output estimates of logarithms of inverse depth of an image are interpreted as values on a logarithmic scale.
  • weights of the to-be-trained neural network may be initialized randomly. At each training iteration of the neural network, a limited number of randomly selected training images is used. Additionally, at the training stage of the neural network, a sum of the loss functions, which is to be minimized, may be applied.
  • L1 scale invariant
  • pairwise loss L1 function (1) can be calculated more efficiently in time. Let denote a list of ascendingly ordered difference values between values d of the logarithm of predicted inverse depth and values of the logarithm of inverse ground truth depth: . After rearranging and grouping similar terms, the pairwise loss function L1 can be written as follows:
  • the pairwise scale invariant loss function indicated above under the number (2) may be applied.
  • SI pairwise loss may be easily converted to SSI pairwise loss.
  • a logarithm of depth d is replaced with a normalized depth:
  • shift-and-scale invariant (SSI) pairwise loss function can be written as follows:
  • the modified pairwise shift-and-scale invariant pairwise loss function indicated above under the number (4) may be applied.
  • the cumulative loss function can be calculated , where is a corresponding loss function, is a corresponding weight of the loss function, wherein weights are selected so that gradients from different loss functions are equal in absolute value: .
  • Loss functions are considered different for different datasets.
  • SI and SSI loss functions are also considered different. Gradients may be calculated by averaging with an exponential moving average with a predetermined smoothing parameter. Other types of moving average may be used, such as simple, weighted, etc.
  • a predetermined smoothing parameter for example, a size of the moving average window, may be predefined or empirically fitted.
  • the sum of weights of the loss functions is equal to 1, and each of the weights is non-negative.
  • SI loss may be used for training on absolute and UTS data
  • SSI loss may be used for training on both absolute and UTS data as well as on UTSS data.
  • Absolute data for training the neural network to estimate the depth and geometry of a scene may be obtained using a motion sensor.
  • UTS data for training the neural network to estimate the depth of a scene are obtained with up-to-scale precision using Structure From Motion algorithm from movies that are available on the Internet.
  • UTSS data for training the neural network to estimate the depth and geometry of the scene are obtained from calibrated stereo images using an optical flow determination algorithm (RAFT).
  • RAFT optical flow determination algorithm
  • FIG. 4 illustrates a block diagram of a computing device 200 according to an embodiment of the present invention.
  • the user computing device 200 comprises at least a processor 205 and a memory 210, which are operably connected to each other.
  • the processor 205 may perform, among other operations, steps S100 and S110 of the method illustrated in FIG. 1.
  • the memory 210 stores the trained neural network (a set of parameters/weights) and processor-executable instructions that, when executed, cause the processor to execute a method for estimating scene depth from an image using the trained neural network.
  • Memory 210 is capable of storing any other data and information.
  • the computing device 200 may comprise other not shown components, for example, a screen, a camera, a communication unit, a touch-sensitive panel, a speaker, a microphone, a Bluetooth module, an NFC module, a Wi-Fi module, a power supply and corresponding interconnections.
  • the disclosed method for estimating a depth of a scene from an image can be implemented on a wide range of computing devices 200, such as laptops, smartphones, tablets, mobile robots and navigation systems.
  • the implementation of the proposed method supports all kinds of devices capable of performing calculations on the CPU.
  • the computing device has an additional device for accelerating the neural network, such as a GPU (graphics processing unit), NPU (neural processing unit), TPU (tensor data processing unit), faster implementation is possible on such devices.
  • At least one of the plurality of modules, blocks, components, steps, sub-steps may be implemented through an AI model.
  • a function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor.
  • the processor may include one or a plurality of processors.
  • One or a plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
  • the one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory.
  • the predefined operating rule or artificial intelligence model is provided through training or learning.
  • being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made.
  • the learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.
  • the AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights.
  • Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
  • the learning algorithm is a method for training a predetermined target computing device using a plurality of learning data to cause, allow, or control the target computing device to make a determination, estimation, or prediction.
  • Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a method for estimating a depth of a scene in a scene image and a computing device for implementing said method, in which a scene depth estimation model obtained by neural network technologies is used. A technical result consists in enabling accurate and reliable scene depth estimate from a single image on a computing device that has limited computing resources and does not have special-purpose components for estimating the scene depth. Provided is the method for estimating a depth of a scene in an image, comprising the steps of: obtaining the image; estimating the depth of the scene on the image using a scale-invariant model that is based on a neural network having lightweight architecture, that is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.

Description

METHOD FOR ESTIMATING DEPTH OF SCENE IN IMAGE AND COMPUTING DEVICE FOR IMPLEMENTATION OF THE SAME
The present invention relates to the field of artificial intelligence (AI) and, particularly, to a method for estimating a depth of a scene (with the possibility of reconstructing geometry of the scene) in a scene image and a computing device for implementing said method, in which a scene depth estimation model obtained by neural network technologies is used.
Single-image monocular depth estimation plays a key role in understanding geometry of a 3D scene for such applications as, for example, AR (Augmented Reality) and 3D modelling. Classical depth estimation methods use various efficient and inventive ways of utilizing image data, which search helpful cues in visual data through detecting edges, estimating planes, or matching objects. Recently, deep learning-based approaches started to compete with classical computer vision algorithms that make use of hand-crafted features. The major advances in this field imply training convolutional neural networks to estimate the real-valued depth map from RGB image. Diverse training data is necessary for training a model able to perform in various real-world scenarios.
The sources of depth data are numerous and have different characteristics. LiDAR (Light Detection and Ranging) scanners that are typically used for self-driving scenarios output precise yet sparse depth measurements. Thus, this data requires careful filtering and manual processing. Cheap and miniature commodity-grade depth sensors based on active stereo with structured light (e.g. Microsoft Kinect), or Time-of-Flight sensors (e.g. Microsoft Kinect Azure or depth sensors in many smartphones), provide relatively dense estimates, yet being less accurate and having a limited range of detectable distances. These sensors are mainly used for indoor scenarios. In several RGB-D (a combination of RGB image and depth image) datasets such as RedWeb and DIML outdoor, stereo pairs serve as a source of depth information. However, the standard depth estimation procedure based on optical flow does not always provide accurate depth maps, especially for objects located at a large distance (10 meters and more).
Recently, Structure from Motion (SfM) method has been applied to estimate depth maps via scene reconstruction, e.g., Li, Z., Snavely, N.: Mega Depth: Learning single-view depth prediction from internet photos. In: Computer Vision and Pattern Recognition (CVPR), 2018, based on the results of this work, the MegaDepth RGB-D dataset has been published, which was obtained using SfM with iterative refinement. The same approach was used in the work Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people, 2019, for the Dataset of Frozen People. However, SfM method works under assumption that the scene is rigid and does not contain moving objects. Thereby, SfM method is mainly applied to reconstruct pieces of architecture or skylines.
[0005] While in some of the datasets absolute depth is comprised (usually measured by sensors or estimated from aligned stereo cameras with known intrinsic and extrinsic parameters), the others datasets comprise only up-to-scale depth (UTS, usually reconstructed by SfM method or estimated from aligned stereo cameras with unknown parameters). There are also several datasets comprising up-to-shift-scale inverse depth (UTSS, usually estimated from unaligned stereo cameras with unknown parameters).
Overall, none of the existing datasets used separately is sufficient in terms of accuracy, diversity and image quantity for training a robust depth estimation model. This drawback induced various strategies of mixing data from different sources during training, see, for example, Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2019. The first version of this paper may be considered as the closest prior art.
To train models that estimate depth in absolute values, only training data with depth absolute values can be used. UTS models can be trained on both absolute training data and UTS training data. It is proposed in the closest prior art specified in the previous paragraph of this specification to train a UTSS model on absolute data, UTS data, and UTSS data from different sources. The resulting model demonstrated an impressive generalization ability. However, UTSS models have a serious drawback of not being able to reconstruct scene geometry.
Provided in a first aspect of the present disclosure is a method for estimating a depth of a scene in an image, comprising the steps of: obtaining the image; estimating the depth of the scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, that is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.
Provided in a second aspect of the present disclosure is a user computing device comprising a processor and memory storing the trained neural network and processor-executable instructions, which, when executed, cause the processor to execute the method for estimating a depth of a scene in an image according to the first aspect of the present invention.
The disclosed invention solves at least some or all of the above problems in the prior art by providing an accurate and reliable model for estimating a depth of a scene in an image by applying the scale-invariant model that is based on a neural network, wherein at each training iteration of the neural network in addition to the training images, either a combination of absolute data and UTS data corresponding to the training images or a combination of UTSS data and UTS data corresponding to the training images is used alternately. Additionally, the proposed invention is suitable for use in computing devices having limited resources, since the architecture of the neural network in the used scale invariant model is lightweight. Finally, the use of the proposed invention does not require special-purpose components, such as a LiDAR scanner, a time-of-flight (ToF) sensor etc., for estimating a depth of a scene (on the basis of which the geometry of the scene can be further constructed), since for such an estimation only a single image (for example, an RGB image of a scene) is needed in the proposed invention, which can be obtained with a general-purpose camera.
Specific embodiments, implementations and other details of the disclosed invention are illustrated in the drawings, in which:
FIG. 1 illustrates a flowchart of a method for estimating a depth of a scene in an image according to an embodiment of the present invention.
FIG. 2 illustrates a refinement branch of a neural network having the proposed architecture according to an embodiment of the present invention.
FIG. 3 illustrates an example of the structure of CRP (Chain Residual Pooling) block of a neural network in the proposed architecture according to an embodiment of the present invention.
FIG. 4 illustrates a block diagram of a computing device according to an embodiment of the present invention.
First, some general concepts and terms of applicable neural network technologies will be described, and then this section will focus on the differences and modifications of these concepts in the present invention. A person skilled in the art will understand that below is not a complete theoretical description of all known neural network technologies, but only that part of such a description that borders on and is necessary for the theoretical foundation and practical implementation of the claimed invention. Special emphasis will be placed on various modifications and differences of the claimed invention from the prior art, as well as on various implementations and embodiments of the claimed invention.
Depth estimation is solved in the present invention as a dense labelling task in continuous space. An efficient solution to such a problem can be obtained using encoder-decoder architectures with skip connections originally developed for semantic segmentation. Such architectures make it possible to successfully combine a pre-trained main (backbone-) neural network that serves as a feature extractor with various decoder architectures. By way of example, and not limitation, a typical feature extractor may be a powerful classification network such as ResNet or ResNeXt pre-trained on the large and diverse dataset, for example, ImageNet dataset. The generalization ability of these models allows using them for various visual recognition tasks, including the current task of estimating a depth of a scene from an image of the scene.
The most well-known and widespread neural network architectures are too computationally expensive to run in real-time on resource-constrained computing devices such as smartphones, tablets, etc. In this case, a neural network model with a lightweight architecture such as MobileNetV2 or EfficientNet is applicable as an encoder, i.e. feature extractor.
A decoder is applicable in a neural network architecture for semantic segmentation and object detection. There are a number of efficient decoders for this purpose, e.g. Light-Weight Refine Net, EfficientDet, and HRNet. The Light-Weight Refine Net decoder iteratively fuses deep feature maps with shallower feature maps. The EfficientDet decoder operates in a similar way, but adds reverse fusion procedure. The HRNet decoder implements a slightly different strategy: by processing input data in several parallel branches with different resolutions, it extracts high-level features and propagates low-level features. As the result, the output data comprise both structural and semantic information, so the input data are used efficiently.
While the same lightweight encoders are often used in different efficient neural network architectures for solving various problems, a design of decoders tends to be more task-specific. Since computational efficiency is one of the key factors in implementing the present invention, a proper choice of decoder architecture is crucial. According to one of the possible techniques, it is possible to perform a search across various neural network architectures to find the most compact yet effective decoder block for scene depth estimation. According to the other technique, it is possible to balance performance and accuracy by training a lightweight architecture via transfer learning.
Absolute depth estimation. There are several approaches to address depth estimation problem. Most of the solutions are devoted to estimating absolute depth in metric units. However, it is not always possible to determine scale of a scene based on a single image. To acquire absolute depth training data, either a depth sensor should be used or stereo pairs obtained by a camera(s) with known extrinsic parameters should be provided, which greatly complicates the process of collecting training data.
Up-to-scale (UTS) depth estimation. UTS depth is a depth that is defined up to an unknown coefficient (and for the entire depth map). In other words, one can say about UTS depth that the units of measurement are unknown. In other words, with respect to UTS depth, it is not known whether the depth is measured in meters, kilometers or millimeters. Other approaches focus on estimating depth up to an unknown coefficient. They aim to reconstruct scene geometry rather then predicting distances to the single points of the scene. The UTS data for training models is easier to acquire than absolute training data, yet the pre-processing requires time and computational resources.
Up-to-shift-scale (UTSS) inverse depth estimation. UTSS depth (inverse depth data) is a depth that is defined with up-to-shift-scale precision. In other words, if a value d of inverse depth is known, then the UTSS data about such inverse depth can be defined as d* = a * d + b, where a and b are unknown coefficients. UTSS inverse depth estimation is applicable to solve the SVDE (Single-View Depth Estimation) task. However, this approach has a serious drawback: the scene geometry cannot be restored properly if a shift b of inverse depth is unknown. The major advantage of this approach is the simplicity of data acquisition, as UTSS depth is accessible and easy to process. In this application, it will be disclosed that training a scale invariant UTS model that is based on a neural network having lightweight architecture can be performed on absolute, UTS, and UTSS training data.
Proposed in the present application is the practical solution for estimating a depth (and optionally a geometry) of a scene based on an image of the scene, which ensures the balance between estimation accuracy and computational efficiency. Such a balance is achieved through the use of a neural network having a lightweight architecture and a certain features of the neural network training on absolute, UTS and UTSS training data. FIG. 1 illustrates a flowchart of a method for estimating an inverse depth of a scene in an image according to an embodiment of the present invention. The method comprises the step S100 of obtaining the image, and the step S110 of estimating depths of a scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, the neural network is trained using training images. At each training iteration, a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions is used. In other words, any N images with any labels are randomly selected into the mixture of training images. In an embodiment of the present invention, a depth of the scene in the image is estimated using the scale-invariant model that is based on the neural network having lightweight architecture as the inverse logarithm of the depth of the scene in the image. Since the scale-invariant model that is based on a neural network having a lightweight architecture estimates the depth with up-to-scale precision, a geometry of the scene may be further constructed based on such an estimate of the scene depth. Thus, in one embodiment of the present invention, the method may further comprise the step of constructing a scene geometry based on the obtained scene depth estimate.
The image in step S100 may be any image from among an image captured by a camera of a computing device, an image retrieved from memory of a computing device, or an image downloaded over a network. The estimation, in step S110, can be performed by lightweight neural network-based scale-invariant model on a central processing unit (CPU) or any other dedicated processor (ASIC, SoC, FPGA, GPU) of a computing device.
The lightweight neural network-based scale-invariant model comprises an encoder and a decoder. MobileNet encoder, MobileNetv2 encoder, or encoder architectures from EfficientNet (namely EfficientNet-Lite0, EfficientNet-b0, b1, b2, b3, b4, b5), previously trained in the classification task on the ImageNet training dataset can be used, but without the limitation, as the encoder. The decoder used in the proposed scale-invariant model is based on the vanilla Light-Weight Refine Net decoder, the architecture of which has been modified as follows to meet the requirements of computational efficiency and to solve stability problems. The first modification: the layer that maps encoder output signal to 256 channels is replaced with a fusion block that does not change a number of channels (i.e., a number of output channels of the fusion block is equal to a number of channels at the corresponding encoder layer). A number of channels in each subsequent fusion block configured to fuse a signal from an output of a deeper layer of the decoder and a signal from a corresponding layer of the encoder in a cascade of fusion blocks is reduced relative to a previous fusion block and is equal to a number of channels at an output of the corresponding layer of the encoder. The cascade of fusion blocks in the preferred embodiment comprises four fusion blocks, but the present invention is not limited to the specific number, as more or fewer fusion blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The cascade of CRP blocks in the preferred embodiment comprises five CRP blocks, but the present invention is not limited to the specific number, as more or fewer CRP blocks in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The above features are illustrated with reference to FIG. 2, which is a schematic diagram of a refinement (decoder) branch of the neural network having the proposed architecture according to the embodiment of the present invention.
The second modification: the summation in a CRP (Chain Residual Pooling) block is supplemented with an averaging operation so that the CRP blocks do not prevent the trained scale invariant model from converging. In other words, in each CRP block comprising two CRP modules, each configured to perform an additive modification of an input signal using a pooling operation (e.g. MaxPooling) and a convolution operation with a filter, an operation of dividing a signal at an output of the CRP block by a number of CRP modules comprised in the given CRP block plus one is added. The cascade of CRP modules in a CRP block in the preferred embodiment comprises two CRP modules, but the present invention is not limited to the specific number, as more or fewer CRP modules in the cascade may be used to balance estimation accuracy and computational efficiency on various hardware configurations on which the disclosed method is executed. The above features are illustrated with reference to FIG. 3, on which an example of the structure of CRP block of the neural network having the proposed architecture according to the embodiment of the present invention.
A scale-invariant model based on the neural network having the lightweight architecture as described above produces estimates of the logarithm of inverse depth (e.g., in the form of a map of the logarithm of inverse depth) that are twice as small as a target map of the logarithm of inverse depth, so the output estimates of the logarithm of inverse depth of the image are upscaled to a target (original) resolution using, for example, bilinear interpolation or any other known method. The output estimates of logarithms of inverse depth of an image are interpreted as values on a logarithmic scale.
Before training the neural network, weights of the to-be-trained neural network may be initialized randomly. At each training iteration of the neural network, a limited number of randomly selected training images is used. Additionally, at the training stage of the neural network, a sum of the loss functions, which is to be minimized, may be applied.
Scale invariant pairwise loss. When training the neural network, it is proposed in the present application to use an L1 pairwise loss function, which can be calculated as follows:
Figure PCTKR2020016094-appb-img-000001
(1)
where d is a logarithm of predicted inverse depth and
Figure PCTKR2020016094-appb-img-000002
is a logarithm of inverse ground truth depth. The proposed pairwise loss L1 is scale invariant (SI), so it can be used for training on both absolute depth maps and UTS depth maps. To calculate this loss, the summation across
Figure PCTKR2020016094-appb-img-000003
terms is performed.
However, the above pairwise loss L1 function (1) can be calculated more efficiently in
Figure PCTKR2020016094-appb-img-000004
time. Let
Figure PCTKR2020016094-appb-img-000005
denote a list of ascendingly ordered difference values
Figure PCTKR2020016094-appb-img-000006
between values d of the logarithm of predicted inverse depth and values
Figure PCTKR2020016094-appb-img-000007
of the logarithm of inverse ground truth depth:
Figure PCTKR2020016094-appb-img-000008
. After rearranging and grouping similar terms, the pairwise loss function L1 can be written as follows:
Figure PCTKR2020016094-appb-img-000009
(2)
where
Figure PCTKR2020016094-appb-img-000010
is an ordered list
Figure PCTKR2020016094-appb-img-000011
, if i > j. To sort the list,
Figure PCTKR2020016094-appb-img-000012
operations are required,
Figure PCTKR2020016094-appb-img-000013
is computed in linear time. Overall, the computational cost of calculating pairwise loss
Figure PCTKR2020016094-appb-img-000014
is
Figure PCTKR2020016094-appb-img-000015
.
Therefore, at each training iteration of the neural network, for training images accompanied by absolute or UTS data, the pairwise scale invariant loss function
Figure PCTKR2020016094-appb-img-000016
indicated above under the number (2) may be applied.
Shift-and-scale invariant (SSI) pairwise loss. SI pairwise loss may be easily converted to SSI pairwise loss. To do this, a logarithm of depth d is replaced with a normalized depth:
Figure PCTKR2020016094-appb-img-000017
(3)
where μ and σ are mean value and standard deviation, respectively, where
Figure PCTKR2020016094-appb-img-000018
. Considering the above, shift-and-scale invariant (SSI) pairwise loss function can be written as follows:
Figure PCTKR2020016094-appb-img-000019
(4)
Therefore, at each training iteration of the neural network, for training images accompanied by absolute, UTS or UTSS data, the modified pairwise shift-and-scale invariant pairwise loss function
Figure PCTKR2020016094-appb-img-000020
indicated above under the number (4) may be applied.
In a further embodiment of the present invention, the cumulative loss function can be calculated
Figure PCTKR2020016094-appb-img-000021
, where
Figure PCTKR2020016094-appb-img-000022
is a corresponding loss function,
Figure PCTKR2020016094-appb-img-000023
is a corresponding weight of the loss function, wherein weights
Figure PCTKR2020016094-appb-img-000024
are selected so that gradients from different loss functions are equal in absolute value:
Figure PCTKR2020016094-appb-img-000025
. Loss functions are considered different for different datasets. SI and SSI loss functions are also considered different. Gradients may be calculated by averaging with an exponential moving average with a predetermined smoothing parameter. Other types of moving average may be used, such as simple, weighted, etc. A predetermined smoothing parameter, for example, a size of the moving average window, may be predefined or empirically fitted. The sum of weights of the loss functions is equal to 1, and each of the weights is non-negative. Thus, having a mixture of UTS and UTSS ground-truth data, SI loss may be used for training on absolute and UTS data, and SSI loss may be used for training on both absolute and UTS data as well as on UTSS data.
Absolute data for training the neural network to estimate the depth and geometry of a scene may be obtained using a motion sensor. UTS data for training the neural network to estimate the depth of a scene are obtained with up-to-scale precision using Structure From Motion algorithm from movies that are available on the Internet. UTSS data for training the neural network to estimate the depth and geometry of the scene are obtained from calibrated stereo images using an optical flow determination algorithm (RAFT). At each iteration of training the neural network, training images from the mixture of training images are provided to the trained neural network in a random order.
FIG. 4 illustrates a block diagram of a computing device 200 according to an embodiment of the present invention. The user computing device 200 comprises at least a processor 205 and a memory 210, which are operably connected to each other. The processor 205 may perform, among other operations, steps S100 and S110 of the method illustrated in FIG. 1. The memory 210 stores the trained neural network (a set of parameters/weights) and processor-executable instructions that, when executed, cause the processor to execute a method for estimating scene depth from an image using the trained neural network. Memory 210 is capable of storing any other data and information. The computing device 200 may comprise other not shown components, for example, a screen, a camera, a communication unit, a touch-sensitive panel, a speaker, a microphone, a Bluetooth module, an NFC module, a Wi-Fi module, a power supply and corresponding interconnections. The disclosed method for estimating a depth of a scene from an image can be implemented on a wide range of computing devices 200, such as laptops, smartphones, tablets, mobile robots and navigation systems. The implementation of the proposed method supports all kinds of devices capable of performing calculations on the CPU. In addition, if the computing device has an additional device for accelerating the neural network, such as a GPU (graphics processing unit), NPU (neural processing unit), TPU (tensor data processing unit), faster implementation is possible on such devices.
At least one of the plurality of modules, blocks, components, steps, sub-steps may be implemented through an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. One or a plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.
The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks. The learning algorithm is a method for training a predetermined target computing device using a plurality of learning data to cause, allow, or control the target computing device to make a determination, estimation, or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
It should be understood that not all technical effects mentioned herein need to be enjoyed in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without the user enjoying some of these technical effects, while other embodiments may be implemented with the user enjoying other technical effects or none at all.
Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting. The scope of the present technology is therefore intended to be limited solely by the scope of the appended claims.
While the above-described implementations have been described and shown with reference to particular steps performed in a particular order, it will be understood that these steps may be combined, sub-divided, or reordered without departing from the teachings of the present technology. Accordingly, an order and grouping of the steps is not a limitation of the present technology. The use of the singular form in relation to any element disclosed in this application does not preclude that two or more such elements may be in an actual implementation.

Claims (14)

  1. A method for estimating a depth of a scene in an image, comprising the steps of:
    obtaining (S100) the image;
    estimating (S110) the depth of the scene in the image using a scale-invariant model that is based on a neural network having lightweight architecture, the neural network is trained using training images, wherein at each training iteration, using a mixture of images randomly selected from training images with absolute data, training images with UTS (Up-to-Scale) data, and training images with UTSS (Up-to-Shift-Scale) data in random proportions.
  2. The method of claim 1, wherein the depth of the scene in the image is estimated using the scale-invariant model that is based on the neural network having lightweight architecture as the inverse logarithm of the depth of the scene in the image.
  3. The method of claim 1, further comprising the step of constructing a geometry of the scene based on the obtained scene depth estimate.
  4. The method of claim 1, wherein before training the neural network, weights of the to-be-trained neural network are initialized randomly.
  5. The method of claim 1, wherein at the stage of training the neural network, one or more loss functions to be minimized are applied.
  6. The method of claim 5, wherein at each training iteration of the neural network for images accompanied by absolute or UTS data, a pairwise scale invariant loss
    Figure PCTKR2020016094-appb-img-000026
    function is applied:
    Figure PCTKR2020016094-appb-img-000027
    ,
    where
    Figure PCTKR2020016094-appb-img-000028
    is a list of ascendingly sorted difference values
    Figure PCTKR2020016094-appb-img-000029
    between values d of a predicted logarithm of inverse depth and values
    Figure PCTKR2020016094-appb-img-000030
    of a logarithm of ground truth inverse depth.
  7. The method of claim 5, wherein at each training iteration of the neural network for images accompanied by absolute, UTS or UTSS data, the following modified pairwise shift-scale invariant loss
    Figure PCTKR2020016094-appb-img-000031
    function is applied:
    Figure PCTKR2020016094-appb-img-000032
    ,
    wherein
    Figure PCTKR2020016094-appb-img-000033
    is the normalized value of the predicted inverse depth, calculated as
    Figure PCTKR2020016094-appb-img-000034
    where μ and σ are mean and standard deviation, respectively, where
    Figure PCTKR2020016094-appb-img-000035
    .
  8. The method of any one of claims 6-7, further comprising the step of calculating a cumulative loss function:
    Figure PCTKR2020016094-appb-img-000036
    where
    Figure PCTKR2020016094-appb-img-000037
    is a corresponding loss function,
    Figure PCTKR2020016094-appb-img-000038
    is a corresponding weight of the loss function,
    where weights
    Figure PCTKR2020016094-appb-img-000039
    are selected so that gradients of different loss functions are equal in absolute value:
    Figure PCTKR2020016094-appb-img-000040
    ,
    the gradients are calculated by averaging with an exponential moving average with a predetermined smoothing parameter, and
    the sum of weights of the loss functions is equal to 1, and each of the weights is non-negative.
  9. The method of claim 1, wherein the neural network consists of an encoder and a decoder, the encoder is a MobileNetV2 or EfficientNet encoder, and the decoder is a modified Light-Weight RefineNet decoder, in which:
    a number of channels in each subsequent fusion block configured to fuse a signal from an output of a deeper layer of the decoder and a signal from a corresponding layer of the encoder in a cascade of fusion blocks is reduced relative to a previous fusion block and is equal to a number of channels at an output of the corresponding layer of the encoder, and
    in the process of training, in each CRP (chain residual pooling) block comprising two CRP modules, each configured to perform an additive modification of an input signal using a pooling operation (MaxPooling) and a convolution operation with a filter, an operation of dividing a signal at an output of the CRP block by a number of CRP modules plus one is added,
    the cascade of fusion blocks comprises four fusion blocks, and the cascade of CRP blocks comprises five CRP blocks.
  10. The method of claim 1, wherein the absolute data for training the neural network to estimate the depth and geometry of the scene are obtained using a motion sensor.
  11. The method of claim 1, wherein the UTS data for training the neural network to estimate the depth and geometry of the scene are obtained with up-to-scale precision using Structure From Motion algorithm from movies that are available on the Internet.
  12. The method of claim 1, wherein the UTSS data for training the neural network to estimate the depth and geometry of the scene are obtained from calibrated stereo images using an optical flow determination algorithm (RAFT).
  13. The method of any one of claims 10-12, in which at each iteration of training the neural network, training images from the mixture of training images are provided to the trained neural network in a random order.
  14. User computing device (200) comprising a processor (205) and memory (210) storing the trained neural network and processor-executable instructions, which, when executed, cause the processor to execute the method for estimating a depth of a scene in an image according to any one of claims 1 to 13 using the trained neural network.
PCT/KR2020/016094 2019-11-14 2020-11-16 Method for estimating depth of scene in image and computing device for implementation of the same WO2021096324A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
RU2019136634 2019-11-14
RU2019136634 2019-11-14
RU2020136895 2020-11-10
RU2020136895A RU2761768C1 (en) 2020-11-10 2020-11-10 Method for estimating the depth of a scene based on an image and computing apparatus for implementation thereof

Publications (1)

Publication Number Publication Date
WO2021096324A1 true WO2021096324A1 (en) 2021-05-20

Family

ID=75913123

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/016094 WO2021096324A1 (en) 2019-11-14 2020-11-16 Method for estimating depth of scene in image and computing device for implementation of the same

Country Status (1)

Country Link
WO (1) WO2021096324A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506307A (en) * 2021-06-29 2021-10-15 吉林大学 Medical image segmentation method for improving U-Net neural network based on residual connection
CN113705580A (en) * 2021-08-31 2021-11-26 西安电子科技大学 Hyperspectral image classification method based on deep migration learning
CN114141108A (en) * 2021-12-03 2022-03-04 中国科学技术大学 Blind-aiding voice-aided reading equipment and method
CN114510959A (en) * 2021-12-21 2022-05-17 中国人民解放军战略支援部队信息工程大学 Radar signal modulation mode identification method and system based on split EfficientNet network under low signal-to-noise ratio
CN114972517A (en) * 2022-06-10 2022-08-30 上海人工智能创新中心 RAFT-based self-supervision depth estimation method
CN115424410A (en) * 2022-11-03 2022-12-02 国网浙江省电力有限公司金华供电公司 High-voltage environment protection method based on wireless radiation perception and multi-modal data
CN118397068A (en) * 2024-07-01 2024-07-26 杭州师范大学 Monocular depth estimation method based on evolutionary neural network architecture search

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137406A1 (en) * 2016-11-15 2018-05-17 Google Inc. Efficient Convolutional Neural Networks and Techniques to Reduce Associated Computational Costs
US20180247113A1 (en) * 2016-10-10 2018-08-30 Gyrfalcon Technology Inc. Image Classification Systems Based On CNN Based IC and Light-Weight Classifier
US20190147318A1 (en) * 2017-11-14 2019-05-16 Google Llc Highly Efficient Convolutional Neural Networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180247113A1 (en) * 2016-10-10 2018-08-30 Gyrfalcon Technology Inc. Image Classification Systems Based On CNN Based IC and Light-Weight Classifier
US20180137406A1 (en) * 2016-11-15 2018-05-17 Google Inc. Efficient Convolutional Neural Networks and Techniques to Reduce Associated Computational Costs
US20190147318A1 (en) * 2017-11-14 2019-05-16 Google Llc Highly Efficient Convolutional Neural Networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KIM SANGWON, NAM JAEYEAL, KO BYOUNGCHUL: "Fast Depth Estimation in a Single Image Using Lightweight Efficient Neural Network", SENSORS, vol. 19, no. 20, pages 4434, XP055812320, DOI: 10.3390/s19204434 *
MARK SANDLER, HOWARD ANDREW, ZHU MENGLONG, ZHMOGINOV ANDREY, CHEN LIANG-CHIEH: "MobileNetV2: Inverted Residuals and Linear Bottlenecks", 2 April 2018 (2018-04-02), XP055522020, Retrieved from the Internet <URL:https://arxiv.org/pdf/1801.04381.pdf> [retrieved on 20181107] *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506307A (en) * 2021-06-29 2021-10-15 吉林大学 Medical image segmentation method for improving U-Net neural network based on residual connection
CN113705580A (en) * 2021-08-31 2021-11-26 西安电子科技大学 Hyperspectral image classification method based on deep migration learning
CN113705580B (en) * 2021-08-31 2024-05-14 西安电子科技大学 Hyperspectral image classification method based on deep migration learning
CN114141108A (en) * 2021-12-03 2022-03-04 中国科学技术大学 Blind-aiding voice-aided reading equipment and method
CN114510959A (en) * 2021-12-21 2022-05-17 中国人民解放军战略支援部队信息工程大学 Radar signal modulation mode identification method and system based on split EfficientNet network under low signal-to-noise ratio
CN114972517A (en) * 2022-06-10 2022-08-30 上海人工智能创新中心 RAFT-based self-supervision depth estimation method
CN114972517B (en) * 2022-06-10 2024-05-31 上海人工智能创新中心 Self-supervision depth estimation method based on RAFT
CN115424410A (en) * 2022-11-03 2022-12-02 国网浙江省电力有限公司金华供电公司 High-voltage environment protection method based on wireless radiation perception and multi-modal data
CN115424410B (en) * 2022-11-03 2023-12-19 国网浙江省电力有限公司金华供电公司 High-pressure environment protection method based on wireless radiation sensing and multi-mode data
CN118397068A (en) * 2024-07-01 2024-07-26 杭州师范大学 Monocular depth estimation method based on evolutionary neural network architecture search

Similar Documents

Publication Publication Date Title
WO2021096324A1 (en) Method for estimating depth of scene in image and computing device for implementation of the same
CN111402130B (en) Data processing method and data processing device
JP2022518322A (en) Semantic segmentation with soft cross entropy loss
CN110717851A (en) Image processing method and device, neural network training method and storage medium
CN112308200A (en) Neural network searching method and device
WO2018176186A1 (en) Semantic image segmentation using gated dense pyramid blocks
CN111797983A (en) Neural network construction method and device
CN112258512A (en) Point cloud segmentation method, device, equipment and storage medium
CN113066017A (en) Image enhancement method, model training method and equipment
US20210064919A1 (en) Method and apparatus for processing image
CN113807361B (en) Neural network, target detection method, neural network training method and related products
WO2022228142A1 (en) Object density determination method and apparatus, computer device and storage medium
US20210064955A1 (en) Methods, apparatuses, and computer program products using a repeated convolution-based attention module for improved neural network implementations
CN113781519A (en) Target tracking method and target tracking device
CN114359289A (en) Image processing method and related device
CN113066018A (en) Image enhancement method and related device
CN115018039A (en) Neural network distillation method, target detection method and device
CN116402876A (en) Binocular depth estimation method, binocular depth estimation device, embedded equipment and readable storage medium
CN117496312A (en) Three-dimensional multi-target detection method based on multi-mode fusion algorithm
CN115601551A (en) Object identification method and device, storage medium and electronic equipment
CN117392488A (en) Data processing method, neural network and related equipment
CN115049730B (en) Component mounting method, component mounting device, electronic apparatus, and storage medium
CN115862012A (en) Point cloud data semantic segmentation method and device, electronic equipment and storage medium
RU2761768C1 (en) Method for estimating the depth of a scene based on an image and computing apparatus for implementation thereof
Zhang et al. Densely connecting depth maps for monocular depth estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20886592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20886592

Country of ref document: EP

Kind code of ref document: A1