CN117036439A

CN117036439A - Single image depth estimation method and system based on multi-scale residual error network

Info

Publication number: CN117036439A
Application number: CN202311298295.0A
Authority: CN
Inventors: 张炜; 何露
Original assignee: Guangzhou Dawan District Virtual Reality Research Institute
Current assignee: Guangzhou Dawan District Virtual Reality Research Institute
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2023-11-10

Abstract

The invention discloses a single image depth estimation method and a system based on a multi-scale residual error network. In the training of the network, a loss function which is suitable for different characteristic data sets and can be mixed for training is introduced, so that a better effect can be obtained when a final depth prediction model can be used as input in pictures acquired in various environments, the problem that a single image depth estimation method has a better effect only in specific data is solved, and the accuracy of depth estimation is improved to a certain extent.

Description

Single image depth estimation method and system based on multi-scale residual error network

Technical Field

The invention relates to the technical field of image depth estimation, in particular to a single image depth estimation method and system based on a multi-scale residual error network.

Background

Virtual production can fully immerse people in an artificial virtual reality environment and interact with virtual objects and people in the environment, and the reality of the virtual environment is one of the most important factors affecting virtual experience. The existing three-dimensional rendering technology can solve the background processing problem of the virtual environment, but because the skin surface of a person is high in complexity, the rendering of the person by the virtual reality rendering technology does not reach a satisfactory place, and can only be replaced by a virtual manufactured person model or cartoon image. Although this method can achieve smooth interaction with the virtual character, it has limited effect of enhancing realism and immersion of the virtual environment.

With the rapid development of computer technology, the application of virtual stereo shooting technology is more widespread. The existing virtual character manufacturing method mainly comprises the steps of driving a virtual model by using a motion capture system and obtaining virtual assets by using a stereoscopic shooting system based on a stereoscopic image pair to perform virtual-real fusion. And setting a mark point on the captured object by using a method of the motion capture system, capturing the motion coordinate of the mark point, and processing the motion track of the mark point by a computer to obtain the position data of the captured object in a real world coordinate system. However, the tracker may have problems such as occlusion or positional offset, which is one of the main factors affecting the accuracy and continuity of the generated three-dimensional data. The method of utilizing the stereo camera shooting system is to simulate human eyes by utilizing double lenses, respectively obtain left and right eye views, and process and display stereo images through three-dimensional stereo software. For a stereoscopic shooting system based on a stereoscopic image pair, because the acquired left and right eye views are required to be consistent with other parameters except parallax, the requirements on hardware are severe, meanwhile, because a camera is sensitive to illumination changes of the environment, the shooting of an outdoor scene is difficult, and the stereoscopic shooting system is only suitable for indoor shooting at present.

Thus, a method of researching how to acquire a stereoscopic image pair required by monocular image acquisition by using deep learning is one of the development directions of virtual production. Because the single image lacks depth information in a three-dimensional space, the method firstly needs to acquire the depth information of the single image through a single image depth estimation algorithm, secondly converts the predicted depth information into parallax values of the stereoscopic image, generates a corresponding stereoscopic image by utilizing an image transformation technology, and performs image processing to perfect a final stereoscopic image pair.

Most of the existing monocular image depth estimation algorithms are based on Markov random field (Markov Random Field, MRF), simple geometric assumption or nonparametric methods, and the like, and scene depth can be directly restored from an input image by utilizing the expression capability of a convolution network, but the methods all need to acquire real depth, so that the existing methods can acquire better effects basically only in specific scene types.

Disclosure of Invention

The primary purpose of the invention is to provide a single image depth estimation method based on a multi-scale residual error network, which solves the problem that the existing single image depth estimation method has better effect only in specific data.

It is a further object of the present invention to provide a single image depth estimation system based on a multi-scale residual network.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a single image depth estimation method based on a multi-scale residual error network comprises the following steps:

s1: acquiring color RGB image datasets having different characteristics;

s2: constructing a multi-scale residual error network model, wherein the multi-scale residual error network model utilizes an improved residual error network as a feedforward network to generate feature images with different semantics and different proportions, a feature fusion module is used for feature combination, and an adaptive output module is used for adjusting the channel number of the feature images and the size of a final output depth estimation image;

s3: training the multi-scale residual error network model based on the step S2 by using the color RGB image data set of the step S1, and updating the multi-scale residual error network model based on the mixed data set training loss function to obtain a trained multi-scale residual error network model based on the mixed data set training loss function;

s4: and performing single image depth estimation by using the trained multi-scale residual error network model.

Further, the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and an adaptive module, wherein:

the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.

Further, the improved residual network in step S2 is specifically:

and deleting the last pooling layer, the full connection layer and the Softmax layer of the residual network structure to obtain the improved residual network.

Further, the up-sampling in the feature map after up-sampling the output feature map of the last layer of the improved residual network adopts residual convolution, specifically:

and inputting the output characteristic diagram of the last layer of the improved residual network into a ReLU layer, sequentially connecting a 3X 3 convolution layer, the ReLU layer and the 3X 3 convolution layer after the ReLU layer, and combining the output of the 3X 3 convolution layer of the last layer with the output characteristic diagram of the last layer of the improved residual network.

Further, the feature fusion module specifically comprises:

the characteristic fusion module comprises a 3X 3 convolution layer, a residual convolution block and an up-sampling operation, wherein the output characteristic diagram of the improved residual network corresponding layer number is sequentially combined with the output of the characteristic fusion module of the previous layer after passing through the 3X 3 convolution layer and the residual convolution block, and is sequentially output after passing through the residual convolution block and the up-sampling operation.

Further, the adaptive module comprises two 3×3 convolution layers connected in sequence.

Further, in step S3, the training loss function of the mixed data set is specifically:

in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),for scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;

defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:

in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.

A single image depth estimation system based on a multi-scale residual network, comprising:

a data module that obtains color RGB image datasets having different characteristics;

the network model module is used for constructing a multi-scale residual error network model, the multi-scale residual error network model is used as a feedforward network to generate feature images with different semantics and different proportions, the feature fusion module is used for feature combination, and finally the adaptive output module is used for adjusting the channel number of the feature images and the size of the final output depth estimation image;

the training module is used for training the multi-scale residual error based network model of the network model module by using the color RGB image data set of the data module, and updating the multi-scale residual error based network model based on the mixed data set training loss function to obtain a trained multi-scale residual error based network model;

and the depth estimation module is used for carrying out single image depth estimation by utilizing the trained multi-scale residual error network model.

Further, the mixed data set trains a loss function, specifically:

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention improves a single image depth estimation network model based on a multi-scale residual error network, takes a color RGB image as input, firstly extracts shared characteristics based on the residual error network, utilizes a characteristic fusion module to fuse a multi-scale pre-trained characteristic diagram and a re-trained characteristic diagram, and finally outputs a final result. In the training of the network, a loss function which is suitable for different characteristic data sets and can be mixed for training is introduced, so that a better effect can be obtained when a final depth prediction model can be used as input in pictures acquired in various environments, the problem that a single image depth estimation method has a better effect only in specific data is solved, and the accuracy of depth estimation is improved to a certain extent.

Drawings

Fig. 1 is a schematic flow chart of a single image depth estimation method based on a multi-scale residual error network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a residual network structure according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a network model based on multi-scale residuals, which is provided in an embodiment of the present invention.

Fig. 4 is an up-sampling schematic diagram provided in an embodiment of the present invention.

Fig. 5 is a schematic diagram of a feature fusion module according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an adaptive module according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of an NYU v2 data sample according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a dweb data sample according to an embodiment of the invention.

Fig. 9 is a schematic diagram of a DIML indoor data sample according to an embodiment of the present invention.

FIG. 10 is a graph comparing visual effects tested on various data sets provided by an embodiment of the present invention.

Fig. 11 is a schematic diagram of a single image depth estimation system based on a multi-scale residual error network according to an embodiment of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a single image depth estimation method based on a multi-scale residual error network, as shown in fig. 1, comprising the following steps:

s1: acquiring color RGB image datasets having different characteristics;

Example 2

The present embodiment continues to disclose the following on the basis of embodiment 1:

the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and a self-adaptive module, wherein:

In a specific embodiment, the network inputs a three-channel RGB image of w×h×3, where the input image format is set to 384×384. The final feature map is typically 1/32 of the input image in size, so the final feature map of the present network has a size of 12 x 12. In the depth estimation algorithm, if the feature images are directly up-sampled or convolved and combined, only a rough depth image can be generated, and in order to avoid rough depth estimation generated by only adopting high-level semantic features, a refinement strategy is used in the embodiment, and the prediction accuracy is improved by combining the high-level semantic features and low-level edge sensitive features.

In a further embodiment, the improved residual network in step S2 is specifically:

In this embodiment, the general residual network structure is shown in fig. 2, and the core idea of the network is to introduce a shortcut connection tag in the feedforward neural network, the shortcut connection can perform identity mapping of the modules, and add their outputs to the stack layer, but without adding extra parameters and computational complexity, skip one or more layers through the shortcut connection, and greatly increase the number of layers of the network. Since the residual network includes a convolution sequence of step size 2 and pooling operations, while increasing the acceptance field of the convolution, more context information can be captured, but also reducing the resolution of the output features, the network first eliminates the last pooling layer, full connection layer, and Softmax layer of the residual network structure so that the residual network can be used for the estimation task of densely predicting each pixel and as a feed forward network generates a series of feature maps with different semantics and different proportions.

In a further embodiment, the upsampling in the upsampled feature map of the last layer of the improved residual network uses residual convolution, specifically:

as shown in fig. 4, the output feature map of the last layer of the improved residual network is input to a ReLU layer, and after the ReLU layer, a 3×3 convolution layer, a ReLU layer, and a 3×3 convolution layer are sequentially connected, and the output of the last layer of the 3×3 convolution layer is combined with the output feature map of the last layer of the improved residual network.

In a further embodiment, the feature fusion module is specifically:

as shown in fig. 5, the feature fusion module includes a 3×3 convolution layer, a residual convolution block and an up-sampling operation, where the improved output feature map corresponding to the layer number of the residual network is sequentially output after passing through the 3×3 convolution layer and the residual convolution block, and then is combined with the output of the feature fusion module of the previous layer, and is output after passing through the residual convolution block and the up-sampling operation in sequence.

In a further embodiment, as shown in fig. 6, the adaptation module comprises two 3 x 3 convolutional layers connected in sequence.

In this embodiment, as shown in fig. 3, the residual network is divided into 4 different modules according to the resolution of the feature map based on the multi-scale residual network model, and the scale of the feature map in each module is the same. Before each residual convolution block, the number of channels of the feature map needs to be adjusted using a convolution layer of 3×3 size, and the channel number of each transition layer is set to 256 in this experiment. As shown in fig. 5, the network takes the output of the last layer of the convolution block as the input to the multi-scale feature fusion module. The multi-scale feature fusion module of the network of the embodiments of the present invention takes as input two sets of feature maps, one set being pre-trained from a residual convolution network and the other set being generated by training from scratch. For each feature fusion module, a residual convolution block is used for transferring the feature mapping from a specific layer of the pre-trained residual network, and then the feature mapping is combined with the fusion feature map generated by the last feature fusion module through a summation method. Finally, an up-sampling operation is applied to generate a feature map with the same resolution as the next input.

In order to perform progressive refinement operation to generate a depth prediction map with higher precision, the network of the embodiment firstly upsamples the last group of feature maps generated by the residual convolution network, and the residual convolution block is adopted to effectively propagate the gradient of the high layer to the low layer through the connection of the short distance and the long distance. And combining the two groups of input feature graphs by a feature fusion module, and performing up-sampling to generate output with higher resolution. The number of channels of the feature map and the size of the final output depth estimate map are adjusted at the end of the network using an adaptation module configured as shown in fig. 6.

The present embodiment explores the diversity of samples by small batches of samples. For each input imageIRandom samplingNPoint-to-point pairsi,j) WhereinNIs the total number of pairs of points,iandjrepresenting the positions of the first and second points, respectively. To mark the sequential relationship between each point pairl _ij Firstly, obtaining depth value from correspondent true valueg _i ,g _j ) The order relationship of the truth values is then defined as:

wherein,is an empirical threshold, set to 0.02 in this embodiment, the relative depth of the true values can be usedIndicating (I)>Indicate->For the position of the first point and the second point, < >>Is thatCorresponding ordered relationships.

Existing datasets suitable for monocular depth estimation are composed of RGB images with corresponding forms of depth annotations, and their differences are mainly represented in various aspects of the environment and object of dataset image acquisition, the type of depth annotation, accuracy, image quality, camera settings and the size of the dataset, such as indoor scene dataset NYU v2, outdoor scene dataset DIW, dataset dweb with relatively realistic diversified dynamic scenes, dataset ETH3D with high-precision laser radar scanning truth values in static scenes, etc. For existing datasets, high-precision data is difficult to acquire on a large scale, and data collected from a network is difficult to ensure image quality and depth precision. Therefore, the existing monocular depth estimation algorithm can obtain better performance and results by performing network training and testing on a single data set, but the effect on other data sets in different forms is not ideal. The experiment tried to select three data sets of NYU depthv2, dweb and DIML indicator for mixed training and tested on the NYU depthv2 data set.

NYU depthv2 is one of the indoor data sets commonly used at present, and as shown in FIG. 7, the images are respectively color camera output images from left to right, and a depth map and a depth label map are preprocessed. Silberman et al acquired color information and depth information for 464 scenes using a color camera and Kinect, and shared 1449 Zhang Biaozhu RGBD image pairs and 407024 unlabeled RGBD image pairs in the dataset. The 1449 images are filled by a coloring algorithm to obtain dense depth images and semantic information is marked manually, and finally a training sample is formed by the RGB images, the depth images and the semantic label images.

ReDWeb (RW) is a small dataset, and Xiaan et al collect 40K stereo images from the Flick and use the latest optical flow algorithm to generate the corresponding stereo matching map. The dataset has 3600 images in total, including a variety of relatively realistic diversified dynamic scenes such as offices, night scenes and streets, as shown in fig. 8, with three sets of RGB images and corresponding relative depth images.

The DIML indoor dataset is a static scene RGBD image pair dataset obtained by using the Zed camera and Kinect v2 synchronously, the dataset is converted by using calibration parameters after parallax mapping is generated by using the latest stereo matching algorithm, and meanwhile, a parallax confidence map of each pixel is provided, and the DIML indoor dataset comprises 200 indoor scene maps, namely an RGB image and a corresponding depth map as shown in fig. 9.

In order to evaluate the network prediction result, a loss function needs to be defined, and the loss functions commonly used in the depth estimation algorithm mainly comprise L1, L2, SSIM, berHU loss functions and cross entropy loss functions. Zhou et al propose a learning framework that can implement self-supervision of monocular depth and camera motion estimation from unstructured video sequences, the loss function of which is shown in the following equation:

wherein,is the difference between the current pixel and the original image, < >>Is a smooth loss->To prevent cross entropy loss of the overfitting.

To achieve displacement and scale size invariance in disparity space, wang et al propose normalized multi-scale gradient (NMG) loss, evaluating the gradient difference between true and rescaled estimates at multiple scales, the loss function is shown as follows:

in this embodiment, the above-mentioned loss function is improved, and a loss function capable of performing hybrid training on different feature data sets is provided, where the main solution problem is: since truth labels exist in different forms, calculations need to be performed in a space compatible with all truths, and the penalty function should be flexible to handle a variety of data sources and take full advantage of the available information. The main difficulties of hybrid training are the inherent difference in depth representation, the difference in image scale and the ambiguity of viewpoint displacement.

The present embodiment performs depth prediction in disparity space and handles the above-described difficulties in combination with a series of scale and displacement invariant intensive losses. Assume thatPRepresented as the number of pixels in the image with significant base truth,θfor predictive modelsParameter, d=d #θ) For the purpose of disparity prediction,for the corresponding true value disparity, the individual pixels are indexed by subscripts, where the scale invariant and displacement invariant loss of the individual samples are defined as:

wherein,scaling and displacement of the predicted value and the true value, respectively, < >>To define a specific type of loss function.

The estimates of scale and translation are made of:and->It is indicated that in order to make sense the defined scale invariant and displacement invariant losses, the predicted and true values corresponding scales and displacements need to be substantially aligned, i.e. such thatAnd->The present embodiment performs alignment based on the least squares criterion, as follows:

in the above-mentioned method, the step of,for the aligned predicted and true values, let +.>In order to determine the factors s and t in a closed form, the target is rewritten by the formula:

from which a closed-loop solution is obtained:

by means ofScale and displacement invariant mean square errors (Mean Square Error, MSE) are defined, but since MSE is not robust to outliers, but only imperfect truth values can be provided in existing large-scale data sets, robustness should improve training of the network, replacing the robust loss function with scale and displacement based robust estimates as shown in the following:

simultaneously, both the predicted value and the true value are adjusted to zero translation and unit scale:

finally, a scale and displacement invariant loss function is obtained:

wherein, according to experimental experience setting of the ReWeb data set=0.8p and ∈>。

The unknown and variable scale is not negligible in the training of the monocular depth estimation model, and the embodiment uses the gradient matching terms of multiple scales and unchanged scales into the parallax space, which is defined as:

wherein,and->Representing disparity map differences for scale K, k=4 scale levels are used in this embodiment, and the image resolution for each scale level is halved. The matching term makes the predicted discontinuity sharper, coinciding with the true value discontinuity.

The training loss function of the mixed data set in the step S3 specifically includes:

for a hybrid training strategy of different data sets, using a pareto optimal multitask learning framework, defining the learning of each data set as a separate task, and finding an approximate pareto optimal value on the data set, the minimum multitarget optimization criteria are as follows:

in the method, in the process of the invention,for the loss function of the respective dataset, +.>For model parameters based on a multi-scale residual network model, < ->Shared in the dataset.

In a specific embodiment, the hybrid dataset single image depth estimation network training is performed based on the ReDWeb, NYU depthv2, and DIML datasets. The framework of deep learning used is Pytorch. The computer hardware is as follows: intel i7 8700 CPU, NVIDIA GTX 1080Ti GPU, its memory is 12G, computer memory is 32GB, and operating system is windows.

Depth estimation network parameter setting:

the ResNet-50 based multi-scale architecture is used as a backbone network to initialize the initial convolutional layer using a random Gaussian initialization method. The images are flipped randomly horizontally and cropped randomly so that the input images fed into the network are resized to 384 x 384, which can enhance the data and maintain the aspect ratio on different input images. In the experiment, other pre-training models with the best effect are used for carrying out mixed training on data sets, as three data sets are required to be mixed, batch size (batch size) is set to be 16, reLU is used as an activation function, adam optimization network is utilized to accelerate convergence of the network, the total number of training samples is 1000 images, and the network iterates 50 times.

The evaluation indexes of the single image depth prediction mainly comprise:

absolute root mean square error (Root Mean Square Error, RMSE):

average Relative Error (REL):

percentage of pixels:

LOG mean error (LOG 10):

weighted human opinion bifurcation rate (WHDR):

wherein,for the total number of pixels of the test set, +.>Index for pixel +.>And->The predicted depth value and true value of the network are respectively, for WHDR,>for defining a threshold value for an equal relation between two points, i.e. when the difference between two predicted depth values is smaller than +.>Then consider the two point depth values equal, +.>Is->Human confidence weights of Point pairs, according to the study by Chen et al, will +.>Set to 1.

From differences in loss functionsThe three aspects of value, different loss functions and different mainstream methods compare the performance of the monocular depth estimation method of the present embodiment.

(1) Difference of loss functionsValue of

At the loss functionIn (I)>Weights responsible for controlling the change of scale, different +.>The value affects the result of the final depth estimation. For verification->Influence of values, this embodiment uses the same network structure and sets different +.>The values test the network performance and the results are shown in table 1.

Table 1 shows the differenceValue impact on network performance

Wherein whenAt a value of 0.5, the three error factors were minimal, at which time the network performance was optimal, as shown by the underlined data results in table 1.

(2) Different loss functions

This section is inAt a value of 0.5, the loss function of the comparison of the Xian et alL(I,G,z)And the loss function used in this embodiment +.>The effect of accuracy of the depth estimation map was generated, and the experimental results are shown in table 2. />

Table 2 different loss functions impact on network

In the single image depth estimation model,Loss _l the effect is better thanL(I,G,z)。

(3) Compared with the mainstream method

The loss function used in this experiment wasLoss _l ，Set to 0.5, the comparison result with the single image depth estimation main stream method is shown in table 3.

Table 3 comparison of performance with mainstream process

In the tables, eigen et al methods are given in Eigen D, puhrsch C, fergus R. Depth map prediction from a single image using amulti-scale deep network [ C ]// Advances in neural information processing systems 2014:2366-2374.

Liu et al are available from Karsch K, liu C, kang S B. Depth extraction from video using non-parametric sampling [ C ]// European Conference on Computer vision, springer, berlin, heidelberg, 2012:775-788.

Li et al are derived from Li B, shen C, dai Y, et al Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognment.2015:1119-1127.

Wang et al are available from Wang X, fouhey D, gupta A. Designing deep networks for surface normal estimation [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern, recovery 2015:539-547.

As can be seen from table 3, compared with other models, the depth estimation model trained by the mixed data set according to the embodiment has a greater improvement in depth estimation accuracy than other depth estimation models trained on a single training set. The pair of visual effects tested on each of the different data sets is shown in fig. 10, left to right for the input image, the hybrid loss depth result of this embodiment and the L (I, G, z) loss depth result of Xian et al, respectively. The model trained for a particular dataset works well on its training dataset, but poorly on other test datasets, and the method used in this embodiment benefits from feature complementation of the datasets after mixing the datasets, with higher accuracy of the depth results after optimization.

The method of Xian et al is available from Xian K, shen C, cao Z, et al Monocular relative depth perception with web stereo data supervision [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recovery 2018:311-320.

In the embodiment, images of different objects and different depths in a green curtain scene are tested, a new viewpoint image is generated to obtain a better depth effect, the position of a person scene serving as a foreground pixel needs to be adjusted, different overlapping degrees can have a certain influence on the accurate value of depth prediction, and overlapping of the person is reduced as much as possible, so that a depth map with a better effect can be obtained.

Example 3

The embodiment provides a single image depth estimation system based on a multi-scale residual error network, as shown in fig. 11, including:

Further, the mixed data set trains a loss function, specifically:

in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),is of unchanged scale and displacementLoss function (F)>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;

/>

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The single image depth estimation method based on the multi-scale residual error network is characterized by comprising the following steps of:

s1: acquiring color RGB image datasets having different characteristics;

2. The multi-scale residual network-based single image depth estimation method of claim 1, wherein the multi-scale residual network-based model comprises a modified residual network, a number of feature fusion modules, and an adaptation module, wherein:

3. The single image depth estimation method based on multi-scale residual network according to claim 2, wherein the improved residual network in step S2 is specifically:

4. The single image depth estimation method based on the multi-scale residual network according to claim 2, wherein the upsampling in the upsampled feature map of the last layer of the improved residual network adopts residual convolution, specifically:

5. The single image depth estimation method based on the multi-scale residual network according to claim 2, wherein the feature fusion module specifically comprises:

6. The single image depth estimation method based on a multi-scale residual network of claim 2, wherein the adaptation module comprises two 3 x 3 convolution layers connected in sequence.

7. The multi-scale residual network-based single image depth estimation method according to any one of claims 1 to 6, wherein the hybrid dataset training loss function in step S3 is specifically:

in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Sample number of->For scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;

8. A single image depth estimation system based on a multi-scale residual network, comprising:

9. The multi-scale residual network based single image depth estimation system of claim 8, wherein the multi-scale residual network based model comprises a modified residual network, a number of feature fusion modules, and an adaptation module, wherein:

10. The multi-scale residual network-based single image depth estimation system according to claim 8 or 9, wherein the hybrid dataset trains a loss function, in particular: