CN117036439A - Single image depth estimation method and system based on multi-scale residual error network - Google Patents

Single image depth estimation method and system based on multi-scale residual error network Download PDF

Info

Publication number
CN117036439A
CN117036439A CN202311298295.0A CN202311298295A CN117036439A CN 117036439 A CN117036439 A CN 117036439A CN 202311298295 A CN202311298295 A CN 202311298295A CN 117036439 A CN117036439 A CN 117036439A
Authority
CN
China
Prior art keywords
network
scale
module
layer
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311298295.0A
Other languages
Chinese (zh)
Inventor
张炜
何露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Dawan District Virtual Reality Research Institute
Original Assignee
Guangzhou Dawan District Virtual Reality Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dawan District Virtual Reality Research Institute filed Critical Guangzhou Dawan District Virtual Reality Research Institute
Priority to CN202311298295.0A priority Critical patent/CN117036439A/en
Publication of CN117036439A publication Critical patent/CN117036439A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a single image depth estimation method and a system based on a multi-scale residual error network. In the training of the network, a loss function which is suitable for different characteristic data sets and can be mixed for training is introduced, so that a better effect can be obtained when a final depth prediction model can be used as input in pictures acquired in various environments, the problem that a single image depth estimation method has a better effect only in specific data is solved, and the accuracy of depth estimation is improved to a certain extent.

Description

Single image depth estimation method and system based on multi-scale residual error network
Technical Field
The invention relates to the technical field of image depth estimation, in particular to a single image depth estimation method and system based on a multi-scale residual error network.
Background
Virtual production can fully immerse people in an artificial virtual reality environment and interact with virtual objects and people in the environment, and the reality of the virtual environment is one of the most important factors affecting virtual experience. The existing three-dimensional rendering technology can solve the background processing problem of the virtual environment, but because the skin surface of a person is high in complexity, the rendering of the person by the virtual reality rendering technology does not reach a satisfactory place, and can only be replaced by a virtual manufactured person model or cartoon image. Although this method can achieve smooth interaction with the virtual character, it has limited effect of enhancing realism and immersion of the virtual environment.
With the rapid development of computer technology, the application of virtual stereo shooting technology is more widespread. The existing virtual character manufacturing method mainly comprises the steps of driving a virtual model by using a motion capture system and obtaining virtual assets by using a stereoscopic shooting system based on a stereoscopic image pair to perform virtual-real fusion. And setting a mark point on the captured object by using a method of the motion capture system, capturing the motion coordinate of the mark point, and processing the motion track of the mark point by a computer to obtain the position data of the captured object in a real world coordinate system. However, the tracker may have problems such as occlusion or positional offset, which is one of the main factors affecting the accuracy and continuity of the generated three-dimensional data. The method of utilizing the stereo camera shooting system is to simulate human eyes by utilizing double lenses, respectively obtain left and right eye views, and process and display stereo images through three-dimensional stereo software. For a stereoscopic shooting system based on a stereoscopic image pair, because the acquired left and right eye views are required to be consistent with other parameters except parallax, the requirements on hardware are severe, meanwhile, because a camera is sensitive to illumination changes of the environment, the shooting of an outdoor scene is difficult, and the stereoscopic shooting system is only suitable for indoor shooting at present.
Thus, a method of researching how to acquire a stereoscopic image pair required by monocular image acquisition by using deep learning is one of the development directions of virtual production. Because the single image lacks depth information in a three-dimensional space, the method firstly needs to acquire the depth information of the single image through a single image depth estimation algorithm, secondly converts the predicted depth information into parallax values of the stereoscopic image, generates a corresponding stereoscopic image by utilizing an image transformation technology, and performs image processing to perfect a final stereoscopic image pair.
Most of the existing monocular image depth estimation algorithms are based on Markov random field (Markov Random Field, MRF), simple geometric assumption or nonparametric methods, and the like, and scene depth can be directly restored from an input image by utilizing the expression capability of a convolution network, but the methods all need to acquire real depth, so that the existing methods can acquire better effects basically only in specific scene types.
Disclosure of Invention
The primary purpose of the invention is to provide a single image depth estimation method based on a multi-scale residual error network, which solves the problem that the existing single image depth estimation method has better effect only in specific data.
It is a further object of the present invention to provide a single image depth estimation system based on a multi-scale residual network.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a single image depth estimation method based on a multi-scale residual error network comprises the following steps:
s1: acquiring color RGB image datasets having different characteristics;
s2: constructing a multi-scale residual error network model, wherein the multi-scale residual error network model utilizes an improved residual error network as a feedforward network to generate feature images with different semantics and different proportions, a feature fusion module is used for feature combination, and an adaptive output module is used for adjusting the channel number of the feature images and the size of a final output depth estimation image;
s3: training the multi-scale residual error network model based on the step S2 by using the color RGB image data set of the step S1, and updating the multi-scale residual error network model based on the mixed data set training loss function to obtain a trained multi-scale residual error network model based on the mixed data set training loss function;
s4: and performing single image depth estimation by using the trained multi-scale residual error network model.
Further, the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and an adaptive module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
Further, the improved residual network in step S2 is specifically:
and deleting the last pooling layer, the full connection layer and the Softmax layer of the residual network structure to obtain the improved residual network.
Further, the up-sampling in the feature map after up-sampling the output feature map of the last layer of the improved residual network adopts residual convolution, specifically:
and inputting the output characteristic diagram of the last layer of the improved residual network into a ReLU layer, sequentially connecting a 3X 3 convolution layer, the ReLU layer and the 3X 3 convolution layer after the ReLU layer, and combining the output of the 3X 3 convolution layer of the last layer with the output characteristic diagram of the last layer of the improved residual network.
Further, the feature fusion module specifically comprises:
the characteristic fusion module comprises a 3X 3 convolution layer, a residual convolution block and an up-sampling operation, wherein the output characteristic diagram of the improved residual network corresponding layer number is sequentially combined with the output of the characteristic fusion module of the previous layer after passing through the 3X 3 convolution layer and the residual convolution block, and is sequentially output after passing through the residual convolution block and the up-sampling operation.
Further, the adaptive module comprises two 3×3 convolution layers connected in sequence.
Further, in step S3, the training loss function of the mixed data set is specifically:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),for scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:
in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.
A single image depth estimation system based on a multi-scale residual network, comprising:
a data module that obtains color RGB image datasets having different characteristics;
the network model module is used for constructing a multi-scale residual error network model, the multi-scale residual error network model is used as a feedforward network to generate feature images with different semantics and different proportions, the feature fusion module is used for feature combination, and finally the adaptive output module is used for adjusting the channel number of the feature images and the size of the final output depth estimation image;
the training module is used for training the multi-scale residual error based network model of the network model module by using the color RGB image data set of the data module, and updating the multi-scale residual error based network model based on the mixed data set training loss function to obtain a trained multi-scale residual error based network model;
and the depth estimation module is used for carrying out single image depth estimation by utilizing the trained multi-scale residual error network model.
Further, the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and an adaptive module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
Further, the mixed data set trains a loss function, specifically:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),for scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:
in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention improves a single image depth estimation network model based on a multi-scale residual error network, takes a color RGB image as input, firstly extracts shared characteristics based on the residual error network, utilizes a characteristic fusion module to fuse a multi-scale pre-trained characteristic diagram and a re-trained characteristic diagram, and finally outputs a final result. In the training of the network, a loss function which is suitable for different characteristic data sets and can be mixed for training is introduced, so that a better effect can be obtained when a final depth prediction model can be used as input in pictures acquired in various environments, the problem that a single image depth estimation method has a better effect only in specific data is solved, and the accuracy of depth estimation is improved to a certain extent.
Drawings
Fig. 1 is a schematic flow chart of a single image depth estimation method based on a multi-scale residual error network according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a residual network structure according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a network model based on multi-scale residuals, which is provided in an embodiment of the present invention.
Fig. 4 is an up-sampling schematic diagram provided in an embodiment of the present invention.
Fig. 5 is a schematic diagram of a feature fusion module according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an adaptive module according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of an NYU v2 data sample according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of a dweb data sample according to an embodiment of the invention.
Fig. 9 is a schematic diagram of a DIML indoor data sample according to an embodiment of the present invention.
FIG. 10 is a graph comparing visual effects tested on various data sets provided by an embodiment of the present invention.
Fig. 11 is a schematic diagram of a single image depth estimation system based on a multi-scale residual error network according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a single image depth estimation method based on a multi-scale residual error network, as shown in fig. 1, comprising the following steps:
s1: acquiring color RGB image datasets having different characteristics;
s2: constructing a multi-scale residual error network model, wherein the multi-scale residual error network model utilizes an improved residual error network as a feedforward network to generate feature images with different semantics and different proportions, a feature fusion module is used for feature combination, and an adaptive output module is used for adjusting the channel number of the feature images and the size of a final output depth estimation image;
s3: training the multi-scale residual error network model based on the step S2 by using the color RGB image data set of the step S1, and updating the multi-scale residual error network model based on the mixed data set training loss function to obtain a trained multi-scale residual error network model based on the mixed data set training loss function;
s4: and performing single image depth estimation by using the trained multi-scale residual error network model.
Example 2
The present embodiment continues to disclose the following on the basis of embodiment 1:
the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and a self-adaptive module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
In a specific embodiment, the network inputs a three-channel RGB image of w×h×3, where the input image format is set to 384×384. The final feature map is typically 1/32 of the input image in size, so the final feature map of the present network has a size of 12 x 12. In the depth estimation algorithm, if the feature images are directly up-sampled or convolved and combined, only a rough depth image can be generated, and in order to avoid rough depth estimation generated by only adopting high-level semantic features, a refinement strategy is used in the embodiment, and the prediction accuracy is improved by combining the high-level semantic features and low-level edge sensitive features.
In a further embodiment, the improved residual network in step S2 is specifically:
and deleting the last pooling layer, the full connection layer and the Softmax layer of the residual network structure to obtain the improved residual network.
In this embodiment, the general residual network structure is shown in fig. 2, and the core idea of the network is to introduce a shortcut connection tag in the feedforward neural network, the shortcut connection can perform identity mapping of the modules, and add their outputs to the stack layer, but without adding extra parameters and computational complexity, skip one or more layers through the shortcut connection, and greatly increase the number of layers of the network. Since the residual network includes a convolution sequence of step size 2 and pooling operations, while increasing the acceptance field of the convolution, more context information can be captured, but also reducing the resolution of the output features, the network first eliminates the last pooling layer, full connection layer, and Softmax layer of the residual network structure so that the residual network can be used for the estimation task of densely predicting each pixel and as a feed forward network generates a series of feature maps with different semantics and different proportions.
In a further embodiment, the upsampling in the upsampled feature map of the last layer of the improved residual network uses residual convolution, specifically:
as shown in fig. 4, the output feature map of the last layer of the improved residual network is input to a ReLU layer, and after the ReLU layer, a 3×3 convolution layer, a ReLU layer, and a 3×3 convolution layer are sequentially connected, and the output of the last layer of the 3×3 convolution layer is combined with the output feature map of the last layer of the improved residual network.
In a further embodiment, the feature fusion module is specifically:
as shown in fig. 5, the feature fusion module includes a 3×3 convolution layer, a residual convolution block and an up-sampling operation, where the improved output feature map corresponding to the layer number of the residual network is sequentially output after passing through the 3×3 convolution layer and the residual convolution block, and then is combined with the output of the feature fusion module of the previous layer, and is output after passing through the residual convolution block and the up-sampling operation in sequence.
In a further embodiment, as shown in fig. 6, the adaptation module comprises two 3 x 3 convolutional layers connected in sequence.
In this embodiment, as shown in fig. 3, the residual network is divided into 4 different modules according to the resolution of the feature map based on the multi-scale residual network model, and the scale of the feature map in each module is the same. Before each residual convolution block, the number of channels of the feature map needs to be adjusted using a convolution layer of 3×3 size, and the channel number of each transition layer is set to 256 in this experiment. As shown in fig. 5, the network takes the output of the last layer of the convolution block as the input to the multi-scale feature fusion module. The multi-scale feature fusion module of the network of the embodiments of the present invention takes as input two sets of feature maps, one set being pre-trained from a residual convolution network and the other set being generated by training from scratch. For each feature fusion module, a residual convolution block is used for transferring the feature mapping from a specific layer of the pre-trained residual network, and then the feature mapping is combined with the fusion feature map generated by the last feature fusion module through a summation method. Finally, an up-sampling operation is applied to generate a feature map with the same resolution as the next input.
In order to perform progressive refinement operation to generate a depth prediction map with higher precision, the network of the embodiment firstly upsamples the last group of feature maps generated by the residual convolution network, and the residual convolution block is adopted to effectively propagate the gradient of the high layer to the low layer through the connection of the short distance and the long distance. And combining the two groups of input feature graphs by a feature fusion module, and performing up-sampling to generate output with higher resolution. The number of channels of the feature map and the size of the final output depth estimate map are adjusted at the end of the network using an adaptation module configured as shown in fig. 6.
The present embodiment explores the diversity of samples by small batches of samples. For each input imageIRandom samplingNPoint-to-point pairsi,j) WhereinNIs the total number of pairs of points,iandjrepresenting the positions of the first and second points, respectively. To mark the sequential relationship between each point pairl ij Firstly, obtaining depth value from correspondent true valueg i ,g j ) The order relationship of the truth values is then defined as:
wherein,is an empirical threshold, set to 0.02 in this embodiment, the relative depth of the true values can be usedIndicating (I)>Indicate->For the position of the first point and the second point, < >>Is thatCorresponding ordered relationships.
Existing datasets suitable for monocular depth estimation are composed of RGB images with corresponding forms of depth annotations, and their differences are mainly represented in various aspects of the environment and object of dataset image acquisition, the type of depth annotation, accuracy, image quality, camera settings and the size of the dataset, such as indoor scene dataset NYU v2, outdoor scene dataset DIW, dataset dweb with relatively realistic diversified dynamic scenes, dataset ETH3D with high-precision laser radar scanning truth values in static scenes, etc. For existing datasets, high-precision data is difficult to acquire on a large scale, and data collected from a network is difficult to ensure image quality and depth precision. Therefore, the existing monocular depth estimation algorithm can obtain better performance and results by performing network training and testing on a single data set, but the effect on other data sets in different forms is not ideal. The experiment tried to select three data sets of NYU depthv2, dweb and DIML indicator for mixed training and tested on the NYU depthv2 data set.
NYU depthv2 is one of the indoor data sets commonly used at present, and as shown in FIG. 7, the images are respectively color camera output images from left to right, and a depth map and a depth label map are preprocessed. Silberman et al acquired color information and depth information for 464 scenes using a color camera and Kinect, and shared 1449 Zhang Biaozhu RGBD image pairs and 407024 unlabeled RGBD image pairs in the dataset. The 1449 images are filled by a coloring algorithm to obtain dense depth images and semantic information is marked manually, and finally a training sample is formed by the RGB images, the depth images and the semantic label images.
ReDWeb (RW) is a small dataset, and Xiaan et al collect 40K stereo images from the Flick and use the latest optical flow algorithm to generate the corresponding stereo matching map. The dataset has 3600 images in total, including a variety of relatively realistic diversified dynamic scenes such as offices, night scenes and streets, as shown in fig. 8, with three sets of RGB images and corresponding relative depth images.
The DIML indoor dataset is a static scene RGBD image pair dataset obtained by using the Zed camera and Kinect v2 synchronously, the dataset is converted by using calibration parameters after parallax mapping is generated by using the latest stereo matching algorithm, and meanwhile, a parallax confidence map of each pixel is provided, and the DIML indoor dataset comprises 200 indoor scene maps, namely an RGB image and a corresponding depth map as shown in fig. 9.
In order to evaluate the network prediction result, a loss function needs to be defined, and the loss functions commonly used in the depth estimation algorithm mainly comprise L1, L2, SSIM, berHU loss functions and cross entropy loss functions. Zhou et al propose a learning framework that can implement self-supervision of monocular depth and camera motion estimation from unstructured video sequences, the loss function of which is shown in the following equation:
wherein,is the difference between the current pixel and the original image, < >>Is a smooth loss->To prevent cross entropy loss of the overfitting.
To achieve displacement and scale size invariance in disparity space, wang et al propose normalized multi-scale gradient (NMG) loss, evaluating the gradient difference between true and rescaled estimates at multiple scales, the loss function is shown as follows:
in this embodiment, the above-mentioned loss function is improved, and a loss function capable of performing hybrid training on different feature data sets is provided, where the main solution problem is: since truth labels exist in different forms, calculations need to be performed in a space compatible with all truths, and the penalty function should be flexible to handle a variety of data sources and take full advantage of the available information. The main difficulties of hybrid training are the inherent difference in depth representation, the difference in image scale and the ambiguity of viewpoint displacement.
The present embodiment performs depth prediction in disparity space and handles the above-described difficulties in combination with a series of scale and displacement invariant intensive losses. Assume thatPRepresented as the number of pixels in the image with significant base truth,θfor predictive modelsParameter, d=d #θ) For the purpose of disparity prediction,for the corresponding true value disparity, the individual pixels are indexed by subscripts, where the scale invariant and displacement invariant loss of the individual samples are defined as:
wherein,scaling and displacement of the predicted value and the true value, respectively, < >>To define a specific type of loss function.
The estimates of scale and translation are made of:and->It is indicated that in order to make sense the defined scale invariant and displacement invariant losses, the predicted and true values corresponding scales and displacements need to be substantially aligned, i.e. such thatAnd->The present embodiment performs alignment based on the least squares criterion, as follows:
in the above-mentioned method, the step of,for the aligned predicted and true values, let +.>In order to determine the factors s and t in a closed form, the target is rewritten by the formula:
from which a closed-loop solution is obtained:
by means ofScale and displacement invariant mean square errors (Mean Square Error, MSE) are defined, but since MSE is not robust to outliers, but only imperfect truth values can be provided in existing large-scale data sets, robustness should improve training of the network, replacing the robust loss function with scale and displacement based robust estimates as shown in the following:
simultaneously, both the predicted value and the true value are adjusted to zero translation and unit scale:
finally, a scale and displacement invariant loss function is obtained:
wherein, according to experimental experience setting of the ReWeb data set=0.8p and ∈>
The unknown and variable scale is not negligible in the training of the monocular depth estimation model, and the embodiment uses the gradient matching terms of multiple scales and unchanged scales into the parallax space, which is defined as:
wherein,and->Representing disparity map differences for scale K, k=4 scale levels are used in this embodiment, and the image resolution for each scale level is halved. The matching term makes the predicted discontinuity sharper, coinciding with the true value discontinuity.
The training loss function of the mixed data set in the step S3 specifically includes:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),for scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
for a hybrid training strategy of different data sets, using a pareto optimal multitask learning framework, defining the learning of each data set as a separate task, and finding an approximate pareto optimal value on the data set, the minimum multitarget optimization criteria are as follows:
in the method, in the process of the invention,for the loss function of the respective dataset, +.>For model parameters based on a multi-scale residual network model, < ->Shared in the dataset.
In a specific embodiment, the hybrid dataset single image depth estimation network training is performed based on the ReDWeb, NYU depthv2, and DIML datasets. The framework of deep learning used is Pytorch. The computer hardware is as follows: intel i7 8700 CPU, NVIDIA GTX 1080Ti GPU, its memory is 12G, computer memory is 32GB, and operating system is windows.
Depth estimation network parameter setting:
the ResNet-50 based multi-scale architecture is used as a backbone network to initialize the initial convolutional layer using a random Gaussian initialization method. The images are flipped randomly horizontally and cropped randomly so that the input images fed into the network are resized to 384 x 384, which can enhance the data and maintain the aspect ratio on different input images. In the experiment, other pre-training models with the best effect are used for carrying out mixed training on data sets, as three data sets are required to be mixed, batch size (batch size) is set to be 16, reLU is used as an activation function, adam optimization network is utilized to accelerate convergence of the network, the total number of training samples is 1000 images, and the network iterates 50 times.
The evaluation indexes of the single image depth prediction mainly comprise:
absolute root mean square error (Root Mean Square Error, RMSE):
average Relative Error (REL):
percentage of pixels:
LOG mean error (LOG 10):
weighted human opinion bifurcation rate (WHDR):
wherein,for the total number of pixels of the test set, +.>Index for pixel +.>And->The predicted depth value and true value of the network are respectively, for WHDR,>for defining a threshold value for an equal relation between two points, i.e. when the difference between two predicted depth values is smaller than +.>Then consider the two point depth values equal, +.>Is->Human confidence weights of Point pairs, according to the study by Chen et al, will +.>Set to 1.
From differences in loss functionsThe three aspects of value, different loss functions and different mainstream methods compare the performance of the monocular depth estimation method of the present embodiment.
(1) Difference of loss functionsValue of
At the loss functionIn (I)>Weights responsible for controlling the change of scale, different +.>The value affects the result of the final depth estimation. For verification->Influence of values, this embodiment uses the same network structure and sets different +.>The values test the network performance and the results are shown in table 1.
Table 1 shows the differenceValue impact on network performance
Wherein whenAt a value of 0.5, the three error factors were minimal, at which time the network performance was optimal, as shown by the underlined data results in table 1.
(2) Different loss functions
This section is inAt a value of 0.5, the loss function of the comparison of the Xian et alL(I,G,z)And the loss function used in this embodiment +.>The effect of accuracy of the depth estimation map was generated, and the experimental results are shown in table 2. />
Table 2 different loss functions impact on network
In the single image depth estimation model,Loss l the effect is better thanL(I,G,z)
(3) Compared with the mainstream method
The loss function used in this experiment wasLoss lSet to 0.5, the comparison result with the single image depth estimation main stream method is shown in table 3.
Table 3 comparison of performance with mainstream process
In the tables, eigen et al methods are given in Eigen D, puhrsch C, fergus R. Depth map prediction from a single image using amulti-scale deep network [ C ]// Advances in neural information processing systems 2014:2366-2374.
Liu et al are available from Karsch K, liu C, kang S B. Depth extraction from video using non-parametric sampling [ C ]// European Conference on Computer vision, springer, berlin, heidelberg, 2012:775-788.
Li et al are derived from Li B, shen C, dai Y, et al Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognment.2015:1119-1127.
Wang et al are available from Wang X, fouhey D, gupta A. Designing deep networks for surface normal estimation [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern, recovery 2015:539-547.
As can be seen from table 3, compared with other models, the depth estimation model trained by the mixed data set according to the embodiment has a greater improvement in depth estimation accuracy than other depth estimation models trained on a single training set. The pair of visual effects tested on each of the different data sets is shown in fig. 10, left to right for the input image, the hybrid loss depth result of this embodiment and the L (I, G, z) loss depth result of Xian et al, respectively. The model trained for a particular dataset works well on its training dataset, but poorly on other test datasets, and the method used in this embodiment benefits from feature complementation of the datasets after mixing the datasets, with higher accuracy of the depth results after optimization.
The method of Xian et al is available from Xian K, shen C, cao Z, et al Monocular relative depth perception with web stereo data supervision [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recovery 2018:311-320.
In the embodiment, images of different objects and different depths in a green curtain scene are tested, a new viewpoint image is generated to obtain a better depth effect, the position of a person scene serving as a foreground pixel needs to be adjusted, different overlapping degrees can have a certain influence on the accurate value of depth prediction, and overlapping of the person is reduced as much as possible, so that a depth map with a better effect can be obtained.
Example 3
The embodiment provides a single image depth estimation system based on a multi-scale residual error network, as shown in fig. 11, including:
a data module that obtains color RGB image datasets having different characteristics;
the network model module is used for constructing a multi-scale residual error network model, the multi-scale residual error network model is used as a feedforward network to generate feature images with different semantics and different proportions, the feature fusion module is used for feature combination, and finally the adaptive output module is used for adjusting the channel number of the feature images and the size of the final output depth estimation image;
the training module is used for training the multi-scale residual error based network model of the network model module by using the color RGB image data set of the data module, and updating the multi-scale residual error based network model based on the mixed data set training loss function to obtain a trained multi-scale residual error based network model;
and the depth estimation module is used for carrying out single image depth estimation by utilizing the trained multi-scale residual error network model.
Further, the multi-scale residual network model comprises an improved residual network, a plurality of feature fusion modules and an adaptive module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
Further, the mixed data set trains a loss function, specifically:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Is used for the number of samples of (a),is of unchanged scale and displacementLoss function (F)>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:
/>
in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.
The same or similar reference numerals correspond to the same or similar components;
the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1. The single image depth estimation method based on the multi-scale residual error network is characterized by comprising the following steps of:
s1: acquiring color RGB image datasets having different characteristics;
s2: constructing a multi-scale residual error network model, wherein the multi-scale residual error network model utilizes an improved residual error network as a feedforward network to generate feature images with different semantics and different proportions, a feature fusion module is used for feature combination, and an adaptive output module is used for adjusting the channel number of the feature images and the size of a final output depth estimation image;
s3: training the multi-scale residual error network model based on the step S2 by using the color RGB image data set of the step S1, and updating the multi-scale residual error network model based on the mixed data set training loss function to obtain a trained multi-scale residual error network model based on the mixed data set training loss function;
s4: and performing single image depth estimation by using the trained multi-scale residual error network model.
2. The multi-scale residual network-based single image depth estimation method of claim 1, wherein the multi-scale residual network-based model comprises a modified residual network, a number of feature fusion modules, and an adaptation module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
3. The single image depth estimation method based on multi-scale residual network according to claim 2, wherein the improved residual network in step S2 is specifically:
and deleting the last pooling layer, the full connection layer and the Softmax layer of the residual network structure to obtain the improved residual network.
4. The single image depth estimation method based on the multi-scale residual network according to claim 2, wherein the upsampling in the upsampled feature map of the last layer of the improved residual network adopts residual convolution, specifically:
and inputting the output characteristic diagram of the last layer of the improved residual network into a ReLU layer, sequentially connecting a 3X 3 convolution layer, the ReLU layer and the 3X 3 convolution layer after the ReLU layer, and combining the output of the 3X 3 convolution layer of the last layer with the output characteristic diagram of the last layer of the improved residual network.
5. The single image depth estimation method based on the multi-scale residual network according to claim 2, wherein the feature fusion module specifically comprises:
the characteristic fusion module comprises a 3X 3 convolution layer, a residual convolution block and an up-sampling operation, wherein the output characteristic diagram of the improved residual network corresponding layer number is sequentially combined with the output of the characteristic fusion module of the previous layer after passing through the 3X 3 convolution layer and the residual convolution block, and is sequentially output after passing through the residual convolution block and the up-sampling operation.
6. The single image depth estimation method based on a multi-scale residual network of claim 2, wherein the adaptation module comprises two 3 x 3 convolution layers connected in sequence.
7. The multi-scale residual network-based single image depth estimation method according to any one of claims 1 to 6, wherein the hybrid dataset training loss function in step S3 is specifically:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Sample number of->For scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:
in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.
8. A single image depth estimation system based on a multi-scale residual network, comprising:
a data module that obtains color RGB image datasets having different characteristics;
the network model module is used for constructing a multi-scale residual error network model, the multi-scale residual error network model is used as a feedforward network to generate feature images with different semantics and different proportions, the feature fusion module is used for feature combination, and finally the adaptive output module is used for adjusting the channel number of the feature images and the size of the final output depth estimation image;
the training module is used for training the multi-scale residual error based network model of the network model module by using the color RGB image data set of the data module, and updating the multi-scale residual error based network model based on the mixed data set training loss function to obtain a trained multi-scale residual error based network model;
and the depth estimation module is used for carrying out single image depth estimation by utilizing the trained multi-scale residual error network model.
9. The multi-scale residual network based single image depth estimation system of claim 8, wherein the multi-scale residual network based model comprises a modified residual network, a number of feature fusion modules, and an adaptation module, wherein:
the input of the improved residual error network is a three-channel RGB image, and the improved residual error network outputs feature images with different semantics and different proportions; the improved residual network is connected with the feature fusion modules in sequence, the input of the feature fusion module is respectively an output feature map of the corresponding layer number of the improved residual network and the output of the feature fusion module of the upper layer, and the input of the first feature fusion module is an output feature map of the penultimate layer of the improved residual network and a feature map obtained by upsampling the output feature map of the last layer of the improved residual network; and the last feature fusion module is connected with the self-adaptive module, and the self-adaptive module outputs a depth estimation image.
10. The multi-scale residual network-based single image depth estimation system according to claim 8 or 9, wherein the hybrid dataset trains a loss function, in particular:
in the method, in the process of the invention,for dataset +.>Training loss function, ++>For dataset +.>Sample number of->For scale and displacement invariant loss function, +.>For a gradient match of multiple scales and scale invariance, < >>、/>Is->The aligned predicted and actual values of the individual samples, +.>Is the weight;
defining the learning of each dataset as a separate task and finding an approximate pareto optimal value on the dataset, a minimum multi-objective optimization criterion is as follows:
in the method, in the process of the invention,for the loss function of the respective dataset, +.>Is a model parameter based on a multi-scale residual network model.
CN202311298295.0A 2023-10-09 2023-10-09 Single image depth estimation method and system based on multi-scale residual error network Pending CN117036439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311298295.0A CN117036439A (en) 2023-10-09 2023-10-09 Single image depth estimation method and system based on multi-scale residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311298295.0A CN117036439A (en) 2023-10-09 2023-10-09 Single image depth estimation method and system based on multi-scale residual error network

Publications (1)

Publication Number Publication Date
CN117036439A true CN117036439A (en) 2023-11-10

Family

ID=88632252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311298295.0A Pending CN117036439A (en) 2023-10-09 2023-10-09 Single image depth estimation method and system based on multi-scale residual error network

Country Status (1)

Country Link
CN (1) CN117036439A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112288788A (en) * 2020-10-12 2021-01-29 南京邮电大学 Monocular image depth estimation method
CN112396645A (en) * 2020-11-06 2021-02-23 华中科技大学 Monocular image depth estimation method and system based on convolution residual learning
US20210390338A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Deep network lung texture recogniton method combined with multi-scale attention

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390338A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Deep network lung texture recogniton method combined with multi-scale attention
CN112288788A (en) * 2020-10-12 2021-01-29 南京邮电大学 Monocular image depth estimation method
CN112396645A (en) * 2020-11-06 2021-02-23 华中科技大学 Monocular image depth estimation method and system based on convolution residual learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K. XIAN ET AL.: "Monocular Relative Depth Perception withWeb Stereo Data Supervision", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 23 June 2018 (2018-06-23), pages 311 - 320, XP033475991, DOI: 10.1109/CVPR.2018.00040 *
R. RANFTL, K. LASINGER, D. HAFNER, K. SCHINDLER AND V. KOLTUN: "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,44(3), 1 March 2022 (2022-03-01), pages 1623 - 1637 *
R. RANFTL, K. LASINGER, D. HAFNER, K. SCHINDLER, V. KOLTUN, TOWARDS ROBUST MONOCULAR DEPTH ESTIMATION: MIXING DATASETS FOR ZERO-SH, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,44(3): 1623-1637, pages 1623 - 1637 *

Similar Documents

Publication Publication Date Title
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN108596974B (en) Dynamic scene robot positioning and mapping system and method
CN108921926B (en) End-to-end three-dimensional face reconstruction method based on single image
CN109086683B (en) Human hand posture regression method and system based on point cloud semantic enhancement
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN109377530A (en) A kind of binocular depth estimation method based on deep neural network
CN113393522B (en) 6D pose estimation method based on monocular RGB camera regression depth information
CN110689562A (en) Trajectory loop detection optimization method based on generation of countermeasure network
CN108986136A (en) A kind of binocular scene flows based on semantic segmentation determine method and system
CN108510573A (en) A method of the multiple views human face three-dimensional model based on deep learning is rebuild
CN111462324B (en) Online spatiotemporal semantic fusion method and system
CN108171249B (en) RGBD data-based local descriptor learning method
CN110335299B (en) Monocular depth estimation system implementation method based on countermeasure network
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN112465021B (en) Pose track estimation method based on image frame interpolation method
CN113313810A (en) 6D attitude parameter calculation method for transparent object
Wang et al. Depth estimation of video sequences with perceptual losses
CN110942484A (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN112907557A (en) Road detection method, road detection device, computing equipment and storage medium
CN115423978A (en) Image laser data fusion method based on deep learning and used for building reconstruction
CN114332125A (en) Point cloud reconstruction method and device, electronic equipment and storage medium
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
Basak et al. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image
Jia et al. Depth measurement based on a convolutional neural network and structured light

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination