CN114638870A - Indoor scene monocular image depth estimation method based on deep learning - Google Patents

Indoor scene monocular image depth estimation method based on deep learning Download PDF

Info

Publication number
CN114638870A
CN114638870A CN202210251724.8A CN202210251724A CN114638870A CN 114638870 A CN114638870 A CN 114638870A CN 202210251724 A CN202210251724 A CN 202210251724A CN 114638870 A CN114638870 A CN 114638870A
Authority
CN
China
Prior art keywords
depth
prediction
interval
pixel
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210251724.8A
Other languages
Chinese (zh)
Inventor
刘佳涛
张亚萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Normal University
Original Assignee
Yunnan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Normal University filed Critical Yunnan Normal University
Priority to CN202210251724.8A priority Critical patent/CN114638870A/en
Publication of CN114638870A publication Critical patent/CN114638870A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to an indoor scene monocular image depth estimation method based on deep learning, and belongs to the technical field of three-dimensional scene perception. The method comprises the steps of firstly introducing a neural network EfficientNet-b7 for image classification pre-training on ImageNet, constructing an encoder, introducing SENet-based residual connection and convolution and resampling calculation operations on different stages of the encoder, then constructing a loss function focusing on the global to local of an image on the basis of the idea of depth interval division, applying the loss function to prediction on different stages to obtain prediction on different stages, and finally fusing depth information predicted on different stages by using a self-attention mechanism-based transform structure to output a scene depth prediction result. The invention improves the characteristics of different stages of the traditional serial fusion encoder into parallel fusion by designing a novel, efficient and lightweight decoder, thereby improving the comprehensive utilization capability of the model on the global and local information of the image during depth estimation.

Description

Indoor scene monocular image depth estimation method based on deep learning
Technical Field
The invention relates to an indoor scene monocular image depth estimation method based on deep learning, and belongs to the technical field of three-dimensional scene perception.
Background
Depth estimation from two-dimensional RGB images has a wide range of applications, for example: three-dimensional reconstruction, scene understanding, autopilot, robotics, and the like. With the advent of large-scale data sets and the improvement of hardware computing power, recent research on image depth estimation has focused mainly on two-dimensional to three-dimensional reconstruction using depth learning and convolutional neural networks. Depth estimation from a single RGB image is an ill-defined problem because one picture can correspond to an unlimited number of three-dimensional scenes. Furthermore, lack of scene coverage, translucent or reflective materials, etc. may lead to blurring where the geometry cannot be deduced from the appearance.
A method for monocular depth estimation based on depth learning begins with the dual-scale network proposed by Eigen et al. Some researchers then proposed many effective methods based on deep learning using convolutional neural networks. The document "Laina et al, deep Depth Prediction with full volumetric Residual Networks" uses a full convolution Residual network based on ResNet-50 and replaces the full connection layer with a series of upsampled blocks. The document "Alhashim et al, High Quality singular Estimation view Transfer Learning" introduces a jump connection in a simple encoder-decoder network architecture and uses a Transfer Learning training model. The document "Lee et al, From Big to Small: Multi-Scale Local Planar guide for singular Depth Estimation" proposes to replace the standard upsampling layer with a Local Planar guiding layer to guide the feature to full resolution in the decoder. The document "Fu et al, Deep regression network for singular regression" found that if a Deep regression task is converted into a classification task, its performance can be improved. The document 'Bhat et al, AdaBins: Depth Estimation using Adaptive Bins' designs an AdaBins module, divides a Depth value range into 256 intervals, the central value of each interval is the Depth value of a pixel falling in the interval, and the final Depth of one pixel is the value of a linear combination interval of the central Depth. The literature "Ranftl et al, Vision transforms for depth Prediction" applies Vision transform to monocular depth estimation, and obtains a highly accurate depth estimation model by training with a large data set.
Although there are currently great advances in depth estimation of indoor monocular images based on depth learning, there are still some problems: 1) in most coding and decoding structures used by the deep learning neural network, an encoder can cause the problems of insufficient feature extraction, loss of spatial information and the like in a feature extraction stage due to operations such as down-sampling of display, so that the network is easy to lose fine-grained information of an image; 2) the actual scene structure faced by indoor scene monocular depth estimation is usually complex, and if the global and local relations in the scene are not effectively considered, the accuracy rate in depth estimation is very low; 3) although the appearance of Vision transform can greatly improve the image granularity loss problem, the model parameters are large in quantity and a large amount of labeling data is needed to drive training.
Disclosure of Invention
The invention aims to solve the technical problem of providing an indoor scene monocular image depth estimation method based on depth learning, aiming at the problem that fine-grained information of an image is easy to lose in the deep layer of a monocular depth estimation coding network using a convolutional neural network, and comprehensively using the characteristics of multi-stage coding. In a decoding network, aiming at the problem that the traditional network is difficult to effectively consider the global and local relation in a scene under a complex scene, a decoder which respectively predicts global to local depth information among parallel partitions, then adjusts and fuses is designed, and a loss function is correspondingly designed, so that the problems are solved.
The technical scheme of the invention is as follows: a depth estimation method for monocular images of indoor scenes based on depth learning specifically comprises the following steps:
step 1: the encoder was constructed by introducing the neural network EfficientNet-b7 pre-trained for image classification on ImageNet.
Step 2: and introducing residual error connection based on SENET and calculation operation of convolution and resampling at different stages of an encoder to obtain prediction at different stages.
Step 3: based on a depth interval division method, a loss function focusing on the global to local of an image is constructed and applied to prediction of different stages.
Step 4: and fusing depth information predicted in different stages by using a Transformer structure based on a self-attention mechanism, and outputting a scene depth prediction result.
The Step1 is specifically as follows: downloading from the Internet an EfficientNet-b7 network pre-trained on ImageNet, obtaining its feature vectors encoded at blocks 3, 5, 6, 8, 12, the resolution of these feature vectors being that of the input image respectively
Figure BDA0003546943530000021
The Step2 is specifically as follows:
step2.1: the feature vector encoded by the 3 rd block is input into 4 SENET-based residual blocks, the feature vector encoded by the 5 th block is input into 3 SENET-based residual blocks, the feature vector encoded by the 6 th block is input into 2 SENET-based residual blocks, and the feature vector encoded by the 8 th block is input into 1 SENET-based residual block.
Step2.2: the channel attention layer is added after the last residual block of each stage and a residual connection from the encoder to the layer is added.
Step2.3: the features of each stage are gradually passed through the double upsampling and convolution layers to obtain five stages of features with the same number of channels of 30 and the same resolution of half the input resolution.
Step2.4: adding and fusing the features of the 1 st, 2 nd and 5 th stages pixel by pixel, adding and fusing the features of the 2 nd, 3 rd and 5 th stages pixel by pixel, adding and fusing the features of the 1 st, 3 rd and 4 th stages pixel by pixel, adding and fusing the features of the 1 st, 4 th and 5 th stages pixel by pixel, then obtaining four predictions through a convolutional layer, and marking the four predictions as prediction 1 to prediction 4 from shallow to deep according to the neural network.
According to the fusion selection, local prediction and global prediction can be respectively carried out from light to deep, then the characteristics of the first stage are taken as the reference of the next two predictions, and the characteristics of the fifth stage are taken as the reference of the above two predictions, so that the global prediction and the local prediction can be better completed; when fusion is carried out, the fused accuracy can be improved by abundant spatial information contained in the shallow feature. Compared with the traditional method in which the input is serially fused in one step and then output, the method not only improves the efficiency, but also improves the precision.
The Step3 specifically comprises the following steps:
step3.1: the maximum depth d _ max and the minimum depth d _ min are obtained from the real depth map.
Step3.2: the depth interval [ d _ min, d _ max ] is equally divided into 10 cells, and the calculation formula of the length of one cell is as follows:
Figure BDA0003546943530000031
in these 10 intervals, the depth value range calculation formula of the ith interval is as follows:
[d_min+(i-1)×len,d_min+i×len]
step3.3: and (3) making a histogram for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals, wherein the interval contains most global information, and correspondingly, the interval occupying the smaller proportion contains more local information.
Step3.4: and (3) performing descending order arrangement on the 10 depth intervals according to the occupied proportion, calculating the mean square error of the prediction 1 in the 5 th interval to the 10 th interval in Step2.4, predicting the mean square error of the prediction 2 in the 4 th interval to the 8 th interval, predicting the mean square error of the prediction 3 in the 2 nd interval to the 4 th interval, and predicting the mean square error of the prediction 4 in the 1 st interval and the 2 nd interval.
Step3.5: combining the four parts of errors as one loss term which is used for restricting the prediction 1 to the prediction 4 to pay attention to the local part to the global part during model training, wherein the calculation formula is as follows:
Figure BDA0003546943530000032
wherein λ1=0.5,λ2=λ3=0.6,λ4=1,niIs the total number of pixels of the real depth map after the interval mask,
Figure BDA0003546943530000033
and
Figure BDA0003546943530000034
respectively a pixel point p in the true depth map and the prediction iiThe depth value of (2).
Through the method based on depth interval division used by Step3, a specific loss function is obtained as a supplementary constraint for subsection prediction. The method is applied to the prediction of different stages, so that the global prediction and the local prediction can be gradually separated into different stages when the model is trained, and the prediction of each stage of the model can pay attention to the depth interval which can be predicted more accurately in the stage.
The Step4 is specifically as follows:
step4.1: splicing the 4-stage prediction results into a four-channel tensor
Figure BDA0003546943530000044
Step4.2: the four-channel tensor
Figure BDA0003546943530000045
Performing convolution operation with convolution kernel of 16 × 16, step size of 16 and output channel of 4, namely:
Figure BDA0003546943530000041
step4.3: flattening the two-dimensional tensor obtained after convolution into one dimension, namely:
Figure BDA0003546943530000042
step4.4: inputting one-dimensional tensor into Transformer EncoderAnd recovering the one-dimensional tensor as a two-dimensional tensor as a weight matrix
Figure BDA0003546943530000046
Step4.5: tensor of four channels
Figure BDA0003546943530000047
Convolution operation with convolution kernel of 3 × 3, step size of 1 and output channel of 128 is performed to obtain a shape of
Figure BDA0003546943530000043
Tensor of
Figure BDA0003546943530000048
Step4.6: weight matrix
Figure BDA0003546943530000049
And tensor
Figure BDA00035469435300000410
And after pixel-by-pixel dot product operation is carried out, outputting a final prediction result through a series of convolution layers.
Compared with the traditional convolution method after channel splicing, the fusion method adopted by Step4 can greatly improve the accuracy of model prediction on the premise of introducing almost no more parameter quantity.
The invention has the beneficial effects that:
(1) in order to fully utilize fine-grained information of an input image and extract feature vectors from multiple stages of an encoder, the invention solves the problem that fine-grained information is easy to lose in the depth of a coding network in the traditional method;
(2) the invention designs a decoder which predicts global-to-local depth information respectively among parallel partitions, adjusts the fused decoder, correspondingly designs a loss function, and optimizes the problem that the global and local relations in a scene are difficult to be effectively considered in a traditional network under a complex scene.
(3) The invention uses the convolution neural network as the basis, and can achieve more accurate effect without driving training by a large data set.
(4) The invention does not need a very large data set to drive when training, and can effectively improve the precision of monocular depth estimation.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
fig. 2 is a schematic diagram showing a comparison between a monocular depth estimation network adopted by the present invention and current most advanced networks AdaBins and DPT-Hybrid predicted depth maps in some scenarios, where:
(a) is an input RGB image;
(b) is a true depth map;
(c) is a depth map for AdaBins prediction;
(d) is a depth map of DPT-Hybrid prediction;
(e) is the depth map predicted by the present invention;
FIG. 3 is an exemplary diagram of the present invention for generating a three-dimensional point cloud from a single RGB image by predicting depth.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
As shown in fig. 1, a depth estimation method for monocular images of an indoor scene based on depth learning specifically includes the following steps:
in the characteristic extraction stage of the encoder, extracting the characteristic vectors coded by the 3 rd, 5 th, 6 th, 8 th and 12 th blocks in the EfficientNet-b7 encoder, wherein the shapes of the characteristic vectors are respectively
Figure BDA0003546943530000051
Figure BDA0003546943530000052
Where H and W are the height and width of the input image, respectively.
Then the feature vector coded by the 3 rd block is inputted into 4 SENEt-based residual blocks, the feature vector coded by the 5 th block is inputted into 3 SENEt-based residual blocks, the feature vector coded by the 6 th block is inputted into 2 SENEt-based residual blocks, and the feature vector coded by the 8 th block is inputted into 1 SENEt-based residual block.
And adding a channel attention layer after the last residual block of each stage, adding a residual connection from an encoder to the layer to construct a large residual block, and gradually passing the characteristics of each stage through a double upsampling layer and a convolution layer to obtain the characteristics of five stages with the same channel number of 30 and the same resolution of half of the input resolution.
And then, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 2 and 5, carrying out pixel-by-pixel addition fusion on the features of the stages 2, 3 and 5, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 3 and 4, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 4 and 5, then obtaining four predictions through a convolutional layer, and marking the predictions as 1 to 4 according to the shallow-to-deep marks of the neural network.
A penalty function focused on local to global depth is then designed for predictions 1 through 4.
Firstly, acquiring a maximum depth d _ max and a minimum depth d _ min from a real depth map, and then averagely dividing a depth interval [ d _ min, d _ max ] into 10 intervals, wherein the calculation formula of the length of one interval is as follows:
Figure BDA0003546943530000061
in these 10 intervals, the depth value range calculation formula of the ith interval is as follows:
[d_min+(i-1)×len,d_min+i×len]
then, a histogram is made for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals, wherein the interval contains most global information, and correspondingly, the interval occupying the smaller proportion contains more local information.
And next, performing descending order arrangement on 10 depth intervals according to the occupied proportion, calculating and predicting the mean square error of 1 in the 5 th interval to the 10 th interval, predicting the mean square error of 2 in the 4 th interval to the 8 th interval, predicting the mean square error of 3 in the 2 nd interval to the 4 th interval, and predicting the mean square error of 4 in the 1 st interval and the 2 nd interval.
And combining the four parts of errors as one loss term which is used for constraining the predictions 1 to 4 to focus on the local part to the global part during model training, wherein the calculation formula is as follows:
Figure BDA0003546943530000062
wherein λ1=0.5,λ2=λ3=0.6,λ4=1,niIs the total number of pixels of the real depth map after the interval mask,
Figure BDA0003546943530000063
and
Figure BDA0003546943530000064
respectively a pixel point p in the true depth map and the prediction iiThe depth value of (2).
After obtaining predictions 1 through 4, the four-part predictions need to be fused.
Firstly, the prediction results of the 4 stages are spliced into a four-channel tensor
Figure BDA0003546943530000066
Then the four-channel tensor is expressed
Figure BDA0003546943530000067
Performing convolution operation with convolution kernel of 16 × 16, step size of 16 and output channel of 4, that is:
Figure BDA0003546943530000065
then, flattening the two-dimensional tensor obtained after convolution into one dimension, namely:
Figure BDA0003546943530000071
next, the one-dimensional tensor is input into the Transformer Encoder, andthe one-dimensional tensor outputted by the weight matrix is restored to be a two-dimensional tensor as the weight matrix
Figure BDA0003546943530000076
The next step is to make the four-channel tensor
Figure BDA0003546943530000077
Convolution operation with convolution kernel of 3 × 3, step size of 1 and output channel of 128 is performed to obtain a shape of
Figure BDA0003546943530000072
Tensor (A)
Figure BDA0003546943530000078
Finally, the weight matrix is processed
Figure BDA0003546943530000079
And tensor
Figure BDA00035469435300000710
And after pixel-by-pixel dot product operation is carried out, outputting a final prediction result through a series of convolution layers.
The invention uses data in NYUDepth v2 and SUN RGB-D data sets to carry out experiments on the proposed depth learning-based indoor scene monocular image depth estimation method, the NYUDepth v2 data set is obtained by acquiring an indoor scene by a Microsoft Kinect RGBD camera, and the SUN RGB-D data set acquisition equipment comprises Intel Realsense, Asus Xtion, Kinect v1 and Kinect v 2. Both data sets are indoor scene data sets, and the SUN RGB-D data sets contain more complex scenes.
Table 1 shows statistics of the average relative error, the root mean square error, the log mean error and the accuracy under the threshold value of the NYUDepth v2 data set after training of the parallel decoder used in the present invention and the conventional simple serial decoder, and the parameters included in the model. As can be seen from the data in table 1, the method of the present invention obtains better results compared with the conventional method, improves the accuracy of depth map estimation to a certain extent, and reduces the model parameters by 29.4%.
Figure BDA0003546943530000073
TABLE 1
Table 2 statistics of the loss terms for prediction i using the design of the present invention versus the average relative error, root mean square error, log mean error and accuracy at the threshold for NYUDepth v2 dataset without training. As can be seen from the data in Table 2, the loss term for the prediction i designed by the present invention can effectively reduce the model prediction error.
Figure BDA0003546943530000074
TABLE 2
Table 3 summarizes the average relative error, the root mean square error, the logarithmic mean error and the accuracy under the threshold value in the NYUDepth v2 data set after the Transformer fusion method used in the present invention is directly trained by the softmax calculation fusion method, and the parameters included in the model, wherein the calculation method by softmax is as follows:
Figure BDA0003546943530000075
where blocki is prediction i in fig. 1. Specifically, the prediction i is convolved, mapped between 0 and 1 by a sigmoid function, multiplied by the prediction i, and finally added pixel by pixel to obtain an output. The encoder in this set of experiments used EfficientNet-B3 pre-trained on ImageNet with fewer parameters and faster training. As can be seen from the data in table 3, the use of Transformer is advantageous in terms of both accuracy and error, and the added parameters are not much, compared to simple computational fusion.
Figure BDA0003546943530000081
TABLE 3
Table 4 shows statistics of the average relative error, root mean square error, log mean error and accuracy under a threshold value in the NYU Depth v2 data set after training when the number of intervals is set to 1, 4, and 10, respectively, in the loss term for prediction i designed by the present invention. The encoder in this set of experiments used EfficientNet-B3 pre-trained on ImageNet as well. As can be seen from the data in table 4, the division of 10 intervals can achieve better effect.
Figure BDA0003546943530000082
TABLE 4
Fig. 2 is a schematic diagram showing a comparison between the monocular depth estimation network adopted by the present invention and the current most advanced networks AdaBins and DPT-Hybrid, which are predicted depth maps in some scenarios, wherein: (a) an input RGB image; (b) a true depth map; (c) AdaBins predicted depth map; (d) a DPT-Hybrid predicted depth map; (e) the invention relates to a predicted depth map. As can be seen from the figure, the depth information of the indoor monocular RGB image can be accurately predicted, and compared with AdaBins, the edge outline of the object is clearer.
Fig. 3 shows a schematic diagram of a three-dimensional point cloud generated by using a depth map predicted by the present invention, and it can be seen from the diagram that the present invention can effectively recover three-dimensional depth information from a two-dimensional image, and has a better guiding effect on tasks such as three-dimensional reconstruction and scene understanding.
Table 5 summarizes the average relative error, root mean square error, log mean error and accuracy under threshold for the present invention and the current state-of-the-art methods AdaBins and DPT-Hybrid in the NYHDepth v2 dataset. As can be seen from the data in Table 5, the method of the present invention achieves better results on a plurality of indexes, and the accuracy of depth map estimation is improved to a certain extent. Although both these most advanced methods and the encoder of the method of the present invention were pre-trained on ImageNet, DPT-Hybrid required a lot of additional training data in the model training and fine-tuning on NYU Depth v2 for better results. Specifically, DPT-Hybrid requires first training 60 epochs on a dataset containing 140 thousand images and then fine-tuning on NYU Depth v2, but AdaBins and the model of the invention only require training 25 epochs and 20 epochs, respectively, on a subset of 5 thousand images of NYU Depth v 2.
Figure BDA0003546943530000083
Figure BDA0003546943530000091
TABLE 5
Table 6 shows statistics of the average relative error, the root mean square error, the log mean error and the accuracy under the threshold obtained by testing the method of the present invention and the most advanced methods at present, AdaBins and DPT-Hybrid, under the NYUDepth v2 data set and then under the SUN RGB-D data set. Since an effective method for processing the inverse depth in the real depth map is not found during the test, only the missing value is masked in the experiment, and since the DPT-Hybrid can only be input at the resolution of 480 × 640, the image resolution is uniformly adjusted to 480 × 640 in the experiment. As can be seen from the data in Table 6, the model generalization capability of the present invention is roughly ranked second among the three.
Figure BDA0003546943530000092
TABLE 6
Table 7 shows statistics of the model parameters of the method of the present invention, AdaBins and DPT-Hybrid, which are currently the most advanced methods, and the time required for one-time prediction. As can be seen from the data in Table 7, the model of the present invention has fewer parameters than the other two models. For the time required for one prediction, the model of the present invention is slightly slower than AdaBins, ranked second. The reasoning speed experiment is completed on a machine equipped with NVIDIA GeForce GTX 1660ti GPU, and the resolution of input images is 480 multiplied by 640. Because the output of DPT-Hybrid is twice the output resolution of the AdaBins and the invention, the AdaBins output and the invention also include twice the resampling time in an experiment. In addition, the times in the table are calculated by averaging the times after 5 thousand inferences.
Figure BDA0003546943530000093
TABLE 7
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (5)

1. A depth learning-based indoor scene monocular image depth estimation method is characterized by comprising the following steps:
step 1: introducing a neural network EfficientNet-b7 pre-trained by image classification on ImageNet, and constructing an encoder;
step 2: introducing residual error connection based on SEnet and calculation operation of convolution and resampling at different stages of an encoder to obtain predictions at different stages;
step 3: constructing a loss function focusing on the global to local of an image based on a depth interval division method, and applying the loss function to prediction in different stages;
step 4: and (3) fusing depth information predicted at different stages by using a Transformer structure based on an attention mechanism, and outputting a scene depth prediction result.
2. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 1, wherein Step1 is specifically: download from the Internet an EfficientNet-b7 network pre-trained on ImageNet, obtaining its coded at blocks 3, 5, 6, 8, 12Feature vectors of which the resolutions are respectively of the input image
Figure FDA0003546943520000011
3. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 2, wherein Step2 is specifically:
step 2.1: inputting the feature vector coded by the 3 rd block into 4 SENET-based residual blocks, the feature vector coded by the 5 th block into 3 SENET-based residual blocks, the feature vector coded by the 6 th block into 2 SENET-based residual blocks, and the feature vector coded by the 8 th block into 1 SENET-based residual block;
step 2.2: adding a channel attention layer after the last residual block of each stage and adding a residual connection from the encoder to the layer;
step2.3: gradually passing the features of each stage through twice upsampling and convolutional layers to obtain the features of the five stages with the same channel number of 30 and the same resolution of half of the input resolution;
step2.4: and performing pixel-by-pixel addition fusion on the features of the stages 1, 2 and 5, performing pixel-by-pixel addition fusion on the features of the stages 2, 3 and 5, performing pixel-by-pixel addition fusion on the features of the stages 1, 3 and 4, performing pixel-by-pixel addition fusion on the features of the stages 1, 4 and 5, and then performing convolution layer to obtain four predictions, wherein the four predictions are marked as prediction 1 to prediction 4 according to the shallow-to-deep label of the neural network.
4. The depth estimation method for monocular images of indoor scenes based on depth learning as claimed in claim 3, wherein Step3 is specifically:
step3.1: acquiring a maximum depth d _ max and a minimum depth d _ min from the real depth map;
step3.2: the depth interval [ d _ min, d _ max ] is divided into 10 cells on average, and the calculation formula of the length of one cell is as follows:
Figure FDA0003546943520000021
in these 10 intervals, the depth value range calculation formula of the ith interval is as follows:
[d_min+(i-1)×len,d_min+i×len]
step3.3: making a histogram for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals;
step3.4: according to the occupied proportion, 10 depth intervals are subjected to descending order arrangement, the mean square error of a prediction 1 in a 5 th interval to a 10 th interval in Step2.4 is calculated, the mean square error of a prediction 2 in a 4 th interval to an 8 th interval is calculated, the mean square error of a prediction 3 in a 2 nd interval to a 4 th interval is calculated, and the mean square error of a prediction 4 in a 1 st interval and a 2 nd interval is calculated;
step3.5: combining the four parts of errors as one loss term which is used for restricting the prediction 1 to the prediction 4 to pay attention to the local part to the global part during model training, wherein the calculation formula is as follows:
Figure FDA0003546943520000022
wherein λ1=0.5,λ2=λ3=0.6,λ4=1,niIs the total number of pixels of the real depth map after the interval mask,
Figure FDA0003546943520000023
Figure FDA0003546943520000024
and
Figure FDA0003546943520000025
respectively a pixel point p in the true depth map and the prediction iiThe depth value of (2).
5. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 1, wherein Step4 is specifically:
step4.1: splicing the 4-stage prediction results into a four-channel tensor
Figure FDA0003546943520000026
Step4.2: the four-channel tensor
Figure FDA0003546943520000027
Performing convolution operation with convolution kernel of 16 × 16, step size of 16 and output channel of 4, namely:
Figure FDA0003546943520000028
step4.3: flattening the two-dimensional tensor obtained after convolution into one dimension, namely:
Figure FDA0003546943520000029
step4.4: inputting the one-dimensional tensor into a Transformer Encoder, and recovering the output one-dimensional tensor into a two-dimensional tensor as a weight matrix
Figure FDA00035469435200000210
Step4.5: tensor of four channels
Figure FDA00035469435200000211
Convolution operation with convolution kernel of 3 × 3, step size of 1 and output channel of 128 is performed to obtain a shape of
Figure FDA00035469435200000212
Tensor of
Figure FDA00035469435200000213
Step4.6: weight matrix
Figure FDA0003546943520000031
And tensor
Figure FDA0003546943520000032
And after pixel-by-pixel dot product operation is carried out, outputting a final prediction result through a series of convolution layers.
CN202210251724.8A 2022-03-15 2022-03-15 Indoor scene monocular image depth estimation method based on deep learning Pending CN114638870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210251724.8A CN114638870A (en) 2022-03-15 2022-03-15 Indoor scene monocular image depth estimation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210251724.8A CN114638870A (en) 2022-03-15 2022-03-15 Indoor scene monocular image depth estimation method based on deep learning

Publications (1)

Publication Number Publication Date
CN114638870A true CN114638870A (en) 2022-06-17

Family

ID=81947769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210251724.8A Pending CN114638870A (en) 2022-03-15 2022-03-15 Indoor scene monocular image depth estimation method based on deep learning

Country Status (1)

Country Link
CN (1) CN114638870A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116883479A (en) * 2023-05-29 2023-10-13 杭州飞步科技有限公司 Monocular image depth map generation method, monocular image depth map generation device, monocular image depth map generation equipment and monocular image depth map generation medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116883479A (en) * 2023-05-29 2023-10-13 杭州飞步科技有限公司 Monocular image depth map generation method, monocular image depth map generation device, monocular image depth map generation equipment and monocular image depth map generation medium
CN116883479B (en) * 2023-05-29 2023-11-28 杭州飞步科技有限公司 Monocular image depth map generation method, monocular image depth map generation device, monocular image depth map generation equipment and monocular image depth map generation medium

Similar Documents

Publication Publication Date Title
CN109064507B (en) Multi-motion-stream deep convolution network model method for video prediction
CN111179167B (en) Image super-resolution method based on multi-stage attention enhancement network
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN109756690B (en) Light-weight video interpolation method based on feature-level optical flow
CN111046962A (en) Sparse attention-based feature visualization method and system for convolutional neural network model
CN110717851A (en) Image processing method and device, neural network training method and storage medium
CN111062395B (en) Real-time video semantic segmentation method
CN111754446A (en) Image fusion method, system and storage medium based on generation countermeasure network
CN113781377A (en) Infrared and visible light image fusion method based on antagonism semantic guidance and perception
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
Liu et al. RB-Net: Training highly accurate and efficient binary neural networks with reshaped point-wise convolution and balanced activation
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN116205962B (en) Monocular depth estimation method and system based on complete context information
CN114842216A (en) Indoor RGB-D image semantic segmentation method based on wavelet transformation
CN115294282A (en) Monocular depth estimation system and method for enhancing feature fusion in three-dimensional scene reconstruction
Yang et al. CODON: on orchestrating cross-domain attentions for depth super-resolution
CN111723812A (en) Real-time semantic segmentation method based on sequence knowledge distillation
CN116485867A (en) Structured scene depth estimation method for automatic driving
Ke et al. Mdanet: Multi-modal deep aggregation network for depth completion
CN114638870A (en) Indoor scene monocular image depth estimation method based on deep learning
Zheng et al. Feature pyramid of bi-directional stepped concatenation for small object detection
CN115049739A (en) Binocular vision stereo matching method based on edge detection
CN117011194B (en) Low-light image enhancement method based on multi-scale dual-channel attention network
Zhang et al. Dynamic selection of proper kernels for image deblurring: a multistrategy design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination