CN114170286A - Monocular depth estimation method based on unsupervised depth learning - Google Patents
Monocular depth estimation method based on unsupervised depth learning Download PDFInfo
- Publication number
- CN114170286A CN114170286A CN202111297537.5A CN202111297537A CN114170286A CN 114170286 A CN114170286 A CN 114170286A CN 202111297537 A CN202111297537 A CN 202111297537A CN 114170286 A CN114170286 A CN 114170286A
- Authority
- CN
- China
- Prior art keywords
- convolution
- channels
- network
- feature
- depth estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a monocular depth estimation method based on unsupervised depth learning, which comprises the steps of firstly, constructing a depth estimation and pose estimation network framework based on unsupervised depth learning; then training the established neural network; finally, testing the trained network; the monocular depth estimation method based on unsupervised depth learning solves the limitation of supervised learning in practical application on the basis of ensuring good precision.
Description
Technical Field
The invention belongs to the technical field of machine vision, and relates to a monocular depth estimation method based on unsupervised deep learning.
Background
Depth estimation is a classic problem in machine vision, and has important significance for three-dimensional reconstruction of a scene, occlusion in augmented reality and illumination processing. With the rapid development of deep learning in recent years, monocular depth estimation based on deep learning has been widely studied and achieved with good accuracy. Monocular depth estimation typically takes as input image data from a single perspective and predicts a depth value for each pixel in the image in an end-to-end manner, where a depth value refers to the distance from the image collector to each point in the scene. For the monocular depth estimation method based on supervised deep learning, each RGB image is required to have a corresponding depth label, the acquisition of the depth label usually needs a depth camera or a laser radar, the range of the former is limited, the latter is expensive, and the acquired original depth label is usually some sparse points and cannot be well matched with the original image.
Disclosure of Invention
The invention aims to provide a monocular depth estimation method based on unsupervised deep learning, which solves the limitation of the supervised learning in practical application on the basis of ensuring good precision.
The technical scheme adopted by the invention is that a monocular depth estimation method based on unsupervised deep learning is implemented according to the following steps:
step 2, training the neural network established in the step 1;
and 3, testing the network trained in the step 2.
The invention is also characterized in that:
the construction process in the step 1 is a feature coding module and a feature decoding module, and is implemented according to the following steps:
step 1.1, constructing an encoding and decoding structure of a depth estimation network;
step 1.2, constructing an encoding and decoding structure of a pose estimation network;
the construction of the coding and decoding structure of the depth estimation network in the step 1.1 is implemented according to the following steps:
step 1.1.1, inputting pictures, performing a common 7 × 7 convolution operation, adjusting the number of picture channels to 64 channels, and performing batch normalization and Relu activation;
step 1.1.2, the characteristic diagram FM obtained in the step 1.1.11Transmitting the data into a residual block after the maximum pooling operation is carried out to obtain a feature map FM with 256 channels2;
Step 1.1.3, the characteristic diagram FM obtained in the step 1.1.22Transmitting the data into a residual block to obtain a feature map FM with the channel number of 5123;
Step 1.1.4, the characteristic diagram FM obtained in the step 1.1.33Transmitting the data into a residual block to obtain a feature map FM with 1024 channels4;
Step 1.1.5, the characteristic diagram FM obtained in the step 1.1.4 is used4Transmitting into the residual block to obtain feature map FM with 2048 channels5;
Step 1.1.6, FM5Inputting, up-sampling and restoring to FM4Size called FM5', then FM is performed4And FM5' feature map generated after feature fusion is called FM45And after convolution operation, outputting the estimated Disparity map1Then FM is applied45The operations are repeated as input to generate the Disparity2 and Disparity map of different scales3And Disparity4As an output of the depth estimation network;
the construction of the residual block in the process of constructing the coding and decoding structure of the depth estimation network specifically comprises the following steps: inputting a characteristic diagram, performing dimension reduction operation through 1-x 1 convolution, adjusting the number of channels, and performing batch normalization and Relu activation; performing convolution operation on the input through a blueprint depth convolution module, and performing batch normalization and Relu activation; performing 1 × 1 convolution again, and adjusting the number of channels; then, the channel attention module learns the correlation among the channels, and screens out the attention aiming at the channels; performing near-path connection on an initial input feature map and an output subjected to channel attention, and activating by using Relu;
the construction process of the blueprint depth convolution module is as follows:
performing point-by-point convolution, performing weighted combination on the feature maps in the depth direction, wherein the size of a convolution kernel is 1 × M, M is the number of channels in the previous layer, the number of output channels is M × p, p is a scaling parameter, and p is 0.5; performing point-by-point convolution, wherein the size of a convolution kernel is 1 × M × p, performing weighted combination on the upper-layer output feature graph in the depth direction again, and the number of output channels is M; finally, performing channel-by-channel convolution, wherein the convolution operation uses cavity convolution, the convolution kernel is 3 x 3, and injection cavities 1, 1, 2 and 3 are respectively arranged in four layers of different residual blocks;
the channel attention module construction process comprises the following steps:
setting the size of the input feature map as W x H x C, wherein W, H, C represents the width, height and channel number of the feature map respectively; the first step is a compression operation, and the feature map is compressed into 1 x C vectors through a global average pooling; then, excitation operation is carried out, the convolution kernel size is 1 × 1 through a full connection layer, C × R neurons exist, R is a scaling parameter, and the output is 1 × C × R; then passing through a full connection layer, inputting 1X 1C R, and outputting 1X 1C; finally, channel weight multiplication is carried out on the input feature map, the original feature vector is W x H x C, and the weight vector 1 x C of each channel calculated by the channel attention module is multiplied by the two-dimensional matrix of the channel corresponding to the original feature map to obtain a result and output;
the coding and decoding structure of the attitude estimation network in the step 1.2 is implemented according to the following steps:
step 1.2.1, inputting two pictures, carrying out a common 7 × 7 convolution operation, adjusting the number of picture channels to 64 channels, and carrying out batch normalization and Relu activation;
step 1.2.2, the characteristic diagram FM obtained in the step 1.2.11Transmitting the data into a residual block after the maximum pooling operation is carried out to obtain a feature map FM with the channel number of 642;
Step 1.2.3, the characteristic diagram FM obtained in the step 1.2.22Into a residual blockObtaining a feature map FM with the channel number of 1283;
Step 1.2.4, the characteristic diagram FM obtained in the step 1.2.33Transmitting the data into a residual block to obtain a feature map FM with 256 channels4;
Step 1.2.5, the characteristic diagram FM obtained in the step 1.2.4 is processed4Transmitting the data into a residual block to obtain a feature map FM with the channel number of 5125;
Step 1.2.6, FM5Inputting, changing the number of channels to 256 by using 1-by-1 convolution, and then activating by using a Relu function to obtain a feature map FM6;
Step 1.2.7, for FM6Extraction of feature output FM using 3 x 256 convolution operation7;
Step 1.2.8, for FM7Extraction of feature output FM using 3 x 256 convolution operation8;
Step 1.2.9, for FM8Changing the number of signature channels to 6 output FM using 1 x 1 convolution9;
Step 1.2.10, for FM9Is averaged with the third dimension, the transform dimension generates a shape of [4, 6%]The vector of (a), the vector being the relative camera pose change of adjacent frames;
the construction method of the residual block in the construction process of the coding and decoding structure of the pose estimation network comprises the following steps:
inputting a feature map W H C, performing feature extraction through a 3X 3 convolution, changing the channel number output to W H2C, and then performing batch normalization and Relu activation; performing feature extraction again through one 3 × 3 convolution, outputting W × H × 2C, and then performing batch normalization processing;
the network training in the step 2 is specifically to train the network by using a KITTI data set, and is specifically implemented according to the following steps:
step 2.1, the data set is disturbed to generate a training sample and a test sample;
step 2.2, acquiring a pre-training weight;
step 2.3, using an Adam optimizer, setting the initial learning rate to be 1e-4, automatically reducing the learning rate in the training process, and beta1=0.9,β2=0.999;
Step 2.3, calculating training loss and verification loss after each epoch;
step 2.5, comparing the verification loss of each epoch, and storing the model parameter with the minimum verification loss;
the specific process of the network test in the step 3 is as follows:
inputting the test image into a network to obtain a depth estimation result, calculating each loss and accuracy in an evaluation index of the depth estimation, and evaluating the network performance.
The invention has the beneficial effects that:
aiming at the problems that the acquisition range of the label is limited and the cost is high under the supervision deep learning method and the sparsity of the acquired original deep label cannot be well matched with the pixel points in the original image, the monocular depth estimation method based on unsupervised deep learning provides that the luminosity loss function is used for replacing the label to be used as a constraint training network, so that the accuracy of the depth map is guaranteed, and the trouble caused by the acquisition of the label is ignored. An attention mechanism is adopted in the network structure, important information of a target processing object is emphasized, some irrelevant information is suppressed, and more distinguishing feature representation is generated. By adopting the jump connection, the method not only can utilize stronger semantic information in the high-level features, but also can be integrated into more position and detail information of the low-level features, thereby improving the performance of the model. And the blueprint separable convolution is adopted, so that the parameter quantity is greatly reduced while the model effect is ensured.
Drawings
FIG. 1 is a block diagram of a monocular depth estimation method based on unsupervised deep learning according to the present invention;
FIG. 2 is a schematic diagram of a depth estimation network model in the monocular depth estimation method based on unsupervised deep learning according to the present invention;
FIG. 3 is a schematic structural diagram of a pose estimation network model in the monocular depth estimation method based on unsupervised deep learning according to the present invention;
FIG. 4 is a schematic structural diagram of a dense residual block in a depth estimation network model structure in a monocular depth estimation method based on unsupervised deep learning according to the present invention;
FIG. 5 is a schematic structural diagram of a channel attention mechanism module in the method for monocular depth estimation based on unsupervised deep learning according to the present invention;
FIG. 6 is a schematic structural diagram of a blue image depth convolution in the monocular depth estimation method based on unsupervised deep learning according to the present invention;
FIG. 7 is the result of the depth map estimated in the monocular depth estimation method based on unsupervised deep learning according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a monocular depth estimation method based on unsupervised learning, which is implemented by the following steps:
The coding structure of the depth estimation network takes a Resnet50 coder as a backbone, and 3 modules are embedded: 1) dense residual blocks; 2) a channel attention module; 3) blue map depth convolution; the decoding structure comprises 2 modules: 4) an upsampling module; 5) a feature fusion module;
the pose estimation network is characterized in that an encoding structure is a Resnet18 encoder, a decoding part changes the number of channels through three layers of 1 x 1 convolution operation, and finally the pose of 6D is output.
Two adjacent frames are input, denoted as IaAnd IbTwo frames are input into a depth estimation network in sequence, and 5 different scale characteristic graphs FM are obtained by extracting multi-scale characteristics through a coder1,FM2,FM3,FM4,FM5. The five feature maps are transferred to the decoder, first the FM is5Inputting, up-sampling and restoring toFM4Size called FM5', then FM is performed4And FM5' feature map generated after feature fusion is called FM45And after convolution operation, outputting the estimated Depth map result Depth1Then FM is applied45As input, repeating the above operations to generate Depth maps Depth with different scales2、Depth3And Depth4As an output of the depth estimation network. And if the depth images are in the training state, recovering the 4 depth images with different scales to the same high resolution through bilinear interpolation, calculating loss functions of the depth images under the same scale for training together, and performing accurate high resolution reconstruction on the target image. If the test state is true, the Depth map Depth is directly output4;
Simultaneously, inputting the two frames into a pose estimation network together, extracting the features through a coder, and obtaining a top layer feature map FM5Transmitting the pose to a decoder, and finally outputting the estimated 6D pose between two frames;
1) dense residual block
The residual block is divided into a direct mapping part and a residual part, and the core of the Resnet model is to ensure that the network of the L +1 layer contains more image information than the L layer by establishing the short-circuit connection between the front layer and the rear layer, so that the problem of network degradation caused by the fact that the characteristic diagram contains image information which is reduced layer by layer as the number of the network layers is deepened is solved. While the idea of a dense residual block is consistent with Resnet, it establishes a dense connection of all layers in front with the layers behind, as shown in fig. 4, with two partial groups, density and Transition, respectively.
In Denseblock, 3 layers are set, and the feature maps of the layers are consistent in size and can be connected in the channel dimension. The nonlinear combination function in the Densblock adopts a structure of Batchnormalization + Relu + 3. multidot.3Conv, K feature maps are output after convolution of all layers in the Densblock, and the number of channels of the obtained feature maps is K. K is a super parameter, set here to 256. Since the features are reused continuously, the input of the back layer is very large, and in order to reduce the calculation amount, a bottleeck layer is adopted inside the Denseblock, and 1 × 1Conv is added in the structure.
For the Transition layer, the Transition layer is mainly connected with two adjacent blocks, has a structure of BatchNormalization + Relu +1 × 1Conv and mainly plays a role of a compression model;
2) channel attention module
The convolution kernel generally aggregates spatial information and characteristic dimension information on a local receptive field to obtain global information. The core of the channel attention module is to model the interdependence relationship between channels explicitly from the relationship between feature channels, specifically, the importance degree of each feature channel is automatically obtained in a learning mode, beneficial channels are selectively enhanced and useless channels are suppressed by using global information, and therefore, the feature map channel adaptive calibration is realized.
And inputting the feature map, and performing feature compression operation through global average pooling, wherein the feature map is compressed into a 1 x C vector, and C is a channel dimension. The compression operation transforms each two-dimensional feature channel into a real number that has, to some extent, a global receptive field that characterizes a corresponding global distribution over the feature channels. The following is an excitation operation, and the convolution kernel size is 1 × 1 through a full connection layer, and has C × R neurons, wherein R is a scaling parameter, the purpose of the parameter is to reduce the number of channels so as to reduce the calculation amount, and the output is 1 × C × R; and then passing through a full connection layer, wherein the input is 1 × C × R, and the output is 1 × C. And finally, multiplying the channel weights of the input feature map, and weighting the input feature map to the previous feature channel by channel to finish the recalibration of the original feature in the channel dimension.
3) Blue depth convolution
In some lightweight networks, deep separable convolution is used for extracting features, the number of parameters and the operation cost are lower compared with those of conventional convolution operation, the deep separable convolution depends on cross-kernel correlation, but research shows that the correlation inside the kernel is dominant, and standard convolution can be separated more effectively. The Chinese medicine composition consists of Depthwise and Pointwise parts:
firstly, inputting feature maps, performing point-by-point convolution, performing weighted combination on the feature maps in the depth direction, wherein the size of a convolution kernel is 1 × M, M is the number of channels in the previous layer, the number of output channels is M × p, p is a scaling parameter, and p is 0.5, and the parameter is used for reducing the number of channels so as to reduce the calculated amount; performing point-by-point convolution, wherein the size of a convolution kernel is 1 × M × p, performing weighted combination on the upper-layer output feature graph in the depth direction again, and the number of output channels is M; and finally, performing channel-by-channel convolution, wherein the convolution operation uses cavity convolution, the convolution kernel is 3 x 3, and injection cavities 1, 1, 2 and 3 are respectively arranged in four layers of different residual blocks, so that the receptive field is increased on the premise of ensuring no information loss.
Step 2, network training: the method uses a Pythrch frame to build a network structure, uses an Adam algorithm to optimize training parameters, uses a KITTI data set to train the network, uses a weighted luminosity loss function, a smooth loss function and geometric consistency loss as supervision signals in the training process, and finally saves a model parameter with the minimum verification loss as an optimal model:
wherein the loss function is specifically defined as follows:
L=αLM p+βLs+Γlgc (1)
in the formula, LM pTo use the weighted luminance loss function (Lp), L of the mask MsDenotes the loss of smoothness, LGCMaximizing data usage for geometric consistency loss through forward and reverse training networks;
wherein the luminosity loss function uses the estimated depth map D according to the principle of luminosity consistencyaAnd relative pose PabUsing differentiable bilinear interpolation to map IbTransformation of micro IaFor synthetic IaCorresponding figure IaThe following objective function is formed:
in which V represents a group IaSuccessfully project to IbThe robustness against its outliers using L1 is lost asThe influence of illumination change existing in real conditions is dealt with, the similarity loss SSIM is added, the pixel brightness is standardized, and the luminosity loss function is changed as follows:
in the formula, λi=0.15,λs=0.85;
The smoothness loss function adjusts the failure of the brightness loss in the low texture region or the repeated characteristic region according to the smoothness prior condition, and the smoothness loss with consistent edges is defined as follows:
in the formula (I), the compound is shown in the specification,for the first derivative in the spatial direction, the smoothness of the image edge is ensured;
the geometric consistency function loss is specifically defined as follows:
by minimizing the geometric distance between the predicted depth values between each successive pair of images to promote consistency in their dimensions, consistency can be passed to the entire video sequence at the time of training;
wherein DdiffThe definition is as follows:
Db ais to use the pose P estimated between two framesa bWarping estimated IaDepth map D ofaVariation of the obtained IbDepth map of (D)'bIs an interpolated depth map obtained by estimation;
the mask is defined as follows:
with pixel-by-pixel auto _ mask, selectively weighting pixels, auto _ mask filters out stationary pixels when both the camera and another object are moving at similar speeds.
M=1-Ddiff (8)
Through the mask M, the weights of the moving object and the shielding part area are reduced, and the adverse effect of the part area in the process of calculating loss is reduced;
wherein N is the total number of pixels, DiFor the estimated depth value of the ith pixel,the real depth value corresponding to the ith pixel;
and 3, testing the network trained in the step 2:
step 3.1, loading the model and reading a data set;
step 3.2, transmitting the data set image into a depth estimation model and a pose estimation model, and calculating the pose between two frames and the pixel point depth of each frame to obtain a depth map;
and 3.3, calculating various losses and accuracy rates between the estimated depth map and the label by using the depth estimation evaluation indexes.
The input picture size is 128 x 416 on a KITTI data set, the loss and the accuracy of each item in the evaluation index and other supervised learning algorithm pairs are shown in a table 1, wherein Depth represents the use of a Depth label for supervision, Stereo represents the use of binocular, Mono represents the use of monocular, L represents the use of a semantic label, and F represents the addition of optical flow information.
TABLE 1 depth estimation method Performance comparison
The monocular depth estimation method based on unsupervised learning provided by the invention achieves the aim of completing monocular depth estimation by unsupervised learning, and eliminates the influence of difficulty in obtaining true value labels in supervised learning; the invention introduces an attention mechanism in a depth estimation network, adds the attention mechanism in an encoder structure, can obtain richer context information and captures the correlation between features in a channel dimension. To take full advantage of these features, dense blocks are integrated into the network; the blueprint separation convolution is used for replacing the common convolution in the bottleneck structure, so that the purpose of reducing parameters is achieved; in view synthesis, we use single-scale images to complete view synthesis and use the synthesized images to calculate loss; for disease region problems such as occlusion, dynamic objects and the like in monocular depth estimation, the joint action of the two masks can be better handled. Experiments on KITTI data sets show that the processing speed of the invention on video frames can reach 59FPS, and the accuracy of each evaluation index absolute relative error, square relative error, root mean square error, logarithm root mean square error and different thresholds are respectively as follows: 0.122, 0.934, 4.885, 0.197, 0.866, 0.955, 0.980, our method achieves higher performance in the depth estimation task than other most advanced performance methods, and the pose estimation network can achieve globally scale-consistent trajectories using geometric consistency loss, thereby producing accuracy competitive with stereo video trained models.
Claims (10)
1. A monocular depth estimation method based on unsupervised deep learning is characterized by comprising the following steps:
step 1, constructing a depth estimation and pose estimation network framework based on unsupervised deep learning;
step 2, training the neural network established in the step 1;
and 3, testing the network trained in the step 2.
2. The method for monocular depth estimation based on unsupervised deep learning according to claim 1, wherein the construction process in step 1 is a feature encoding module and a feature decoding module, and is specifically implemented according to the following steps:
step 1.1, constructing an encoding and decoding structure of a depth estimation network;
and 1.2, constructing a coding and decoding structure of the pose estimation network.
3. The monocular depth estimation method based on unsupervised deep learning of claim 2, wherein the coding and decoding structure construction of the depth estimation network in the step 1.1 is implemented by the following steps:
step 1.1.1, inputting pictures, performing a common 7 × 7 convolution operation, adjusting the number of picture channels to 64 channels, and performing batch normalization and Relu activation;
step 1.1.2, the characteristic diagram FM obtained in the step 1.1.11Transmitting the data into a residual block after the maximum pooling operation is carried out to obtain a feature map FM with 256 channels2;
Step 1.1.3, the characteristic diagram FM obtained in the step 1.1.22Transmitting the data into a residual block to obtain a feature map FM with the channel number of 5123;
Step 1.1.4, the characteristic diagram FM obtained in the step 1.1.33Transmitting the data into a residual block to obtain a feature map FM with 1024 channels4;
Step 1.1.5, the characteristic diagram FM obtained in the step 1.1.4 is used4Transmitting into the residual block to obtain feature map FM with 2048 channels5;
Step 1.1.6, FM5Inputting, up-sampling and restoring to FM4Size called FM5', then FM is performed4And FM5' feature map generated after feature fusion is called FM45And after convolution operation, outputting the estimated Disparity map1Then FM is applied45The operations are repeated as input to generate the Disparity2 and Disparity map of different scales3And Disparity4As an output of the depth estimation network.
4. The method according to claim 3, wherein the residual block structure in the process of constructing the coding and decoding structure of the depth estimation network specifically comprises: inputting a characteristic diagram, performing dimension reduction operation through 1-x 1 convolution, adjusting the number of channels, and performing batch normalization and Relu activation; performing convolution operation on the input through a blueprint depth convolution module, and performing batch normalization and Relu activation; performing 1 × 1 convolution again, and adjusting the number of channels; then, the channel attention module learns the correlation among the channels, and screens out the attention aiming at the channels; relu activation is used after the near-path connection of the original input profile with the output after channel attention.
5. The method for monocular depth estimation based on unsupervised deep learning of claim 4, wherein the construction process of the blueprint depth convolution module is as follows:
performing point-by-point convolution, performing weighted combination on the feature maps in the depth direction, wherein the size of a convolution kernel is 1 × M, M is the number of channels in the previous layer, the number of output channels is M × p, p is a scaling parameter, and p is 0.5; performing point-by-point convolution, wherein the size of a convolution kernel is 1 × M × p, performing weighted combination on the upper-layer output feature graph in the depth direction again, and the number of output channels is M; and finally, performing channel-by-channel convolution, wherein the convolution operation uses hole convolution, the convolution kernel is 3 x 3, and injection holes 1, 1, 2 and 3 are respectively arranged in four layers of different residual blocks.
6. The method of claim 4, wherein the channel attention module is constructed by:
setting the size of the input feature map as W x H x C, wherein W, H, C represents the width, height and channel number of the feature map respectively; the first step is a compression operation, and the feature map is compressed into 1 x C vectors through a global average pooling; then, excitation operation is carried out, the convolution kernel size is 1 × 1 through a full connection layer, C × R neurons exist, R is a scaling parameter, and the output is 1 × C × R; then passing through a full connection layer, inputting 1X 1C R, and outputting 1X 1C; and finally, multiplying the channel weight of the input characteristic diagram, wherein the original characteristic vector is W x H x C, multiplying the weight vector 1 x C of each channel calculated by the channel attention module by the two-dimensional matrix of the channel corresponding to the original characteristic diagram to obtain a result and outputting the result.
7. The monocular depth estimation method based on unsupervised deep learning of claim 2, wherein the codec structure of the bit attitude estimation network in the step 1.2 is implemented by the following steps:
step 1.2.1, inputting two pictures, carrying out a common 7 × 7 convolution operation, adjusting the number of picture channels to 64 channels, and carrying out batch normalization and Relu activation;
step 1.2.2, the characteristic diagram FM obtained in the step 1.2.11Transmitting the data into a residual block after the maximum pooling operation is carried out to obtain a feature map FM with the channel number of 642;
Step 1.2.3, the characteristic diagram FM obtained in the step 1.2.22Transmitting the data into a residual block to obtain a feature map FM with the channel number of 1283;
Step 1.2.4, the characteristic diagram FM obtained in the step 1.2.33Transmitting the data into a residual block to obtain a feature map FM with 256 channels4;
Step 1.2.5, the characteristic diagram FM obtained in the step 1.2.4 is processed4Transmitting the data into a residual block to obtain a feature map FM with the channel number of 5125;
Step 1.2.6, FM5Inputting, changing the number of channels to 256 by using 1-by-1 convolution, and then activating by using a Relu function to obtain a feature map FM6;
Step 1.2.7, for FM6Extraction of feature output FM using 3 x 256 convolution operation7;
Step 1.2.8, for FM7Extraction of feature output FM using 3 x 256 convolution operation8;
Step 1.2.9, for FM8Changing the number of signature channels to 6 output FM using 1 x 1 convolution9;
Step 1.2.10, for FM9Is averaged with the third dimension, the transform dimension generates a shape of [4, 6%]The vector of (a), the vector being the relative camera pose change of the adjacent frame.
8. The method for monocular depth estimation based on unsupervised deep learning of claim 7, wherein the construction of the residual block in the construction process of the coding and decoding structure of the pose estimation network comprises the following steps:
inputting a feature map W H C, performing feature extraction through a 3X 3 convolution, changing the channel number output to W H2C, and then performing batch normalization and Relu activation; and performing feature extraction again through one 3 × 3 convolution to output W × H × 2C, and then performing batch normalization processing.
9. The method of claim 1, wherein the network training in step 2 is specifically to train the network by using a KITTI dataset, and is specifically implemented by the following steps:
step 2.1, the data set is disturbed to generate a training sample and a test sample:
step 2.2, acquiring a pre-training weight;
step 2.3, using an Adam optimizer, setting the initial learning rate to be 1e-4, automatically reducing the learning rate in the training process, and beta1=0.9,β2=0.999;
Step 2.4, calculating training loss and verification loss after each epoch;
and 2.5, comparing the verification loss of each epoch, and storing the model parameter with the minimum verification loss.
10. The method for monocular depth estimation based on unsupervised deep learning of claim 1, wherein the specific process of the network test in step 3 is as follows:
inputting the test image into a network to obtain a depth estimation result, calculating each loss and accuracy in an evaluation index of the depth estimation, and evaluating the network performance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297537.5A CN114170286B (en) | 2021-11-04 | 2021-11-04 | Monocular depth estimation method based on unsupervised deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297537.5A CN114170286B (en) | 2021-11-04 | 2021-11-04 | Monocular depth estimation method based on unsupervised deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114170286A true CN114170286A (en) | 2022-03-11 |
CN114170286B CN114170286B (en) | 2023-04-28 |
Family
ID=80478016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111297537.5A Active CN114170286B (en) | 2021-11-04 | 2021-11-04 | Monocular depth estimation method based on unsupervised deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114170286B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114998411A (en) * | 2022-04-29 | 2022-09-02 | 中国科学院上海微系统与信息技术研究所 | Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss |
CN116245927A (en) * | 2023-02-09 | 2023-06-09 | 湖北工业大学 | ConvDepth-based self-supervision monocular depth estimation method and system |
WO2023245321A1 (en) * | 2022-06-20 | 2023-12-28 | 北京小米移动软件有限公司 | Image depth prediction method and apparatus, device, and storage medium |
CN118397063A (en) * | 2024-04-22 | 2024-07-26 | 中国矿业大学 | Self-supervision monocular depth estimation method and system in unmanned monorail crane of coal mine |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986166A (en) * | 2018-07-20 | 2018-12-11 | 山东大学 | A kind of monocular vision mileage prediction technique and odometer based on semi-supervised learning |
CN109377530A (en) * | 2018-11-30 | 2019-02-22 | 天津大学 | A kind of binocular depth estimation method based on deep neural network |
CN110490928A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of camera Attitude estimation method based on deep neural network |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN111354030A (en) * | 2020-02-29 | 2020-06-30 | 同济大学 | Method for generating unsupervised monocular image depth map embedded into SENET unit |
CN111739082A (en) * | 2020-06-15 | 2020-10-02 | 大连理工大学 | Stereo vision unsupervised depth estimation method based on convolutional neural network |
-
2021
- 2021-11-04 CN CN202111297537.5A patent/CN114170286B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986166A (en) * | 2018-07-20 | 2018-12-11 | 山东大学 | A kind of monocular vision mileage prediction technique and odometer based on semi-supervised learning |
CN109377530A (en) * | 2018-11-30 | 2019-02-22 | 天津大学 | A kind of binocular depth estimation method based on deep neural network |
CN110490928A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of camera Attitude estimation method based on deep neural network |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN111354030A (en) * | 2020-02-29 | 2020-06-30 | 同济大学 | Method for generating unsupervised monocular image depth map embedded into SENET unit |
CN111739082A (en) * | 2020-06-15 | 2020-10-02 | 大连理工大学 | Stereo vision unsupervised depth estimation method based on convolutional neural network |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114998411A (en) * | 2022-04-29 | 2022-09-02 | 中国科学院上海微系统与信息技术研究所 | Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss |
CN114998411B (en) * | 2022-04-29 | 2024-01-09 | 中国科学院上海微系统与信息技术研究所 | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss |
WO2023245321A1 (en) * | 2022-06-20 | 2023-12-28 | 北京小米移动软件有限公司 | Image depth prediction method and apparatus, device, and storage medium |
CN116245927A (en) * | 2023-02-09 | 2023-06-09 | 湖北工业大学 | ConvDepth-based self-supervision monocular depth estimation method and system |
CN116245927B (en) * | 2023-02-09 | 2024-01-16 | 湖北工业大学 | ConvDepth-based self-supervision monocular depth estimation method and system |
CN118397063A (en) * | 2024-04-22 | 2024-07-26 | 中国矿业大学 | Self-supervision monocular depth estimation method and system in unmanned monorail crane of coal mine |
Also Published As
Publication number | Publication date |
---|---|
CN114170286B (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111739078B (en) | Monocular unsupervised depth estimation method based on context attention mechanism | |
CN109064507B (en) | Multi-motion-stream deep convolution network model method for video prediction | |
CN111325794B (en) | Visual simultaneous localization and map construction method based on depth convolution self-encoder | |
CN114170286B (en) | Monocular depth estimation method based on unsupervised deep learning | |
CN111784602B (en) | Method for generating countermeasure network for image restoration | |
CN111739082B (en) | Stereo vision unsupervised depth estimation method based on convolutional neural network | |
CN110782490A (en) | Video depth map estimation method and device with space-time consistency | |
CN110689599A (en) | 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement | |
CN111383200B (en) | CFA image demosaicing method based on generated antagonistic neural network | |
CN114463218B (en) | Video deblurring method based on event data driving | |
CN112270691B (en) | Monocular video structure and motion prediction method based on dynamic filter network | |
CN113610912B (en) | System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction | |
CN112767283A (en) | Non-uniform image defogging method based on multi-image block division | |
CN109871790A (en) | A kind of video decolorizing method based on hybrid production style | |
CN113077505A (en) | Optimization method of monocular depth estimation network based on contrast learning | |
CN115631223A (en) | Multi-view stereo reconstruction method based on self-adaptive learning and aggregation | |
CN110889868A (en) | Monocular image depth estimation method combining gradient and texture features | |
CN117710429A (en) | Improved lightweight monocular depth estimation method integrating CNN and transducer | |
CN112446245A (en) | Efficient motion characterization method and device based on small displacement of motion boundary | |
CN114022371B (en) | Defogging device and defogging method based on space and channel attention residual error network | |
CN117333682A (en) | Multi-view three-dimensional reconstruction method based on self-attention mechanism | |
CN114119704A (en) | Light field image depth estimation method based on spatial pyramid pooling | |
CN111127587B (en) | Reference-free image quality map generation method based on countermeasure generation network | |
Zhang et al. | Unsupervised learning of depth estimation based on attention model from monocular images | |
TWI748426B (en) | Method, system and computer program product for generating depth maps of monocular video frames |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |