CN112085702A - Monocular depth estimation method based on sparse depth of key region - Google Patents
Monocular depth estimation method based on sparse depth of key region Download PDFInfo
- Publication number
- CN112085702A CN112085702A CN202010777954.9A CN202010777954A CN112085702A CN 112085702 A CN112085702 A CN 112085702A CN 202010777954 A CN202010777954 A CN 202010777954A CN 112085702 A CN112085702 A CN 112085702A
- Authority
- CN
- China
- Prior art keywords
- depth
- layer
- sparse
- network
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 238000005070 sampling Methods 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 230000000694 effects Effects 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 238000010606 normalization Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 2
- 238000007796 conventional method Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 12
- 238000013135 deep learning Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- WURBVZBTWMNKQT-UHFFFAOYSA-N 1-(4-chlorophenoxy)-3,3-dimethyl-1-(1,2,4-triazol-1-yl)butan-2-one Chemical compound C1=NC=NN1C(C(=O)C(C)(C)C)OC1=CC=C(Cl)C=C1 WURBVZBTWMNKQT-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration using local operators
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/513—Sparse representations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a monocular depth estimation method based on sparse depth of a key region. Inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, then performing up-sampling to obtain a predicted depth map with the same size as the input image, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters. And training for multiple rounds to obtain a final network model. And finally, testing through the test set. The method has the advantages that the sampling points are more reasonable and targeted, the key points for depth estimation of the neural network are selected, the quantitative effect of the depth estimation is improved, the depth is more accurate compared with the depth predicted by the conventional method, the error is smaller, and the generated depth map effect is clearer.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a monocular depth estimation method based on a sparse depth map.
Background
Depth estimation has been widely applied in engineering practice, such as automatic driving, augmented reality, etc., and is a very important research direction in the field of computer vision and a very popular research subject in recent years. The methods for depth estimation are various, and besides monocular camera ranging, there are also several common methods, such as ranging using laser radar, with an accuracy that is not comparable to other methods; ranging by a structured light sensor such as a Kinect; in addition, methods such as binocular camera ranging are available. The laser radar ranging has extremely high accuracy, but is often forbidden due to the high selling price of thousands of dollars, and the laser sensor is easily influenced by environmental factors such as haze and the like; the Kinect and other structured light sensors are short in detection distance, relatively high in energy consumption and sensitive to illumination intensity; binocular camera ranging requires fine manual calibration.
Based on the various defects of the above methods and the rapid development of deep learning in recent years, the use of a monocular camera in combination with a method based on deep learning for depth estimation has attracted extensive attention of researchers, and has gained rapid development, and many monocular depth estimation methods based on deep learning have been proposed in the academic community, and can be roughly classified into three types, namely a supervised method, a weak supervised method and an unsupervised method, from the aspect of supervision information. The input data mainly used at present is still RGB images, and although some progress has been made in this method in recent years, the depth estimation using RGB images is inherently a pathological problem, so the overall accuracy and reliability are still unsatisfactory.
Therefore, a sparse depth is considered to be adopted as a supplement of the RGB image, a reference point is provided for the neural network to predict the depth, and various methods such as SLAM or cheap laser radar are available for acquiring the sparse depth. Thereby greatly improving the accuracy of monocular depth estimation.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: due to the high cost of the laser radar and the low accuracy and reliability of the monocular depth estimation based on the RGB images, the existing monocular depth estimation method based on the depth learning cannot provide an accurate depth map.
In view of the above situation, the present invention provides a monocular depth estimation method based on a sparse depth map. The invention discloses a method for generating a dense depth map by extracting sparse depth of a key area as monocular depth estimation auxiliary input and extracting features by using a deep neural network. The method mainly comprises the following design points: 1) segmenting and data enhancing a data set; 2) designing a network with a self-encoder structure, and performing feature extraction and up-sampling on input data; 3) extracting key areas in the RGB image to obtain sparse depth of the areas; 4) and (5) testing the trained model on a test set.
A monocular depth estimation method based on sparse depth of a key region comprises the following steps:
step 1, preprocessing a data set, cutting, rotating and changing brightness of a training set, and cutting a testing set.
Step 2, designing a network model structure;
the network model is divided into two parts of an encoder and an up-sampling network. The encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed;
and 3, training the network, inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, and then performing up-sampling to obtain a predicted depth map pred with the same size as the input image.
And 4, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters. And training for multiple rounds to obtain a final network model.
And 5, testing the network model. And inputting the data image of the test set and the corresponding sparse depth into the trained model to obtain a predicted depth map pred, and calculating each evaluation index.
The above steps are specifically described below:
the step 1 is as follows:
the preprocessing of the training set comprises: zooming the training set data, then randomly turning the training set data horizontally, then randomly rotating the training set data, then performing center cutting on the training set data, performing color data enhancement on the training set data, and finally regularizing the training set data;
the preprocessing of the test set includes: and zooming the test set data, then performing center cropping, and finally regularizing the test data.
The network structure of step 2 is specifically as follows:
the encoder portion of the network uses Resnet-50 to remove the last average pooling layer and full connectivity layer and replace it with convolutional and normalization layers with a core size of 1 x 1. The upsampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed. The UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then the data pass through a convolution layer and a standardization layer, are activated by a relu function, then pass through the upper branch and the lower branch respectively and then are added, and finally, the data are activated by the relu function. The structure of the lower branch is a convolution layer and a normalization layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the normalization layer with kernel size of 3 x 3 pass through after being activated by the relu function. The upper branching structure is a convolutional layer and a normalization layer with a core size of 5 x 5.
The sparse depth in step 3 is obtained specifically as follows:
and performing Gaussian filtering on the RGB image in the input training set, wherein the Gaussian filtering adopts a convolution layer with a convolution kernel size of 3 x 3. And extracting the image edge by using a canny operator to obtain a mask. And generating a random number matrix s _ mask which has the same size as the mask and the value of 0 to 1 through numpy, and setting a threshold prob to enable the value of the random number matrix s _ mask which is smaller than the threshold prob to be 0 and the rest to be 1. And carrying out bitwise AND operation on the random number matrix s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask. And generating a full 0 matrix sparse _ depth with the same size as the depth map through numpy, and taking the depth value of the position with the sparse depth mask sparse _ depth _ mask value of 1 in the corresponding depth map in the data set as the value of the corresponding position in the full 0 matrix sparse _ depth to obtain the final sparse depth of the key area.
Further, the parameter of the Canny operator is set as: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.
The loss function L and related parameters of step 4 are as follows:
the loss function L contains three terms:
L=ldepth+lgrad+lssim
wherein ldepthIs the depth error, which is used to make the prediction and training set group Truth closer, and is defined as follows:
n is the number of pixels of the image input into the neural network,one pixel in depth, y, predicted for neural networkspIs a pixel in the training set group Truth.
lgradIs a gradient error for making the edges of the generated depth map sharper and clearer, defined as follows:
wherein the content of the first and second substances,is the differential in the x-direction,is the differential in the y-direction.
The structure similarity error is adopted, so that the generated depth map has a better display effect:
wherein, mu represents the mean value,represents the variance of x and represents the variance of x,represents the variance, σ, of yxyRepresents the covariance of x, y, c1、c2Is two constants.
The evaluation index in step 5 includes:
Card (x) is the number of elements in a set x, and is used in the present invention to calculate the number of pixels.
The invention has the following beneficial effects:
according to the invention, the sparse depth of the edge of the object is extracted as auxiliary input, so that the network can obtain the depth information of the key position, and the prediction can be made better. Meanwhile, the quality of the prediction image is further improved by adding gradient loss and structural similarity loss in the loss function, and the generated image is sharper and has clearer edges.
1. The sampling points are more reasonable and targeted, and the key points for deep estimation of the neural network are selected.
2. The depth estimation method has the advantages that the quantitative effect of the depth estimation is improved, the depth is more accurate compared with the depth predicted by the conventional method, and the error is smaller.
3. The generated depth map has clearer effect.
Drawings
FIG. 1 is an algorithm flow diagram;
FIG. 2 is a flow chart of edge sparse depth extraction;
fig. 3 is a network configuration diagram.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
As shown in fig. 1, the present invention comprises the steps of:
step 1, preprocessing a data set, taking NYUv2 data set as an example, the size of an original image (the original image includes an RGB image and a depth map) is 480 × 640, and performing the following processing:
step 1.1, narrow side is zoomed to 240, long side is zoomed to 320 correspondingly, resource consumed by follow-up operation is reduced
Step 1.2, random horizontal overturning and random rotation are carried out, the random rotation angle is +/-5 degrees, and the diversity of data is enhanced
Step 1.3, center crop is performed, cropping the image to 228 x 304, and converting to tenor data type
And 1.4. Performing principal component analysis to reduce dimension of data set with characteristic value of
[0.2175,0.0188,0.0045]
The feature vector is
And step 1.5, carrying out color dithering, wherein the variation range comprises the variation of contrast, brightness and saturation, and the variation range is +/-0.4.
Step 1.6, data normalization is carried out, and the normalized mean value is
[0.485 0.456 0.406]
Standard deviation of
[0.229 0.224 0.225]
The training set is subjected to the above 6 items of image data processing, and the test set is only subjected to step 1.1, step 1.3 and step 1.6.
Step 2. as shown in fig. 3, the network backbone includes an encoder and a decoder for performing upsampling. The encoder is Resnet-50, and the Resnet-50 comprises 6 parts: 7 by 7 convolutional layers; the four convolution modules, Block1 through Block4, have 9, 12, 18, 9 convolutional layers respectively, the last average pooling layer and the fully connected layer are removed and replaced with convolutional layers and normalization layers with core size 1 x 1. After input is input into the encoder, the output obtained from the first 5 parts is 2048 × 8 × 10 eigenvectors, and the output is convolved with 1 × 1 of the 6 th part to obtain 1024 × 8 × 10 eigenvectors.
And then entering an upsampling part, wherein the upsampling part also comprises 6 parts, the first 4 parts are repeated upsampling layers, namely UpProj modules, an UpProj structure is adopted, and then a convolution layer with a convolution kernel size of 3 x 3 and a bilinear interpolation layer are adopted. The UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then pass through a convolution layer and a standard layer, are activated by a relu function, and then pass through the upper branch and the lower branch respectively and then are added for activation. The structure of the lower branch is a convolution layer and a normalization layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the normalization layer with kernel size of 3 x 3 pass through after being activated by the relu function. The upper branch structure is a convolution layer and a normalization layer with convolution kernel size of 5 x 5.
And 3, training the network, inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, and then performing up-sampling to obtain a predicted depth map pred with the same size as the input image.
As shown in fig. 2, the method for extracting edge sparse depth includes:
taking the NYUv2 dataset as an example, the size of the picture processed in step 1 is 228 x 304, and the rgb picture image in the dataset is gaussian filtered with a kernel size of 3 x 3. And extracting the image edge by using a canny operator to obtain a mask. And generating a random number matrix s _ mask with the size of 228 × 304 and the value of 0 to 1, and setting a threshold value prob to enable the value of s _ mask smaller than prob to be 0 and the rest to be 1.
The meaning of the formula is to set the threshold as the ratio of the number of pixels to be sampled to the non-zero element in the edge detection result, where num _ samples is 200 in this embodiment. And carrying out bitwise AND operation on the s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask. And generating a full 0 matrix sparse _ depth with the size of 228 × 304, and taking the depth value of the position with the sparse _ depth _ mask value of 1 in a corresponding depth map in the data set as the value of the corresponding position in the sparse _ depth to finally obtain the sparse depth sparse _ depth of the key area. The image and spark _ depth are fused into a 4-channel matrix to obtain the final network input, and the single size is 4 × 228 × 304.
The parameters of the Canny operator are set as follows: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.
Inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, wherein the number of channels is changed to be one half of the original number and the length and the width are changed to be two times of the original number every time a feature vector output by the encoder passes through an upsampling layer, after the feature vector passes through four upsampling layers, the feature vector is changed from 1024 × 8 × 10 to 64 × 128 160, after 3 × 3 convolution layers and a normalization layer, the feature vector is 1 × 228 × 304, and the depth map has the same size as the depth map processed in the step 1.
And 4, constructing a loss function L, calculating the error of each forward propagation, and updating the weight of the neural network through a back propagation algorithm. Taking an NYUv2 data set as an example, 50688 training samples are included in the data set, the batch size is set to 32, each iteration is performed for 1584 times, 20 rounds of training are performed in total, the optimizer searches the weight which enables the loss function to be minimum in each iteration, the final weight is obtained after the training is finished, and the model which shows the best performance can be stored by comparing the performance of the models obtained after each round of training is finished.
The specific loss function is:
L=ldepth+lgrad+lssim
wherein ldepthIs the depth error, used to make the prediction closer to the Ground Truth, defined as follows:
n is the number of pixels of the image inputted to the neural network, and after the processing of step 1, in this embodiment n 69312,one pixel in depth, y, predicted for neural networkspIs one pixel in the group Truth in the data set.
lgradIs gradient error for making the edge of the generated depth map clearer and sharperThe definition is as follows:
The structure similarity error is adopted, so that the generated depth map has a better display effect:
wherein μ represents the mean value of the mean value,represents the variance of x and represents the variance of x,represents the variance, σ, of yxyRepresents the covariance of x, y, c1、c2Equal to 1, 9 respectively.
The optimizer was a random gradient descent (SGD), the learning rate was 0.01, and 5 rounds of training were performed down to 10% of the original, the momentum was 0.9, and the weight decay was 0.0004.
And 5, loading the model stored in the step 4, and testing through the test set. And obtaining a predicted depth map by the trained model of each sample on the test set, comparing the predicted depth map with the group Truth in the test set, and calculating each evaluation index.
The training and testing environment in this embodiment is:
the system comprises the following steps: ubuntu 16.04
Cpu:Intel(R)Xeon(R)Silver 4114CPU@2.20GHz*4
Memory: 128GB
GPU:RTX2080Ti*4
The evaluation indexes in this example include:
Card (x) is the number of elements in a set x, here the number of pixels.
The test results are:
RMSE=0.221,REL=0.044,1=0.972。
Claims (7)
1. a monocular depth estimation method based on sparse depth of a key region is characterized by comprising the following steps:
step 1, preprocessing a data set, cutting, rotating and changing brightness of a training set, and cutting a test set;
step 2, designing a network model structure;
the network model is divided into two parts of an encoder and an up-sampling network; the encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed;
step 3, training a network, inputting the RGB image in the training set and the corresponding sparse depth into an encoder for feature extraction, and then performing up-sampling to obtain a prediction depth map pred with the same size as the input image;
step 4, calculating a loss function of the network, performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model;
step 5, testing the network model; and inputting the data image of the test set and the corresponding sparse depth into the trained model to obtain a predicted depth map pred, and calculating each evaluation index.
2. The method for monocular depth estimation based on sparse depth of key regions according to claim 1, wherein step 1 specifically comprises:
the preprocessing of the training set comprises: zooming the training set data, then randomly turning the training set data horizontally, then randomly rotating the training set data, then performing center cutting on the training set data, performing color data enhancement on the training set data, and finally regularizing the training set data;
the preprocessing of the test set includes: and zooming the test set data, then performing center cropping, and finally regularizing the test data.
3. The method for monocular depth estimation based on sparse depth of key regions as claimed in claim 2, wherein the network structure of step 2 is specifically as follows:
the encoder part of the network adopts Resnet-50, removes the last average pooling layer and the full-link layer, and replaces the last average pooling layer and the full-link layer with the convolution layer and the normalization layer with the core size of 1 x 1; the upper sampling network is divided into 6 parts, the first 4 parts are UpProj modules, and then a convolution layer with the convolution kernel size of 3 x 3 and a bilinear interpolation layer are formed; the UpProj module comprises an upper branch and a lower branch, firstly, input data are subjected to up-sampling through an unprol layer, then pass through a convolution layer and a standardization layer, are activated by a relu function, then pass through the upper branch and the lower branch respectively and then are added, and finally, the input data are activated by the relu function; the structure of the lower branch is a convolution layer and a standard layer with convolution kernel size of 5 x 5 in sequence, and the convolution layer and the standard layer with kernel size of 3 x 3 pass through after being activated by a relu function; the upper branching structure is a convolutional layer and a normalization layer with a core size of 5 x 5.
4. The method for monocular depth estimation based on sparse depth of a key region according to claim 1, wherein the sparse depth of step 3 is obtained specifically as follows:
performing Gaussian filtering on the input RGB image in the training set, wherein the Gaussian filtering adopts a convolution layer with a convolution kernel size of 3 x 3; extracting the edge of the image by using a canny operator to obtain a mask; generating a random number matrix s _ mask which has the same size as the mask and a value between 0 and 1 through numpy, and setting a threshold prob to enable the value which is smaller than the threshold prob in the random number matrix s _ mask to be 0 and the rest to be 1; carrying out bitwise AND operation on the random number matrix s _ mask and the extracted mask to obtain a final sparse depth mask _ depth _ mask; and generating a full 0 matrix sparse _ depth with the same size as the depth map through numpy, and taking the depth value of the position with the sparse depth mask sparse _ depth _ mask value of 1 in the corresponding depth map in the data set as the value of the corresponding position in the full 0 matrix sparse _ depth to obtain the final sparse depth of the key area.
5. The method of claim 3, wherein the loss function L and related parameters in step 4 are as follows:
the loss function L contains three terms:
L=ldepth+lgrad+lssim
wherein ldepthIs the depth error, which is used to make the prediction and training set group Truth closer, and is defined as follows:
n is the number of pixels of the image input into the neural network,one pixel in depth, y, predicted for neural networkspA pixel in a training set group Truth is obtained;
lgradis a gradient error for making the edges of the generated depth map sharper and clearer, defined as follows:
wherein the content of the first and second substances,is the differential in the x-direction,is the differential in the y direction;
the structure similarity error is adopted, so that the generated depth map has a better display effect:
7. The method according to claim 4, wherein the parameters of the Canny operator are set as follows: the ratio of the low threshold to the high threshold is 1: 3, the nucleus size is 3 x 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010777954.9A CN112085702A (en) | 2020-08-05 | 2020-08-05 | Monocular depth estimation method based on sparse depth of key region |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010777954.9A CN112085702A (en) | 2020-08-05 | 2020-08-05 | Monocular depth estimation method based on sparse depth of key region |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112085702A true CN112085702A (en) | 2020-12-15 |
Family
ID=73736029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010777954.9A Withdrawn CN112085702A (en) | 2020-08-05 | 2020-08-05 | Monocular depth estimation method based on sparse depth of key region |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112085702A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114627351A (en) * | 2022-02-18 | 2022-06-14 | 电子科技大学 | Fusion depth estimation method based on vision and millimeter wave radar |
-
2020
- 2020-08-05 CN CN202010777954.9A patent/CN112085702A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114627351A (en) * | 2022-02-18 | 2022-06-14 | 电子科技大学 | Fusion depth estimation method based on vision and millimeter wave radar |
CN114627351B (en) * | 2022-02-18 | 2023-05-16 | 电子科技大学 | Fusion depth estimation method based on vision and millimeter wave radar |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109584248B (en) | Infrared target instance segmentation method based on feature fusion and dense connection network | |
CN108230329B (en) | Semantic segmentation method based on multi-scale convolution neural network | |
CN109544555B (en) | Tiny crack segmentation method based on generation type countermeasure network | |
CN112184577B (en) | Single image defogging method based on multiscale self-attention generation countermeasure network | |
CN111612008B (en) | Image segmentation method based on convolution network | |
CN110648334A (en) | Multi-feature cyclic convolution saliency target detection method based on attention mechanism | |
CN111625608B (en) | Method and system for generating electronic map according to remote sensing image based on GAN model | |
CN110796009A (en) | Method and system for detecting marine vessel based on multi-scale convolution neural network model | |
CN110070517B (en) | Blurred image synthesis method based on degradation imaging mechanism and generation countermeasure mechanism | |
CN111242026B (en) | Remote sensing image target detection method based on spatial hierarchy perception module and metric learning | |
CN111652273A (en) | Deep learning-based RGB-D image classification method | |
CN115908772A (en) | Target detection method and system based on Transformer and fusion attention mechanism | |
CN115861756A (en) | Earth background small target identification method based on cascade combination network | |
CN115330703A (en) | Remote sensing image cloud and cloud shadow detection method based on context information fusion | |
CN116883650A (en) | Image-level weak supervision semantic segmentation method based on attention and local stitching | |
Peng et al. | Incorporating generic and specific prior knowledge in a multiscale phase field model for road extraction from VHR images | |
CN115170978A (en) | Vehicle target detection method and device, electronic equipment and storage medium | |
CN114155165A (en) | Image defogging method based on semi-supervision | |
Babu et al. | An efficient image dahazing using Googlenet based convolution neural networks | |
CN112132867B (en) | Remote sensing image change detection method and device | |
Shit et al. | An encoder‐decoder based CNN architecture using end to end dehaze and detection network for proper image visualization and detection | |
CN116503677B (en) | Wetland classification information extraction method, system, electronic equipment and storage medium | |
CN112085702A (en) | Monocular depth estimation method based on sparse depth of key region | |
Di et al. | FDNet: An end-to-end fusion decomposition network for infrared and visible images | |
CN116740362A (en) | Attention-based lightweight asymmetric scene semantic segmentation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201215 |
|
WW01 | Invention patent application withdrawn after publication |