CN109460815B

CN109460815B - Monocular vision depth estimation method

Info

Publication number: CN109460815B
Application number: CN201811246664.0A
Authority: CN
Inventors: 周武杰; 袁建中; 吕思嘉; 钱亚冠; 向坚; 张宇来
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2021-12-10
Anticipated expiration: 2038-10-25
Also published as: CN109460815A

Abstract

The invention discloses a monocular vision depth estimation method, which comprises the steps of firstly constructing a convolutional neural network, wherein the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame; then, using the monocular images in the training set as original input images, inputting the original input images into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set; then, calculating a loss function value between an estimated depth image corresponding to the monocular image in the training set and a corresponding real depth image to obtain a convolutional neural network training model, an optimal weight vector and an optimal bias term; inputting the monocular image to be predicted into a convolutional neural network training model, and predicting to obtain a corresponding predicted depth image by using the optimal weight vector and the optimal bias term; its advantage is high predicting accuracy.

Description

Monocular vision depth estimation method

Technical Field

The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.

Background

The high-speed development of economy brings continuous improvement of the living standard of people, and along with the gradual enhancement of the requirements of people on good living quality, the convenience of traffic is better and better. As an important part of traffic, automobiles are more emphasized in their development. Nowadays, the unmanned vehicle is one of the more popular topics in recent years due to the artificial intelligence and the hot tide of the unmanned vehicle is continuously improved after the unmanned vehicle is declared to be put into mass production and to be put into use. The monocular visual depth estimation in front of the automobile is a part of the field of unmanned driving, and can effectively guarantee the safety of the automobile in the driving process.

The monocular visual depth estimation method mainly comprises a traditional method and a deep learning method. Before the advent of deep learning methods, depth estimation relying on traditional methods yielded results far from meeting the lowest expected criteria of people; after the deep learning method appears, an end-to-end training method is used in the deep learning, a large amount of training data is used, and the accuracy of a result obtained by deep estimation after learning is greatly improved. Eigen et al further promoted on the basis of the neural Network proposed in the document "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" (Single Image Depth Map Prediction based on Multi-Scale Depth Network), "which proposed the use of two-Scale neural networks for Depth estimation: the coarse scale network predicts the global depth distribution and the fine scale network to locally refine the depth map, while Eigen et al extends it to three scales on the basis of the two-scale neural network. The three-scale neural network architecture firstly uses a first scale to predict a rough result according to the whole image area, then uses a second scale to optimize the result on the basis of medium resolution, and finally uses a third scale to perform fine refinement and refinement on the result to obtain a predicted depth map.

Disclosure of Invention

The invention aims to provide a monocular visual depth estimation method which is high in prediction accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qⁿ(x, y) combining the training set with Qⁿ(x, y) corresponding true depth image is noted as

Wherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents Qⁿ(x, y) and

l represents Qⁿ(x, y) and

r and L are each divisible by 2;

step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;

for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;

for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;

for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J₁Wherein, J₁Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J₁The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₁Wherein P is₁Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P₁The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H₁Wherein H is₁Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H₁The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z₁Wherein Z is₁Each feature map of (1) has a width of

Has a height of

Input terminal of the second convolution layer receives Z₁The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J₂Wherein, J₂Each feature map of (1) has a width of

Has a height of

Input of second normalization layer receives J₂The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P₂Wherein P is₂Each feature map of (1) has a width of

Has a height of

The input terminal of the second active layer receives P₂The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H₂Wherein H is₂Each feature map of (1) has a width of

Has a height of

Input terminal of the third convolution layer receives H₂The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J₃Wherein, J₃Each feature map of (1) has a width of

Has a height of

Input of third normalization layer receives J₃The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P₃Wherein P is₃Each feature map of (1) has a width of

Has a height of

The input terminal of the first splice fusion layer receives P₃All characteristic diagrams in (1) and (H)₂The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C₁Wherein, C₁Each feature map of (1) has a width of

Has a height of

Input terminal of third active layer receives C₁The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H₃Wherein H is₃Each feature map of (1) has a width of

Has a height of

The input of the second largest pooling layer receives H₃The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z₂Wherein Z is₂Each feature map of (1) has a width of

Has a height of

Input terminal of the fourth convolution layer receiving Z₂The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J₄Wherein, J₄Each feature map of (1) has a width of

Has a height of

Input of fourth normalization layer receives J₄The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P₄Wherein P is₄Each feature map of (1) has a width of

Has a height of

The input terminal of the fourth active layer receives P₄The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H₄Wherein H is₄Each feature map of (1) has a width of

Has a height of

Input terminal of fifth convolution layer receives H₄The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J₅Wherein, J₅Each feature map of (1) has a width of

Has a height of

Input of fifth normalization layer receives J₅The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P₅Wherein P is₅Each feature map of (1) has a width of

Has a height of

The input end of the second splice fusion layer receives P₅All characteristic diagrams in (1) and (H)₄256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C₂Wherein, C₂Each feature map of (1) has a width of

Has a height of

The input terminal of the fifth active layer receives C₂256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H₅Wherein H is₅Each feature map of (1) has a width of

Has a height of

The input of the third largest pooling layer receives H₅256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z₃Wherein Z is₃Each feature map of (1) has a width of

Has a height of

Input terminal reception Z of the first perforated convolutional layer₃256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K₁Wherein, K is₁Each characteristic diagram ofHas a width of

Has a height of

Input of the sixth normalization layer receives K₁256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P₆Wherein P is₆Each feature map of (1) has a width of

Has a height of

The input terminal of the sixth active layer receives P₆256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H₆Wherein H is₆Each feature map of (1) has a width of

Has a height of

The input end of the second porous convolution layer receives H₆256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K₂Wherein, K is₂Each feature map of (1) has a width of

Has a height of

Input terminal of seventh normalization layer receives K₂256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P₇Which isIn, P₇Each feature map of (1) has a width of

Has a height of

The input terminal of the third concatemate fusion layer receives P₇All characteristic diagrams in (1) and (H)₆The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C₃Wherein, C₃Each feature map of (1) has a width of

Has a height of

Input terminal of seventh active layer receives C₃512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H₇Wherein H is₇Each feature map of (1) has a width of

Has a height of

The input of the fourth largest pooling layer receives H₇512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z₄Wherein Z is₄Each feature map of (1) has a width of

Has a height of

Input terminal of the third perforated buildup layer receiving Z₄All characteristic figures in (1), the output end of the third perforated convolutional layerOutputting 512 feature maps, and recording the set of all the output feature maps as K₃Wherein, K is₃Each feature map of (1) has a width of

Has a height of

Input terminal of eighth normalization layer receives K₃The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P₈Wherein P is₈Each feature map of (1) has a width of

Has a height of

The input terminal of the eighth active layer receives P₈512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H₈，H₈I.e. the output of the coding framework, where H₈Each feature map of (1) has a width of

Has a height of

For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame₈256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F₁Wherein F is₁Each feature map of (1) has a width of

Has a height of

Input terminal of ninth normalization layer receives F₁256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P₉Wherein P is₉Each feature map of (1) has a width of

Has a height of

The input terminal of the fourth splice fusion layer receives P₉All feature maps and P in (1)₇The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C₄Wherein, C₄Each feature map of (1) has a width of

Has a height of

The input terminal of the ninth active layer receives C₄512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H₉Wherein H is₉Each feature map of (1) has a width of

Has a height of

Input terminal of sixth convolution layer receives H₉256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J₆Wherein, J₆Each feature map of (1) has a width of

Has a height of

Input of the tenth normalization layer receives J₆256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P₁₀Wherein P is₁₀Each feature map of (1) has a width of

Has a height of

The input terminal of the tenth active layer receives P₁₀256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H₁₀Wherein H is₁₀Each feature map of (1) has a width of

Has a height of

The input end of the second deconvolution layer receives the output H of the coding frame₁₀The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F₂Wherein F is₂Each feature map of (1) has a width of

Has a height of

Input terminal of eleventh batch normalization layer receives F₂The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P₁₁Wherein P is₁₁Each feature map of (1) has a width of

Has a height of

The input terminal of the fifth splice layer receives P₁₁All feature maps and P in (1)₅256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C₅Wherein, C₅Each feature map of (1) has a width of

Has a height of

Input terminal of eleventh active layer receives C₅256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H₁₁Wherein H is₁₁Each feature map of (1) has a width of

Has a height of

Input terminal of seventh convolution layer receives H₁₁The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J₇Wherein, J₇Each feature map of (1) has a width of

Has a height of

Input of lot twelve normalization layer receives J₇The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P₁₂Wherein P is₁₂Each feature map of (1) has a width of

Has a height of

The input end of the twelfth active layer receives P₁₂The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H₁₂Wherein H is₁₂Each feature map of (1) has a width of

Has a height of

The input of the third deconvolution layer receives H₁₂The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F₃Wherein F is₃Each feature map of (1) has a width of

Has a height of

Input terminal of the thirteenth normalization layer receives F₃The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P₁₃Wherein P is₁₃Each feature map of (1) has a width of

Has a height of

The input terminal of the sixth splice layer receives P₁₃All feature maps and P in (1)₃All feature maps in (1), the output of the sixth Concatenate fusion layerOutputting 128 characteristic graphs, and recording the set formed by all the output characteristic graphs as C₆Wherein, C₆Each feature map of (1) has a width of

Has a height of

Input terminal of thirteenth active layer receives C₆The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H₁₃Wherein H is₁₃Each feature map of (1) has a width of

Has a height of

Input terminal of eighth convolution layer receives H₁₃The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J₈Wherein, J₈Each feature map of (1) has a width of

Has a height of

The input of the fourteenth normalization layer receives J₈The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P₁₄Wherein P is₁₄Each feature map of (1) has a width of

Has a height of

The input terminal of the fourteenth active layer receives P₁₄All of (1)And (4) outputting 64 feature graphs at the output end of the fourteenth activation layer, and recording a set formed by all the output feature graphs as H₁₄Wherein H is₁₄Each feature map of (1) has a width of

Has a height of

The input of the fourth deconvolution layer receives H₁₄The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F₄Wherein F is₄Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F₄The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₁₅Wherein P is₁₅Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P₁₅All characteristic diagrams in (1), H₁The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C₇Wherein, C₇Each feature map in (1) has a width of R and a height of L;

for the upsampling frame, the input of the first upsampling layer receives Z₄512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y₁Wherein Y is₁Each feature map of (1) has a width of

Has a height of

The input terminal of the tenth convolutional layer receives Y₁256 feature maps are output from the output terminal of the tenth convolutional layerIs denoted as J₁₀Wherein, J₁₀Each feature map of (1) has a width of

Has a height of

Input terminal of the seventeenth normalization layer receives J₁₀256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P₁₇Wherein P is₁₇Each feature map of (1) has a width of

Has a height of

The input end of the seventeenth active layer receives P₁₇256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H₁₇Wherein H is₁₇Each feature map of (1) has a width of

Has a height of

The input of the second up-sampling layer receives H₁₇256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y₂Wherein Y is₂Each feature map of (1) has a width of

Has a height of

An input terminal of the eleventh convolution layer receives Y₂All feature maps in (1), eleventh convolutionThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J₁₁Wherein, J₁₁Each feature map of (1) has a width of

Has a height of

Input terminal of eighteenth normalization layer receives J₁₁The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P₁₈Wherein P is₁₈Each feature map of (1) has a width of

Has a height of

The input end of the eighteenth active layer receives P₁₈The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H₁₈Wherein H is₁₈Each feature map of (1) has a width of

Has a height of

The input of the third upsampling layer receives H₁₈The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y₃Wherein Y is₃Each feature map of (1) has a width of

Has a height of

The input terminal of the twelfth convolution layer receives Y₃The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J₁₂Wherein, J₁₂Each feature map of (1) has a width of

Has a height of

Input reception of the nineteenth batch normalization layer J₁₂The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P₁₉Wherein P is₁₉Each feature map of (1) has a width of

Has a height of

The input terminal of the nineteenth active layer receives P₁₉The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H₁₉Wherein H is₁₉Each feature map of (1) has a width of

Has a height of

The input of the fourth upsampling layer receives H₁₉The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y₄Wherein Y is₄Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y₄The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J₁₃Wherein, J₁₃Each feature map of (1) has a width ofR and L are height; input of twentieth normalization layer receives J₁₃The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₂₀Wherein P is₂₀Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P₂₀The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H₂₀Wherein H is₂₀Each feature map in (1) has a width of R and a height of L;

for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C₇The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H₁₅Wherein H is₁₅Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H₁₅The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J₉Wherein, J₉The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J₉The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P₁₆Wherein P is₁₆The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P₁₆The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H₁₆Wherein H is₁₆The width of the characteristic diagram is R, the height is L, H₁₆The characteristic image in the image is an estimated depth image corresponding to the original input image;

step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QⁿThe corresponding estimated depth image of (x, y) is noted as

Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image

And

the value of the loss function in between is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'denotes the width of Q (x', y '), L' denotes the height of Q (x ', y');

step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing W^bestAnd b^bestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Q_depth(x',y')。

In the step 1-4, the first step,

and obtaining by using a mean square error function.

Compared with the prior art, the invention has the advantages that:

1) in the method, a layer-hopping connection mode is adopted in the process of constructing the convolutional neural network, namely a concatemate fusion layer is adopted, and short-hop layer connection is used in a coding frame, namely a first concatemate fusion layer, a second concatemate fusion layer and a third concatemate fusion layer are used for connection; the long-jump layer connection is used between the coding frame and the decoding frame, namely, the fourth, fifth, sixth and seventh concatemate fusion layers are used for connection, the jump layer connection is beneficial to multi-scale feature fusion and boundary maintenance, the short-jump layer connection enriches the information diversity in the coding process, and the long-jump layer connection solves the problem of the loss of original boundary information of the decoding part, so that the depth estimation of the convolutional neural network training model obtained by training is more accurate.

2) The method uses an end-to-end convolutional neural network training frame, three perforated convolutional layers are used behind the third maximum pooling layer of the coding frame to extract the characteristic information, and the perforated convolutional layers can enlarge the receptive field of neurons and obtain more characteristic information on the premise of not increasing the number of training parameters.

3) The hidden layer of the convolutional neural network created by the method comprises an encoding frame, a decoding frame and an up-sampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to be capable of extracting characteristics with rich information, so that depth information with high accuracy can be obtained, and the precision of a depth estimation result is improved.

4) The size of the predicted depth image obtained by the method is the same as that of the original monocular image, and direct use of depth information in the predicted depth image is facilitated.

Drawings

FIG. 1 is a schematic diagram of the structure of the coding framework in the hidden layer of the convolutional neural network created in the method of the present invention;

FIG. 2 is a schematic diagram of the respective constituent structures of the decoding framework in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the convolutional neural network created;

fig. 3 is a schematic structural diagram of the composition of an upsampling frame in the hidden layer of the convolutional neural network created in the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides a monocular vision depth estimation method which is characterized by comprising a training stage and a testing stage.

The specific steps of the training phase process are as follows:

Wherein N is a positive integer, N is more than or equal to 100, if N is 1000, N is a positive integer, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to R, y is more than or equal to 1 and less than or equal to L, and R represents Qⁿ(x, y) and

l represents Qⁿ(x, y) and

r and L are each divisible by 2; here, the original monocular image and its corresponding true depth image are provided directly by the KITTI official network.

Step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.

For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with a width of R and a height of L.

As shown in fig. 1, the coding framework is composed of a first convolution layer, a first normalization layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second normalization layer, a second active layer, a third convolution layer, a third normalization layer, a first Concatenate fusion layer, a third active layer, a second maximum pooling layer, a fourth convolution layer, a fourth normalization layer, a fourth active layer, a fifth convolution layer, a fifth normalization layer, a second Concatenate fusion layer, a fifth active layer, a third maximum pooling layer, a first perforated convolution layer, a sixth normalization layer, a sixth active layer, a second perforated convolution layer, a seventh normalization layer, a third Concatenate fusion layer, a seventh active layer, a fourth maximum pooling layer, a third perforated convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged; as for the decoding framework, as shown in fig. 2, it is composed of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer, and a seventh normalization layer, which are sequentially arranged; for the up-sampling frame, as shown in fig. 3, the up-sampling frame is composed of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer, and a twentieth active layer, which are sequentially arranged; as shown in fig. 2, the output layer is composed of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer, which are sequentially arranged, wherein the sizes of convolution kernels of the first to thirteenth convolutional layers, the first to third perforated convolutional layers and the first to fourth deconvolution layers are 3 × 3, the number of convolution kernels of the first to third convolutional layers is 32, the number of convolution kernels of the second and third convolutional layers is 64, the number of convolution kernels of the fourth and fifth convolutional layers is 128, the number of convolution kernels of the first and second perforated convolutional layers is 256, the number of convolution kernels of the third perforated convolutional layer is 512, the number of convolution kernels of the first and sixth convolutional layers is 256, the number of convolution kernels of the second and seventh convolutional layers is 128, the number of convolution kernels of the third and eighth convolutional layers is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer.

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Input terminal reception Z of the first perforated convolutional layer₃256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K₁Wherein, K is₁Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Input terminal of seventh normalization layer receives K₂256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P₇Wherein P is₇Each feature map of (1) has a width of

Has a height of

Has a height of

Input terminal of seventh active layer receives C₃Of the seventh active layer, output 512 feature maps, and a set of all output feature maps is denoted as H₇Wherein H is₇Each feature map of (1) has a width of

Has a height of

Has a height of

Input terminal of the third perforated buildup layer receiving Z₄The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K₃Wherein, K is₃Each feature map of (1) has a width of

Has a height of

Has a height of

The input terminal of the eighth active layer receives P₈All the characteristics ofThe output end of the eighth activation layer outputs 512 characteristic graphs, and the set formed by all the output characteristic graphs is marked as H₈，H₈I.e. the output of the coding framework, where H₈Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input of the third deconvolution layer receives H₁₂The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded asF₃Wherein F is₃Each feature map of (1) has a width of

Has a height of

Has a height of

The input terminal of the sixth splice layer receives P₁₃All feature maps and P in (1)₃The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C₆Wherein, C₆Each feature map of (1) has a width of

Has a height of

Has a height of

Input terminal of eighth convolution layer receives H₁₃All characteristic maps in (1)The output end of the eight convolutional layers outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J₈Wherein, J₈Each feature map of (1) has a width of

Has a height of

Has a height of

The input terminal of the fourteenth active layer receives P₁₄The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H₁₄Wherein H is₁₄Each feature map of (1) has a width of

Has a height of

The input of the fourth deconvolution layer receives H₁₄The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F₄Wherein F is₄Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F₄The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₁₅Wherein P is₁₅Each feature map in (1) has a width of R and a height of L; seventh Concatenathe input end of the te fusion layer receives P₁₅All characteristic diagrams in (1), H₁The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C₇Wherein, C₇Each feature map in (1) has a width of R and a height of L.

Has a height of

The input terminal of the tenth convolutional layer receives Y₁256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J₁₀Wherein, J₁₀Each feature map of (1) has a width of

Has a height of

Has a height of

The input end of the seventeenth active layer receives P₁₇All the characteristic diagrams in (1), the seventeenth active layerThe output end of (2) outputs 256 feature maps, and the set of all the output feature maps is denoted as H₁₇Wherein H is₁₇Each feature map of (1) has a width of

Has a height of

Has a height of

An input terminal of the eleventh convolution layer receives Y₂All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J₁₁Wherein, J₁₁Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input of the fourth upsampling layer receives H₁₉The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y₄Wherein Y is₄Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y₄The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J₁₃Wherein, J₁₃Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J₁₃The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₂₀Wherein P is₂₀Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P₂₀The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H₂₀Wherein H is₂₀Each feature map in (1) has a width of R and a height of L.

For the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C₇The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H₁₅Wherein H is₁₅Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H₁₅All characteristic maps in (1), output of the ninth convolutional layer1 feature graph, and the set of all feature graphs output is denoted as J₉Wherein, J₉The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J₉The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P₁₆Wherein P is₁₆The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P₁₆The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H₁₆Wherein H is₁₆The width of the characteristic diagram is R, the height is L, H₁₆The feature map in (1) is the estimated depth image corresponding to the original input image.

And

the value of the loss function in between is recorded as

And obtaining by using a mean square error function.

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then, the value of the loss function is found to be the smallest value from the N × V loss function valuesA loss function value; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Where V > 1, in this example V is 20.

The test stage process comprises the following specific steps:

step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'represents the width of Q (x', y '), and L' represents the height of Q (x ', y').

In order to verify the feasibility and effectiveness of the method of the invention, experiments were carried out on the method of the invention.

In the method, the monocular images forming the training set and the monocular images for testing are both given by the KITTI official website, so that the accuracy of the method is analyzed and tested by directly using the test data set given by the KITTI official website. Inputting each monocular image in the test data set as a monocular image to be predicted into a trained deep convolution neural network training model, and loading the monocular images into an optimal weight W obtained in a training stage^bestAnd obtaining a corresponding prediction depth image.

Here, 6 common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indexes, that is: root mean square error (rms), log root mean square error (log _ rms), average log error (average log₁₀error, log10), threshold accuracy (thr): delta₁、δ₂、δ₃. The lower the values of the root mean square error, the logarithmic root mean square error and the average logarithmic error, the closer the predicted depth image is to the real depth image, and the more delta₁、δ₂、δ₃A higher value of (a) indicates a higher accuracy of predicting the depth image. The evaluation performance reflecting the method of the invention is excellentRoot mean square error, log root mean square error, average log error, and delta of the deterioration indicator₁、δ₂、δ₃The results are shown in Table 1. As can be seen from the data listed in Table 1, the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and the feasibility and the effectiveness of the method of the present invention are reflected.

TABLE 1 comparative evaluation index between predicted depth image and real depth image predicted by the method of the present invention

Claims

1. A monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

l represents Qⁿ(x, y) and

r and L are each divisible by 2;

for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J₁Wherein, J₁Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J₁The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is recordedIs P₁Wherein P is₁Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P₁The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H₁Wherein H is₁Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H₁The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z₁Wherein Z is₁Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

The input terminal of the second active layer receives P₂All the characteristics ofThe output end of the second activation layer outputs 64 characteristic diagrams, and the set formed by all the output characteristic diagrams is marked as H₂Wherein H is₂Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input end of the second splice fusion layer receives P₅All characteristic diagrams in (1) and (H)₄The output of the second concatemate fusion layer 256A set of all the output feature maps is denoted as C₂Wherein, C₂Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Input terminal of the third perforated buildup layer receiving Z₄The output end of the third buildup layer with holes outputs 512 characteristic diagrams, and all the output characteristic diagrams are formedCollectively denoted as K₃Wherein, K is₃Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Input terminal of eleventh batch normalization layer receives F₂All feature maps in (1), normalization of the eleventh batchThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P₁₁Wherein P is₁₁Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input of the fourteenth normalization layer receives J₈All feature maps in (1), specification of the fourteenth batchThe output end of the hierarchical layer outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P₁₄Wherein P is₁₄Each feature map of (1) has a width of

Has a height of

Has a height of

for the upsampling frame, the input of the first upsampling layer receives Z₄All feature maps in (1), output of the first up-sampling layerThe end outputs 512 feature maps, and the set formed by all the output feature maps is marked as Y₁Wherein Y is₁Each feature map of (1) has a width of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input of the fourth upsampling layer receives H₁₉The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y₄Wherein Y is₄Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y₄The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J₁₃Wherein, J₁₃Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J₁₃The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P₂₀Wherein P is₂₀Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P₂₀The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H₂₀Wherein H is₂₀Each feature map in (1) has a width of R and a height of L;

for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C₇The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H₁₅Wherein H is₁₅Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H₁₅The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J₉Wherein, J₉The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; of the sixteenth normalization layerInput terminal receiving J₉The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P₁₆Wherein P is₁₆The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P₁₆The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H₁₆Wherein H is₁₆The width of the characteristic diagram is R, the height is L, H₁₆The characteristic image in the image is an estimated depth image corresponding to the original input image;

And

the value of the loss function in between is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector corresponding to the loss function value with the minimum value and the bias item as the optimal weight vector of the trained convolutional neural network training modelAnd an optimum bias term, corresponding to W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

2. The method of claim 1, wherein in step 1_4,

and obtaining by using a mean square error function.