CN109460815B - Monocular vision depth estimation method - Google Patents
Monocular vision depth estimation method Download PDFInfo
- Publication number
- CN109460815B CN109460815B CN201811246664.0A CN201811246664A CN109460815B CN 109460815 B CN109460815 B CN 109460815B CN 201811246664 A CN201811246664 A CN 201811246664A CN 109460815 B CN109460815 B CN 109460815B
- Authority
- CN
- China
- Prior art keywords
- layer
- output
- feature maps
- height
- width
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention discloses a monocular vision depth estimation method, which comprises the steps of firstly constructing a convolutional neural network, wherein the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame; then, using the monocular images in the training set as original input images, inputting the original input images into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set; then, calculating a loss function value between an estimated depth image corresponding to the monocular image in the training set and a corresponding real depth image to obtain a convolutional neural network training model, an optimal weight vector and an optimal bias term; inputting the monocular image to be predicted into a convolutional neural network training model, and predicting to obtain a corresponding predicted depth image by using the optimal weight vector and the optimal bias term; its advantage is high predicting accuracy.
Description
Technical Field
The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.
Background
The high-speed development of economy brings continuous improvement of the living standard of people, and along with the gradual enhancement of the requirements of people on good living quality, the convenience of traffic is better and better. As an important part of traffic, automobiles are more emphasized in their development. Nowadays, the unmanned vehicle is one of the more popular topics in recent years due to the artificial intelligence and the hot tide of the unmanned vehicle is continuously improved after the unmanned vehicle is declared to be put into mass production and to be put into use. The monocular visual depth estimation in front of the automobile is a part of the field of unmanned driving, and can effectively guarantee the safety of the automobile in the driving process.
The monocular visual depth estimation method mainly comprises a traditional method and a deep learning method. Before the advent of deep learning methods, depth estimation relying on traditional methods yielded results far from meeting the lowest expected criteria of people; after the deep learning method appears, an end-to-end training method is used in the deep learning, a large amount of training data is used, and the accuracy of a result obtained by deep estimation after learning is greatly improved. Eigen et al further promoted on the basis of the neural Network proposed in the document "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" (Single Image Depth Map Prediction based on Multi-Scale Depth Network), "which proposed the use of two-Scale neural networks for Depth estimation: the coarse scale network predicts the global depth distribution and the fine scale network to locally refine the depth map, while Eigen et al extends it to three scales on the basis of the two-scale neural network. The three-scale neural network architecture firstly uses a first scale to predict a rough result according to the whole image area, then uses a second scale to optimize the result on the basis of medium resolution, and finally uses a third scale to perform fine refinement and refinement on the result to obtain a predicted depth map.
Disclosure of Invention
The invention aims to provide a monocular visual depth estimation method which is high in prediction accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted asWherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents Qn(x, y) andl represents Qn(x, y) andr and L are each divisible by 2;
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofInput terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each characteristic diagram ofHas a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Which isIn, P7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4All characteristic figures in (1), the output end of the third perforated convolutional layerOutputting 512 feature maps, and recording the set of all the output feature maps as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width ofHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3All feature maps in (1), the output of the sixth Concatenate fusion layerOutputting 128 characteristic graphs, and recording the set formed by all the output characteristic graphs as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14All of (1)And (4) outputting 64 feature graphs at the output end of the fourteenth activation layer, and recording a set formed by all the output feature graphs as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output terminal of the tenth convolutional layerIs denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofInput terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All feature maps in (1), eleventh convolutionThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map of (1) has a width ofR and L are height; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'denotes the width of Q (x', y '), L' denotes the height of Q (x ', y');
step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
Compared with the prior art, the invention has the advantages that:
1) in the method, a layer-hopping connection mode is adopted in the process of constructing the convolutional neural network, namely a concatemate fusion layer is adopted, and short-hop layer connection is used in a coding frame, namely a first concatemate fusion layer, a second concatemate fusion layer and a third concatemate fusion layer are used for connection; the long-jump layer connection is used between the coding frame and the decoding frame, namely, the fourth, fifth, sixth and seventh concatemate fusion layers are used for connection, the jump layer connection is beneficial to multi-scale feature fusion and boundary maintenance, the short-jump layer connection enriches the information diversity in the coding process, and the long-jump layer connection solves the problem of the loss of original boundary information of the decoding part, so that the depth estimation of the convolutional neural network training model obtained by training is more accurate.
2) The method uses an end-to-end convolutional neural network training frame, three perforated convolutional layers are used behind the third maximum pooling layer of the coding frame to extract the characteristic information, and the perforated convolutional layers can enlarge the receptive field of neurons and obtain more characteristic information on the premise of not increasing the number of training parameters.
3) The hidden layer of the convolutional neural network created by the method comprises an encoding frame, a decoding frame and an up-sampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to be capable of extracting characteristics with rich information, so that depth information with high accuracy can be obtained, and the precision of a depth estimation result is improved.
4) The size of the predicted depth image obtained by the method is the same as that of the original monocular image, and direct use of depth information in the predicted depth image is facilitated.
Drawings
FIG. 1 is a schematic diagram of the structure of the coding framework in the hidden layer of the convolutional neural network created in the method of the present invention;
FIG. 2 is a schematic diagram of the respective constituent structures of the decoding framework in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the convolutional neural network created;
fig. 3 is a schematic structural diagram of the composition of an upsampling frame in the hidden layer of the convolutional neural network created in the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a monocular vision depth estimation method which is characterized by comprising a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted asWherein N is a positive integer, N is more than or equal to 100, if N is 1000, N is a positive integer, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to R, y is more than or equal to 1 and less than or equal to L, and R represents Qn(x, y) andl represents Qn(x, y) andr and L are each divisible by 2; here, the original monocular image and its corresponding true depth image are provided directly by the KITTI official network.
Step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.
For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with a width of R and a height of L.
As shown in fig. 1, the coding framework is composed of a first convolution layer, a first normalization layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second normalization layer, a second active layer, a third convolution layer, a third normalization layer, a first Concatenate fusion layer, a third active layer, a second maximum pooling layer, a fourth convolution layer, a fourth normalization layer, a fourth active layer, a fifth convolution layer, a fifth normalization layer, a second Concatenate fusion layer, a fifth active layer, a third maximum pooling layer, a first perforated convolution layer, a sixth normalization layer, a sixth active layer, a second perforated convolution layer, a seventh normalization layer, a third Concatenate fusion layer, a seventh active layer, a fourth maximum pooling layer, a third perforated convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged; as for the decoding framework, as shown in fig. 2, it is composed of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer, and a seventh normalization layer, which are sequentially arranged; for the up-sampling frame, as shown in fig. 3, the up-sampling frame is composed of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer, and a twentieth active layer, which are sequentially arranged; as shown in fig. 2, the output layer is composed of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer, which are sequentially arranged, wherein the sizes of convolution kernels of the first to thirteenth convolutional layers, the first to third perforated convolutional layers and the first to fourth deconvolution layers are 3 × 3, the number of convolution kernels of the first to third convolutional layers is 32, the number of convolution kernels of the second and third convolutional layers is 64, the number of convolution kernels of the fourth and fifth convolutional layers is 128, the number of convolution kernels of the first and second perforated convolutional layers is 256, the number of convolution kernels of the third perforated convolutional layer is 512, the number of convolution kernels of the first and sixth convolutional layers is 256, the number of convolution kernels of the second and seventh convolutional layers is 128, the number of convolution kernels of the third and eighth convolutional layers is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer.
For the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofInput terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3Of the seventh active layer, output 512 feature maps, and a set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8All the characteristics ofThe output end of the eighth activation layer outputs 512 characteristic graphs, and the set formed by all the output characteristic graphs is marked as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width ofHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded asF3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13All characteristic maps in (1)The output end of the eight convolutional layers outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; seventh Concatenathe input end of the te fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L.
For the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofInput terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17All the characteristic diagrams in (1), the seventeenth active layerThe output end of (2) outputs 256 feature maps, and the set of all the output feature maps is denoted as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L.
For the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15All characteristic maps in (1), output of the ninth convolutional layer1 feature graph, and the set of all feature graphs output is denoted as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The feature map in (1) is the estimated depth image corresponding to the original input image.
Step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded asAnd obtaining by using a mean square error function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then, the value of the loss function is found to be the smallest value from the N × V loss function valuesA loss function value; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 20.
The test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'represents the width of Q (x', y '), and L' represents the height of Q (x ', y').
Step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
In order to verify the feasibility and effectiveness of the method of the invention, experiments were carried out on the method of the invention.
In the method, the monocular images forming the training set and the monocular images for testing are both given by the KITTI official website, so that the accuracy of the method is analyzed and tested by directly using the test data set given by the KITTI official website. Inputting each monocular image in the test data set as a monocular image to be predicted into a trained deep convolution neural network training model, and loading the monocular images into an optimal weight W obtained in a training stagebestAnd obtaining a corresponding prediction depth image.
Here, 6 common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indexes, that is: root mean square error (rms), log root mean square error (log _ rms), average log error (average log10error, log10), threshold accuracy (thr): delta1、δ2、δ3. The lower the values of the root mean square error, the logarithmic root mean square error and the average logarithmic error, the closer the predicted depth image is to the real depth image, and the more delta1、δ2、δ3A higher value of (a) indicates a higher accuracy of predicting the depth image. The evaluation performance reflecting the method of the invention is excellentRoot mean square error, log root mean square error, average log error, and delta of the deterioration indicator1、δ2、δ3The results are shown in Table 1. As can be seen from the data listed in Table 1, the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and the feasibility and the effectiveness of the method of the present invention are reflected.
TABLE 1 comparative evaluation index between predicted depth image and real depth image predicted by the method of the present invention
Claims (2)
1. A monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted asWherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents Qn(x, y) andl represents Qn(x, y) andr and L are each divisible by 2;
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is recordedIs P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofInput terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2All the characteristics ofThe output end of the second activation layer outputs 64 characteristic diagrams, and the set formed by all the output characteristic diagrams is marked as H2Wherein H is2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4The output of the second concatemate fusion layer 256A set of all the output feature maps is denoted as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4The output end of the third buildup layer with holes outputs 512 characteristic diagrams, and all the output characteristic diagrams are formedCollectively denoted as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width ofHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2All feature maps in (1), normalization of the eleventh batchThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8All feature maps in (1), specification of the fourteenth batchThe output end of the hierarchical layer outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the upsampling frame, the input of the first upsampling layer receives Z4All feature maps in (1), output of the first up-sampling layerThe end outputs 512 feature maps, and the set formed by all the output feature maps is marked as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofInput terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; of the sixteenth normalization layerInput terminal receiving J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector corresponding to the loss function value with the minimum value and the bias item as the optimal weight vector of the trained convolutional neural network training modelAnd an optimum bias term, corresponding to WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'denotes the width of Q (x', y '), L' denotes the height of Q (x ', y');
step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811246664.0A CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811246664.0A CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109460815A CN109460815A (en) | 2019-03-12 |
CN109460815B true CN109460815B (en) | 2021-12-10 |
Family
ID=65608334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811246664.0A Active CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109460815B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11907829B2 (en) * | 2019-03-14 | 2024-02-20 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
DE102019106529A1 (en) | 2019-03-14 | 2020-09-17 | Infineon Technologies Ag | FMCW RADAR WITH INTERFERENCE SUPPRESSION USING AN ARTIFICIAL NEURAL NETWORK |
CN110414674B (en) * | 2019-07-31 | 2021-09-10 | 浙江科技学院 | Monocular depth estimation method based on residual error network and local refinement |
CN111161166A (en) * | 2019-12-16 | 2020-05-15 | 西安交通大学 | Image moire eliminating method based on depth multi-resolution network |
US20220301211A1 (en) * | 2021-03-16 | 2022-09-22 | Huan Liu | Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886165A (en) * | 2017-12-30 | 2018-04-06 | 北京工业大学 | A kind of parallel-convolution neural net method based on CRT technology |
CN108090472A (en) * | 2018-01-12 | 2018-05-29 | 浙江大学 | Pedestrian based on multichannel uniformity feature recognition methods and its system again |
CN108681692A (en) * | 2018-04-10 | 2018-10-19 | 华南理工大学 | Increase Building recognition method in a kind of remote sensing images based on deep learning newly |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018098230A1 (en) * | 2016-11-22 | 2018-05-31 | Massachusetts Institute Of Technology | Systems and methods for training neural networks |
-
2018
- 2018-10-25 CN CN201811246664.0A patent/CN109460815B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886165A (en) * | 2017-12-30 | 2018-04-06 | 北京工业大学 | A kind of parallel-convolution neural net method based on CRT technology |
CN108090472A (en) * | 2018-01-12 | 2018-05-29 | 浙江大学 | Pedestrian based on multichannel uniformity feature recognition methods and its system again |
CN108681692A (en) * | 2018-04-10 | 2018-10-19 | 华南理工大学 | Increase Building recognition method in a kind of remote sensing images based on deep learning newly |
Also Published As
Publication number | Publication date |
---|---|
CN109460815A (en) | 2019-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460815B (en) | Monocular vision depth estimation method | |
CN109146944B (en) | Visual depth estimation method based on depth separable convolutional neural network | |
CN110992275B (en) | Refined single image rain removing method based on generation of countermeasure network | |
CN113159051B (en) | Remote sensing image lightweight semantic segmentation method based on edge decoupling | |
CN110363716B (en) | High-quality reconstruction method for generating confrontation network composite degraded image based on conditions | |
CN110334580A (en) | The equipment fault classification method of changeable weight combination based on integrated increment | |
CN109410917A (en) | Voice data classification method based on modified capsule network | |
CN110009700B (en) | Convolutional neural network visual depth estimation method based on RGB (red, green and blue) graph and gradient graph | |
CN113240683B (en) | Attention mechanism-based lightweight semantic segmentation model construction method | |
CN109461177B (en) | Monocular image depth prediction method based on neural network | |
CN109448039B (en) | Monocular vision depth estimation method based on deep convolutional neural network | |
CN111179187A (en) | Single image rain removing method based on cyclic generation countermeasure network | |
CN113658189B (en) | Cross-scale feature fusion real-time semantic segmentation method and system | |
CN113160265A (en) | Construction method of prediction image for brain corpus callosum segmentation for corpus callosum state evaluation | |
CN110555461A (en) | scene classification method and system based on multi-structure convolutional neural network feature fusion | |
CN111127331A (en) | Image denoising method based on pixel-level global noise estimation coding and decoding network | |
CN114913493A (en) | Lane line detection method based on deep learning | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN116109920A (en) | Remote sensing image building extraction method based on transducer | |
CN110633706B (en) | Semantic segmentation method based on pyramid network | |
CN116469100A (en) | Dual-band image semantic segmentation method based on Transformer | |
CN114639002A (en) | Infrared and visible light image fusion method based on multi-mode characteristics | |
CN108805844A (en) | A kind of lightweight Recurrent networks construction method based on priori filtering | |
CN110517272B (en) | Deep learning-based blood cell segmentation method | |
CN112418229A (en) | Unmanned ship marine scene image real-time segmentation method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |