CN109460815B - Monocular vision depth estimation method - Google Patents

Monocular vision depth estimation method Download PDF

Info

Publication number
CN109460815B
CN109460815B CN201811246664.0A CN201811246664A CN109460815B CN 109460815 B CN109460815 B CN 109460815B CN 201811246664 A CN201811246664 A CN 201811246664A CN 109460815 B CN109460815 B CN 109460815B
Authority
CN
China
Prior art keywords
layer
output
feature maps
height
width
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811246664.0A
Other languages
Chinese (zh)
Other versions
CN109460815A (en
Inventor
周武杰
袁建中
吕思嘉
钱亚冠
向坚
张宇来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201811246664.0A priority Critical patent/CN109460815B/en
Publication of CN109460815A publication Critical patent/CN109460815A/en
Application granted granted Critical
Publication of CN109460815B publication Critical patent/CN109460815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a monocular vision depth estimation method, which comprises the steps of firstly constructing a convolutional neural network, wherein the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame; then, using the monocular images in the training set as original input images, inputting the original input images into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set; then, calculating a loss function value between an estimated depth image corresponding to the monocular image in the training set and a corresponding real depth image to obtain a convolutional neural network training model, an optimal weight vector and an optimal bias term; inputting the monocular image to be predicted into a convolutional neural network training model, and predicting to obtain a corresponding predicted depth image by using the optimal weight vector and the optimal bias term; its advantage is high predicting accuracy.

Description

Monocular vision depth estimation method
Technical Field
The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.
Background
The high-speed development of economy brings continuous improvement of the living standard of people, and along with the gradual enhancement of the requirements of people on good living quality, the convenience of traffic is better and better. As an important part of traffic, automobiles are more emphasized in their development. Nowadays, the unmanned vehicle is one of the more popular topics in recent years due to the artificial intelligence and the hot tide of the unmanned vehicle is continuously improved after the unmanned vehicle is declared to be put into mass production and to be put into use. The monocular visual depth estimation in front of the automobile is a part of the field of unmanned driving, and can effectively guarantee the safety of the automobile in the driving process.
The monocular visual depth estimation method mainly comprises a traditional method and a deep learning method. Before the advent of deep learning methods, depth estimation relying on traditional methods yielded results far from meeting the lowest expected criteria of people; after the deep learning method appears, an end-to-end training method is used in the deep learning, a large amount of training data is used, and the accuracy of a result obtained by deep estimation after learning is greatly improved. Eigen et al further promoted on the basis of the neural Network proposed in the document "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" (Single Image Depth Map Prediction based on Multi-Scale Depth Network), "which proposed the use of two-Scale neural networks for Depth estimation: the coarse scale network predicts the global depth distribution and the fine scale network to locally refine the depth map, while Eigen et al extends it to three scales on the basis of the two-scale neural network. The three-scale neural network architecture firstly uses a first scale to predict a rough result according to the whole image area, then uses a second scale to optimize the result on the basis of medium resolution, and finally uses a third scale to perform fine refinement and refinement on the result to obtain a predicted depth map.
Disclosure of Invention
The invention aims to provide a monocular visual depth estimation method which is high in prediction accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted as
Figure GDA0003061601460000021
Wherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents Qn(x, y) and
Figure GDA0003061601460000022
l represents Qn(x, y) and
Figure GDA0003061601460000023
r and L are each divisible by 2;
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width of
Figure GDA0003061601460000041
Has a height of
Figure GDA0003061601460000042
Input terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width of
Figure GDA0003061601460000043
Has a height of
Figure GDA0003061601460000044
Input of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width of
Figure GDA0003061601460000051
Has a height of
Figure GDA0003061601460000052
The input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width of
Figure GDA0003061601460000053
Has a height of
Figure GDA0003061601460000054
Input terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width of
Figure GDA0003061601460000055
Has a height of
Figure GDA0003061601460000056
Input of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width of
Figure GDA0003061601460000057
Has a height of
Figure GDA0003061601460000058
The input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width of
Figure GDA0003061601460000059
Has a height of
Figure GDA00030616014600000510
Input terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width of
Figure GDA00030616014600000511
Has a height of
Figure GDA00030616014600000512
The input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width of
Figure GDA00030616014600000513
Has a height of
Figure GDA00030616014600000514
Input terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width of
Figure GDA00030616014600000515
Has a height of
Figure GDA00030616014600000516
Input of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width of
Figure GDA00030616014600000517
Has a height of
Figure GDA00030616014600000518
The input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width of
Figure GDA0003061601460000061
Has a height of
Figure GDA0003061601460000062
Input terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width of
Figure GDA0003061601460000063
Has a height of
Figure GDA0003061601460000064
Input of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width of
Figure GDA0003061601460000065
Has a height of
Figure GDA0003061601460000066
The input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width of
Figure GDA0003061601460000067
Has a height of
Figure GDA0003061601460000068
The input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width of
Figure GDA0003061601460000069
Has a height of
Figure GDA00030616014600000610
The input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width of
Figure GDA00030616014600000611
Has a height of
Figure GDA00030616014600000612
Input terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each characteristic diagram ofHas a width of
Figure GDA00030616014600000613
Has a height of
Figure GDA00030616014600000614
Input of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width of
Figure GDA0003061601460000071
Has a height of
Figure GDA0003061601460000072
The input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width of
Figure GDA0003061601460000073
Has a height of
Figure GDA0003061601460000074
The input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width of
Figure GDA0003061601460000075
Has a height of
Figure GDA0003061601460000076
Input terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Which isIn, P7Each feature map of (1) has a width of
Figure GDA0003061601460000077
Has a height of
Figure GDA0003061601460000078
The input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width of
Figure GDA0003061601460000079
Has a height of
Figure GDA00030616014600000710
Input terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width of
Figure GDA00030616014600000711
Has a height of
Figure GDA00030616014600000712
The input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width of
Figure GDA00030616014600000713
Has a height of
Figure GDA00030616014600000714
Input terminal of the third perforated buildup layer receiving Z4All characteristic figures in (1), the output end of the third perforated convolutional layerOutputting 512 feature maps, and recording the set of all the output feature maps as K3Wherein, K is3Each feature map of (1) has a width of
Figure GDA00030616014600000715
Has a height of
Figure GDA00030616014600000716
Input terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width of
Figure GDA00030616014600000717
Has a height of
Figure GDA0003061601460000081
The input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width of
Figure GDA0003061601460000082
Has a height of
Figure GDA0003061601460000083
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width of
Figure GDA0003061601460000084
Has a height of
Figure GDA0003061601460000085
Input terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width of
Figure GDA0003061601460000086
Has a height of
Figure GDA0003061601460000087
The input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width of
Figure GDA0003061601460000088
Has a height of
Figure GDA0003061601460000089
The input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width of
Figure GDA00030616014600000810
Has a height of
Figure GDA00030616014600000811
Input terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width of
Figure GDA00030616014600000812
Has a height of
Figure GDA00030616014600000813
Input of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width of
Figure GDA00030616014600000814
Has a height of
Figure GDA00030616014600000815
The input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width of
Figure GDA0003061601460000091
Has a height of
Figure GDA0003061601460000092
The input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width of
Figure GDA0003061601460000093
Has a height of
Figure GDA0003061601460000094
Input terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width of
Figure GDA0003061601460000095
Has a height of
Figure GDA0003061601460000096
The input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width of
Figure GDA0003061601460000097
Has a height of
Figure GDA0003061601460000098
Input terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width of
Figure GDA0003061601460000099
Has a height of
Figure GDA00030616014600000910
Input terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width of
Figure GDA00030616014600000911
Has a height of
Figure GDA00030616014600000912
Input of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width of
Figure GDA00030616014600000913
Has a height of
Figure GDA00030616014600000914
The input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width of
Figure GDA00030616014600000915
Has a height of
Figure GDA00030616014600000916
The input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width of
Figure GDA0003061601460000101
Has a height of
Figure GDA0003061601460000102
Input terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width of
Figure GDA0003061601460000103
Has a height of
Figure GDA0003061601460000104
The input terminal of the sixth splice layer receives P13All feature maps and P in (1)3All feature maps in (1), the output of the sixth Concatenate fusion layerOutputting 128 characteristic graphs, and recording the set formed by all the output characteristic graphs as C6Wherein, C6Each feature map of (1) has a width of
Figure GDA0003061601460000105
Has a height of
Figure GDA0003061601460000106
Input terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width of
Figure GDA0003061601460000107
Has a height of
Figure GDA0003061601460000108
Input terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width of
Figure GDA0003061601460000109
Has a height of
Figure GDA00030616014600001010
The input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width of
Figure GDA00030616014600001011
Has a height of
Figure GDA00030616014600001012
The input terminal of the fourteenth active layer receives P14All of (1)And (4) outputting 64 feature graphs at the output end of the fourteenth activation layer, and recording a set formed by all the output feature graphs as H14Wherein H is14Each feature map of (1) has a width of
Figure GDA00030616014600001013
Has a height of
Figure GDA00030616014600001014
The input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width of
Figure GDA0003061601460000111
Has a height of
Figure GDA0003061601460000112
The input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output terminal of the tenth convolutional layerIs denoted as J10Wherein, J10Each feature map of (1) has a width of
Figure GDA0003061601460000113
Has a height of
Figure GDA0003061601460000114
Input terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width of
Figure GDA0003061601460000115
Has a height of
Figure GDA0003061601460000116
The input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width of
Figure GDA0003061601460000117
Has a height of
Figure GDA0003061601460000118
The input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width of
Figure GDA0003061601460000119
Has a height of
Figure GDA00030616014600001110
An input terminal of the eleventh convolution layer receives Y2All feature maps in (1), eleventh convolutionThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J11Wherein, J11Each feature map of (1) has a width of
Figure GDA00030616014600001111
Has a height of
Figure GDA0003061601460000121
Input terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width of
Figure GDA0003061601460000122
Has a height of
Figure GDA0003061601460000123
The input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width of
Figure GDA0003061601460000124
Has a height of
Figure GDA0003061601460000125
The input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width of
Figure GDA0003061601460000126
Has a height of
Figure GDA0003061601460000127
The input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width of
Figure GDA0003061601460000128
Has a height of
Figure GDA0003061601460000129
Input reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width of
Figure GDA00030616014600001210
Has a height of
Figure GDA00030616014600001211
The input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width of
Figure GDA00030616014600001212
Has a height of
Figure GDA00030616014600001213
The input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map of (1) has a width ofR and L are height; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Figure GDA0003061601460000131
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image
Figure GDA0003061601460000132
And
Figure GDA0003061601460000133
the value of the loss function in between is recorded as
Figure GDA0003061601460000134
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'denotes the width of Q (x', y '), L' denotes the height of Q (x ', y');
step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
In the step 1-4, the first step,
Figure GDA0003061601460000141
and obtaining by using a mean square error function.
Compared with the prior art, the invention has the advantages that:
1) in the method, a layer-hopping connection mode is adopted in the process of constructing the convolutional neural network, namely a concatemate fusion layer is adopted, and short-hop layer connection is used in a coding frame, namely a first concatemate fusion layer, a second concatemate fusion layer and a third concatemate fusion layer are used for connection; the long-jump layer connection is used between the coding frame and the decoding frame, namely, the fourth, fifth, sixth and seventh concatemate fusion layers are used for connection, the jump layer connection is beneficial to multi-scale feature fusion and boundary maintenance, the short-jump layer connection enriches the information diversity in the coding process, and the long-jump layer connection solves the problem of the loss of original boundary information of the decoding part, so that the depth estimation of the convolutional neural network training model obtained by training is more accurate.
2) The method uses an end-to-end convolutional neural network training frame, three perforated convolutional layers are used behind the third maximum pooling layer of the coding frame to extract the characteristic information, and the perforated convolutional layers can enlarge the receptive field of neurons and obtain more characteristic information on the premise of not increasing the number of training parameters.
3) The hidden layer of the convolutional neural network created by the method comprises an encoding frame, a decoding frame and an up-sampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to be capable of extracting characteristics with rich information, so that depth information with high accuracy can be obtained, and the precision of a depth estimation result is improved.
4) The size of the predicted depth image obtained by the method is the same as that of the original monocular image, and direct use of depth information in the predicted depth image is facilitated.
Drawings
FIG. 1 is a schematic diagram of the structure of the coding framework in the hidden layer of the convolutional neural network created in the method of the present invention;
FIG. 2 is a schematic diagram of the respective constituent structures of the decoding framework in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the convolutional neural network created;
fig. 3 is a schematic structural diagram of the composition of an upsampling frame in the hidden layer of the convolutional neural network created in the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a monocular vision depth estimation method which is characterized by comprising a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted as
Figure GDA0003061601460000151
Wherein N is a positive integer, N is more than or equal to 100, if N is 1000, N is a positive integer, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to R, y is more than or equal to 1 and less than or equal to L, and R represents Qn(x, y) and
Figure GDA0003061601460000161
l represents Qn(x, y) and
Figure GDA0003061601460000162
r and L are each divisible by 2; here, the original monocular image and its corresponding true depth image are provided directly by the KITTI official network.
Step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.
For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with a width of R and a height of L.
As shown in fig. 1, the coding framework is composed of a first convolution layer, a first normalization layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second normalization layer, a second active layer, a third convolution layer, a third normalization layer, a first Concatenate fusion layer, a third active layer, a second maximum pooling layer, a fourth convolution layer, a fourth normalization layer, a fourth active layer, a fifth convolution layer, a fifth normalization layer, a second Concatenate fusion layer, a fifth active layer, a third maximum pooling layer, a first perforated convolution layer, a sixth normalization layer, a sixth active layer, a second perforated convolution layer, a seventh normalization layer, a third Concatenate fusion layer, a seventh active layer, a fourth maximum pooling layer, a third perforated convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged; as for the decoding framework, as shown in fig. 2, it is composed of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer, and a seventh normalization layer, which are sequentially arranged; for the up-sampling frame, as shown in fig. 3, the up-sampling frame is composed of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer, and a twentieth active layer, which are sequentially arranged; as shown in fig. 2, the output layer is composed of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer, which are sequentially arranged, wherein the sizes of convolution kernels of the first to thirteenth convolutional layers, the first to third perforated convolutional layers and the first to fourth deconvolution layers are 3 × 3, the number of convolution kernels of the first to third convolutional layers is 32, the number of convolution kernels of the second and third convolutional layers is 64, the number of convolution kernels of the fourth and fifth convolutional layers is 128, the number of convolution kernels of the first and second perforated convolutional layers is 256, the number of convolution kernels of the third perforated convolutional layer is 512, the number of convolution kernels of the first and sixth convolutional layers is 256, the number of convolution kernels of the second and seventh convolutional layers is 128, the number of convolution kernels of the third and eighth convolutional layers is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer.
For the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width of
Figure GDA0003061601460000181
Has a height of
Figure GDA0003061601460000182
Input terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width of
Figure GDA0003061601460000183
Has a height of
Figure GDA0003061601460000184
Input of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width of
Figure GDA0003061601460000185
Has a height of
Figure GDA0003061601460000186
The input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width of
Figure GDA0003061601460000187
Has a height of
Figure GDA0003061601460000188
Input terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width of
Figure GDA0003061601460000189
Has a height of
Figure GDA00030616014600001810
Input of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width of
Figure GDA00030616014600001811
Has a height of
Figure GDA00030616014600001812
The input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width of
Figure GDA00030616014600001813
Has a height of
Figure GDA00030616014600001814
Input terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width of
Figure GDA00030616014600001815
Has a height of
Figure GDA00030616014600001816
The input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width of
Figure GDA0003061601460000191
Has a height of
Figure GDA0003061601460000192
Input terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width of
Figure GDA0003061601460000193
Has a height of
Figure GDA0003061601460000194
Input of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width of
Figure GDA0003061601460000195
Has a height of
Figure GDA0003061601460000196
The input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width of
Figure GDA0003061601460000197
Has a height of
Figure GDA0003061601460000198
Input terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width of
Figure GDA0003061601460000199
Has a height of
Figure GDA00030616014600001910
Input of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width of
Figure GDA00030616014600001911
Has a height of
Figure GDA00030616014600001912
The input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width of
Figure GDA00030616014600001913
Has a height of
Figure GDA00030616014600001914
The input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width of
Figure GDA00030616014600001915
Has a height of
Figure GDA00030616014600001916
The input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width of
Figure GDA0003061601460000201
Has a height of
Figure GDA0003061601460000202
Input terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width of
Figure GDA0003061601460000203
Has a height of
Figure GDA0003061601460000204
Input of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width of
Figure GDA0003061601460000205
Has a height of
Figure GDA0003061601460000206
The input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width of
Figure GDA0003061601460000207
Has a height of
Figure GDA0003061601460000208
The input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width of
Figure GDA0003061601460000209
Has a height of
Figure GDA00030616014600002010
Input terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width of
Figure GDA00030616014600002011
Has a height of
Figure GDA00030616014600002012
The input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width of
Figure GDA00030616014600002013
Has a height of
Figure GDA00030616014600002014
Input terminal of seventh active layer receives C3Of the seventh active layer, output 512 feature maps, and a set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width of
Figure GDA00030616014600002015
Has a height of
Figure GDA00030616014600002016
The input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width of
Figure GDA0003061601460000211
Has a height of
Figure GDA0003061601460000212
Input terminal of the third perforated buildup layer receiving Z4The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K3Wherein, K is3Each feature map of (1) has a width of
Figure GDA0003061601460000213
Has a height of
Figure GDA0003061601460000214
Input terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width of
Figure GDA0003061601460000215
Has a height of
Figure GDA0003061601460000216
The input terminal of the eighth active layer receives P8All the characteristics ofThe output end of the eighth activation layer outputs 512 characteristic graphs, and the set formed by all the output characteristic graphs is marked as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width of
Figure GDA0003061601460000217
Has a height of
Figure GDA0003061601460000218
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width of
Figure GDA0003061601460000219
Has a height of
Figure GDA00030616014600002110
Input terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width of
Figure GDA00030616014600002111
Has a height of
Figure GDA00030616014600002112
The input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width of
Figure GDA00030616014600002113
Has a height of
Figure GDA00030616014600002114
The input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width of
Figure GDA00030616014600002115
Has a height of
Figure GDA00030616014600002116
Input terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width of
Figure GDA0003061601460000221
Has a height of
Figure GDA0003061601460000222
Input of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width of
Figure GDA0003061601460000223
Has a height of
Figure GDA0003061601460000224
The input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width of
Figure GDA0003061601460000225
Has a height of
Figure GDA0003061601460000226
The input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width of
Figure GDA0003061601460000227
Has a height of
Figure GDA0003061601460000228
Input terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width of
Figure GDA0003061601460000229
Has a height of
Figure GDA00030616014600002210
The input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width of
Figure GDA00030616014600002211
Has a height of
Figure GDA00030616014600002212
Input terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width of
Figure GDA00030616014600002213
Has a height of
Figure GDA00030616014600002214
Input terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width of
Figure GDA0003061601460000231
Has a height of
Figure GDA0003061601460000232
Input of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width of
Figure GDA0003061601460000233
Has a height of
Figure GDA0003061601460000234
The input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width of
Figure GDA0003061601460000235
Has a height of
Figure GDA0003061601460000236
The input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded asF3Wherein F is3Each feature map of (1) has a width of
Figure GDA0003061601460000237
Has a height of
Figure GDA0003061601460000238
Input terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width of
Figure GDA0003061601460000239
Has a height of
Figure GDA00030616014600002310
The input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width of
Figure GDA00030616014600002311
Has a height of
Figure GDA00030616014600002312
Input terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width of
Figure GDA00030616014600002313
Has a height of
Figure GDA00030616014600002314
Input terminal of eighth convolution layer receives H13All characteristic maps in (1)The output end of the eight convolutional layers outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is recorded as J8Wherein, J8Each feature map of (1) has a width of
Figure GDA00030616014600002315
Has a height of
Figure GDA00030616014600002316
The input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width of
Figure GDA0003061601460000241
Has a height of
Figure GDA0003061601460000242
The input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width of
Figure GDA0003061601460000243
Has a height of
Figure GDA0003061601460000244
The input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; seventh Concatenathe input end of the te fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L.
For the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width of
Figure GDA0003061601460000245
Has a height of
Figure GDA0003061601460000246
The input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width of
Figure GDA0003061601460000247
Has a height of
Figure GDA0003061601460000248
Input terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width of
Figure GDA0003061601460000249
Has a height of
Figure GDA00030616014600002410
The input end of the seventeenth active layer receives P17All the characteristic diagrams in (1), the seventeenth active layerThe output end of (2) outputs 256 feature maps, and the set of all the output feature maps is denoted as H17Wherein H is17Each feature map of (1) has a width of
Figure GDA0003061601460000251
Has a height of
Figure GDA0003061601460000252
The input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width of
Figure GDA0003061601460000253
Has a height of
Figure GDA0003061601460000254
An input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width of
Figure GDA0003061601460000255
Has a height of
Figure GDA0003061601460000256
Input terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width of
Figure GDA0003061601460000257
Has a height of
Figure GDA0003061601460000258
The input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width of
Figure GDA0003061601460000259
Has a height of
Figure GDA00030616014600002510
The input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width of
Figure GDA00030616014600002511
Has a height of
Figure GDA00030616014600002512
The input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width of
Figure GDA00030616014600002513
Has a height of
Figure GDA00030616014600002514
Input reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width of
Figure GDA00030616014600002515
Has a height of
Figure GDA00030616014600002516
The input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width of
Figure GDA0003061601460000261
Has a height of
Figure GDA0003061601460000262
The input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L.
For the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15All characteristic maps in (1), output of the ninth convolutional layer1 feature graph, and the set of all feature graphs output is denoted as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The feature map in (1) is the estimated depth image corresponding to the original input image.
Step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Figure GDA0003061601460000271
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image
Figure GDA0003061601460000272
And
Figure GDA0003061601460000273
the value of the loss function in between is recorded as
Figure GDA0003061601460000274
And obtaining by using a mean square error function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then, the value of the loss function is found to be the smallest value from the N × V loss function valuesA loss function value; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 20.
The test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'represents the width of Q (x', y '), and L' represents the height of Q (x ', y').
Step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
In order to verify the feasibility and effectiveness of the method of the invention, experiments were carried out on the method of the invention.
In the method, the monocular images forming the training set and the monocular images for testing are both given by the KITTI official website, so that the accuracy of the method is analyzed and tested by directly using the test data set given by the KITTI official website. Inputting each monocular image in the test data set as a monocular image to be predicted into a trained deep convolution neural network training model, and loading the monocular images into an optimal weight W obtained in a training stagebestAnd obtaining a corresponding prediction depth image.
Here, 6 common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indexes, that is: root mean square error (rms), log root mean square error (log _ rms), average log error (average log10error, log10), threshold accuracy (thr): delta1、δ2、δ3. The lower the values of the root mean square error, the logarithmic root mean square error and the average logarithmic error, the closer the predicted depth image is to the real depth image, and the more delta1、δ2、δ3A higher value of (a) indicates a higher accuracy of predicting the depth image. The evaluation performance reflecting the method of the invention is excellentRoot mean square error, log root mean square error, average log error, and delta of the deterioration indicator1、δ2、δ3The results are shown in Table 1. As can be seen from the data listed in Table 1, the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and the feasibility and the effectiveness of the method of the present invention are reflected.
TABLE 1 comparative evaluation index between predicted depth image and real depth image predicted by the method of the present invention
Figure GDA0003061601460000281

Claims (2)

1. A monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as Qn(x, y) combining the training set with Qn(x, y) corresponding true depth image is noted as
Figure FDA0003061601450000011
Wherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents Qn(x, y) and
Figure FDA0003061601450000012
l represents Qn(x, y) and
Figure FDA0003061601450000013
r and L are each divisible by 2;
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is recordedIs P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width of
Figure FDA0003061601450000031
Has a height of
Figure FDA0003061601450000032
Input terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width of
Figure FDA0003061601450000033
Has a height of
Figure FDA0003061601450000034
Input of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width of
Figure FDA0003061601450000035
Has a height of
Figure FDA0003061601450000036
The input terminal of the second active layer receives P2All the characteristics ofThe output end of the second activation layer outputs 64 characteristic diagrams, and the set formed by all the output characteristic diagrams is marked as H2Wherein H is2Each feature map of (1) has a width of
Figure FDA0003061601450000037
Has a height of
Figure FDA0003061601450000038
Input terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width of
Figure FDA0003061601450000039
Has a height of
Figure FDA00030616014500000310
Input of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width of
Figure FDA00030616014500000311
Has a height of
Figure FDA00030616014500000312
The input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width of
Figure FDA0003061601450000041
Has a height of
Figure FDA0003061601450000042
Input terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width of
Figure FDA0003061601450000043
Has a height of
Figure FDA0003061601450000044
The input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width of
Figure FDA0003061601450000045
Has a height of
Figure FDA0003061601450000046
Input terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width of
Figure FDA0003061601450000047
Has a height of
Figure FDA0003061601450000048
Input of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width of
Figure FDA0003061601450000049
Has a height of
Figure FDA00030616014500000410
The input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width of
Figure FDA00030616014500000411
Has a height of
Figure FDA00030616014500000412
Input terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width of
Figure FDA00030616014500000413
Has a height of
Figure FDA00030616014500000414
Input of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width of
Figure FDA00030616014500000415
Has a height of
Figure FDA00030616014500000416
The input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4The output of the second concatemate fusion layer 256A set of all the output feature maps is denoted as C2Wherein, C2Each feature map of (1) has a width of
Figure FDA0003061601450000051
Has a height of
Figure FDA0003061601450000052
The input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width of
Figure FDA0003061601450000053
Has a height of
Figure FDA0003061601450000054
The input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width of
Figure FDA0003061601450000055
Has a height of
Figure FDA0003061601450000056
Input terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width of
Figure FDA0003061601450000057
Has a height of
Figure FDA0003061601450000058
Input of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width of
Figure FDA0003061601450000059
Has a height of
Figure FDA00030616014500000510
The input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width of
Figure FDA00030616014500000511
Has a height of
Figure FDA00030616014500000512
The input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width of
Figure FDA00030616014500000513
Has a height of
Figure FDA00030616014500000514
Input terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width of
Figure FDA00030616014500000515
Has a height of
Figure FDA00030616014500000516
The input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width of
Figure FDA0003061601450000061
Has a height of
Figure FDA0003061601450000062
Input terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width of
Figure FDA0003061601450000063
Has a height of
Figure FDA0003061601450000064
The input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width of
Figure FDA0003061601450000065
Has a height of
Figure FDA0003061601450000066
Input terminal of the third perforated buildup layer receiving Z4The output end of the third buildup layer with holes outputs 512 characteristic diagrams, and all the output characteristic diagrams are formedCollectively denoted as K3Wherein, K is3Each feature map of (1) has a width of
Figure FDA0003061601450000067
Has a height of
Figure FDA0003061601450000068
Input terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width of
Figure FDA0003061601450000069
Has a height of
Figure FDA00030616014500000610
The input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width of
Figure FDA00030616014500000611
Has a height of
Figure FDA00030616014500000612
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width of
Figure FDA00030616014500000613
Has a height of
Figure FDA00030616014500000614
Input terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width of
Figure FDA00030616014500000615
Has a height of
Figure FDA00030616014500000616
The input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width of
Figure FDA0003061601450000071
Has a height of
Figure FDA0003061601450000072
The input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width of
Figure FDA0003061601450000073
Has a height of
Figure FDA0003061601450000074
Input terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width of
Figure FDA0003061601450000075
Has a height of
Figure FDA0003061601450000076
Input of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width of
Figure FDA0003061601450000077
Has a height of
Figure FDA0003061601450000078
The input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width of
Figure FDA0003061601450000079
Has a height of
Figure FDA00030616014500000710
The input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width of
Figure FDA00030616014500000711
Has a height of
Figure FDA00030616014500000712
Input terminal of eleventh batch normalization layer receives F2All feature maps in (1), normalization of the eleventh batchThe output end of the layer outputs 128 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P11Wherein P is11Each feature map of (1) has a width of
Figure FDA00030616014500000713
Has a height of
Figure FDA00030616014500000714
The input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width of
Figure FDA00030616014500000715
Has a height of
Figure FDA0003061601450000081
Input terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width of
Figure FDA0003061601450000082
Has a height of
Figure FDA0003061601450000083
Input terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width of
Figure FDA0003061601450000084
Has a height of
Figure FDA0003061601450000085
Input of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width of
Figure FDA0003061601450000086
Has a height of
Figure FDA0003061601450000087
The input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width of
Figure FDA0003061601450000088
Has a height of
Figure FDA0003061601450000089
The input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width of
Figure FDA00030616014500000810
Has a height of
Figure FDA00030616014500000811
Input terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width of
Figure FDA00030616014500000812
Has a height of
Figure FDA00030616014500000813
The input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width of
Figure FDA00030616014500000814
Has a height of
Figure FDA00030616014500000815
Input terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width of
Figure FDA0003061601450000091
Has a height of
Figure FDA0003061601450000092
Input terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width of
Figure FDA0003061601450000093
Has a height of
Figure FDA0003061601450000094
The input of the fourteenth normalization layer receives J8All feature maps in (1), specification of the fourteenth batchThe output end of the hierarchical layer outputs 64 characteristic graphs, and the set formed by all the output characteristic graphs is marked as P14Wherein P is14Each feature map of (1) has a width of
Figure FDA0003061601450000095
Has a height of
Figure FDA0003061601450000096
The input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width of
Figure FDA0003061601450000097
Has a height of
Figure FDA0003061601450000098
The input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the upsampling frame, the input of the first upsampling layer receives Z4All feature maps in (1), output of the first up-sampling layerThe end outputs 512 feature maps, and the set formed by all the output feature maps is marked as Y1Wherein Y is1Each feature map of (1) has a width of
Figure FDA0003061601450000099
Has a height of
Figure FDA00030616014500000910
The input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width of
Figure FDA0003061601450000101
Has a height of
Figure FDA0003061601450000102
Input terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width of
Figure FDA0003061601450000103
Has a height of
Figure FDA0003061601450000104
The input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width of
Figure FDA0003061601450000105
Has a height of
Figure FDA0003061601450000106
The input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width of
Figure FDA0003061601450000107
Has a height of
Figure FDA0003061601450000108
An input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width of
Figure FDA0003061601450000109
Has a height of
Figure FDA00030616014500001010
Input terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width of
Figure FDA00030616014500001011
Has a height of
Figure FDA00030616014500001012
The input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width of
Figure FDA00030616014500001013
Has a height of
Figure FDA00030616014500001014
The input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width of
Figure FDA00030616014500001015
Has a height of
Figure FDA00030616014500001016
The input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width of
Figure FDA0003061601450000111
Has a height of
Figure FDA0003061601450000112
Input reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width of
Figure FDA0003061601450000113
Has a height of
Figure FDA0003061601450000114
The input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width of
Figure FDA0003061601450000115
Has a height of
Figure FDA0003061601450000116
The input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; of the sixteenth normalization layerInput terminal receiving J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and inputting QnThe corresponding estimated depth image of (x, y) is noted as
Figure FDA0003061601450000121
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth image
Figure FDA0003061601450000122
And
Figure FDA0003061601450000123
the value of the loss function in between is recorded as
Figure FDA0003061601450000124
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector corresponding to the loss function value with the minimum value and the bias item as the optimal weight vector of the trained convolutional neural network training modelAnd an optimum bias term, corresponding to WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let Q (x ', y') denote the monocular image to be predicted; wherein 1. ltoreq. x '. ltoreq.R', 1. ltoreq. y '. ltoreq.L', R 'denotes the width of Q (x', y '), L' denotes the height of Q (x ', y');
step 2_ 2: inputting Q (x ', y') into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to Q (x ', y'), and recording the predicted depth image as Qdepth(x',y')。
2. The method of claim 1, wherein in step 1_4,
Figure FDA0003061601450000125
and obtaining by using a mean square error function.
CN201811246664.0A 2018-10-25 2018-10-25 Monocular vision depth estimation method Active CN109460815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811246664.0A CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811246664.0A CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Publications (2)

Publication Number Publication Date
CN109460815A CN109460815A (en) 2019-03-12
CN109460815B true CN109460815B (en) 2021-12-10

Family

ID=65608334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811246664.0A Active CN109460815B (en) 2018-10-25 2018-10-25 Monocular vision depth estimation method

Country Status (1)

Country Link
CN (1) CN109460815B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11907829B2 (en) * 2019-03-14 2024-02-20 Infineon Technologies Ag FMCW radar with interference signal suppression using artificial neural network
DE102019106529A1 (en) 2019-03-14 2020-09-17 Infineon Technologies Ag FMCW RADAR WITH INTERFERENCE SUPPRESSION USING AN ARTIFICIAL NEURAL NETWORK
CN110414674B (en) * 2019-07-31 2021-09-10 浙江科技学院 Monocular depth estimation method based on residual error network and local refinement
CN111161166A (en) * 2019-12-16 2020-05-15 西安交通大学 Image moire eliminating method based on depth multi-resolution network
US20220301211A1 (en) * 2021-03-16 2022-09-22 Huan Liu Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886165A (en) * 2017-12-30 2018-04-06 北京工业大学 A kind of parallel-convolution neural net method based on CRT technology
CN108090472A (en) * 2018-01-12 2018-05-29 浙江大学 Pedestrian based on multichannel uniformity feature recognition methods and its system again
CN108681692A (en) * 2018-04-10 2018-10-19 华南理工大学 Increase Building recognition method in a kind of remote sensing images based on deep learning newly

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018098230A1 (en) * 2016-11-22 2018-05-31 Massachusetts Institute Of Technology Systems and methods for training neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886165A (en) * 2017-12-30 2018-04-06 北京工业大学 A kind of parallel-convolution neural net method based on CRT technology
CN108090472A (en) * 2018-01-12 2018-05-29 浙江大学 Pedestrian based on multichannel uniformity feature recognition methods and its system again
CN108681692A (en) * 2018-04-10 2018-10-19 华南理工大学 Increase Building recognition method in a kind of remote sensing images based on deep learning newly

Also Published As

Publication number Publication date
CN109460815A (en) 2019-03-12

Similar Documents

Publication Publication Date Title
CN109460815B (en) Monocular vision depth estimation method
CN109146944B (en) Visual depth estimation method based on depth separable convolutional neural network
CN110992275B (en) Refined single image rain removing method based on generation of countermeasure network
CN113159051B (en) Remote sensing image lightweight semantic segmentation method based on edge decoupling
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
CN110334580A (en) The equipment fault classification method of changeable weight combination based on integrated increment
CN109410917A (en) Voice data classification method based on modified capsule network
CN110009700B (en) Convolutional neural network visual depth estimation method based on RGB (red, green and blue) graph and gradient graph
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN109461177B (en) Monocular image depth prediction method based on neural network
CN109448039B (en) Monocular vision depth estimation method based on deep convolutional neural network
CN111179187A (en) Single image rain removing method based on cyclic generation countermeasure network
CN113658189B (en) Cross-scale feature fusion real-time semantic segmentation method and system
CN113160265A (en) Construction method of prediction image for brain corpus callosum segmentation for corpus callosum state evaluation
CN110555461A (en) scene classification method and system based on multi-structure convolutional neural network feature fusion
CN111127331A (en) Image denoising method based on pixel-level global noise estimation coding and decoding network
CN114913493A (en) Lane line detection method based on deep learning
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN116109920A (en) Remote sensing image building extraction method based on transducer
CN110633706B (en) Semantic segmentation method based on pyramid network
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
CN114639002A (en) Infrared and visible light image fusion method based on multi-mode characteristics
CN108805844A (en) A kind of lightweight Recurrent networks construction method based on priori filtering
CN110517272B (en) Deep learning-based blood cell segmentation method
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant