CN109460815A - A kind of monocular depth estimation method - Google Patents
A kind of monocular depth estimation method Download PDFInfo
- Publication number
- CN109460815A CN109460815A CN201811246664.0A CN201811246664A CN109460815A CN 109460815 A CN109460815 A CN 109460815A CN 201811246664 A CN201811246664 A CN 201811246664A CN 109460815 A CN109460815 A CN 109460815A
- Authority
- CN
- China
- Prior art keywords
- layer
- feature maps
- output
- height
- width
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 52
- 238000005070 sampling Methods 0.000 claims abstract description 42
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 31
- 238000010606 normalization Methods 0.000 claims description 198
- 230000004913 activation Effects 0.000 claims description 57
- 230000004927 fusion Effects 0.000 claims description 51
- 238000011176 pooling Methods 0.000 claims description 46
- 238000010586 diagram Methods 0.000 claims description 30
- 238000012360 testing method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000000007 visual effect Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 description 16
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Error Detection And Correction (AREA)
Abstract
The invention discloses a kind of monocular depth estimation methods, first construct convolutional neural networks comprising input layer, hidden layer and output layer;Hidden layer includes coding framework, decoding frame and up-sampling frame;Then it uses the monocular image in training set as original input picture, is input in convolutional neural networks and is trained, obtain the corresponding estimating depth image of every original monocular image in training set;Then by calculating the loss function value between the corresponding estimating depth image of monocular image and corresponding real depth image in training set, convolutional neural networks training pattern and the optimal bias term of best initial weights vector sum are obtained;Monocular image to be predicted is input in convolutional neural networks training pattern again, and utilizes the optimal bias term of best initial weights vector sum, prediction obtains corresponding predetermined depth image;Advantage is its precision of prediction height.
Description
Technical Field
The invention relates to an image signal processing technology, in particular to a monocular visual depth estimation method.
Background
The high-speed development of economy brings continuous improvement of the living standard of people, and along with the gradual enhancement of the requirements of people on good living quality, the convenience of traffic is better and better. As an important part of traffic, automobiles are more emphasized in their development. Nowadays, the unmanned vehicle is one of the more popular topics in recent years due to the artificial intelligence and the hot tide of the unmanned vehicle is continuously improved after the unmanned vehicle is declared to be put into mass production and to be put into use. The monocular visual depth estimation in front of the automobile is a part of the field of unmanned driving, and can effectively guarantee the safety of the automobile in the driving process.
The monocular visual depth estimation method mainly comprises a traditional method and a deep learning method. Before the advent of deep learning methods, depth estimation relying on traditional methods yielded results far from meeting the lowest expected criteria of people; after the deep learning method appears, an end-to-end training method is used in the deep learning, a large amount of training data is used, and the accuracy of a result obtained by deep estimation after learning is greatly improved. Eigen et al further promoted on the basis of the neural Network proposed in the document "Depth Map Predictionfrom a Single Image using a Multi-Scale Deep Network" (Single Image Depth Map prediction based on Multi-Scale Depth Network) ", and proposed to use two-Scale neural networks for Depth estimation: the coarse scale network predicts the global depth distribution and the fine scale network to locally refine the depth map, while Eigen et al extends it to three scales on the basis of the two-scale neural network. The three-scale neural network architecture firstly uses a first scale to predict a rough result according to the whole image area, then uses a second scale to optimize the result on the basis of medium resolution, and finally uses a third scale to perform fine refinement and refinement on the result to obtain a predicted depth map.
Disclosure of Invention
The invention aims to provide a monocular visual depth estimation method which is high in prediction accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and each original monocular imageAnd (3) real depth images corresponding to the monocular images, forming a training set, and recording the nth original monocular image in the training set as { Qn(x, y) }, the training set is summed with { Q }n(x, y) } the corresponding true depth image is recorded asWherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents { Q ≦ Ln(x, y) } andl represents { Q ]n(x, y) } andr and L can be divided by 2, Qn(x, y) represents { QnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofInput terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded asZ2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and all the output feature maps are formedSet is denoted J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5All feature maps in (1), the third largest256 characteristic graphs are output from the output end of the pooling layer, and the set formed by all the output characteristic graphs is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each of which isThe width of the feature map isHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth active layer, and the set formed by all the output feature maps is recorded as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11All feature maps in (1), seventh convolution layerThe output end of the system outputs 128 characteristic diagrams, and the set formed by all the output characteristic diagrams is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the case of an up-sampling frame,the input of the first up-sampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofInput terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the feature map in (1) isR and L are height; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and taking the { Q value as the value of the estimated depth imagen(x, y) } corresponding estimated depth image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; then minimizing the valueThe weight vector and the bias item corresponding to the loss function value of (1) are correspondingly used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model and are correspondingly marked as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let { Q (x ', y') } denote a monocular image to be predicted; wherein, 1 ≦ x '≦ R', 1 ≦ y '≦ L', R 'represents a width of { Q (x', y ') }, L' represents a height of { Q (x ', y') }, and Q (x ', y') represents a pixel value of a pixel point of which coordinate position is (x ', y') in { Q (x ', y') };
step 2_ 2: inputting { Q (x ', y') } into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to the { Q (x ', y') }, and recording the predicted depth image as the { Q (x ', y') }depth(x ', y') }; wherein Q isdepth(x ', y') represents { Qdepth(x ', y') } pixel value of the pixel point whose coordinate position is (x ', y').
In the step 1-4, the first step,and obtaining by using a mean square error function.
Compared with the prior art, the invention has the advantages that:
1) in the method, a layer-hopping connection mode is adopted in the process of constructing the convolutional neural network, namely a concatemate fusion layer is adopted, and short-hop layer connection is used in a coding frame, namely a first concatemate fusion layer, a second concatemate fusion layer and a third concatemate fusion layer are used for connection; the long-jump layer connection is used between the coding frame and the decoding frame, namely, the fourth, fifth, sixth and seventh concatemate fusion layers are used for connection, the jump layer connection is beneficial to multi-scale feature fusion and boundary maintenance, the short-jump layer connection enriches the information diversity in the coding process, and the long-jump layer connection solves the problem of the loss of original boundary information of the decoding part, so that the depth estimation of the convolutional neural network training model obtained by training is more accurate.
2) The method uses an end-to-end convolutional neural network training frame, three perforated convolutional layers are used behind the third maximum pooling layer of the coding frame to extract the characteristic information, and the perforated convolutional layers can enlarge the receptive field of neurons and obtain more characteristic information on the premise of not increasing the number of training parameters.
3) The hidden layer of the convolutional neural network created by the method comprises an encoding frame, a decoding frame and an up-sampling frame, and the combination of the three frames enables the convolutional neural network training model obtained by training to be capable of extracting characteristics with rich information, so that depth information with high accuracy can be obtained, and the precision of a depth estimation result is improved.
4) The size of the predicted depth image obtained by the method is the same as that of the original monocular image, and direct use of depth information in the predicted depth image is facilitated.
Drawings
FIG. 1 is a schematic diagram of the structure of the coding framework in the hidden layer of the convolutional neural network created in the method of the present invention;
FIG. 2 is a schematic diagram of the respective constituent structures of the decoding framework in the hidden layer of the convolutional neural network created in the method of the present invention and the output layer of the convolutional neural network created;
fig. 3 is a schematic structural diagram of the composition of an upsampling frame in the hidden layer of the convolutional neural network created in the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a monocular vision depth estimation method which is characterized by comprising a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as { Qn(x, y) }, the training set is summed with { Q }n(x, y) } the corresponding true depth image is recorded asWherein N is a positive integer, N is more than or equal to 100, if N is 1000, N is a positive integer, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to R, y is more than or equal to 1 and less than or equal to L, and R represents { Q ≦ Ln(x, y) } andl represents { Q ]n(x, y) } andr and L can be divided by 2, Qn(x, y) represents { QnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y); here, the original monocular image and its corresponding true depth image are provided directly by the KITTI official network.
Step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer includes an encoding frame, a decoding frame, and an upsampling frame.
For the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with a width of R and a height of L.
As shown in fig. 1, the coding framework is composed of a first convolution layer, a first normalization layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second normalization layer, a second active layer, a third convolution layer, a third normalization layer, a first Concatenate fusion layer, a third active layer, a second maximum pooling layer, a fourth convolution layer, a fourth normalization layer, a fourth active layer, a fifth convolution layer, a fifth normalization layer, a second Concatenate fusion layer, a fifth active layer, a third maximum pooling layer, a first perforated convolution layer, a sixth normalization layer, a sixth active layer, a second perforated convolution layer, a seventh normalization layer, a third Concatenate fusion layer, a seventh active layer, a fourth maximum pooling layer, a third perforated convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged; as for the decoding framework, as shown in fig. 2, it is composed of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer, and a seventh normalization layer, which are sequentially arranged; for the up-sampling frame, as shown in fig. 3, the up-sampling frame is composed of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer, and a twentieth active layer, which are sequentially arranged; as shown in fig. 2, the output layer is composed of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer, which are sequentially arranged, wherein the sizes of convolution kernels of the first to thirteenth convolutional layers, the first to third perforated convolutional layers and the first to fourth deconvolution layers are 3 × 3, the number of convolution kernels of the first to third convolutional layers is 32, the number of convolution kernels of the second and third convolutional layers is 64, the number of convolution kernels of the fourth and fifth convolutional layers is 128, the number of convolution kernels of the first and second perforated convolutional layers is 256, the number of convolution kernels of the third perforated convolutional layer is 512, the number of convolution kernels of the first and sixth convolutional layers is 256, the number of convolution kernels of the second and seventh convolutional layers is 128, the number of convolution kernels of the third and eighth convolutional layers is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer.
For the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofInput terminal of the second convolution layer receives Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein,H2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 characteristic graphs, and the characteristic graphs are outputThe set formed by all the characteristic graphs is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4All the characteristic maps in (1), output of output terminal of fourth active layer128 feature maps, and the set of all output feature maps is denoted as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofInput reception of fifth active layerC2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth concatemate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is recorded asIs C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width ofHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth activation layer, and all the output feature maps are formed intoIs denoted as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature mapsLet P denote the set of all output feature maps15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L.
For the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofInput terminal of the seventeenth normalization layer receives J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L.
For the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The feature map in (1) is the estimated depth image corresponding to the original input image.
Step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and taking the { Q value as the value of the estimated depth imagen(x, y) } corresponding estimated depth image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded asAnd obtaining by using a mean square error function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 20.
The test stage process comprises the following specific steps:
step 2_ 1: let { Q (x ', y') } denote a monocular image to be predicted; wherein, 1 ≦ x '≦ R', 1 ≦ y '≦ L', R 'represents the width of { Q (x', y ') }, L' represents the height of { Q (x ', y') }, and Q (x ', y') represents the pixel value of the pixel point whose coordinate position is (x ', y') in { Q (x ', y') }.
Step 2_ 2: inputting { Q (x ', y') } into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to the { Q (x ', y') }, and recording the predicted depth image as the { Q (x ', y') }depth(x ', y') }; wherein Q isdepth(x ', y') represents { Qdepth(x ', y') } pixel value of the pixel point whose coordinate position is (x ', y').
In order to verify the feasibility and effectiveness of the method of the invention, experiments were carried out on the method of the invention.
In the method, the monocular images forming the training set and the monocular image for testing are given by the KITTI official website, so that the testing data set given by the KITTI official website is directly used for analyzing and testing the methodAccuracy of the method. Inputting each monocular image in the test data set as a monocular image to be predicted into a trained deep convolution neural network training model, and loading the monocular images into an optimal weight W obtained in a training stagebestAnd obtaining a corresponding prediction depth image.
Here, 6 common objective parameters of the monocular visual depth prediction evaluation method are used as evaluation indexes, that is: root mean square error (rms), log root mean square error (log _ rms), average log error (average log10error, log10), threshold accuracy (thr): delta1、δ2、δ3. The lower the values of the root mean square error, the logarithmic root mean square error and the average logarithmic error, the closer the predicted depth image is to the real depth image, and the more delta1、δ2、δ3A higher value of (a) indicates a higher accuracy of predicting the depth image. The root mean square error, the logarithm root mean square error, the average logarithm error and the delta which reflect the evaluation performance indexes of the method1、δ2、δ3The results are shown in Table 1. As can be seen from the data listed in Table 1, the difference between the predicted depth image obtained by the method of the present invention and the real depth image is very small, which shows that the accuracy of the prediction result of the method of the present invention is very high, and the feasibility and the effectiveness of the method of the present invention are reflected.
TABLE 1 comparative evaluation index between predicted depth image and real depth image predicted by the method of the present invention
Claims (2)
1. A monocular visual depth estimation method is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original monocular images and a real depth image corresponding to each original monocular image, forming a training set, and recording the nth original monocular image in the training set as { Qn(x, y) }, the training set is summed with { Q }n(x, y) } the corresponding true depth image is recorded asWherein N is a positive integer, N is not less than 100, N is a positive integer, N is not less than 1 and not more than N, x is not less than 1 and not more than R, y is not less than 1 and not more than L, and R represents { Q ≦ Ln(x, y) } andl represents { Q ]n(x, y) } andr and L can be divided by 2, Qn(x, y) represents { QnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises an encoding frame, a decoding frame and an up-sampling frame;
for the input layer, the input end of the input layer receives an original input image, and the output end of the input layer outputs the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be R, and the height of the original input image is required to be L;
for the coding frame, the coding frame consists of a first coiling layer, a first batch of normalization layers, a first activation layer, a first maximum pooling layer, a second coiling layer, a second batch of normalization layers, a second activation layer, a third coiling layer, a third batch of normalization layers, a first Concatenate fusion layer, a third activation layer, a second maximum pooling layer, a fourth coiling layer, a fourth normalization layer, a fourth activation layer, a fifth coiling layer, a fifth batch of normalization layers, a second Concatenate fusion layer, a fifth activation layer, a third maximum pooling layer, a first perforated coiling layer, a sixth batch of normalization layers, a sixth activation layer, a second perforated coiling layer, a seventh batch of normalization layers, a third Concatenate fusion layer, a seventh activation layer, a fourth maximum pooling layer, a third perforated coiling layer, an eighth batch of normalization layers and an eighth activation layer which are arranged in sequence; for the decoding frame, the decoding frame consists of a first deconvolution layer, a ninth normalization layer, a fourth normalization layer, a ninth active layer, a sixth convolution layer, a tenth normalization layer, a tenth active layer, a second deconvolution layer, an eleventh normalization layer, a fifth normalization layer, an eleventh active layer, a seventh convolution layer, a twelfth normalization layer, a twelfth active layer, a third deconvolution layer, a thirteenth normalization layer, a sixth normalization layer, a thirteenth active layer, an eighth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fourth deconvolution layer, a fifteenth normalization layer and a seventh normalization layer which are arranged in sequence; for the up-sampling frame, the up-sampling frame consists of a first up-sampling layer, a tenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, a second up-sampling layer, an eleventh convolution layer, an eighteenth normalization layer, an eighteenth active layer, a third up-sampling layer, a twelfth convolution layer, a nineteenth normalization layer, a nineteenth active layer, a fourth up-sampling layer, a thirteenth convolution layer, a twentieth normalization layer and a twentieth active layer which are arranged in sequence; and for the output layer, the output layer consists of a fifteenth active layer, a ninth convolutional layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged, wherein the sizes of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the first holed convolutional layer to the third holed convolutional layer and the first deconvolution layer to the fourth deconvolution layer are respectively 3 multiplied by 3, the number of convolution kernels of the first convolutional layer to the thirteenth convolutional layer, the number of convolution kernels of the second convolutional layer and the third convolutional layer is 64, the number of convolution kernels of the fourth convolutional layer and the fifth convolutional layer is 128, the number of convolution kernels of the first holed convolutional layer and the second holed convolutional layer is 256, the number of convolution kernels of the third holed convolutional layer is 512, the number of convolution kernels of the first deconvolution layer and the sixth convolutional layer is 256, the number of convolution kernels of the second deconvolution layer and the seventh convolutional layer is 128, the number of convolution kernels of the third deconvolution layer and the eighth convolutional layer is 64, The number of convolution kernels of a fourth deconvolution layer is 32, the number of convolution kernels of a ninth convolution layer is 1, the number of convolution kernels of a tenth convolution layer is 256, the number of convolution kernels of an eleventh convolution layer is 128, the number of convolution kernels of a twelfth convolution layer is 64, the number of convolution kernels of a thirteenth convolution layer is 32, the convolution step lengths of the first convolution layer to the thirteenth convolution layer and the first convolution layer to the third convolution layer with holes adopt default values, the convolution step lengths of the first deconvolution layer to the fourth convolution layer are 2 x 2, the parameters of the first normalization layer to the twentieth normalization layer adopt default values, the ReLu is adopted as an activation function of the first activation layer to the twentieth activation layer, the 2 x 2 is adopted as a pooling step length of the first maximum pooling layer to the fourth maximum pooling layer, and the 2 x 2 is adopted as a sampling step length of the first up-sampling layer to the fourth up-sampling layer;
for the coding framework, the input end of the first convolution layer receives the original input image output by the output end of the input layer, the output end of the first convolution layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as J1Wherein, J1Each feature map in (1) has a width of R and a height of L; the input of the first normalization layer receives J1The output end of the first normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P1Wherein P is1Each feature map in (1) has a width of R and a height of L; the input terminal of the first active layer receives P1The output end of the first activation layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as H1Wherein H is1Each feature map in (1) has a width of R and a height of L; the input of the first largest pooling layer receives H1The output end of the first maximum pooling layer outputs 32 feature maps, and the set formed by all the output feature maps is recorded as Z1Wherein Z is1Each feature map of (1) has a width ofHas a height ofOf the second convolution layerInput terminal receiving Z1The output end of the second convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J2Wherein, J2Each feature map of (1) has a width ofHas a height ofInput of second normalization layer receives J2The output end of the second normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P2Wherein P is2Each feature map of (1) has a width ofHas a height ofThe input terminal of the second active layer receives P2The output end of the second activation layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as H2Wherein H is2Each feature map of (1) has a width ofHas a height ofInput terminal of the third convolution layer receives H2The output end of the third convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J3Wherein, J3Each feature map of (1) has a width ofHas a height ofInput of third normalization layer receives J3The output end of the third normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P3Wherein P is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the first splice fusion layer receives P3All characteristic diagrams in (1) and (H)2The output end of the first concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C1Wherein, C1Each feature map of (1) has a width ofHas a height ofInput terminal of third active layer receives C1The output end of the third activation layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as H3Wherein H is3Each feature map of (1) has a width ofHas a height ofThe input of the second largest pooling layer receives H3The output end of the second maximum pooling layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as Z2Wherein Z is2Each feature map of (1) has a width ofHas a height ofInput terminal of the fourth convolution layer receiving Z2The output end of the fourth convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as J4Wherein, J4Each feature map of (1) has a width ofHas a height ofInput of fourth normalization layer receives J4The output end of the fourth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P4Wherein P is4Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth active layer receives P4The output end of the fourth activation layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H4Wherein H is4Each feature map of (1) has a width ofHas a height ofInput terminal of fifth convolution layer receives H4The output end of the fifth convolutional layer outputs 128 characteristic maps, and all the characteristic maps are outputThe set of constructs is denoted J5Wherein, J5Each feature map of (1) has a width ofHas a height ofInput of fifth normalization layer receives J5The output end of the fifth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P5Wherein P is5Each feature map of (1) has a width ofHas a height ofThe input end of the second splice fusion layer receives P5All characteristic diagrams in (1) and (H)4256 feature maps are output from the output end of the second concatemate fusion layer, and the set formed by all the output feature maps is marked as C2Wherein, C2Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth active layer receives C2256 feature maps are output from the output end of the fifth activation layer, and the set formed by all the output feature maps is recorded as H5Wherein H is5Each feature map of (1) has a width ofHas a height ofThe input of the third largest pooling layer receives H5256 feature maps are output from the output end of the third maximum pooling layer, and the set of all output feature maps is recorded as Z3Wherein Z is3Each feature map of (1) has a width ofHas a height ofInput terminal reception Z of the first perforated convolutional layer3256 feature maps are output from the output end of the first holey convolutional layer, and the set of all the output feature maps is denoted as K1Wherein, K is1Each feature map of (1) has a width ofHas a height ofInput of the sixth normalization layer receives K1256 feature maps are output from the output end of the sixth normalization layer, and the set of all output feature maps is denoted as P6Wherein P is6Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth active layer receives P6256 feature maps are output from the output end of the sixth active layer, and the set formed by all the output feature maps is recorded as H6Wherein H is6Each feature map of (1) has a width ofHas a height ofThe input end of the second porous convolution layer receives H6256 feature maps are output from the output end of the second convolutional layer with holes, and the set of all the output feature maps is denoted as K2Wherein, K is2Each feature map of (1) has a width ofHas a height ofInput terminal of seventh normalization layer receives K2256 feature maps are output from the output end of the seventh normalization layer, and the set of all output feature maps is denoted as P7Wherein P is7Each feature map of (1) has a width ofHas a height ofThe input terminal of the third concatemate fusion layer receives P7All characteristic diagrams in (1) and (H)6The output end of the third Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C3Wherein, C3Each feature map of (1) has a width ofHas a height ofInput terminal of seventh active layer receives C3512 feature maps are output from the output end of the seventh active layer, and the set of all output feature maps is denoted as H7Wherein H is7Each feature map of (1) has a width ofHas a height ofThe input of the fourth largest pooling layer receives H7512 feature maps are output from the output end of the fourth max pooling layer, and the set of all output feature maps is recorded as Z4Wherein Z is4Each feature map of (1) has a width ofHas a height ofInput terminal of the third perforated buildup layer receiving Z4The output end of the third convolutional layer with holes outputs 512 feature maps, and the set of all the output feature maps is denoted as K3Wherein, K is3Each feature map of (1) has a width ofHas a height ofInput terminal of eighth normalization layer receives K3The output end of the eighth normalization layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as P8Wherein P is8Each feature map of (1) has a width ofHas a height ofThe input terminal of the eighth active layer receives P8512 feature maps are output from the output end of the eighth active layer, and the set of all output feature maps is denoted as H8,H8I.e. the output of the coding framework, where H8Each feature map of (1) has a width ofHas a height of
For the decoding frame, the input end of the first deconvolution layer receives the output H of the encoding frame8256 feature maps are output from the output end of the first deconvolution layer, and the set of all output feature maps is denoted as F1Wherein F is1Each feature map of (1) has a width ofHas a height ofInput terminal of ninth normalization layer receives F1256 feature maps are output from the output end of the ninth normalization layer, and the set of all output feature maps is denoted as P9Wherein P is9Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourth splice fusion layer receives P9All feature maps and P in (1)7The output end of the fourth Concatenate fusion layer outputs 512 feature maps, and the set formed by all the output feature maps is marked as C4Wherein, C4Each feature map of (1) has a width ofHas a height ofThe input terminal of the ninth active layer receives C4512 feature maps are output from the output end of the ninth active layer, and the set formed by all the output feature maps is recorded as H9Wherein H is9Each feature map of (1) has a width ofHas a height ofInput terminal of sixth convolution layer receives H9256 feature maps are output from the output end of the sixth convolutional layer, and the set of all output feature maps is denoted as J6Wherein, J6Each feature map of (1) has a width ofHas a height ofInput of the tenth normalization layer receives J6256 feature maps are output from the output end of the tenth normalization layer, and the set of all output feature maps is denoted as P10Wherein P is10Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth active layer receives P10256 feature maps are output from the output end of the tenth activation layer, and all the output feature maps are formed intoIs denoted as H10Wherein H is10Each feature map of (1) has a width ofHas a height ofThe input end of the second deconvolution layer receives the output H of the coding frame10The output end of the second deconvolution layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as F2Wherein F is2Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh batch normalization layer receives F2The output end of the eleventh normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P11Wherein P is11Each feature map of (1) has a width ofHas a height ofThe input terminal of the fifth splice layer receives P11All feature maps and P in (1)5256 feature maps are output from the output end of the fifth concatemate fusion layer, and the set of all output feature maps is marked as C5Wherein, C5Each feature map of (1) has a width ofHas a height ofInput terminal of eleventh active layer receives C5256 feature maps are output from the output end of the eleventh activation layer, and a set formed by all the output feature maps is recorded as H11Wherein H is11Each feature map of (1) has a width ofHas a height ofInput terminal of seventh convolution layer receives H11The output end of the seventh convolutional layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as J7Wherein, J7Each feature map of (1) has a width ofHas a height ofInput of lot twelve normalization layer receives J7The output end of the twelfth normalization layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as P12Wherein P is12Each feature map of (1) has a width ofHas a height ofThe input end of the twelfth active layer receives P12The output end of the twelfth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H12Wherein H is12Each feature map of (1) has a width ofHas a height ofThe input of the third deconvolution layer receives H12The output end of the third deconvolution layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as F3Wherein F is3Each feature map of (1) has a width ofHas a height ofInput terminal of the thirteenth normalization layer receives F3The output end of the thirteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P13Wherein P is13Each feature map of (1) has a width ofHas a height ofThe input terminal of the sixth splice layer receives P13All feature maps and P in (1)3The output end of the sixth concatemate fusion layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as C6Wherein, C6Each feature map of (1) has a width ofHas a height ofInput terminal of thirteenth active layer receives C6The output end of the thirteenth activation layer outputs 128 characteristic maps, and the output end of the thirteenth activation layer outputs all the characteristic mapsIs denoted as H13Wherein H is13Each feature map of (1) has a width ofHas a height ofInput terminal of eighth convolution layer receives H13The output end of the eighth convolutional layer outputs 64 feature maps, and the set of all the output feature maps is denoted as J8Wherein, J8Each feature map of (1) has a width ofHas a height ofThe input of the fourteenth normalization layer receives J8The output end of the fourteenth normalization layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as P14Wherein P is14Each feature map of (1) has a width ofHas a height ofThe input terminal of the fourteenth active layer receives P14The output end of the fourteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H14Wherein H is14Each feature map of (1) has a width ofHas a height ofThe input of the fourth deconvolution layer receives H14The output end of the fourth deconvolution layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as F4Wherein F is4Each feature map in (1) has a width of R and a height of L; input terminal of the fifteenth normalization layer receives F4The output end of the fifteenth normalization layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as P15Wherein P is15Each feature map in (1) has a width of R and a height of L; the input terminal of the seventh splice fusion layer receives P15All characteristic diagrams in (1), H1The output end of the seventh concatemate fusion layer outputs 96 feature maps, and the set formed by all the output feature maps is marked as C7Wherein, C7Each feature map in (1) has a width of R and a height of L;
for the upsampling frame, the input of the first upsampling layer receives Z4512 feature maps are output from the output end of the first upsampling layer, and the set formed by all the output feature maps is recorded as Y1Wherein Y is1Each feature map of (1) has a width ofHas a height ofThe input terminal of the tenth convolutional layer receives Y1256 feature maps are output from the output end of the tenth convolutional layer, and the set of all output feature maps is denoted as J10Wherein, J10Each feature map of (1) has a width ofHas a height ofTransport of the seventeenth normalization layerInput terminal receiving J10256 feature maps are output from the output end of the seventeenth normalization layer, and the set of all the output feature maps is denoted as P17Wherein P is17Each feature map of (1) has a width ofHas a height ofThe input end of the seventeenth active layer receives P17256 feature maps are output from the output end of the seventeenth active layer, and the set of all output feature maps is recorded as H17Wherein H is17Each feature map of (1) has a width ofHas a height ofThe input of the second up-sampling layer receives H17256 feature maps are output from the output end of the second upsampling layer, and the set formed by all the output feature maps is marked as Y2Wherein Y is2Each feature map of (1) has a width ofHas a height ofAn input terminal of the eleventh convolution layer receives Y2All the feature maps in (1), 128 feature maps are output from the output end of the eleventh convolutional layer, and the set of all the output feature maps is denoted as J11Wherein, J11Each feature map of (1) has a width ofHas a height ofInput terminal of eighteenth normalization layer receives J11The eighteenth normalization layer outputs 128 feature maps, and the set of all feature maps is denoted as P18Wherein P is18Each feature map of (1) has a width ofHas a height ofThe input end of the eighteenth active layer receives P18The eighteenth active layer outputs 128 feature maps, and the set formed by all the output feature maps is recorded as H18Wherein H is18Each feature map of (1) has a width ofHas a height ofThe input of the third upsampling layer receives H18The output end of the third up-sampling layer outputs 128 feature maps, and the set formed by all the output feature maps is marked as Y3Wherein Y is3Each feature map of (1) has a width ofHas a height ofThe input terminal of the twelfth convolution layer receives Y3The output end of the twelfth convolutional layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as J12Wherein, J12Each feature map of (1) has a width ofHas a height ofInput reception of the nineteenth batch normalization layer J12The output end of the nineteenth batch of normalization layers outputs 64 feature maps, and the set formed by all the output feature maps is marked as P19Wherein P is19Each feature map of (1) has a width ofHas a height ofThe input terminal of the nineteenth active layer receives P19The output end of the nineteenth activation layer outputs 64 feature maps, and the set formed by all the output feature maps is recorded as H19Wherein H is19Each feature map of (1) has a width ofHas a height ofThe input of the fourth upsampling layer receives H19The output end of the fourth up-sampling layer outputs 64 feature maps, and the set formed by all the output feature maps is marked as Y4Wherein Y is4Each feature map in (1) has a width of R and a height of L; the input terminal of the thirteenth convolution layer receives Y4The output end of the thirteenth convolutional layer outputs 32 feature maps, and the set formed by all the output feature maps is denoted as J13Wherein, J13Each feature map in (1) has a width of R and a height of L; input of twentieth normalization layer receives J13The output end of the twentieth normalization layer outputs 32 feature maps, and the feature maps are outputThe set of all feature maps is denoted P20Wherein P is20Each feature map in (1) has a width of R and a height of L; the input terminal of the twentieth active layer receives P20The output end of the twentieth activation layer outputs 32 feature maps, and the set formed by all the output feature maps is marked as H20Wherein H is20Each feature map in (1) has a width of R and a height of L;
for the output layer, the input end of the fifteenth active layer receives the output of the decoding frame, namely C7The output end of the fifteenth active layer outputs 96 feature maps, and a set formed by all the output feature maps is recorded as H15Wherein H is15Each feature map in (1) has a width of R and a height of L; input terminal of ninth convolution layer receives H15The output end of the ninth convolutional layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as J9Wherein, J9The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; input terminal of the sixteenth normalization layer receives J9The feature map in (1) is output from the output end of the sixteenth normalization layer, and the set of all the output feature maps is denoted as P16Wherein P is16The width of the characteristic diagram in (1) is R, and the height of the characteristic diagram in (2) is L; the input terminal of the sixteenth active layer receives P16The output end of the sixteenth active layer outputs 1 feature map, and the set formed by all the output feature maps is recorded as H16Wherein H is16The width of the characteristic diagram is R, the height is L, H16The characteristic image in the image is an estimated depth image corresponding to the original input image;
step 1_ 3: inputting each original monocular image in the training set as an original input image into a convolutional neural network for training to obtain an estimated depth image corresponding to each original monocular image in the training set, and taking the { Q value as the value of the estimated depth imagen(x, y) } corresponding estimated depth image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the estimated depth image corresponding to each original monocular image in the training set and the corresponding real depth imageAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a trained convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to be used as the optimal weight vector and the optimal bias item of the trained convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: let { Q (x ', y') } denote a monocular image to be predicted; wherein, 1 ≦ x '≦ R', 1 ≦ y '≦ L', R 'represents a width of { Q (x', y ') }, L' represents a height of { Q (x ', y') }, and Q (x ', y') represents a pixel value of a pixel point of which coordinate position is (x ', y') in { Q (x ', y') };
step 2_ 2: inputting { Q (x ', y') } into a trained convolutional neural network training model and utilizing WbestAnd bbestPredicting to obtain a predicted depth image corresponding to the { Q (x ', y') }, and recording the predicted depth image as the { Q (x ', y') }depth(x ', y') }; wherein,Qdepth(x ', y') represents { Qdepth(x ', y') } pixel value of the pixel point whose coordinate position is (x ', y').
2. The method of claim 1, wherein in step 1_4,and obtaining by using a mean square error function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811246664.0A CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811246664.0A CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109460815A true CN109460815A (en) | 2019-03-12 |
CN109460815B CN109460815B (en) | 2021-12-10 |
Family
ID=65608334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811246664.0A Active CN109460815B (en) | 2018-10-25 | 2018-10-25 | Monocular vision depth estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109460815B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414674A (en) * | 2019-07-31 | 2019-11-05 | 浙江科技学院 | A kind of monocular depth estimation method based on residual error network and local refinement |
CN111161166A (en) * | 2019-12-16 | 2020-05-15 | 西安交通大学 | Image moire eliminating method based on depth multi-resolution network |
US20210209453A1 (en) * | 2019-03-14 | 2021-07-08 | Infineon Technologies Ag | Fmcw radar with interference signal suppression using artificial neural network |
WO2022193866A1 (en) * | 2021-03-16 | 2022-09-22 | Huawei Technologies Co., Ltd. | Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos |
US11885903B2 (en) | 2019-03-14 | 2024-01-30 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
US12032089B2 (en) | 2019-03-14 | 2024-07-09 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886165A (en) * | 2017-12-30 | 2018-04-06 | 北京工业大学 | A kind of parallel-convolution neural net method based on CRT technology |
CN108090472A (en) * | 2018-01-12 | 2018-05-29 | 浙江大学 | Pedestrian based on multichannel uniformity feature recognition methods and its system again |
US20180260703A1 (en) * | 2016-11-22 | 2018-09-13 | Massachusetts Institute Of Technology | Systems and methods for training neural networks |
CN108681692A (en) * | 2018-04-10 | 2018-10-19 | 华南理工大学 | Increase Building recognition method in a kind of remote sensing images based on deep learning newly |
-
2018
- 2018-10-25 CN CN201811246664.0A patent/CN109460815B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180260703A1 (en) * | 2016-11-22 | 2018-09-13 | Massachusetts Institute Of Technology | Systems and methods for training neural networks |
CN107886165A (en) * | 2017-12-30 | 2018-04-06 | 北京工业大学 | A kind of parallel-convolution neural net method based on CRT technology |
CN108090472A (en) * | 2018-01-12 | 2018-05-29 | 浙江大学 | Pedestrian based on multichannel uniformity feature recognition methods and its system again |
CN108681692A (en) * | 2018-04-10 | 2018-10-19 | 华南理工大学 | Increase Building recognition method in a kind of remote sensing images based on deep learning newly |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210209453A1 (en) * | 2019-03-14 | 2021-07-08 | Infineon Technologies Ag | Fmcw radar with interference signal suppression using artificial neural network |
US11885903B2 (en) | 2019-03-14 | 2024-01-30 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
US11907829B2 (en) * | 2019-03-14 | 2024-02-20 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
US12032089B2 (en) | 2019-03-14 | 2024-07-09 | Infineon Technologies Ag | FMCW radar with interference signal suppression using artificial neural network |
CN110414674A (en) * | 2019-07-31 | 2019-11-05 | 浙江科技学院 | A kind of monocular depth estimation method based on residual error network and local refinement |
CN110414674B (en) * | 2019-07-31 | 2021-09-10 | 浙江科技学院 | Monocular depth estimation method based on residual error network and local refinement |
CN111161166A (en) * | 2019-12-16 | 2020-05-15 | 西安交通大学 | Image moire eliminating method based on depth multi-resolution network |
WO2022193866A1 (en) * | 2021-03-16 | 2022-09-22 | Huawei Technologies Co., Ltd. | Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos |
US12033342B2 (en) | 2021-03-16 | 2024-07-09 | Huawei Technologies Co., Ltd. | Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos |
Also Published As
Publication number | Publication date |
---|---|
CN109460815B (en) | 2021-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460815B (en) | Monocular vision depth estimation method | |
CN110992275B (en) | Refined single image rain removing method based on generation of countermeasure network | |
CN113688723B (en) | Infrared image pedestrian target detection method based on improved YOLOv5 | |
CN109146944B (en) | Visual depth estimation method based on depth separable convolutional neural network | |
CN113780296B (en) | Remote sensing image semantic segmentation method and system based on multi-scale information fusion | |
CN110490082B (en) | Road scene semantic segmentation method capable of effectively fusing neural network features | |
CN109635662B (en) | Road scene semantic segmentation method based on convolutional neural network | |
CN112489164B (en) | Image coloring method based on improved depth separable convolutional neural network | |
CN109740451B (en) | Road scene image semantic segmentation method based on importance weighting | |
CN110322499A (en) | A kind of monocular image depth estimation method based on multilayer feature | |
CN110009700B (en) | Convolutional neural network visual depth estimation method based on RGB (red, green and blue) graph and gradient graph | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN109461177B (en) | Monocular image depth prediction method based on neural network | |
CN116469100A (en) | Dual-band image semantic segmentation method based on Transformer | |
CN112818777B (en) | Remote sensing image target detection method based on dense connection and feature enhancement | |
CN113160265A (en) | Construction method of prediction image for brain corpus callosum segmentation for corpus callosum state evaluation | |
CN110555461A (en) | scene classification method and system based on multi-structure convolutional neural network feature fusion | |
CN109448039B (en) | Monocular vision depth estimation method based on deep convolutional neural network | |
CN112819096A (en) | Method for constructing fossil image classification model based on composite convolutional neural network | |
CN114913493A (en) | Lane line detection method based on deep learning | |
CN112215199A (en) | SAR image ship detection method based on multi-receptive-field and dense feature aggregation network | |
CN115937693A (en) | Road identification method and system based on remote sensing image | |
CN116206214A (en) | Automatic landslide recognition method, system, equipment and medium based on lightweight convolutional neural network and double attention | |
CN112149496A (en) | Real-time road scene segmentation method based on convolutional neural network | |
CN115512100A (en) | Point cloud segmentation method, device and medium based on multi-scale feature extraction and fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |