CN110009674B - Monocular image depth of field real-time calculation method based on unsupervised depth learning - Google Patents

Monocular image depth of field real-time calculation method based on unsupervised depth learning Download PDF

Info

Publication number
CN110009674B
CN110009674B CN201910256117.9A CN201910256117A CN110009674B CN 110009674 B CN110009674 B CN 110009674B CN 201910256117 A CN201910256117 A CN 201910256117A CN 110009674 B CN110009674 B CN 110009674B
Authority
CN
China
Prior art keywords
image
depth
neural network
convolutional neural
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910256117.9A
Other languages
Chinese (zh)
Other versions
CN110009674A (en
Inventor
仲训昱
杨德龙
殷昕
彭侠夫
邹朝圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Winjoin Technology Co ltd
Xiamen University
Original Assignee
Xiamen Winjoin Technology Co ltd
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Winjoin Technology Co ltd, Xiamen University filed Critical Xiamen Winjoin Technology Co ltd
Priority to CN201910256117.9A priority Critical patent/CN110009674B/en
Publication of CN110009674A publication Critical patent/CN110009674A/en
Application granted granted Critical
Publication of CN110009674B publication Critical patent/CN110009674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle

Abstract

The invention discloses a monocular image depth of field real-time calculation method based on unsupervised depth learning, which utilizes the geometric constraint relation between binocular sequence images to construct a supervision signal, replaces the traditional manual marking data set and completes the unsupervised algorithm design; in a Depth-CNN network, a loss function considers geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, so that the algorithm accuracy is improved; the output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.

Description

Monocular image depth of field real-time calculation method based on unsupervised deep learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a monocular image depth of field real-time calculation method based on unsupervised deep learning.
Background
Due to the characteristics of low purchase price, real-time acquisition of scene complete information and the like, the camera is widely applied to the research of scene perception technologies of service robots, autonomous navigation robots and unmanned vehicles. With the development of high-performance computing devices, artificial intelligence technology for analyzing 2D image information by using a deep neural network increasingly plays an irreplaceable role in the fields of unmanned driving, robot navigation and the like. The scene depth of field real-time calculation problem based on the monocular image is a premise of a three-dimensional scene perception technology. In 2014, DavidEigen firstly uses a depth neural network to calculate scene depth corresponding to a 2D image, and establishes a mapping relation from 2D to 3D.
At present, scene depth of field calculation algorithms based on monocular images are mainly classified into supervised algorithms and unsupervised algorithms. Supervised algorithms require a large amount of data with artificial markers, and David Eigen proposes a method of coarse and fine estimation of images using two depth convolutional neural network sub-steps to obtain scene depth in the document "d.eigen, c.puhrsch, and r.fergus.depth map prediction from a single image using a multi-scale depth network. However, such manual marking data mostly depends on a laser scanner, and has high acquisition cost, difficult acquisition and limited application range. The unsupervised algorithm only uses scene images for a scene as a training set, is widely applied, and in documents of "t.zhou, m.brown, n.snavely, and d.g.lowe.unsuperviced left of depth and ego motion from video in CVPR, 2017", Zhou Tinghui et al uses sequence images as input and can directly calculate scene depth without manual marking. However, the depth neural network only analyzes scene information through a large number of images to acquire scene depth of field, and the accuracy cannot meet the specified requirement.
Through the analysis of the above problems, it was found that: they either require a large number of manually labeled images as training data sets or fail to fulfill the requirements of accurate computation, with varying degrees of loss in detail. The high-precision real-time depth of field calculation result has great significance in the image-based unmanned application scene, and therefore, an unsupervised real-time depth of field calculation method for the unmanned scene needs to be developed.
Disclosure of Invention
The invention provides a monocular image depth of field real-time calculation method based on unsupervised deep learning, aiming at the problems of three-dimensional scene perception in an outdoor unmanned automobile or an unmanned autonomous robot, difficulty in acquiring a large number of artificial labeled data sets, limited application scenes and the like.
In order to solve the problems, the invention is realized by the following technical scheme:
the monocular image depth of field real-time calculation method based on unsupervised deep learning comprises the following steps:
step 1, using binocular sequence images in an unmanned driving data set KITTI as input data, and classifying the binocular sequence images into two types through data preprocessing, namely a stereo image pair for a depth-of-field estimation convolutional neural network and a sequence image for a camera attitude estimation convolutional neural network;
step 2, establishing a depth-of-field estimation convolutional neural network based on a residual error network, constructing an end-to-end system, taking a stereo image pair as input, outputting a corresponding depth-of-field estimation image, and designing a loss function corresponding to the depth-of-field estimation convolutional neural network for feedback propagation;
step 3, establishing a camera attitude estimation convolutional neural network based on a convolutional neural network module, constructing an end-to-end system, outputting an attitude change matrix between sequence images by taking the sequence images and the depth of field estimation images as input, and designing a loss function corresponding to the camera attitude estimation convolutional neural network for feedback propagation;
step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3;
step 5, completing construction of a depth neural network based on the depth-of-field estimation convolutional neural network in the step 2 and the camera attitude estimation convolutional neural network in the step 3, completing design of a target function based on the step 4, and simultaneously training the depth-of-field estimation convolutional neural network and the camera attitude estimation convolutional neural network in the depth neural network by using all data in the unmanned data set KITTI obtained in the step 1 to fix a network parameter value and a network structure of the depth neural network to obtain a final calculation model;
and 6, inputting the monocular image actually obtained by the camera into the calculation model obtained in the step 5, wherein the output of the calculation model is the scene depth image corresponding to the image.
In the step 4, the constructed objective function is:
Lossfinal=λ1depth_loss+λ2pose_loss
Figure BDA0002013789720000021
Figure BDA0002013789720000031
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Representing the weight of a loss function of the camera attitude estimation convolutional neural network, depth _ loss representing the loss function of the depth-of-field estimation convolutional neural network, and pos _ loss representing the loss function of the camera attitude estimation convolutional neural network; alpha represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; n represents the total number of pixel points; i | · | purple wind1Represents the L1 norm; t represents a transposition of an image; SSIM () is a function that measures differences in image surface structuring; i islAnd IrA left view and a right view respectively representing a stereoscopic image;
Figure BDA0002013789720000032
and
Figure BDA0002013789720000033
a left view and a right view respectively representing a stereoscopic image reconstructed using a binocular camera geometric principle; dlAnd drRespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;
Figure BDA0002013789720000034
and
Figure BDA0002013789720000035
gradient images of the left disparity map in the abscissa and ordinate directions are respectively represented;
Figure BDA0002013789720000036
and
Figure BDA0002013789720000037
gradient images of the right disparity map in the abscissa and ordinate directions are respectively represented;
Figure BDA0002013789720000038
and
Figure BDA0002013789720000039
gradient images of a left image of the stereoscopic image in abscissa and ordinate directions, respectively;
Figure BDA00020137897200000310
and
Figure BDA00020137897200000311
gradient images of a right image of the stereoscopic image in abscissa and ordinate directions, respectively;
Figure BDA00020137897200000312
and
Figure BDA00020137897200000313
a left graph and a right graph respectively representing sequential images;
Figure BDA00020137897200000314
and
Figure BDA00020137897200000315
the left image and the right image respectively represent a target image corresponding to a reference image in the sequence image under the s scale;
Figure BDA00020137897200000316
a gradient map representing a depth image,
Figure BDA00020137897200000317
and
Figure BDA00020137897200000318
gradient images representing the left and right images of the sequence image, respectively.
In the step 1, aiming at the depth-of-field estimation convolutional neural network, extracting a corresponding stereo image pair from a binocular sequence image to be used as input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
Compared with the prior art, the invention has the following characteristics:
1. a supervision signal is constructed by using a geometric constraint relation between binocular sequence images, and a traditional manual marking data set is replaced, so that the design of an unsupervised algorithm is completed.
2. In the Depth-CNN network, the loss function considers the geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, thereby improving the algorithm accuracy.
3. The output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.
Drawings
Fig. 1 is a flow chart of a monocular image depth of field real-time calculation method based on unsupervised deep learning.
Fig. 2 is a Depth-CNN network framework diagram.
Fig. 3 is a diagram of a pos-CNN network framework.
Fig. 4 is an overall structural diagram of the objective function construction.
Fig. 5 is a graph of the algorithm results.
Fig. 5(a) is an input binocular sequence image, fig. 5(b) is an algorithm result of Zhou Tinghui, and fig. 5(c) is an algorithm result of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.
A monocular image depth of field real-time calculation method based on unsupervised deep learning is disclosed, as shown in FIG. 1, and specifically comprises the following steps:
step 1: and (5) preprocessing training data.
Using binocular sequence images in the unmanned data set KITTI as input data, the required images were classified into two types by data preprocessing: (1) stereoscopic image pairs for a Depth-CNN network; (2) sequence images for a Pose-CNN network.
All data in the KITTI database are preprocessed, and the original image is firstly converted into an image with the size of 256 multiplied by 512 and R, G, B on three channels, wherein the gray value of the image is between 0 and 1. The data is reorganized according to the difference of the deep neural network. And for Depth-CNN, extracting a corresponding stereo image pair from the binocular sequence image as input data of a training data set. For the Pose-CNN, three continuous images are respectively extracted from two sequence images (respectively corresponding to a left camera and a right camera) of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
Step 2: a Depth-CNN network (Depth of field estimation convolutional neural network) was established as shown in fig. 2.
And (3) establishing a Depth-CNN network based on the residual error network, constructing an end-to-end system, outputting a corresponding Depth-of-field estimation image by taking the stereo image pair mentioned in the step (1) as input, and designing a corresponding loss function for feedback propagation.
Establishing a coding-decoding model based on a residual error network, using a convolution kernel to successively extract high-dimensional features of an input image in the coding process to generate a multi-scale feature image, using a deconvolution kernel to perform deconvolution on an upper-layer feature image in the decoding process, wherein the scale of the generated target image and the feature image in the coding process are in one-to-one correspondence. The encoding and decoding process is an end-to-end learning process, and aims to learn an objective function d (f) (I) through a network, wherein the objective function establishes a corresponding relation of a pixel level, and a depth value corresponding to each pixel of an input image is obtained.
The solving process of the objective function d ═ f (i) is an iterative process, here we use the disparity map disp to replace depth of the depth image, and the relationship between the two is depth (i, j) ═ bf/disp (i, j), where b and f refer to the base line and focal length corresponding to the binocular camera, respectively, and i, j refer to the coordinates of the pixel in the image. Let IlAnd IrIs an input stereo image pair of a Depth-CNN network and is output as a corresponding parallax image displeftAnd dispright
The loss function is divided into three parts, namely an image reconstruction part and a disparity map consistency part corresponding to the left image and the right image respectively. The principle of the corresponding image reconstruction part of the left image and the right image is the same, taking the construction loss term of the right image reconstructed by the left image as an example, taking the left image I as an examplelInputting the Depth-CNN, and outputting a parallax image disp corresponding to the left image as an output resultleftThe following correspondence can be obtained from the geometric principle of the binocular camera:
Figure BDA0002013789720000051
wherein omegalKnowing the area to which the image pixel belongs, Il,IrRespectively a left and a right input image,
Figure BDA0002013789720000052
respectively a left reconstructed image and a right reconstructed image. Equation (1) a relational expression for reconstructing a right input image from a disparity map output from a left input image and a Depth-CNN is established
Figure BDA0002013789720000053
And comparing the difference between the reconstructed right image and the original input right image to obtain a supervision signal of the deep convolutional neural network, namely:
Figure BDA0002013789720000054
where α is a parameter value function, SSIM () is a function that measures the difference in the structuralization results of two input images (see in particular paper Wang Z, Boyik A C, Sheikh H R, et al. Image Quality Association: From Error Visibility to Structural Similarity [ J ]. IEEE Transactions on Image Processing, 2004, 13 (4)). Meanwhile, considering that the scene depth value is more discontinuous in the edge area of the object, in order to maintain the image detail information, a loss item is constructed by using the image edge information:
Figure BDA0002013789720000055
wherein, N is the image pixel number, x, y represents the gradient of the image along the horizontal and vertical coordinate directions.
The reconstruction loss function for the right graph can be obtained from equations (2) (3) as:
Figure BDA0002013789720000056
the reconstruction loss function for the left image can be derived by the same method as equation (4):
Figure BDA0002013789720000057
where s is the image scale, in the present embodiment, s is 4, i.e., corresponding images of four scales are extracted as output results.
Since the left image and the right image are acquired by a binocular camera at the same time, the left parallax icon and the right parallax icon have the same magnitude, and parallax consistency loss terms are designed by using the principle, namely:
Figure BDA0002013789720000061
the loss function corresponding to the Depth-CNN obtained by simultaneous equations (4), (5) and (6) is:
depth_loss=depthriqht+depthleft+LR_losss (7)
and step 3: a pos-CNN network (camera Pose estimation convolutional neural network) was established as shown in fig. 3.
And establishing a Pose-CNN network based on a convolutional neural network module, constructing an end-to-end system, taking the sequence images mentioned in the step 1 and the depth of field estimation images mentioned in the step 2 as input, outputting a posture change matrix between the sequence images, and designing a corresponding loss function for feedback propagation.
And (3) establishing a deep convolutional neural network based on a convolutional neural network module, wherein the network uses the sequence image in the data preprocessing result in the step (1) as input and outputs four transformation matrixes corresponding to the transformation matrixes from the reference image to the target image in the left and right sequence images. Each transformation matrix comprises six degrees of freedom corresponding to spatial rotation and translation of the camera. And (3) reconstructing an image through the depth-of-field image in the step (2) and the matrix output by the Pose-CNN network, and using the reconstructed image as a supervision signal of the network.
The reconstruction process of the left and right sequence images is similar, taking the left sequence image as an example, let { I }1,I2,I3Denotes the left sequence of pictures, where I2Is a target image, I1And I3Is a reference image. Our goal is Depth map corresponding to the above three images output by Depth-CNN and I output by Pose-CNN1And I3To I2Is used for reconstructing a target image I2Then input the target graph I with the original2The comparison constructs a loss function. The construction principle is as follows:
Is1→t(pt)=It(KTs1→tDs1(ps1)K-1ps1) (8)
Is2→t(pt)=It(KTs2→tDs2(ps2)K-1ps2) (9)
wherein p iss1And ps2Respectively refer to reference picture I1And I3Pixel of (2), Dss(ps1) And Ds2(ps2) Respectively referring to the depth values, T, corresponding to the pixels in the reference image obtained in step 2s1→tAnd Ts2→tRefer to reference picture I output by Pose-CNN respectively1And I3To the target image I2The transformation matrix of (2). I iss1→t(pt) And Is2→t(pt) respectively refer to the target image reconstructed from the reference image at the scale s.
Like the image difference function construction of step 2, here a loss term is designed for the sequence image as a supervision signal:
Figure BDA0002013789720000062
where β is a parameter value, and in the present embodiment β is 0.85.
Similar to step 2, the loss term is constructed using image edge information:
Figure BDA0002013789720000063
from equations (10) and (11), the corresponding loss term of the left sequence image at the scale s is:
Figure BDA0002013789720000064
similarly, the corresponding loss term of the right sequence image at the scale s is:
Figure BDA0002013789720000065
thus, for the sequence images having the equations (12) (13), the total loss function is:
Figure BDA0002013789720000066
the constructed objective functions are respectively designed according to four scales, and finally summed.
And 4, step 4: and constructing an objective function.
In the training process of the network, the Depth-CNN and the Pose-CNN are trained simultaneously, and the loss terms of the two parts are all involved in the feedback propagation process of the network as a part of the final loss function, as shown in FIG. 4. The final objective function is composed of the loss function terms of the Depth-CNN and the Pose-CNN, and is shown as the formula (15):
Lossfinal=λ1depth_loss+λ2pose_loss (15)
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Weights representing the convolutional neural network loss function of camera pose estimation, in this embodiment, λ1=1.0,λ2The objective function considers the geometrical constraints of the image reconstruction process for both the stereo image pair and the sequence image pair, 0.8.
The loss function of the depth-of-field estimation convolutional neural network is:
Figure BDA0002013789720000071
the loss function of the convolutional neural network for camera pose estimation is:
Figure BDA0002013789720000081
wherein, alpha represents the weight for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; t represents a transposition of an image; i | · | purple wind1Represents the L1 norm; SSIM () is a measure of image surface structured differencesA function of (a); i and j respectively represent the abscissa and the ordinate of a pixel point in an image; n represents the total number of pixel points;
Figure BDA0002013789720000082
representing the left image in the input stereo image pair,
Figure BDA0002013789720000083
a right graph representing the input physical strength image pair;
Figure BDA0002013789720000084
and
Figure BDA0002013789720000085
respectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;
Figure BDA0002013789720000086
is shown in
Figure BDA0002013789720000087
And
Figure BDA0002013789720000088
for input, a left image is reconstructed by using a binocular camera geometric principle;
Figure BDA0002013789720000089
is shown in
Figure BDA00020137897200000810
And
Figure BDA00020137897200000811
for input, a right image is reconstructed by using a binocular camera geometric principle;
Figure BDA00020137897200000812
and
Figure BDA00020137897200000813
respectively representing target images in the input sequence images corresponding to the left camera and the right camera;
Figure BDA00020137897200000814
and
Figure BDA00020137897200000815
respectively representing in a left input sequence image and a right input sequence image, taking a reference image, a depth image corresponding to the reference image and a camera attitude change matrix as input, and combining a left target image reconstruction result and a right target image reconstruction result obtained by a camera parameter matrix;
Figure BDA00020137897200000816
gradient images of the left disparity map in the abscissa direction and the ordinate direction are respectively represented;
Figure BDA00020137897200000817
gradient images of the right disparity map in the abscissa direction and the ordinate direction are respectively represented;
Figure BDA00020137897200000818
gradient images of the left input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;
Figure BDA00020137897200000819
gradient images of the right input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;
Figure BDA00020137897200000820
an extracted map representing a depth image,
Figure BDA00020137897200000821
gradient images respectively representing the left and right input images; p is a radical oftRepresenting the coordinates of the pixel points in the image; n → t denotes the transformation of the two reference images into the target image.
And 5: and (5) deep neural network training.
And (4) after completing the deep convolutional neural network construction and the target function design through the steps 2-4, entering a network training process. The total data in the KITTI data set is selected to be about 180GB, 22600 pairs of stereo images can be obtained after preprocessing, three groups of stereo images are input each time and enter the network for training network parameters, and the number of the parameters is about 6500 ten thousand. Here we set the network to iterate a total of 300000 times, eventually obtaining the computational model for the actual testing process.
Step 6: and (5) actual testing.
The design of the deep neural network and the calculation of network parameters are completed, and the monocular image is used as input data in the actual use process, so that the scene depth image corresponding to the image is directly obtained.
After the step 5 is finished, fixing the network parameter value and the network structure, directly inputting the monocular image at the moment, directly outputting the corresponding depth-of-field image by the network, and reaching the speed of 35ms for each image, thereby meeting the requirement of processing the video data. Thereby establishing a correspondence between the 2D image to the three-dimensional spatial perception.
The effects of the present invention are further illustrated by the following simulation results.
1. Simulation conditions
(1) Any image in the KITTI data set is selected and converted into a 256 multiplied by 512 RGB image.
(3) Setting experimental parameters: lambda [ alpha ]1=1.0,λ2=0.8,α=0.85,β=0.85
2. Simulation content and results
Simulation content: using a 256 × 512 RGB image as input, the two classical algorithm results were compared using a uniform error evaluation criterion. The error evaluation results were as follows:
absolute difference (Abs Rel):
Figure BDA0002013789720000091
mean square error (Sq Rel):
Figure BDA0002013789720000092
mean squareRoot error (RMSE):
Figure BDA0002013789720000093
logarithmic root mean square error (RMSE log 10):
Figure BDA0002013789720000094
threshold value: % of
Figure BDA0002013789720000095
Where N is the number of pixels, y is the depth of field prediction*Is the true depth of field value.
The experimental results are as follows:
the experimental results are shown in table 1, and compared with the supervised algorithm proposed by David Eigen and the unsupervised algorithm proposed by Zhou Tinghui, the accuracy of the method is improved, and the method meets the application requirements of navigation of unmanned automobiles and outdoor unmanned autonomous robots in real time and accuracy.
TABLE 1
Method Abs Rel Sq Rel RMSE RMSE log10 Threshold value
DavidEigen 0.214 1.605 6.563 0.292 0.957
ZhouTinghui 0.208 1.768 6.856 0.283 0.957
The invention 0.151 1.325 5.653 0.231 0.975
The evaluation index absolute difference (Abs Rel), the mean square error (Sq Rel), the Root Mean Square Error (RMSE) and the logarithmic mean square error (RMSE log10) in table 1 represent algorithm error values, which are used to measure the accuracy of the algorithm, and a smaller error value represents a higher accuracy. The threshold value represents the degree of closeness of the depth of field predicted value and the true value, and the higher the threshold value is, the better the stability of the algorithm is. The experimental result shows that the precision of the method is obviously superior to that of the two methods. Considering that the algorithm of David Eigen is a supervised algorithm, we only compare the experimental results of the present invention with the Zhou Tinghui algorithm, as shown in fig. 5(a) - (c). Test results show that the method is obviously superior to the method of Zhou Tinghui in the aspect of target image detail detection.
In the neural network training process, the selection of the activation function has a large influence on the result, and almost all the partiesThe method uses a linear correction unit (Relu) as the activation function. After a plurality of experiments, the index correction unit (Elu) is selected as the activation function, the experimental result is shown in table 2, and the result of using the index correction unit as the activation function is obviously better than the result of using the linear correction unit as the activation function. In the present embodiment, the exponential linear correction unit
Figure BDA0002013789720000101
As a function of activation.
TABLE 2
Activating a function Abs Rel Sq Rel RMSE RMSE log10 Threshold value
Relu 0.204 2.078 7.004 0.343 0.922
Elu 0.151 1.325 5.653 0.231 0.975
Aiming at the problems of three-dimensional space perception in the autonomous navigation of the current unmanned and outdoor unmanned robots and high cost caused by the adoption of laser radar, the invention provides a low-cost scene depth real-time calculation method suitable for the autonomous navigation of the unmanned and outdoor unmanned robots. The method uses a monocular camera as a sensor, directly calculates scene depth through a depth convolution neural network trained under a line, and is an end-to-end method from an input image to a scene depth image. The method has the characteristics of real-time performance, high accuracy and the like, solves the problem of field depth calculation in three-dimensional scene perception only by relying on a low-cost image sensor, and provides an economical and reliable scene depth real-time calculation method for unmanned driving and unmanned robot autonomous navigation technologies.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims (2)

1. The monocular image depth of field real-time calculation method based on unsupervised deep learning is characterized by comprising the following steps of:
step 1, using binocular sequence images in an unmanned driving data set KITTI as input data, and classifying the binocular sequence images into two types through data preprocessing, namely a stereo image pair for a depth-of-field estimation convolutional neural network and a sequence image for a camera attitude estimation convolutional neural network;
step 2, establishing a depth-of-field estimation convolutional neural network based on a residual error network, constructing an end-to-end system, taking a stereo image pair as input, outputting a corresponding depth-of-field estimation image, and designing a loss function corresponding to the depth-of-field estimation convolutional neural network for feedback propagation;
step 3, establishing a camera attitude estimation convolutional neural network based on a convolutional neural network module, constructing an end-to-end system, outputting an attitude change matrix between sequence images by taking the sequence images and the depth of field estimation images as input, and designing a loss function corresponding to the camera attitude estimation convolutional neural network for feedback propagation;
step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3; wherein the constructed objective function is:
Lossfinal=λ1depth_loss+λ2pose_loss
Figure FDA0002660258710000011
Figure FDA0002660258710000021
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Representing the weight of a loss function of the camera attitude estimation convolutional neural network, depth _ loss representing the loss function of the depth-of-field estimation convolutional neural network, and pos _ loss representing the loss function of the camera attitude estimation convolutional neural network; alpha represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; n represents the total number of pixel points; i | · | purple wind1Represents the L1 norm; t represents a transposition of an image; SSIM () is a function that measures differences in image surface structuring; i islAnd IrA left view and a right view respectively representing a stereoscopic image;
Figure FDA0002660258710000022
and
Figure FDA0002660258710000023
a left view and a right view respectively representing a stereoscopic image reconstructed using a binocular camera geometric principle; dlAnd drRespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;
Figure FDA0002660258710000024
and
Figure FDA0002660258710000025
gradient images of the left disparity map in the abscissa and ordinate directions are respectively represented;
Figure FDA0002660258710000026
and
Figure FDA0002660258710000027
gradient images of the right disparity map in the abscissa and ordinate directions are respectively represented;
Figure FDA0002660258710000028
and
Figure FDA0002660258710000029
gradient images of a left image of the stereoscopic image in abscissa and ordinate directions, respectively;
Figure FDA00026602587100000210
and
Figure FDA00026602587100000211
ladder for respectively representing right picture of stereo image in horizontal coordinate and vertical coordinate directionsA degree image;
Figure FDA00026602587100000212
and
Figure FDA00026602587100000213
a left graph and a right graph respectively representing sequential images;
Figure FDA00026602587100000214
and
Figure FDA00026602587100000215
the left image and the right image respectively represent a target image corresponding to a reference image in the sequence image under the s scale;
Figure FDA00026602587100000216
a gradient map representing a depth image,
Figure FDA00026602587100000217
and
Figure FDA00026602587100000218
gradient images respectively representing a left image and a right image of the sequence image;
step 5, completing construction of a depth neural network based on the depth-of-field estimation convolutional neural network in the step 2 and the camera attitude estimation convolutional neural network in the step 3, completing design of a target function based on the step 4, and simultaneously training the depth-of-field estimation convolutional neural network and the camera attitude estimation convolutional neural network in the depth neural network by using all data in the unmanned data set KITTI obtained in the step 1 to fix a network parameter value and a network structure of the depth neural network to obtain a final calculation model;
and 6, inputting the monocular image actually obtained by the camera into the calculation model obtained in the step 5, wherein the output of the calculation model is the scene depth image corresponding to the image.
2. The method for calculating the depth of field of a monocular image based on unsupervised deep learning in real time according to claim 1, wherein in step 1, a corresponding stereo image pair is extracted from a binocular sequence image aiming at a depth of field estimation convolutional neural network and is used as one input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
CN201910256117.9A 2019-04-01 2019-04-01 Monocular image depth of field real-time calculation method based on unsupervised depth learning Active CN110009674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910256117.9A CN110009674B (en) 2019-04-01 2019-04-01 Monocular image depth of field real-time calculation method based on unsupervised depth learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910256117.9A CN110009674B (en) 2019-04-01 2019-04-01 Monocular image depth of field real-time calculation method based on unsupervised depth learning

Publications (2)

Publication Number Publication Date
CN110009674A CN110009674A (en) 2019-07-12
CN110009674B true CN110009674B (en) 2021-04-13

Family

ID=67169169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910256117.9A Active CN110009674B (en) 2019-04-01 2019-04-01 Monocular image depth of field real-time calculation method based on unsupervised depth learning

Country Status (1)

Country Link
CN (1) CN110009674B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112258565B (en) * 2019-07-22 2023-03-28 华为技术有限公司 Image processing method and device
CN110503680B (en) * 2019-08-29 2023-08-18 大连海事大学 Unsupervised convolutional neural network-based monocular scene depth estimation method
CN110751100A (en) * 2019-10-22 2020-02-04 北京理工大学 Auxiliary training method and system for stadium
CN111311664B (en) * 2020-03-03 2023-04-21 上海交通大学 Combined unsupervised estimation method and system for depth, pose and scene flow
CN113393510B (en) * 2020-03-12 2023-05-12 武汉Tcl集团工业研究院有限公司 Image processing method, intelligent terminal and storage medium
CN111583345B (en) * 2020-05-09 2022-09-27 吉林大学 Method, device and equipment for acquiring camera parameters and storage medium
CN111753961B (en) 2020-06-26 2023-07-28 北京百度网讯科技有限公司 Model training method and device, prediction method and device
CN112150531B (en) * 2020-09-29 2022-12-09 西北工业大学 Robust self-supervised learning single-frame image depth estimation method
CN112561947A (en) * 2020-12-10 2021-03-26 中国科学院深圳先进技术研究院 Image self-adaptive motion estimation method and application
CN113763474B (en) * 2021-09-16 2024-04-09 上海交通大学 Indoor monocular depth estimation method based on scene geometric constraint
CN114332187B (en) * 2022-03-09 2022-06-14 深圳安智杰科技有限公司 Monocular target ranging method and device
CN114967121B (en) * 2022-05-13 2023-02-03 哈尔滨工业大学 Design method of end-to-end single lens imaging system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106210450A (en) * 2016-07-20 2016-12-07 罗轶 Video display artificial intelligence based on SLAM
CN109377530A (en) * 2018-11-30 2019-02-22 天津大学 A kind of binocular depth estimation method based on deep neural network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204010B (en) * 2017-04-28 2019-11-19 中国科学院计算技术研究所 A kind of monocular image depth estimation method and system
CN108961327B (en) * 2018-05-22 2021-03-30 深圳市商汤科技有限公司 Monocular depth estimation method and device, equipment and storage medium thereof
CN109063746A (en) * 2018-07-14 2018-12-21 深圳市唯特视科技有限公司 A kind of visual similarity learning method based on depth unsupervised learning
CN109472830A (en) * 2018-09-28 2019-03-15 中山大学 A kind of monocular visual positioning method based on unsupervised learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106210450A (en) * 2016-07-20 2016-12-07 罗轶 Video display artificial intelligence based on SLAM
CN109377530A (en) * 2018-11-30 2019-02-22 天津大学 A kind of binocular depth estimation method based on deep neural network

Also Published As

Publication number Publication date
CN110009674A (en) 2019-07-12

Similar Documents

Publication Publication Date Title
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN108416840B (en) Three-dimensional scene dense reconstruction method based on monocular camera
CN111325797B (en) Pose estimation method based on self-supervision learning
CN108921926B (en) End-to-end three-dimensional face reconstruction method based on single image
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN110853075B (en) Visual tracking positioning method based on dense point cloud and synthetic view
CN110675423A (en) Unmanned aerial vehicle tracking method based on twin neural network and attention model
CN110689562A (en) Trajectory loop detection optimization method based on generation of countermeasure network
CN113160375B (en) Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm
CN110910437B (en) Depth prediction method for complex indoor scene
CN108171249B (en) RGBD data-based local descriptor learning method
CN112767467B (en) Double-image depth estimation method based on self-supervision deep learning
CN113962858A (en) Multi-view depth acquisition method
CN113762358A (en) Semi-supervised learning three-dimensional reconstruction method based on relative deep training
CN114359509A (en) Multi-view natural scene reconstruction method based on deep learning
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN113570658A (en) Monocular video depth estimation method based on depth convolutional network
CN114299405A (en) Unmanned aerial vehicle image real-time target detection method
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN116958420A (en) High-precision modeling method for three-dimensional face of digital human teacher
CN115984349A (en) Depth stereo matching algorithm based on central pixel gradient fusion and global cost aggregation
CN116772820A (en) Local refinement mapping system and method based on SLAM and semantic segmentation
CN115375838A (en) Binocular gray image three-dimensional reconstruction method based on unmanned aerial vehicle
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN112686830B (en) Super-resolution method of single depth map based on image decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant