CN110009674B

CN110009674B - Monocular image depth of field real-time calculation method based on unsupervised depth learning

Info

Publication number: CN110009674B
Application number: CN201910256117.9A
Authority: CN
Inventors: 仲训昱; 杨德龙; 殷昕; 彭侠夫; 邹朝圣
Original assignee: Xiamen Winjoin Technology Co ltd; Xiamen University
Current assignee: Xiamen Winjoin Technology Co ltd; Xiamen University
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2021-04-13
Anticipated expiration: 2039-04-01
Also published as: CN110009674A

Abstract

The invention discloses a monocular image depth of field real-time calculation method based on unsupervised depth learning, which utilizes the geometric constraint relation between binocular sequence images to construct a supervision signal, replaces the traditional manual marking data set and completes the unsupervised algorithm design; in a Depth-CNN network, a loss function considers geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, so that the algorithm accuracy is improved; the output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.

Description

Monocular image depth of field real-time calculation method based on unsupervised deep learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a monocular image depth of field real-time calculation method based on unsupervised deep learning.

Background

Due to the characteristics of low purchase price, real-time acquisition of scene complete information and the like, the camera is widely applied to the research of scene perception technologies of service robots, autonomous navigation robots and unmanned vehicles. With the development of high-performance computing devices, artificial intelligence technology for analyzing 2D image information by using a deep neural network increasingly plays an irreplaceable role in the fields of unmanned driving, robot navigation and the like. The scene depth of field real-time calculation problem based on the monocular image is a premise of a three-dimensional scene perception technology. In 2014, DavidEigen firstly uses a depth neural network to calculate scene depth corresponding to a 2D image, and establishes a mapping relation from 2D to 3D.

At present, scene depth of field calculation algorithms based on monocular images are mainly classified into supervised algorithms and unsupervised algorithms. Supervised algorithms require a large amount of data with artificial markers, and David Eigen proposes a method of coarse and fine estimation of images using two depth convolutional neural network sub-steps to obtain scene depth in the document "d.eigen, c.puhrsch, and r.fergus.depth map prediction from a single image using a multi-scale depth network. However, such manual marking data mostly depends on a laser scanner, and has high acquisition cost, difficult acquisition and limited application range. The unsupervised algorithm only uses scene images for a scene as a training set, is widely applied, and in documents of "t.zhou, m.brown, n.snavely, and d.g.lowe.unsuperviced left of depth and ego motion from video in CVPR, 2017", Zhou Tinghui et al uses sequence images as input and can directly calculate scene depth without manual marking. However, the depth neural network only analyzes scene information through a large number of images to acquire scene depth of field, and the accuracy cannot meet the specified requirement.

Through the analysis of the above problems, it was found that: they either require a large number of manually labeled images as training data sets or fail to fulfill the requirements of accurate computation, with varying degrees of loss in detail. The high-precision real-time depth of field calculation result has great significance in the image-based unmanned application scene, and therefore, an unsupervised real-time depth of field calculation method for the unmanned scene needs to be developed.

Disclosure of Invention

The invention provides a monocular image depth of field real-time calculation method based on unsupervised deep learning, aiming at the problems of three-dimensional scene perception in an outdoor unmanned automobile or an unmanned autonomous robot, difficulty in acquiring a large number of artificial labeled data sets, limited application scenes and the like.

In order to solve the problems, the invention is realized by the following technical scheme:

the monocular image depth of field real-time calculation method based on unsupervised deep learning comprises the following steps:

step 1, using binocular sequence images in an unmanned driving data set KITTI as input data, and classifying the binocular sequence images into two types through data preprocessing, namely a stereo image pair for a depth-of-field estimation convolutional neural network and a sequence image for a camera attitude estimation convolutional neural network;

step 2, establishing a depth-of-field estimation convolutional neural network based on a residual error network, constructing an end-to-end system, taking a stereo image pair as input, outputting a corresponding depth-of-field estimation image, and designing a loss function corresponding to the depth-of-field estimation convolutional neural network for feedback propagation;

step 3, establishing a camera attitude estimation convolutional neural network based on a convolutional neural network module, constructing an end-to-end system, outputting an attitude change matrix between sequence images by taking the sequence images and the depth of field estimation images as input, and designing a loss function corresponding to the camera attitude estimation convolutional neural network for feedback propagation;

step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3;

step 5, completing construction of a depth neural network based on the depth-of-field estimation convolutional neural network in the step 2 and the camera attitude estimation convolutional neural network in the step 3, completing design of a target function based on the step 4, and simultaneously training the depth-of-field estimation convolutional neural network and the camera attitude estimation convolutional neural network in the depth neural network by using all data in the unmanned data set KITTI obtained in the step 1 to fix a network parameter value and a network structure of the depth neural network to obtain a final calculation model;

and 6, inputting the monocular image actually obtained by the camera into the calculation model obtained in the step 5, wherein the output of the calculation model is the scene depth image corresponding to the image.

In the step 4, the constructed objective function is:

Loss_final＝λ₁depth_loss+λ₂pose_loss

wherein λ is₁Weight, λ, representing the depth of field estimated convolutional neural network loss function₂Representing the weight of a loss function of the camera attitude estimation convolutional neural network, depth _ loss representing the loss function of the depth-of-field estimation convolutional neural network, and pos _ loss representing the loss function of the camera attitude estimation convolutional neural network; alpha represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; n represents the total number of pixel points; i | · | purple wind₁Represents the L1 norm; t represents a transposition of an image; SSIM () is a function that measures differences in image surface structuring; i is^lAnd I^rA left view and a right view respectively representing a stereoscopic image;

and

a left view and a right view respectively representing a stereoscopic image reconstructed using a binocular camera geometric principle; d^lAnd d^rRespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;

and

gradient images of the left disparity map in the abscissa and ordinate directions are respectively represented;

and

gradient images of the right disparity map in the abscissa and ordinate directions are respectively represented;

and

gradient images of a left image of the stereoscopic image in abscissa and ordinate directions, respectively;

and

gradient images of a right image of the stereoscopic image in abscissa and ordinate directions, respectively;

and

a left graph and a right graph respectively representing sequential images;

and

the left image and the right image respectively represent a target image corresponding to a reference image in the sequence image under the s scale;

a gradient map representing a depth image,

and

gradient images representing the left and right images of the sequence image, respectively.

In the step 1, aiming at the depth-of-field estimation convolutional neural network, extracting a corresponding stereo image pair from a binocular sequence image to be used as input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.

Compared with the prior art, the invention has the following characteristics:

1. a supervision signal is constructed by using a geometric constraint relation between binocular sequence images, and a traditional manual marking data set is replaced, so that the design of an unsupervised algorithm is completed.

2. In the Depth-CNN network, the loss function considers the geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, thereby improving the algorithm accuracy.

3. The output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.

Drawings

Fig. 1 is a flow chart of a monocular image depth of field real-time calculation method based on unsupervised deep learning.

Fig. 2 is a Depth-CNN network framework diagram.

Fig. 3 is a diagram of a pos-CNN network framework.

Fig. 4 is an overall structural diagram of the objective function construction.

Fig. 5 is a graph of the algorithm results.

Fig. 5(a) is an input binocular sequence image, fig. 5(b) is an algorithm result of Zhou Tinghui, and fig. 5(c) is an algorithm result of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

A monocular image depth of field real-time calculation method based on unsupervised deep learning is disclosed, as shown in FIG. 1, and specifically comprises the following steps:

step 1: and (5) preprocessing training data.

Using binocular sequence images in the unmanned data set KITTI as input data, the required images were classified into two types by data preprocessing: (1) stereoscopic image pairs for a Depth-CNN network; (2) sequence images for a Pose-CNN network.

All data in the KITTI database are preprocessed, and the original image is firstly converted into an image with the size of 256 multiplied by 512 and R, G, B on three channels, wherein the gray value of the image is between 0 and 1. The data is reorganized according to the difference of the deep neural network. And for Depth-CNN, extracting a corresponding stereo image pair from the binocular sequence image as input data of a training data set. For the Pose-CNN, three continuous images are respectively extracted from two sequence images (respectively corresponding to a left camera and a right camera) of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.

Step 2: a Depth-CNN network (Depth of field estimation convolutional neural network) was established as shown in fig. 2.

And (3) establishing a Depth-CNN network based on the residual error network, constructing an end-to-end system, outputting a corresponding Depth-of-field estimation image by taking the stereo image pair mentioned in the step (1) as input, and designing a corresponding loss function for feedback propagation.

Establishing a coding-decoding model based on a residual error network, using a convolution kernel to successively extract high-dimensional features of an input image in the coding process to generate a multi-scale feature image, using a deconvolution kernel to perform deconvolution on an upper-layer feature image in the decoding process, wherein the scale of the generated target image and the feature image in the coding process are in one-to-one correspondence. The encoding and decoding process is an end-to-end learning process, and aims to learn an objective function d (f) (I) through a network, wherein the objective function establishes a corresponding relation of a pixel level, and a depth value corresponding to each pixel of an input image is obtained.

The solving process of the objective function d ═ f (i) is an iterative process, here we use the disparity map disp to replace depth of the depth image, and the relationship between the two is depth (i, j) ═ bf/disp (i, j), where b and f refer to the base line and focal length corresponding to the binocular camera, respectively, and i, j refer to the coordinates of the pixel in the image. Let I^lAnd I^rIs an input stereo image pair of a Depth-CNN network and is output as a corresponding parallax image disp^leftAnd disp^right，

The loss function is divided into three parts, namely an image reconstruction part and a disparity map consistency part corresponding to the left image and the right image respectively. The principle of the corresponding image reconstruction part of the left image and the right image is the same, taking the construction loss term of the right image reconstructed by the left image as an example, taking the left image I as an example^lInputting the Depth-CNN, and outputting a parallax image disp corresponding to the left image as an output result^leftThe following correspondence can be obtained from the geometric principle of the binocular camera:

wherein omega_lKnowing the area to which the image pixel belongs, I_l，I_rRespectively a left and a right input image,

respectively a left reconstructed image and a right reconstructed image. Equation (1) a relational expression for reconstructing a right input image from a disparity map output from a left input image and a Depth-CNN is established

And comparing the difference between the reconstructed right image and the original input right image to obtain a supervision signal of the deep convolutional neural network, namely:

where α is a parameter value function, SSIM () is a function that measures the difference in the structuralization results of two input images (see in particular paper Wang Z, Boyik A C, Sheikh H R, et al. Image Quality Association: From Error Visibility to Structural Similarity [ J ]. IEEE Transactions on Image Processing, 2004, 13 (4)). Meanwhile, considering that the scene depth value is more discontinuous in the edge area of the object, in order to maintain the image detail information, a loss item is constructed by using the image edge information:

wherein, N is the image pixel number, x, y represents the gradient of the image along the horizontal and vertical coordinate directions.

The reconstruction loss function for the right graph can be obtained from equations (2) (3) as:

the reconstruction loss function for the left image can be derived by the same method as equation (4):

where s is the image scale, in the present embodiment, s is 4, i.e., corresponding images of four scales are extracted as output results.

Since the left image and the right image are acquired by a binocular camera at the same time, the left parallax icon and the right parallax icon have the same magnitude, and parallax consistency loss terms are designed by using the principle, namely:

the loss function corresponding to the Depth-CNN obtained by simultaneous equations (4), (5) and (6) is:

depth_loss＝depth_riqht+depth_left+LR_loss_s (7)

and step 3: a pos-CNN network (camera Pose estimation convolutional neural network) was established as shown in fig. 3.

And establishing a Pose-CNN network based on a convolutional neural network module, constructing an end-to-end system, taking the sequence images mentioned in the step 1 and the depth of field estimation images mentioned in the step 2 as input, outputting a posture change matrix between the sequence images, and designing a corresponding loss function for feedback propagation.

And (3) establishing a deep convolutional neural network based on a convolutional neural network module, wherein the network uses the sequence image in the data preprocessing result in the step (1) as input and outputs four transformation matrixes corresponding to the transformation matrixes from the reference image to the target image in the left and right sequence images. Each transformation matrix comprises six degrees of freedom corresponding to spatial rotation and translation of the camera. And (3) reconstructing an image through the depth-of-field image in the step (2) and the matrix output by the Pose-CNN network, and using the reconstructed image as a supervision signal of the network.

The reconstruction process of the left and right sequence images is similar, taking the left sequence image as an example, let { I }₁，I₂，I₃Denotes the left sequence of pictures, where I₂Is a target image, I₁And I₃Is a reference image. Our goal is Depth map corresponding to the above three images output by Depth-CNN and I output by Pose-CNN₁And I₃To I₂Is used for reconstructing a target image I₂Then input the target graph I with the original₂The comparison constructs a loss function. The construction principle is as follows:

I_s1→t(p_t)＝I_t(KT_s1→tD_s1(p_s1)K^-1p_s1) (8)

I_s2→t(p_t)＝I_t(KT_s2→tD_s2(p_s2)K^-1p_s2) (9)

wherein p is_s1And p_s2Respectively refer to reference picture I₁And I₃Pixel of (2), D_ss(p_s1) And D_s2(p_s2) Respectively referring to the depth values, T, corresponding to the pixels in the reference image obtained in step 2_s1→tAnd T_s2→tRefer to reference picture I output by Pose-CNN respectively₁And I₃To the target image I₂The transformation matrix of (2). I is_s1→t(p_t) And I_s2→t(pt) respectively refer to the target image reconstructed from the reference image at the scale s.

Like the image difference function construction of step 2, here a loss term is designed for the sequence image as a supervision signal:

where β is a parameter value, and in the present embodiment β is 0.85.

Similar to step 2, the loss term is constructed using image edge information:

from equations (10) and (11), the corresponding loss term of the left sequence image at the scale s is:

similarly, the corresponding loss term of the right sequence image at the scale s is:

thus, for the sequence images having the equations (12) (13), the total loss function is:

the constructed objective functions are respectively designed according to four scales, and finally summed.

And 4, step 4: and constructing an objective function.

In the training process of the network, the Depth-CNN and the Pose-CNN are trained simultaneously, and the loss terms of the two parts are all involved in the feedback propagation process of the network as a part of the final loss function, as shown in FIG. 4. The final objective function is composed of the loss function terms of the Depth-CNN and the Pose-CNN, and is shown as the formula (15):

Loss_final＝λ₁depth_loss+λ₂pose_loss (15)

wherein λ is₁Weight, λ, representing the depth of field estimated convolutional neural network loss function₂Weights representing the convolutional neural network loss function of camera pose estimation, in this embodiment, λ₁＝1.0，λ₂The objective function considers the geometrical constraints of the image reconstruction process for both the stereo image pair and the sequence image pair, 0.8.

The loss function of the depth-of-field estimation convolutional neural network is:

the loss function of the convolutional neural network for camera pose estimation is:

wherein, alpha represents the weight for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; t represents a transposition of an image; i | · | purple wind₁Represents the L1 norm; SSIM () is a measure of image surface structured differencesA function of (a); i and j respectively represent the abscissa and the ordinate of a pixel point in an image; n represents the total number of pixel points;

representing the left image in the input stereo image pair,

a right graph representing the input physical strength image pair;

and

respectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;

is shown in

And

for input, a left image is reconstructed by using a binocular camera geometric principle;

is shown in

And

for input, a right image is reconstructed by using a binocular camera geometric principle;

and

respectively representing target images in the input sequence images corresponding to the left camera and the right camera;

and

respectively representing in a left input sequence image and a right input sequence image, taking a reference image, a depth image corresponding to the reference image and a camera attitude change matrix as input, and combining a left target image reconstruction result and a right target image reconstruction result obtained by a camera parameter matrix;

gradient images of the left disparity map in the abscissa direction and the ordinate direction are respectively represented;

gradient images of the right disparity map in the abscissa direction and the ordinate direction are respectively represented;

gradient images of the left input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;

gradient images of the right input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;

an extracted map representing a depth image,

gradient images respectively representing the left and right input images; p is a radical of_tRepresenting the coordinates of the pixel points in the image; n → t denotes the transformation of the two reference images into the target image.

And 5: and (5) deep neural network training.

And (4) after completing the deep convolutional neural network construction and the target function design through the steps 2-4, entering a network training process. The total data in the KITTI data set is selected to be about 180GB, 22600 pairs of stereo images can be obtained after preprocessing, three groups of stereo images are input each time and enter the network for training network parameters, and the number of the parameters is about 6500 ten thousand. Here we set the network to iterate a total of 300000 times, eventually obtaining the computational model for the actual testing process.

Step 6: and (5) actual testing.

The design of the deep neural network and the calculation of network parameters are completed, and the monocular image is used as input data in the actual use process, so that the scene depth image corresponding to the image is directly obtained.

After the step 5 is finished, fixing the network parameter value and the network structure, directly inputting the monocular image at the moment, directly outputting the corresponding depth-of-field image by the network, and reaching the speed of 35ms for each image, thereby meeting the requirement of processing the video data. Thereby establishing a correspondence between the 2D image to the three-dimensional spatial perception.

The effects of the present invention are further illustrated by the following simulation results.

1. Simulation conditions

(1) Any image in the KITTI data set is selected and converted into a 256 multiplied by 512 RGB image.

(3) Setting experimental parameters: lambda [ alpha ]₁＝1.0，λ₂＝0.8，α＝0.85，β＝0.85

2. Simulation content and results

Simulation content: using a 256 × 512 RGB image as input, the two classical algorithm results were compared using a uniform error evaluation criterion. The error evaluation results were as follows:

absolute difference (Abs Rel):

mean square error (Sq Rel):

mean squareRoot error (RMSE):

logarithmic root mean square error (RMSE log 10):

threshold value: % of

Where N is the number of pixels, y is the depth of field prediction^*Is the true depth of field value.

The experimental results are as follows:

the experimental results are shown in table 1, and compared with the supervised algorithm proposed by David Eigen and the unsupervised algorithm proposed by Zhou Tinghui, the accuracy of the method is improved, and the method meets the application requirements of navigation of unmanned automobiles and outdoor unmanned autonomous robots in real time and accuracy.

TABLE 1

Method	Abs Rel	Sq Rel	RMSE	RMSE log10	Threshold value
						DavidEigen	0.214	1.605	6.563	0.292	0.957
ZhouTinghui	0.208	1.768	6.856	0.283	0.957
						The invention	0.151	1.325	5.653	0.231	0.975

The evaluation index absolute difference (Abs Rel), the mean square error (Sq Rel), the Root Mean Square Error (RMSE) and the logarithmic mean square error (RMSE log10) in table 1 represent algorithm error values, which are used to measure the accuracy of the algorithm, and a smaller error value represents a higher accuracy. The threshold value represents the degree of closeness of the depth of field predicted value and the true value, and the higher the threshold value is, the better the stability of the algorithm is. The experimental result shows that the precision of the method is obviously superior to that of the two methods. Considering that the algorithm of David Eigen is a supervised algorithm, we only compare the experimental results of the present invention with the Zhou Tinghui algorithm, as shown in fig. 5(a) - (c). Test results show that the method is obviously superior to the method of Zhou Tinghui in the aspect of target image detail detection.

In the neural network training process, the selection of the activation function has a large influence on the result, and almost all the partiesThe method uses a linear correction unit (Relu) as the activation function. After a plurality of experiments, the index correction unit (Elu) is selected as the activation function, the experimental result is shown in table 2, and the result of using the index correction unit as the activation function is obviously better than the result of using the linear correction unit as the activation function. In the present embodiment, the exponential linear correction unit

As a function of activation.

TABLE 2

Activating a function	Abs Rel	Sq Rel	RMSE	RMSE log10	Threshold value
						Relu	0.204	2.078	7.004	0.343	0.922
Elu	0.151	1.325	5.653	0.231	0.975

Aiming at the problems of three-dimensional space perception in the autonomous navigation of the current unmanned and outdoor unmanned robots and high cost caused by the adoption of laser radar, the invention provides a low-cost scene depth real-time calculation method suitable for the autonomous navigation of the unmanned and outdoor unmanned robots. The method uses a monocular camera as a sensor, directly calculates scene depth through a depth convolution neural network trained under a line, and is an end-to-end method from an input image to a scene depth image. The method has the characteristics of real-time performance, high accuracy and the like, solves the problem of field depth calculation in three-dimensional scene perception only by relying on a low-cost image sensor, and provides an economical and reliable scene depth real-time calculation method for unmanned driving and unmanned robot autonomous navigation technologies.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. The monocular image depth of field real-time calculation method based on unsupervised deep learning is characterized by comprising the following steps of:

step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3; wherein the constructed objective function is:

Loss_final＝λ₁depth_loss+λ₂pose_loss

and

and

and

and

and

ladder for respectively representing right picture of stereo image in horizontal coordinate and vertical coordinate directionsA degree image;

and

a left graph and a right graph respectively representing sequential images;

and

a gradient map representing a depth image,

and

gradient images respectively representing a left image and a right image of the sequence image;

2. The method for calculating the depth of field of a monocular image based on unsupervised deep learning in real time according to claim 1, wherein in step 1, a corresponding stereo image pair is extracted from a binocular sequence image aiming at a depth of field estimation convolutional neural network and is used as one input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.