CN108389226A

CN108389226A - A kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax

Info

Publication number: CN108389226A
Application number: CN201810144465.2A
Authority: CN
Inventors: 刘波; 杨青相
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-02-12
Filing date: 2018-02-12
Publication date: 2018-08-10

Abstract

The present invention discloses a kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax, includes the following steps：First, it is fitted a nonlinear function using convolutional neural networks, two width RGB images is converted into corresponding depth image；Then, it is calculated from left image pixel coordinate by transformation using depth information and obtains the location of pixels in right image；After the location of pixels for obtaining right image the pixel coordinate of right image and corresponding pixel value are obtained by bilinear interpolation；Finally using acquiring pixel value and left image corresponding calculated for pixel values prediction loss.Corresponding depth image can be obtained by this training for not needing any real depth information.This method is not needing any corresponding depth image of real depth information prediction.

Description

A kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax

Technical field

The invention belongs to depth learning technology field more particularly to a kind of nothings based on convolutional neural networks and binocular parallax Depth prediction approach is supervised, automatic Pilot, distance estimations are applied to.

Background technology

The mankind can very easily be inferred to the three-dimensional structure of the movement and a scene of oneself in a short period of time. Such as we are very easy to find barrier and avoiding obstacles of making a response rapidly when walking in the street.But computer is wanted Completing above for task is extremely complex, and ability of the computer in terms of rebuilding real-world scene can not show a candle to the mankind, The case where especially processing blocks and lacks texture.

Why the mankind can be more preferable than what computer was done in these tasks.It is a kind of rational to assume to be that we pass through right The cognition in the world, including go about and largely observe the understanding for having developed us to scene structure.We from it is millions of time this The rule about the world is recognized in the observation of sample is：Road is flat, and building is straight, and automobile is on the way face.When you As soon as when the new scene of observation, judged using these rules.In this work we by training one model come This method is simulated, RGB image is shot by one group or so camera of training to explain camera motion and scene structure.

In recent years with the extensive use of deep learning, especially applied in image domains in convolutional neural networks (CNN) After obtaining immense success.Researcher recognizes CNN because complicated and implicit relationship can be captured, it is in image domains Achieve preferable effect.And because there is the presence of ImageNet artificial labeled data collection very big in this way, there is supervision Deep learning method successfully solves the problems, such as very more.

However, nowadays the obvious disadvantage of convolutional neural networks is to need the data using a large amount of handmarkings to instruct Practice.On the one hand handmarking's data set huge in this way ImageNet is needed to consume a large amount of manpower and materials, on the other hand Also it is easy to appear mistakes in labeling process.Expensive hardware is generally required particularly with the depth information of acquisition Outdoor Scene With conscientious careful acquisition.Advanced 3D sensors and multiple camera acquisitions calibrated have been used in spite of KITTI is this Data set, but the reliable depth of its acquisition, still only in limited range, and acquisition cost is higher.

Nowadays all it is using such as NYUv2 and KITTI by there is the method that the method for supervision trains CNN to carry out depth prediction Such data set is trained by RGB image depth map corresponding with its.But these have the method for supervision to be learnt There is no promote network except their direct application field.Trace it to its cause is if by trained single-view estimation of Depth Model is applied to another scene, needs RGB image depth image corresponding with its using another scene, and need again Training network.

Invention content

It is above-mentioned in order to solve the problems, such as, the present invention propose it is a kind of based on convolutional neural networks and binocular parallax without prison Depth prediction approach is superintended and directed, the training convolutional neural networks in the case where not needing any real depth information.

To achieve the above object, the present invention adopts the following technical scheme that：

A kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax, includes the following steps：

Step 1 is fitted a nonlinear function using convolutional neural networks, and left and right camera, which is acquired two width RGB images, to be turned It is changed to corresponding depth image；

Step 2 obtains the location of pixels in right image using depth information calculating from left image pixel coordinate by transformation；

Step 3 by bilinear interpolation obtains the pixel coordinate of right image and right after the location of pixels for obtaining right image The pixel value answered；

Step 4, using acquiring pixel value and left image corresponding calculated for pixel values prediction loss.

The present invention can obtain corresponding depth image by this training for not needing any real depth information.It should Method is not needing any corresponding depth image of real depth information prediction.

Description of the drawings

Fig. 1 method flow diagrams；

The convolutional neural networks structure chart that Fig. 2 present invention uses；

Fig. 3 a, Fig. 3 b, Fig. 3 c are test result design sketch, wherein Fig. 3 a are left image, and Fig. 3 b are right image, and Fig. 3 c are Depth map.

Specific implementation mode

Below with reference to drawings and examples, invention is further described in detail.

The present invention provides a kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax, is not needing Training convolutional neural networks in the case of any real depth information；For training convolutional neural networks, KITTI data sets are used In training dataset is used as by a pair of of RGB image that left and right color camera obtains；These data than the RGB image calibrated and Its corresponding real depth image is easier to obtain.

The present invention simulates complicated nonlinear transformation using a CNN, it will be controlled using the parallax of left images Two width RGB images are converted to corresponding depth map.

It is as follows that symbol used in inventive method is described:

I_L I_R	Left images
		K_L K_R	The corresponding internal reference matrix of left and right camera
T	Outer ginseng matrix between the camera of left and right
		p_L p_R	The corresponding pixel coordinate of left images
I_D	By CNN predict come depth map
		I_w	Pass through the image that bilinear interpolation is newly-generated
q₀ q₁ q₂ q₃	Indicate four elements of spin matrix
		X_L Y_L Z_L	Left camera coordinates system three-dimensional coordinate
X_R Y_R Z_R	Right camera coordinates system three-dimensional coordinate

The flow chart of the present invention is as shown in Figure 1, include four steps：

Step 1：Two width RGB images are converted into corresponding depth image.

The present invention uses convolutional neural networks as shown in Figure 2.First five layer of network layer and first five layer of Alexnet networks Very similar, the full linking layer of Alexnet networks is replaced with full convolutional layer by us, and last we use five layers of warp lamination It is up-sampled.

Network layer name is followed successively by：Left convolutional layer 1, left convolutional layer 2, right convolutional layer 1, right convolutional layer 2, channel merge layer, Convolutional layer 3, convolutional layer 4, convolutional layer 5, full convolutional layer, warp lamination 1, warp lamination 2, warp lamination 3, warp lamination 4, warp Lamination 5.

We carry out feature extraction using left convolutional layer 1 and left convolutional layer 2 to left image.Similarly we use right convolutional layer 1 and right convolutional layer 2 to right image carry out image characteristics extraction.Then, the feature that we extract left images, in channels Dimension merges into row of channels.For the capability of fitting of our neural networks of raising, the result that we generate after merging for channel It carries out successively again：Convolutional layer 3, convolutional layer 4, convolutional layer 5, full convolutional layer.Finally convolutional layer 5 is produced in order to up-sample us Raw result carries out successively：Warp lamination 1, warp lamination 2, warp lamination 3, warp lamination 4, warp lamination 5.

We are fitted a nonlinear function using convolutional neural networks as described above:

D(I_L,I_R)=I_D

We are using this convolutional neural networks come by two width RGB image I_LAnd I_RBe converted to corresponding depth image I_D。

Step 2：Projected position is calculated using depth information.

First by the pixel coordinate conversion of left image to the camera coordinates of left camera, then again by the camera coordinates of left camera The camera coordinates of right camera are transformed to, the camera coordinates of right camera are finally projected as to the location of pixels of right image.Whole process Formula is represented by：

Step 2.1：Left camera coordinates system is changed in left image pixel coordinate inversion.

By the pixel coordinate conversion of left image to the camera coordinates of left camera, it is formulated as：

X_L=I_D(p_L)(u_L-u_L0)/k_Lx

Y_L=I_D(p_L)(v_L-v_L0)/k_Ly

Z_L=I_D(p_L)

Wherein, u_L,v_LFor the pixel transverse and longitudinal coordinate of image, u_L0, v_L0, k_Lx, k_LyFor the intrinsic parameter of camera, andf_xAnd f_yFor camera focus.

Step 2.2：Left camera coordinates system transforms to right camera coordinates system.

Coordinate system transformation is carried out by rotating translation matrix, is formulated as：

Wherein, spin matrix can be expressed as with four elements：

Also, four elements need to meet constraints：

Step 2.3：Right camera coordinates system projects to right location of pixels.

By right camera coordinates by projective transformation be right location of pixels, be formulated as：

Step 3：Pixel coordinate is converted to using bilinear interpolation for the position sought.

Right location of pixels (the u that right camera projects in step 2_R, v_R) it is successive value, Wo MenyongAfter indicating projection Obtained right location of pixels, is formulated as：In order to obtain better pixel filling effect, we Using the method for bilinear interpolation, the pixel of the value (upper left corner, the upper right corner, the lower left corner and the lower right corner) of four neighborhood of pixels is used It is worth into row interpolation.It is represented by with formula：

Wherein,It indicates respectivelyFour neighborhood of pixels (upper left corner, the upper right corner, the lower left corner and The lower right corner) pixel coordinate；It can pass throughWithBetween spatial linear distance seek, and expire equality constraint Relationship

Step 4：Calculate prediction loss

The reconstruction loss function computational methods that the present invention uses, with reference to absolute error loss function formula, formula expression is such as Under：

Since the image pixel intensities that the gradient of this loss function is mainly derived from four fields around are poor.If predicted position Will occur almost without the problem of gradient or predicted position and actual position are apart from larger pixel difference positioned at weak texture position It is excessive the excessive situation of gradient occur.And in order to keep the depth map of appearance smoother, we use the simple regularization sides L2 Method constrains the gradient of depth imageFormula is expressed as follows：

We are expressed as final loss function：

Wherein, n is amount of images；λ is hyper parameter, and the intensity of regularization is adjusted as the coefficient of regularization.

Embodiment 1：

The present invention is using tall and handsome up to GPU as computing platform, using caffe deep learnings frame as CNN frames.It is specific real Apply that steps are as follows：

Step 1：Data set prepares.

, using our neural network of KITTI public data collection training, which, which uses, is mounted on mobile vehicle for we On a pair of of B/W camera, a pair of of color camera, laser radar acquires several Outdoor Scenes.We use November 26 On the same day, same vehicle, which acquires, belongs to cities and towns, residential area, and the data of left and right color camera acquisition are as me in the contextual data of road Training data.RGB original images that we acquire left and right camera it is down-sampled at 160x608 resolution ratio as our god Through network inputs.

For training set by 13855 left images to forming, we use 500 data conducts for carrying real depth information Test set assesses our result.

Step 2：Build convolutional neural networks.

We use network structure as shown in Figure 2, wherein network layer name to be followed successively by：Left convolutional layer 1, left convolutional layer 2, Right convolutional layer 1, right convolutional layer 2, channel merging layer, convolutional layer 3, convolutional layer 4, convolutional layer 5, full convolutional layer, warp lamination 1, instead Convolutional layer 2, warp lamination 3, warp lamination 4, warp lamination 5.

We carry out feature extraction using left convolutional layer 1 and left convolutional layer 2 to left image.Similarly, we use right convolution Layer 1 and right convolutional layer 2 carry out image characteristics extraction to right image.Then, the feature that we extract left images, Channels dimensions merge into row of channels.For the capability of fitting of our neural networks of raising, we produce after merging for channel Raw result carries out successively again：Convolutional layer 3, convolutional layer 4, convolutional layer 5, full convolutional layer.It is finally right in order to carry out up-sampling us The result that convolutional layer 5 generates carries out successively：Warp lamination 1, warp lamination 2, warp lamination 3, warp lamination 4, warp lamination 5.

The left convolutional layer 1 and 2 convolution kernel size of left convolutional layer are respectively：11*11 sizes and 5*5 sizes, left convolutional layer 1 Exporting characteristic pattern quantity with left convolutional layer 2 is respectively：96 and 256, corresponding right convolutional layer 1 and right convolution 2 are same. Channel merges layer our concat layers for being provided using caffe frames.Convolutional layer 3,5 convolution kernel size of convolutional layer 4 and convolutional layer It is：3*3 sizes, characteristic pattern the number of output are respectively：384,384 and 256.Full convolutional layer we use 1024 The convolution kernel of 1*1 sizes.Five layers of warp lamination convolution kernel size are：4*4 sizes, output characteristic pattern quantity are：1.

Step 3：Initialize the inside and outside parameter of left and right camera.

When training our neural network to seek depth information, it would be desirable to rational to initialize the inside and outside of left and right camera Parameter could solve as follows to preferable depth information initialization procedure：

Camera intrinsic parameter initializes：Wherein we initialize u_L0, v_L0It is the one of our input image sizes after down-sampled Half, respectively 304,80.Similarly us are facilitated to initialize the corresponding parameter u of right image in order to solve_R0, v_R0It is 304,80.I Initialize corresponding k_Lx, k_LyRespectively 950,950.Similarly initialize k_Rx, k_RyTo be similarly worth.

Camera extrinsic number initializes：Because the movement between the data set two images that we use is mainly reflected in level side Spin matrix is initialized as unit matrix, initialization is flat by upward translation so when we initialize corresponding outer ginseng matrix The movement that other directions of the movement of horizontal direction are only initialized when moving matrix is initialized as 0.Our spin matrixs need to meet multiple We use four element representation unit matrixs to miscellaneous constraints, and since quaternary number needs to meet equality constraint：So it is q that we, which initialize quaternary number,₀=1, q₁=0, q₂=0, q₃=0, we initialize Parameter t in translation matrix_x,t_y,t_zRespectively 50,0,0.

We train our neural network using the camera inside and outside parameter of above-mentioned initialization.

Step 4：The training of neural network and the setting of network parameter.

In training convolutional neural networks, we read in 7 images to as a batch every time.We using SGD with 0.9 momentum and 0.0005 weight decaying network is optimized.We subtract corresponding mean value tri- channels RGB (104,117,123), then divided by 255 make left images pixel value be distributed between section [- 0.5,0.5].In loss functionIn we be arranged hyper parameter λ be 0.05.

In order to save the training time, we train 40000 model using Ravi Garg et al. in the training process Middle part fraction value starts our training.

Specific implementation step narration finishes, and effect is as shown in Fig. 3 a, Fig. 3 b, Fig. 3 c.The survey of the present invention is given below Test result.Experimental situation is：GPU：7.5 version of TITAN, CUDA carries out test using KITTI data sets and supervises list with having The several method of mesh prediction is compared.We assess our result with following appraisal procedure：

Comparative result：

Wherein c⁷Indicate the coarse grid and fine network of Eigen methods respectively with f.

Our unsupervised depth prediction approach based on convolutional neural networks and binocular vision, comparing has the method for supervision Difference is not very big, also bigger development space and research significance in accuracy rate.

Claims

1. a kind of unsupervised depth prediction approach based on convolutional neural networks and binocular parallax, which is characterized in that including following Step：

Step 1 is fitted a nonlinear function using convolutional neural networks, and left and right camera, which is acquired two width RGB images, to be converted to Corresponding depth image；

Step 3 by bilinear interpolation obtains the pixel coordinate of right image and corresponding after the location of pixels for obtaining right image Pixel value；

2. the unsupervised depth prediction approach based on convolutional neural networks and binocular parallax as described in claim 1, feature It is, step 1 is specially：

The convolutional neural networks include being followed successively by：Left convolutional layer 1, left convolutional layer 2, right convolutional layer 1, right convolutional layer 2, channel are closed And layer, convolutional layer 3, convolutional layer 4, convolutional layer 5, full convolutional layer, warp lamination 1, warp lamination 2, warp lamination 3, warp lamination 4, warp lamination 5；Wherein, feature extraction is carried out to left image using left convolutional layer 1 and left convolutional layer 2.Similarly use right convolution Layer 1 and right convolutional layer 2 carry out image characteristics extraction to right image；Then, it to the feature of left images extraction, is tieed up in channels It spends into row of channels and merges；The result generated after merging for channel carries out successively again：Convolutional layer 3, convolutional layer 4, convolutional layer 5, entirely Convolutional layer；Finally carried out successively to be up-sampled the result generated to convolutional layer 5：Warp lamination 1, warp lamination 2, warp Lamination 3, warp lamination 4, warp lamination 5；

It is fitted a nonlinear function using convolutional neural networks as described above:

D(I_L,I_R)=I_D

Using this convolutional neural networks come by two width RGB image I_LAnd I_RBe converted to corresponding depth image I_D。

3. the unsupervised depth prediction approach based on convolutional neural networks and binocular parallax as described in claim 1, feature It is, in step 2, first by the pixel coordinate conversion of left image to the camera coordinates of left camera, then again by the phase of left camera The camera coordinates of right camera are finally projected as the location of pixels of right image by machine coordinate transform to the camera coordinates of right camera, whole A process formula is represented by：

It specifically includes：

Step 2.1：Left camera coordinates system is changed in left image pixel coordinate inversion

X_L=I_D(p_L)(u_L-u_L0)/k_Lx

Y_L=I_D(p_L)(v_L-v_L0)/k_Ly

Z_L=I_D(p_L)

Wherein, u_L,v_LFor the pixel transverse and longitudinal coordinate of image, u_L0, v_L0, k_Lx, k_LyFor the intrinsic parameter of camera, andf_xAnd f_yFor camera focus；

Step 2.2：Left camera coordinates system transforms to right camera coordinates system

Wherein, spin matrix can be expressed as with four elements：

Also, four elements need to meet constraints：

Step 2.3：Right camera coordinates system projects to right location of pixels

4. the unsupervised depth prediction approach based on convolutional neural networks and binocular parallax as described in claim 1, feature It is, step 3 is specially：

Right location of pixels (the u that right camera projects in step 2_R, v_R) it is successive value, it usesTo obtain after indicating projection Right location of pixels, is formulated as：Using the method for bilinear interpolation, four neighborhood of pixels are used Value (upper left corner, the upper right corner, the lower left corner and the lower right corner) pixel value into row interpolation, be represented by using formula：

Wherein,It indicates respectivelyFour neighborhood of pixels (upper left corner, the upper right corner, the lower left corner and bottom rights Angle) pixel coordinate；It can pass throughWithBetween spatial linear distance seek, and expire equality constraint relationship

5. the unsupervised depth prediction approach based on convolutional neural networks and binocular parallax as described in claim 1, feature It is, step 4 is specially：

The reconstruction loss function computational methods used, with reference to absolute error loss function formula, formula expression is as follows：

Image pixel intensities of the gradient of loss function from four fields of surrounding are poor, in order to keep the depth map of appearance smoother, make The gradient of depth image is constrained with simple L2 regularization methodsFormula is expressed as follows：

Final loss function is expressed as：