CN110503680B

CN110503680B - Unsupervised convolutional neural network-based monocular scene depth estimation method

Info

Publication number: CN110503680B
Application number: CN201910807213.8A
Authority: CN
Inventors: 刘洪波; 岳晓彤; 江同棒; 张博; 马茜; 王乃尧; 杨丽平; 林正奎
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2023-08-18
Anticipated expiration: 2039-08-29
Also published as: CN110503680A

Abstract

The invention discloses an unsupervised convolutional neural network-based monocular scene depth estimation method, which comprises the following steps: obtaining a depth value of each pixel point of a target image; acquiring a camera pose value when pixel coordinates on a target image are transferred to a next frame of image; constructing a loss function; and performing depth estimation based on the unsupervised conditional random field residual convolution neural network scene. The invention adopts an unsupervised method to well solve the problem of difficult manual data marking, saves manpower and improves economic benefit. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. And combining the unsupervised residual error convolutional neural network scene depth estimation model to form the unsupervised conditional random field residual error convolutional neural network scene depth estimation model. The model of the invention is superior to the other three models in average relative error (rel) and accuracy (acc).

Description

Unsupervised convolutional neural network-based monocular scene depth estimation method

Technical Field

The invention relates to a scene depth estimation method, in particular to an unsupervised convolutional neural network-based monocular scene depth estimation method.

Background

Computer vision is primarily a simulation of biological vision through a computer and associated vision sensors. People firstly acquire external images by using a camera, then the images are converted into digital signals by using a computer, so that the digital processing of the images is realized, and finally a new subject, namely computer vision, is created, which relates to a plurality of application fields, including target tracking, image classification, face recognition, scene understanding and the like. The research objective of computer vision is to enable a computer to have the ability to observe the environment, understand the environment and adapt to the environment autonomously like a person.

However, most of the current computer vision technologies are directed to digital image processing, and due to the lack of depth information of a real scene and pose information of a camera in the image processing process, understanding and recognition of scene errors can be caused to a certain extent. Therefore, how to reconstruct a three-dimensional structure of a scene from an image using depth information and pose information of a camera is a very important topic in computer vision. At present, three-dimensional scene reconstruction by using a depth map is an important approach, and two methods are mainly used for acquiring depth information of an image, namely, the traditional method is to directly acquire the depth information through hardware equipment, such as a laser range finder, but the equipment is difficult to manufacture, high in cost and high in price, and the popularization of the equipment is restricted. Another method is to acquire depth information of an image, i.e., a scene depth estimation method, through computer vision technology.

The field depth estimation method is mainly divided into a monocular field depth estimation method and a binocular field depth estimation method. The binocular scene depth estimation method first needs to perform scene depth estimation under the assumption that the optical geometry constraint is unchanged, such as stereo image matching. The monocular scene depth estimation method does not need prior assumption, has low requirements on camera construction, is convenient to apply, and has the defect that abundant scene structure features are difficult to obtain from monocular images so as to infer scene depth. In recent years, a convolutional neural network in the field of computer vision achieves a plurality of excellent results, and in 2016 Liu et al, in combination with the convolutional neural network and a conditional random field, a DCNF-FCSP scene depth estimation model is proposed, wherein the convolutional neural network mainly acquires the bottom features of an image, and the conditional random field smoothly estimates a depth map according to the similarity of adjacent super pixels. When a depth image dataset is manufactured, due to various external condition interferences such as illumination, weather changes and the like, a depth sensor cannot obtain reliable and accurate image depth information, which may affect the accuracy of the estimation result of a depth estimation model, and the supervised learning method causes the problem of difficult manual data annotation. On the other hand, as the layer number of the convolutional neural network is deepened, the gradient disappearance problem may be brought, and a certain degree of difficulty is brought to the training of the network, so that the obtained result is not accurate enough.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unsupervised convolutional neural network monocular scene depth estimation method which does not need manual data annotation and has more accurate results.

In order to achieve the above object, the technical scheme of the present invention is as follows: an unsupervised convolutional neural network-based monocular scene depth estimation method comprises the following steps:

A. obtaining depth values of all pixel points of target image

A1, assume that three continuous images I in a scene are input _t-1 、I _t 、I _t+1, wherein ,I_t Representing the current frame image, I _t-1 For the previous frame image, I _t+1 For the next frame image, the subscript t represents the current frame and defines I _t Is the target image.

A2, using object image I _t The method is used as input of a monocular depth estimation residual convolution neural network model, wherein the monocular depth estimation residual convolution neural network model comprises an input layer, seven convolution layers, seven deconvolution layers and four residual items. Input target image I _t The result of the feature map obtained after the convolution layer is expressed by the following formula:

T _L ＝f(w _L H _L-1 +b _L )，L∈{1，2，...，L-1} (1)

H _L+1 ＝w _L+1 T _L +b _L+1 ，L∈{1，2，...，L-1} (2)

T _L+1 ＝f(H _L-1 +H _L+ 1)，L∈{1，2，...，L-1} (3)

wherein L represents the number of layers, w, of the convolution layers _L and w_L+1 Respectively representing training monocular depth estimation residual convolution neural network modesWeight values of L-1 layer and L+1 layer convolution layer, b _L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H _L-1 and H_L+1 And the characteristic diagram of the output of the L-1 layer and the L+1 layer convolution layers is shown. T (T) _L Representing the value when the L-th layer convolution layer fails the residual term, T _L+1 Representing the values after the l+1 layer convolutional layer is activated by the residual term.

A3, adding a depth linear regression function after the final feature map output by the convolution layer, and mapping each pixel point in the feature map into a corresponding depth value, as shown in a formula (4):

wherein G represents the last layer of the deconvolution layer, w _Gd B, for training the weight of the monocular depth estimation residual convolution neural network model _Gd Represents the offset vector, H _G A feature map obtained by the final deconvolution layer is shown.Representing the resulting depth value.

B. Obtaining a camera pose value when pixel coordinates on a target image are transferred to a next frame image

Computing a current target image I by using pose residual convolution neural network model _t The pixel coordinates of each point in (a) are transferred to the next frame of image I _t+1 When the corresponding pixel coordinates are, the corresponding attitude value of the cameraThe pose residual convolution neural network model consists of an input layer, seven convolution layers and two residual items, and comprises the following specific steps:

b1, assume a given image I of two consecutive RGB _t 、I _t+1 And the sizes are 426 multiplied by 128 multiplied by 3, and the residual volumes are input into the pose residual volumeAnd (5) integrating the neural network model.

B2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I _t 、I _t+1 A feature vector ax+b of 1×768 corresponds, where a represents the convolution kernel, X represents the image feature, i.e. the gray matrix, and b represents the offset value.

B3, obtaining an image I through a camera pose estimation algorithm _t To image I _t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I _t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value _t+1 Corresponding pixel coordinates.

C. Construction of a loss function

C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I _t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained _t And a pixel coordinate point p in the next frame image _t+1 Mapping relation between the two. This process is called view synthesis. As shown in equation (5).

Wherein K represents the built-in parameters of the camera,pose estimation value representing camera motion from time t to time t+1, +.>Is a pixelCoordinate point p _t Is a depth value of (a).

And C2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of the loss function. Because the coordinate values in the image are all discrete values and are all integers, in order to ensure I _t (p _t+1 ) With pixel values, I is determined by bilinear interpolation _t (p _t+1 ) Proportional conversion is carried out on the values of the four fields (upper left corner, lower left corner, upper right corner and lower right corner) to obtainRepresenting the new image after coordinate conversion. As shown in equation (6).

wherein ,ω^ij And p is as follows _t+1 Andin a linear relationship Σω ^ij =1, and ω ^ij Is a parameter of bilinear interpolation, N _p And representing a neighborhood of pixel coordinates p on the image, i represents linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel coordinates p, and j represents linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel coordinates p.

And C3, jointly forming an unsupervised residual error convolutional neural network scene depth estimation model by the monocular depth estimation residual error convolutional neural network model and the pose residual error convolutional neural network model, obtaining a loss function of the unsupervised residual error convolutional neural network scene depth estimation model by view synthesis, and realizing mapping from an original image to a depth image, wherein the loss function is shown in a formula (7):

wherein ,I_t (p) representing the target image I _t The coordinates of the pixel points on the image are calculated,the pixel point coordinates on the reconstructed new image are represented, N represents the number of pixel points, and M represents the number of images. t represents a frame, and p is the pixel coordinates.

D. Depth estimation based on unsupervised conditional random field residual convolution neural network scene

D1, adding a conditional random field based on the step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items.

D2, for scene depth estimation, given image X, regarding depthIs written in the form of a gibbs distribution as shown in equation (8):

wherein ,the normalization factor Z (X) is as shown in equation (9) as an energy function:

and D3, training an unsupervised conditional random field residual error convolutional neural network scene depth estimation model by adopting a maximum conditional likelihood estimation method, so that a negative log likelihood function L of a loss function of the unsupervised conditional random field residual error convolutional neural network scene depth estimation model is shown as a formula (10):

wherein the energy function isDefined as a form containing a univariate term and a bivariate term, as shown in formula (11):

wherein ,d_p and d_j Representing depth values at points p and j. D (D) _ij (d _p ，d _j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network _VS As shown in formula (7):

wherein ,π_pj A smoothness penalty is expressed to measure the similarity of features of neighboring pixels. U (U) _p and U_j The feature values of the pixel points at the point p and the point j are respectively represented, and when the difference between the feature values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar.

And D4, mapping the original target image to the depth map by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and forming a univariate part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model. And the conditional random field corresponds to a binary item part in the unsupervised conditional random field residual convolution neural network scene depth estimation model, so that the feature expression of the original image is realized. And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. Similar to the traditional training mode of parameters in a conditional random field, training of an unsupervised conditional random field residual convolution neural network scene depth estimation model is achieved by adopting a maximum conditional likelihood estimation method, and a negative log likelihood function is adopted as a loss function L (W) of the model, as shown in a formula (13):

wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model. To facilitate deriving parameters in the loss function, the following expression is introduced:

Q＝B+C-R

wherein B represents an n×n identity matrix, R is equal to pi _pj The similarity matrix constituting the square matrix, C being the diagonal matrix of R and C-R being a Laplacian matrix. Thus (2)Finishing to formula (14):

where T represents the transpose of the matrix. Tarnish ¹ And tarnish ² And respectively representing parameters of an unsupervised conditional random field residual convolution neural network scene depth estimation model univariate term and bivariate term. Due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):

wherein ,d_I Representing the depth value of each pixel point on the image. Synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):

the loss function is sorted into a negative log likelihood function as shown in equation (17):

and D5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method. Parameter tarnishing in depth estimation model of non-supervision conditional random field residual convolution neural network ¹ When deriving, L (W) is related to the tarnish because it is related to parameters in only one element ¹ The derivative of (2) is shown in equation (18).

In the process of tarnishing the parameters of convolutional neural network ² When deriving, L (W) is also known to be a tarnish because it is only related to parameters in the binary item ² The derivative of (2) is shown in formula (19):

p in formulas (18), (19) _r (·)、T _r (. Cndot.) all represent traces of the matrix.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention adopts an unsupervised residual convolution neural network scene depth estimation model, and the original image is subjected to the step A to obtain the depth value of each pixel in the imageObtaining the current target image I in step B _t The pixel coordinates of each point in (a) are transferred to the next frame of image I _t+1 In the corresponding pixel coordinates, the pose value corresponding to the camera +.>Reconstructing a new image by step C1 +.>And C2, obtaining a supervision signal of the loss function, and finally obtaining the loss function of the non-supervision model by using the view synthesis idea in step C3, thereby realizing a non-supervision method. The non-supervision method adopted by the invention well solves the problem of difficult manual data marking, saves manpower and improves economic benefit.

2. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. The non-supervision residual error convolutional neural network scene depth estimation model is combined to form the non-supervision conditional random field residual error convolutional neural network scene depth estimation model, and the non-supervision conditional random field residual error convolutional neural network scene depth estimation model is experimentally compared with a supervision depth estimation model proposed by Eigen. Etc., a supervision depth estimation model proposed by Liu. Etc., and a non-supervision depth estimation model proposed by Godard. Etc. under the same data set. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.

Drawings

The invention is illustrated in fig. 3, in which:

fig. 1 is a flow chart of the present invention.

Fig. 2 is an original image of a scene.

Fig. 3 is a scene depth image obtained after convergence of model training.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The scene image is processed according to the flow chart shown in fig. 1, firstly, a camera is used for shooting video of a scene to be processed, images of continuous frames are selected, and the images are used as original images of scene depth estimation according to the invention as shown in fig. 2. According to the step A, B, C of the invention, mapping from an original image to a depth map is realized by utilizing view synthesis thought, and a loss function L based on an unsupervised residual convolution neural network scene depth estimation model is obtained _VS Then, as shown in the formula (6), the feature expression D of the target image is obtained according to the step D of the present invention _ij (d _p ，d _j ) And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. By minimizing the model loss function, training and parameter learning of an unsupervised residual convolution neural network scene depth estimation model are achieved by using a random gradient descent method. After convergence of the model training, a final scene depth map is obtained, as shown in fig. 3.

The present invention was experimentally compared with the supervised depth estimation model proposed by eigen. Etc, the supervised depth estimation model proposed by liu. Etc, the unsupervised depth estimation model proposed by golard. Etc under the same data set as shown in table 1. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.

Table 1 comparison of experimental results

The present invention is not limited to the present embodiment, and any equivalent concept or modification within the technical scope of the present invention is listed as the protection scope of the present invention.

Claims

1. An unsupervised convolutional neural network-based monocular scene depth estimation method is characterized by comprising the following steps of: the method comprises the following steps:

A. obtaining depth values of all pixel points of target image

A1, assume that three continuous images I in a scene are input _t-1 、I _t 、I _t+1, wherein ,I_t Representing the current frame image, I _t-1 For the previous frame image, I _t+1 For the next frame image, the subscript t represents the current frame and defines I _t Is a target image;

a2, using object image I _t The method comprises the steps of inputting a monocular depth estimation residual convolutional neural network model, wherein the monocular depth estimation residual convolutional neural network model comprises an input layer, a seven-layer convolutional layer, a seven-layer deconvolution layer and four residual items; input target image I _t The result of the feature map obtained after the convolution layer is expressed by the following formula:

T _L ＝f(w _L H _L-1 +b _L )，L∈{1，2，...，L-1} (1)

H _L+1 ＝w _L+1 T _L +b _L+1 ，L∈{1,2,...,L-1} (2)

T _L+1 ＝f(H _L-1 +H _L+1 )，L∈{1，2，...，L-1} (3)

wherein L represents the number of layers, w, of the convolution layers _L and w_L+1 Weight values respectively representing the L-1 layer and the L+1 layer convolution layers of the training monocular depth estimation residual convolution neural network model, b _L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H _L-1 and H_L+1 A feature map representing the output of the L-1 layer and the L+1 layer convolution layers; t (T) _L Representing the value when the L-th layer convolution layer fails the residual term, T _L+1 Representing the value after the l+1 layer convolutional layer is activated by the residual term;

wherein G represents the last layer of the deconvolution layer, w _Gd B, for training the weight of the monocular depth estimation residual convolution neural network model _Gd Represents the offset vector, H _G Representing a feature map obtained by a final deconvolution layer;representing the resulting depth value;

b1, assume a given image I of two consecutive RGB _t 、I _t+1 The sizes of the model are 426 multiplied by 128 multiplied by 3, and the model is input into a pose residual convolution neural network model;

b2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I _t 、I _t+1 A feature vector ax+b of corresponding size 1×768, where a represents a convolution kernel, X represents an image feature, i.e., a gray matrix, and b represents an offset value;

B3obtaining an image I through a camera pose estimation algorithm _t To image I _t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I _t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value _t+1 Corresponding pixel coordinates;

C. construction of a loss function

C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I _t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained _t And a pixel coordinate point p in the next frame image _t+1 Mapping relation between the two; this process is called view synthesis; as shown in formula (5);

wherein K represents the built-in parameters of the camera,representing pose estimates of the camera moving from time t to time t +1,is the pixel coordinate point p _t Depth values of (2);

c2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of a loss function; because the coordinate values in the image are all discrete values and are all integers, in order to ensureI _t (p _t+1 ) With pixel values, I is determined by bilinear interpolation _t (p _t+1 ) The four field values, namely the upper left corner value, the lower left corner value, the upper right corner value and the lower right corner value are obtained by proportional conversionRepresenting the new image after coordinate conversion; as shown in formula (6);

wherein ,ω^ij And p is as follows _t+1 Andin a linear relationship Σω ^ij =1, and ω ^ij Is a parameter of bilinear interpolation, N _p Representing a neighborhood of pixel point coordinates p on an image, i representing linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel point coordinates p, and j representing linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel point coordinates p;

wherein ,I_t (p) representing the target image I _t The coordinates of the pixel points on the image are calculated,representing pixel coordinates on the reconstructed new image, N representing the pixelThe number M represents the number of images; t represents a frame, and p is a pixel point coordinate;

D1, adding a conditional random field on the basis of a step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items;

wherein ,d_p and d_j Representing depth values at points p and j; d (D) _ij (d _p ,d _j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network _VS As shown in formula (7):

wherein ,π_pj Representing smoothness penalty for measuring similarity of adjacent pixel characteristics; u (U) _p and U_j The characteristic values of the pixel points at the point p and the point j are respectively represented, and when the difference between the characteristic values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar;

d4, mapping the original target image to the depth map is achieved by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and a unitary item part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model is formed; binary item parts in the condition random field residual convolution neural network scene depth estimation model corresponding to the unsupervised condition random field are used for realizing the feature expression of the original image; utilizing the output of the two parts to construct an unsupervised conditional random field residual convolution neural network scene depth estimation model; similar to the conventional training method of parameters in a conditional random field, the method adopts a polarTraining of a scene depth estimation model of an unsupervised conditional random field residual convolution neural network based on a large condition likelihood estimation method is achieved, and a negative log likelihood function is adopted as a loss function of the modelAs shown in equation (13):

wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model; to facilitate deriving parameters in the loss function, the following expression is introduced:

Q＝B+C-R

wherein B represents an n×n identity matrix, R is equal to pi _pj The similarity matrix of the square matrix is formed, C is the diagonal matrix of R, and C-R is a Laplacian matrix; thus (2)Finishing to formula (14):

wherein T represents the transpose of the matrix; w (W) ¹ and W² Parameters of a single item and a binary item of an unsupervised conditional random field residual convolution neural network scene depth estimation model are respectively represented; due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):

wherein ,d_I Representing a depth value for each pixel point on the image; synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):

d5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method; when the parameter W in the depth estimation model of the unsupervised conditional random field residual convolution neural network is used ¹ When deriving, L (W) is related to W because it is related to parameters in only one element ¹ The derivative of (2) is shown in formula (18);

in the process of convolving the neural network parameter W ² When deriving, L (W) is also obtained with respect to W because it is related only to parameters in the binary item ² The derivative of (2) is shown in formula (19):

formulas (18), (1)9) P in (3) _r (·)、T _r (. Cndot.) all represent traces of the matrix.