CN110503680B - Unsupervised convolutional neural network-based monocular scene depth estimation method - Google Patents

Unsupervised convolutional neural network-based monocular scene depth estimation method Download PDF

Info

Publication number
CN110503680B
CN110503680B CN201910807213.8A CN201910807213A CN110503680B CN 110503680 B CN110503680 B CN 110503680B CN 201910807213 A CN201910807213 A CN 201910807213A CN 110503680 B CN110503680 B CN 110503680B
Authority
CN
China
Prior art keywords
neural network
image
depth estimation
unsupervised
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910807213.8A
Other languages
Chinese (zh)
Other versions
CN110503680A (en
Inventor
刘洪波
岳晓彤
江同棒
张博
马茜
王乃尧
杨丽平
林正奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN201910807213.8A priority Critical patent/CN110503680B/en
Publication of CN110503680A publication Critical patent/CN110503680A/en
Application granted granted Critical
Publication of CN110503680B publication Critical patent/CN110503680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an unsupervised convolutional neural network-based monocular scene depth estimation method, which comprises the following steps: obtaining a depth value of each pixel point of a target image; acquiring a camera pose value when pixel coordinates on a target image are transferred to a next frame of image; constructing a loss function; and performing depth estimation based on the unsupervised conditional random field residual convolution neural network scene. The invention adopts an unsupervised method to well solve the problem of difficult manual data marking, saves manpower and improves economic benefit. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. And combining the unsupervised residual error convolutional neural network scene depth estimation model to form the unsupervised conditional random field residual error convolutional neural network scene depth estimation model. The model of the invention is superior to the other three models in average relative error (rel) and accuracy (acc).

Description

Unsupervised convolutional neural network-based monocular scene depth estimation method
Technical Field
The invention relates to a scene depth estimation method, in particular to an unsupervised convolutional neural network-based monocular scene depth estimation method.
Background
Computer vision is primarily a simulation of biological vision through a computer and associated vision sensors. People firstly acquire external images by using a camera, then the images are converted into digital signals by using a computer, so that the digital processing of the images is realized, and finally a new subject, namely computer vision, is created, which relates to a plurality of application fields, including target tracking, image classification, face recognition, scene understanding and the like. The research objective of computer vision is to enable a computer to have the ability to observe the environment, understand the environment and adapt to the environment autonomously like a person.
However, most of the current computer vision technologies are directed to digital image processing, and due to the lack of depth information of a real scene and pose information of a camera in the image processing process, understanding and recognition of scene errors can be caused to a certain extent. Therefore, how to reconstruct a three-dimensional structure of a scene from an image using depth information and pose information of a camera is a very important topic in computer vision. At present, three-dimensional scene reconstruction by using a depth map is an important approach, and two methods are mainly used for acquiring depth information of an image, namely, the traditional method is to directly acquire the depth information through hardware equipment, such as a laser range finder, but the equipment is difficult to manufacture, high in cost and high in price, and the popularization of the equipment is restricted. Another method is to acquire depth information of an image, i.e., a scene depth estimation method, through computer vision technology.
The field depth estimation method is mainly divided into a monocular field depth estimation method and a binocular field depth estimation method. The binocular scene depth estimation method first needs to perform scene depth estimation under the assumption that the optical geometry constraint is unchanged, such as stereo image matching. The monocular scene depth estimation method does not need prior assumption, has low requirements on camera construction, is convenient to apply, and has the defect that abundant scene structure features are difficult to obtain from monocular images so as to infer scene depth. In recent years, a convolutional neural network in the field of computer vision achieves a plurality of excellent results, and in 2016 Liu et al, in combination with the convolutional neural network and a conditional random field, a DCNF-FCSP scene depth estimation model is proposed, wherein the convolutional neural network mainly acquires the bottom features of an image, and the conditional random field smoothly estimates a depth map according to the similarity of adjacent super pixels. When a depth image dataset is manufactured, due to various external condition interferences such as illumination, weather changes and the like, a depth sensor cannot obtain reliable and accurate image depth information, which may affect the accuracy of the estimation result of a depth estimation model, and the supervised learning method causes the problem of difficult manual data annotation. On the other hand, as the layer number of the convolutional neural network is deepened, the gradient disappearance problem may be brought, and a certain degree of difficulty is brought to the training of the network, so that the obtained result is not accurate enough.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an unsupervised convolutional neural network monocular scene depth estimation method which does not need manual data annotation and has more accurate results.
In order to achieve the above object, the technical scheme of the present invention is as follows: an unsupervised convolutional neural network-based monocular scene depth estimation method comprises the following steps:
A. obtaining depth values of all pixel points of target image
A1, assume that three continuous images I in a scene are input t-1 、I t 、I t+1, wherein ,It Representing the current frame image, I t-1 For the previous frame image, I t+1 For the next frame image, the subscript t represents the current frame and defines I t Is the target image.
A2, using object image I t The method is used as input of a monocular depth estimation residual convolution neural network model, wherein the monocular depth estimation residual convolution neural network model comprises an input layer, seven convolution layers, seven deconvolution layers and four residual items. Input target image I t The result of the feature map obtained after the convolution layer is expressed by the following formula:
T L =f(w L H L-1 +b L ),L∈{1,2,...,L-1} (1)
H L+1 =w L+1 T L +b L+1 ,L∈{1,2,...,L-1} (2)
T L+1 =f(H L-1 +H L+ 1),L∈{1,2,...,L-1} (3)
wherein L represents the number of layers, w, of the convolution layers L and wL+1 Respectively representing training monocular depth estimation residual convolution neural network modesWeight values of L-1 layer and L+1 layer convolution layer, b L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H L-1 and HL+1 And the characteristic diagram of the output of the L-1 layer and the L+1 layer convolution layers is shown. T (T) L Representing the value when the L-th layer convolution layer fails the residual term, T L+1 Representing the values after the l+1 layer convolutional layer is activated by the residual term.
A3, adding a depth linear regression function after the final feature map output by the convolution layer, and mapping each pixel point in the feature map into a corresponding depth value, as shown in a formula (4):
wherein G represents the last layer of the deconvolution layer, w Gd B, for training the weight of the monocular depth estimation residual convolution neural network model Gd Represents the offset vector, H G A feature map obtained by the final deconvolution layer is shown.Representing the resulting depth value.
B. Obtaining a camera pose value when pixel coordinates on a target image are transferred to a next frame image
Computing a current target image I by using pose residual convolution neural network model t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 When the corresponding pixel coordinates are, the corresponding attitude value of the cameraThe pose residual convolution neural network model consists of an input layer, seven convolution layers and two residual items, and comprises the following specific steps:
b1, assume a given image I of two consecutive RGB t 、I t+1 And the sizes are 426 multiplied by 128 multiplied by 3, and the residual volumes are input into the pose residual volumeAnd (5) integrating the neural network model.
B2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I t 、I t+1 A feature vector ax+b of 1×768 corresponds, where a represents the convolution kernel, X represents the image feature, i.e. the gray matrix, and b represents the offset value.
B3, obtaining an image I through a camera pose estimation algorithm t To image I t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value t+1 Corresponding pixel coordinates.
C. Construction of a loss function
C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained t And a pixel coordinate point p in the next frame image t+1 Mapping relation between the two. This process is called view synthesis. As shown in equation (5).
Wherein K represents the built-in parameters of the camera,pose estimation value representing camera motion from time t to time t+1, +.>Is a pixelCoordinate point p t Is a depth value of (a).
And C2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of the loss function. Because the coordinate values in the image are all discrete values and are all integers, in order to ensure I t (p t+1 ) With pixel values, I is determined by bilinear interpolation t (p t+1 ) Proportional conversion is carried out on the values of the four fields (upper left corner, lower left corner, upper right corner and lower right corner) to obtainRepresenting the new image after coordinate conversion. As shown in equation (6).
wherein ,ωij And p is as follows t+1 Andin a linear relationship Σω ij =1, and ω ij Is a parameter of bilinear interpolation, N p And representing a neighborhood of pixel coordinates p on the image, i represents linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel coordinates p, and j represents linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel coordinates p.
And C3, jointly forming an unsupervised residual error convolutional neural network scene depth estimation model by the monocular depth estimation residual error convolutional neural network model and the pose residual error convolutional neural network model, obtaining a loss function of the unsupervised residual error convolutional neural network scene depth estimation model by view synthesis, and realizing mapping from an original image to a depth image, wherein the loss function is shown in a formula (7):
wherein ,It (p) representing the target image I t The coordinates of the pixel points on the image are calculated,the pixel point coordinates on the reconstructed new image are represented, N represents the number of pixel points, and M represents the number of images. t represents a frame, and p is the pixel coordinates.
D. Depth estimation based on unsupervised conditional random field residual convolution neural network scene
D1, adding a conditional random field based on the step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items.
D2, for scene depth estimation, given image X, regarding depthIs written in the form of a gibbs distribution as shown in equation (8):
wherein ,the normalization factor Z (X) is as shown in equation (9) as an energy function:
and D3, training an unsupervised conditional random field residual error convolutional neural network scene depth estimation model by adopting a maximum conditional likelihood estimation method, so that a negative log likelihood function L of a loss function of the unsupervised conditional random field residual error convolutional neural network scene depth estimation model is shown as a formula (10):
wherein the energy function isDefined as a form containing a univariate term and a bivariate term, as shown in formula (11):
wherein ,dp and dj Representing depth values at points p and j. D (D) ij (d p ,d j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network VS As shown in formula (7):
wherein ,πpj A smoothness penalty is expressed to measure the similarity of features of neighboring pixels. U (U) p and Uj The feature values of the pixel points at the point p and the point j are respectively represented, and when the difference between the feature values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar.
And D4, mapping the original target image to the depth map by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and forming a univariate part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model. And the conditional random field corresponds to a binary item part in the unsupervised conditional random field residual convolution neural network scene depth estimation model, so that the feature expression of the original image is realized. And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. Similar to the traditional training mode of parameters in a conditional random field, training of an unsupervised conditional random field residual convolution neural network scene depth estimation model is achieved by adopting a maximum conditional likelihood estimation method, and a negative log likelihood function is adopted as a loss function L (W) of the model, as shown in a formula (13):
wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model. To facilitate deriving parameters in the loss function, the following expression is introduced:
Q=B+C-R
wherein B represents an n×n identity matrix, R is equal to pi pj The similarity matrix constituting the square matrix, C being the diagonal matrix of R and C-R being a Laplacian matrix. Thus (2)Finishing to formula (14):
where T represents the transpose of the matrix. Tarnish 1 And tarnish 2 And respectively representing parameters of an unsupervised conditional random field residual convolution neural network scene depth estimation model univariate term and bivariate term. Due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):
wherein ,dI Representing the depth value of each pixel point on the image. Synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):
the loss function is sorted into a negative log likelihood function as shown in equation (17):
and D5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method. Parameter tarnishing in depth estimation model of non-supervision conditional random field residual convolution neural network 1 When deriving, L (W) is related to the tarnish because it is related to parameters in only one element 1 The derivative of (2) is shown in equation (18).
In the process of tarnishing the parameters of convolutional neural network 2 When deriving, L (W) is also known to be a tarnish because it is only related to parameters in the binary item 2 The derivative of (2) is shown in formula (19):
p in formulas (18), (19) r (·)、T r (. Cndot.) all represent traces of the matrix.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention adopts an unsupervised residual convolution neural network scene depth estimation model, and the original image is subjected to the step A to obtain the depth value of each pixel in the imageObtaining the current target image I in step B t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 In the corresponding pixel coordinates, the pose value corresponding to the camera +.>Reconstructing a new image by step C1 +.>And C2, obtaining a supervision signal of the loss function, and finally obtaining the loss function of the non-supervision model by using the view synthesis idea in step C3, thereby realizing a non-supervision method. The non-supervision method adopted by the invention well solves the problem of difficult manual data marking, saves manpower and improves economic benefit.
2. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. The non-supervision residual error convolutional neural network scene depth estimation model is combined to form the non-supervision conditional random field residual error convolutional neural network scene depth estimation model, and the non-supervision conditional random field residual error convolutional neural network scene depth estimation model is experimentally compared with a supervision depth estimation model proposed by Eigen. Etc., a supervision depth estimation model proposed by Liu. Etc., and a non-supervision depth estimation model proposed by Godard. Etc. under the same data set. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.
Drawings
The invention is illustrated in fig. 3, in which:
fig. 1 is a flow chart of the present invention.
Fig. 2 is an original image of a scene.
Fig. 3 is a scene depth image obtained after convergence of model training.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The scene image is processed according to the flow chart shown in fig. 1, firstly, a camera is used for shooting video of a scene to be processed, images of continuous frames are selected, and the images are used as original images of scene depth estimation according to the invention as shown in fig. 2. According to the step A, B, C of the invention, mapping from an original image to a depth map is realized by utilizing view synthesis thought, and a loss function L based on an unsupervised residual convolution neural network scene depth estimation model is obtained VS Then, as shown in the formula (6), the feature expression D of the target image is obtained according to the step D of the present invention ij (d p ,d j ) And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. By minimizing the model loss function, training and parameter learning of an unsupervised residual convolution neural network scene depth estimation model are achieved by using a random gradient descent method. After convergence of the model training, a final scene depth map is obtained, as shown in fig. 3.
The present invention was experimentally compared with the supervised depth estimation model proposed by eigen. Etc, the supervised depth estimation model proposed by liu. Etc, the unsupervised depth estimation model proposed by golard. Etc under the same data set as shown in table 1. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.
Table 1 comparison of experimental results
The present invention is not limited to the present embodiment, and any equivalent concept or modification within the technical scope of the present invention is listed as the protection scope of the present invention.

Claims (1)

1. An unsupervised convolutional neural network-based monocular scene depth estimation method is characterized by comprising the following steps of: the method comprises the following steps:
A. obtaining depth values of all pixel points of target image
A1, assume that three continuous images I in a scene are input t-1 、I t 、I t+1, wherein ,It Representing the current frame image, I t-1 For the previous frame image, I t+1 For the next frame image, the subscript t represents the current frame and defines I t Is a target image;
a2, using object image I t The method comprises the steps of inputting a monocular depth estimation residual convolutional neural network model, wherein the monocular depth estimation residual convolutional neural network model comprises an input layer, a seven-layer convolutional layer, a seven-layer deconvolution layer and four residual items; input target image I t The result of the feature map obtained after the convolution layer is expressed by the following formula:
T L =f(w L H L-1 +b L ),L∈{1,2,...,L-1} (1)
H L+1 =w L+1 T L +b L+1 ,L∈{1,2,...,L-1} (2)
T L+1 =f(H L-1 +H L+1 ),L∈{1,2,...,L-1} (3)
wherein L represents the number of layers, w, of the convolution layers L and wL+1 Weight values respectively representing the L-1 layer and the L+1 layer convolution layers of the training monocular depth estimation residual convolution neural network model, b L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H L-1 and HL+1 A feature map representing the output of the L-1 layer and the L+1 layer convolution layers; t (T) L Representing the value when the L-th layer convolution layer fails the residual term, T L+1 Representing the value after the l+1 layer convolutional layer is activated by the residual term;
a3, adding a depth linear regression function after the final feature map output by the convolution layer, and mapping each pixel point in the feature map into a corresponding depth value, as shown in a formula (4):
wherein G represents the last layer of the deconvolution layer, w Gd B, for training the weight of the monocular depth estimation residual convolution neural network model Gd Represents the offset vector, H G Representing a feature map obtained by a final deconvolution layer;representing the resulting depth value;
B. obtaining a camera pose value when pixel coordinates on a target image are transferred to a next frame image
Computing a current target image I by using pose residual convolution neural network model t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 When the corresponding pixel coordinates are, the corresponding attitude value of the cameraThe pose residual convolution neural network model consists of an input layer, seven convolution layers and two residual items, and comprises the following specific steps:
b1, assume a given image I of two consecutive RGB t 、I t+1 The sizes of the model are 426 multiplied by 128 multiplied by 3, and the model is input into a pose residual convolution neural network model;
b2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I t 、I t+1 A feature vector ax+b of corresponding size 1×768, where a represents a convolution kernel, X represents an image feature, i.e., a gray matrix, and b represents an offset value;
B3obtaining an image I through a camera pose estimation algorithm t To image I t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value t+1 Corresponding pixel coordinates;
C. construction of a loss function
C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained t And a pixel coordinate point p in the next frame image t+1 Mapping relation between the two; this process is called view synthesis; as shown in formula (5);
wherein K represents the built-in parameters of the camera,representing pose estimates of the camera moving from time t to time t +1,is the pixel coordinate point p t Depth values of (2);
c2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of a loss function; because the coordinate values in the image are all discrete values and are all integers, in order to ensureI t (p t+1 ) With pixel values, I is determined by bilinear interpolation t (p t+1 ) The four field values, namely the upper left corner value, the lower left corner value, the upper right corner value and the lower right corner value are obtained by proportional conversionRepresenting the new image after coordinate conversion; as shown in formula (6);
wherein ,ωij And p is as follows t+1 Andin a linear relationship Σω ij =1, and ω ij Is a parameter of bilinear interpolation, N p Representing a neighborhood of pixel point coordinates p on an image, i representing linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel point coordinates p, and j representing linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel point coordinates p;
and C3, jointly forming an unsupervised residual error convolutional neural network scene depth estimation model by the monocular depth estimation residual error convolutional neural network model and the pose residual error convolutional neural network model, obtaining a loss function of the unsupervised residual error convolutional neural network scene depth estimation model by view synthesis, and realizing mapping from an original image to a depth image, wherein the loss function is shown in a formula (7):
wherein ,It (p) representing the target image I t The coordinates of the pixel points on the image are calculated,representing pixel coordinates on the reconstructed new image, N representing the pixelThe number M represents the number of images; t represents a frame, and p is a pixel point coordinate;
D. depth estimation based on unsupervised conditional random field residual convolution neural network scene
D1, adding a conditional random field on the basis of a step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items;
d2, for scene depth estimation, given image X, regarding depthIs written in the form of a gibbs distribution as shown in equation (8):
wherein ,the normalization factor Z (X) is as shown in equation (9) as an energy function:
and D3, training an unsupervised conditional random field residual error convolutional neural network scene depth estimation model by adopting a maximum conditional likelihood estimation method, so that a negative log likelihood function L of a loss function of the unsupervised conditional random field residual error convolutional neural network scene depth estimation model is shown as a formula (10):
wherein the energy function isDefined as a form containing a univariate term and a bivariate term, as shown in formula (11):
wherein ,dp and dj Representing depth values at points p and j; d (D) ij (d p ,d j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network VS As shown in formula (7):
wherein ,πpj Representing smoothness penalty for measuring similarity of adjacent pixel characteristics; u (U) p and Uj The characteristic values of the pixel points at the point p and the point j are respectively represented, and when the difference between the characteristic values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar;
d4, mapping the original target image to the depth map is achieved by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and a unitary item part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model is formed; binary item parts in the condition random field residual convolution neural network scene depth estimation model corresponding to the unsupervised condition random field are used for realizing the feature expression of the original image; utilizing the output of the two parts to construct an unsupervised conditional random field residual convolution neural network scene depth estimation model; similar to the conventional training method of parameters in a conditional random field, the method adopts a polarTraining of a scene depth estimation model of an unsupervised conditional random field residual convolution neural network based on a large condition likelihood estimation method is achieved, and a negative log likelihood function is adopted as a loss function of the modelAs shown in equation (13):
wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model; to facilitate deriving parameters in the loss function, the following expression is introduced:
Q=B+C-R
wherein B represents an n×n identity matrix, R is equal to pi pj The similarity matrix of the square matrix is formed, C is the diagonal matrix of R, and C-R is a Laplacian matrix; thus (2)Finishing to formula (14):
wherein T represents the transpose of the matrix; w (W) 1 and W2 Parameters of a single item and a binary item of an unsupervised conditional random field residual convolution neural network scene depth estimation model are respectively represented; due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):
wherein ,dI Representing a depth value for each pixel point on the image; synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):
the loss function is sorted into a negative log likelihood function as shown in equation (17):
d5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method; when the parameter W in the depth estimation model of the unsupervised conditional random field residual convolution neural network is used 1 When deriving, L (W) is related to W because it is related to parameters in only one element 1 The derivative of (2) is shown in formula (18);
in the process of convolving the neural network parameter W 2 When deriving, L (W) is also obtained with respect to W because it is related only to parameters in the binary item 2 The derivative of (2) is shown in formula (19):
formulas (18), (1)9) P in (3) r (·)、T r (. Cndot.) all represent traces of the matrix.
CN201910807213.8A 2019-08-29 2019-08-29 Unsupervised convolutional neural network-based monocular scene depth estimation method Active CN110503680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910807213.8A CN110503680B (en) 2019-08-29 2019-08-29 Unsupervised convolutional neural network-based monocular scene depth estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910807213.8A CN110503680B (en) 2019-08-29 2019-08-29 Unsupervised convolutional neural network-based monocular scene depth estimation method

Publications (2)

Publication Number Publication Date
CN110503680A CN110503680A (en) 2019-11-26
CN110503680B true CN110503680B (en) 2023-08-18

Family

ID=68590325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910807213.8A Active CN110503680B (en) 2019-08-29 2019-08-29 Unsupervised convolutional neural network-based monocular scene depth estimation method

Country Status (1)

Country Link
CN (1) CN110503680B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340864B (en) * 2020-02-26 2023-12-12 浙江大华技术股份有限公司 Three-dimensional scene fusion method and device based on monocular estimation
CN111354030B (en) * 2020-02-29 2023-08-04 同济大学 Method for generating unsupervised monocular image depth map embedded into SENet unit
CN113822918A (en) * 2020-04-28 2021-12-21 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium
CN111583345B (en) * 2020-05-09 2022-09-27 吉林大学 Method, device and equipment for acquiring camera parameters and storage medium
CN112150531B (en) * 2020-09-29 2022-12-09 西北工业大学 Robust self-supervised learning single-frame image depth estimation method
CN112270692B (en) * 2020-10-15 2022-07-05 电子科技大学 Monocular video structure and motion prediction self-supervision method based on super-resolution
CN112561947A (en) * 2020-12-10 2021-03-26 中国科学院深圳先进技术研究院 Image self-adaptive motion estimation method and application
WO2022165722A1 (en) * 2021-02-04 2022-08-11 华为技术有限公司 Monocular depth estimation method, apparatus and device
CN112767468B (en) * 2021-02-05 2023-11-03 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN113129370B (en) * 2021-03-04 2022-08-19 同济大学 Semi-supervised object pose estimation method combining generated data and label-free data
CN113160294B (en) * 2021-03-31 2022-12-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium
CN112801074B (en) * 2021-04-15 2021-07-16 速度时空信息科技股份有限公司 Depth map estimation method based on traffic camera
CN114170286B (en) * 2021-11-04 2023-04-28 西安理工大学 Monocular depth estimation method based on unsupervised deep learning
TWI823416B (en) * 2022-06-08 2023-11-21 鴻海精密工業股份有限公司 Training method, device, electronic device and storage medium for depth estimation network
CN116245927B (en) * 2023-02-09 2024-01-16 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180231871A1 (en) * 2016-06-27 2018-08-16 Zhejiang Gongshang University Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
CN108765479A (en) * 2018-04-04 2018-11-06 上海工程技术大学 Using deep learning to monocular view estimation of Depth optimization method in video sequence
CN110009674A (en) * 2019-04-01 2019-07-12 厦门大学 Monocular image depth of field real-time computing technique based on unsupervised deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180231871A1 (en) * 2016-06-27 2018-08-16 Zhejiang Gongshang University Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
CN108765479A (en) * 2018-04-04 2018-11-06 上海工程技术大学 Using deep learning to monocular view estimation of Depth optimization method in video sequence
CN110009674A (en) * 2019-04-01 2019-07-12 厦门大学 Monocular image depth of field real-time computing technique based on unsupervised deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李耀宇."基于结构化深度学习的单目图像深度估计".《机器人》.2017,第39卷(第6期),全文. *

Also Published As

Publication number Publication date
CN110503680A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
CN108416840B (en) Three-dimensional scene dense reconstruction method based on monocular camera
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN108921926B (en) End-to-end three-dimensional face reconstruction method based on single image
CN109584353B (en) Method for reconstructing three-dimensional facial expression model based on monocular video
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN105654492B (en) Robust real-time three-dimensional method for reconstructing based on consumer level camera
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN110163974B (en) Single-image picture reconstruction method based on undirected graph learning model
CN112766160A (en) Face replacement method based on multi-stage attribute encoder and attention mechanism
CN112215050A (en) Nonlinear 3DMM face reconstruction and posture normalization method, device, medium and equipment
CN112784736B (en) Character interaction behavior recognition method based on multi-modal feature fusion
CN108932536A (en) Human face posture method for reconstructing based on deep neural network
CN109684969B (en) Gaze position estimation method, computer device, and storage medium
CN108280858B (en) Linear global camera motion parameter estimation method in multi-view reconstruction
CN110853075A (en) Visual tracking positioning method based on dense point cloud and synthetic view
CN111783582A (en) Unsupervised monocular depth estimation algorithm based on deep learning
CN110176023B (en) Optical flow estimation method based on pyramid structure
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN113570658A (en) Monocular video depth estimation method based on depth convolutional network
CN115471423A (en) Point cloud denoising method based on generation countermeasure network and self-attention mechanism
Jiang et al. H $ _ {2} $-Mapping: Real-time Dense Mapping Using Hierarchical Hybrid Representation
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant