CN110503680B - Unsupervised convolutional neural network-based monocular scene depth estimation method - Google Patents
Unsupervised convolutional neural network-based monocular scene depth estimation method Download PDFInfo
- Publication number
- CN110503680B CN110503680B CN201910807213.8A CN201910807213A CN110503680B CN 110503680 B CN110503680 B CN 110503680B CN 201910807213 A CN201910807213 A CN 201910807213A CN 110503680 B CN110503680 B CN 110503680B
- Authority
- CN
- China
- Prior art keywords
- neural network
- image
- depth estimation
- unsupervised
- residual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 41
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013528 artificial neural network Methods 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims description 40
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 17
- 238000003062 neural network model Methods 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 9
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 238000003786 synthesis reaction Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 2
- 230000004913 activation Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 claims description 2
- 238000009826 distribution Methods 0.000 claims description 2
- 238000005315 distribution function Methods 0.000 claims description 2
- 238000012417 linear regression Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000002194 synthesizing effect Effects 0.000 claims description 2
- 238000013519 translation Methods 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000005494 tarnishing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an unsupervised convolutional neural network-based monocular scene depth estimation method, which comprises the following steps: obtaining a depth value of each pixel point of a target image; acquiring a camera pose value when pixel coordinates on a target image are transferred to a next frame of image; constructing a loss function; and performing depth estimation based on the unsupervised conditional random field residual convolution neural network scene. The invention adopts an unsupervised method to well solve the problem of difficult manual data marking, saves manpower and improves economic benefit. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. And combining the unsupervised residual error convolutional neural network scene depth estimation model to form the unsupervised conditional random field residual error convolutional neural network scene depth estimation model. The model of the invention is superior to the other three models in average relative error (rel) and accuracy (acc).
Description
Technical Field
The invention relates to a scene depth estimation method, in particular to an unsupervised convolutional neural network-based monocular scene depth estimation method.
Background
Computer vision is primarily a simulation of biological vision through a computer and associated vision sensors. People firstly acquire external images by using a camera, then the images are converted into digital signals by using a computer, so that the digital processing of the images is realized, and finally a new subject, namely computer vision, is created, which relates to a plurality of application fields, including target tracking, image classification, face recognition, scene understanding and the like. The research objective of computer vision is to enable a computer to have the ability to observe the environment, understand the environment and adapt to the environment autonomously like a person.
However, most of the current computer vision technologies are directed to digital image processing, and due to the lack of depth information of a real scene and pose information of a camera in the image processing process, understanding and recognition of scene errors can be caused to a certain extent. Therefore, how to reconstruct a three-dimensional structure of a scene from an image using depth information and pose information of a camera is a very important topic in computer vision. At present, three-dimensional scene reconstruction by using a depth map is an important approach, and two methods are mainly used for acquiring depth information of an image, namely, the traditional method is to directly acquire the depth information through hardware equipment, such as a laser range finder, but the equipment is difficult to manufacture, high in cost and high in price, and the popularization of the equipment is restricted. Another method is to acquire depth information of an image, i.e., a scene depth estimation method, through computer vision technology.
The field depth estimation method is mainly divided into a monocular field depth estimation method and a binocular field depth estimation method. The binocular scene depth estimation method first needs to perform scene depth estimation under the assumption that the optical geometry constraint is unchanged, such as stereo image matching. The monocular scene depth estimation method does not need prior assumption, has low requirements on camera construction, is convenient to apply, and has the defect that abundant scene structure features are difficult to obtain from monocular images so as to infer scene depth. In recent years, a convolutional neural network in the field of computer vision achieves a plurality of excellent results, and in 2016 Liu et al, in combination with the convolutional neural network and a conditional random field, a DCNF-FCSP scene depth estimation model is proposed, wherein the convolutional neural network mainly acquires the bottom features of an image, and the conditional random field smoothly estimates a depth map according to the similarity of adjacent super pixels. When a depth image dataset is manufactured, due to various external condition interferences such as illumination, weather changes and the like, a depth sensor cannot obtain reliable and accurate image depth information, which may affect the accuracy of the estimation result of a depth estimation model, and the supervised learning method causes the problem of difficult manual data annotation. On the other hand, as the layer number of the convolutional neural network is deepened, the gradient disappearance problem may be brought, and a certain degree of difficulty is brought to the training of the network, so that the obtained result is not accurate enough.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an unsupervised convolutional neural network monocular scene depth estimation method which does not need manual data annotation and has more accurate results.
In order to achieve the above object, the technical scheme of the present invention is as follows: an unsupervised convolutional neural network-based monocular scene depth estimation method comprises the following steps:
A. obtaining depth values of all pixel points of target image
A1, assume that three continuous images I in a scene are input t-1 、I t 、I t+1, wherein ,It Representing the current frame image, I t-1 For the previous frame image, I t+1 For the next frame image, the subscript t represents the current frame and defines I t Is the target image.
A2, using object image I t The method is used as input of a monocular depth estimation residual convolution neural network model, wherein the monocular depth estimation residual convolution neural network model comprises an input layer, seven convolution layers, seven deconvolution layers and four residual items. Input target image I t The result of the feature map obtained after the convolution layer is expressed by the following formula:
T L =f(w L H L-1 +b L ),L∈{1,2,...,L-1} (1)
H L+1 =w L+1 T L +b L+1 ,L∈{1,2,...,L-1} (2)
T L+1 =f(H L-1 +H L+ 1),L∈{1,2,...,L-1} (3)
wherein L represents the number of layers, w, of the convolution layers L and wL+1 Respectively representing training monocular depth estimation residual convolution neural network modesWeight values of L-1 layer and L+1 layer convolution layer, b L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H L-1 and HL+1 And the characteristic diagram of the output of the L-1 layer and the L+1 layer convolution layers is shown. T (T) L Representing the value when the L-th layer convolution layer fails the residual term, T L+1 Representing the values after the l+1 layer convolutional layer is activated by the residual term.
A3, adding a depth linear regression function after the final feature map output by the convolution layer, and mapping each pixel point in the feature map into a corresponding depth value, as shown in a formula (4):
wherein G represents the last layer of the deconvolution layer, w Gd B, for training the weight of the monocular depth estimation residual convolution neural network model Gd Represents the offset vector, H G A feature map obtained by the final deconvolution layer is shown.Representing the resulting depth value.
B. Obtaining a camera pose value when pixel coordinates on a target image are transferred to a next frame image
Computing a current target image I by using pose residual convolution neural network model t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 When the corresponding pixel coordinates are, the corresponding attitude value of the cameraThe pose residual convolution neural network model consists of an input layer, seven convolution layers and two residual items, and comprises the following specific steps:
b1, assume a given image I of two consecutive RGB t 、I t+1 And the sizes are 426 multiplied by 128 multiplied by 3, and the residual volumes are input into the pose residual volumeAnd (5) integrating the neural network model.
B2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I t 、I t+1 A feature vector ax+b of 1×768 corresponds, where a represents the convolution kernel, X represents the image feature, i.e. the gray matrix, and b represents the offset value.
B3, obtaining an image I through a camera pose estimation algorithm t To image I t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value t+1 Corresponding pixel coordinates.
C. Construction of a loss function
C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained t And a pixel coordinate point p in the next frame image t+1 Mapping relation between the two. This process is called view synthesis. As shown in equation (5).
Wherein K represents the built-in parameters of the camera,pose estimation value representing camera motion from time t to time t+1, +.>Is a pixelCoordinate point p t Is a depth value of (a).
And C2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of the loss function. Because the coordinate values in the image are all discrete values and are all integers, in order to ensure I t (p t+1 ) With pixel values, I is determined by bilinear interpolation t (p t+1 ) Proportional conversion is carried out on the values of the four fields (upper left corner, lower left corner, upper right corner and lower right corner) to obtainRepresenting the new image after coordinate conversion. As shown in equation (6).
wherein ,ωij And p is as follows t+1 Andin a linear relationship Σω ij =1, and ω ij Is a parameter of bilinear interpolation, N p And representing a neighborhood of pixel coordinates p on the image, i represents linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel coordinates p, and j represents linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel coordinates p.
And C3, jointly forming an unsupervised residual error convolutional neural network scene depth estimation model by the monocular depth estimation residual error convolutional neural network model and the pose residual error convolutional neural network model, obtaining a loss function of the unsupervised residual error convolutional neural network scene depth estimation model by view synthesis, and realizing mapping from an original image to a depth image, wherein the loss function is shown in a formula (7):
wherein ,It (p) representing the target image I t The coordinates of the pixel points on the image are calculated,the pixel point coordinates on the reconstructed new image are represented, N represents the number of pixel points, and M represents the number of images. t represents a frame, and p is the pixel coordinates.
D. Depth estimation based on unsupervised conditional random field residual convolution neural network scene
D1, adding a conditional random field based on the step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items.
D2, for scene depth estimation, given image X, regarding depthIs written in the form of a gibbs distribution as shown in equation (8):
wherein ,the normalization factor Z (X) is as shown in equation (9) as an energy function:
and D3, training an unsupervised conditional random field residual error convolutional neural network scene depth estimation model by adopting a maximum conditional likelihood estimation method, so that a negative log likelihood function L of a loss function of the unsupervised conditional random field residual error convolutional neural network scene depth estimation model is shown as a formula (10):
wherein the energy function isDefined as a form containing a univariate term and a bivariate term, as shown in formula (11):
wherein ,dp and dj Representing depth values at points p and j. D (D) ij (d p ,d j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network VS As shown in formula (7):
wherein ,πpj A smoothness penalty is expressed to measure the similarity of features of neighboring pixels. U (U) p and Uj The feature values of the pixel points at the point p and the point j are respectively represented, and when the difference between the feature values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar.
And D4, mapping the original target image to the depth map by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and forming a univariate part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model. And the conditional random field corresponds to a binary item part in the unsupervised conditional random field residual convolution neural network scene depth estimation model, so that the feature expression of the original image is realized. And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. Similar to the traditional training mode of parameters in a conditional random field, training of an unsupervised conditional random field residual convolution neural network scene depth estimation model is achieved by adopting a maximum conditional likelihood estimation method, and a negative log likelihood function is adopted as a loss function L (W) of the model, as shown in a formula (13):
wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model. To facilitate deriving parameters in the loss function, the following expression is introduced:
Q=B+C-R
wherein B represents an n×n identity matrix, R is equal to pi pj The similarity matrix constituting the square matrix, C being the diagonal matrix of R and C-R being a Laplacian matrix. Thus (2)Finishing to formula (14):
where T represents the transpose of the matrix. Tarnish 1 And tarnish 2 And respectively representing parameters of an unsupervised conditional random field residual convolution neural network scene depth estimation model univariate term and bivariate term. Due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):
wherein ,dI Representing the depth value of each pixel point on the image. Synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):
the loss function is sorted into a negative log likelihood function as shown in equation (17):
and D5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method. Parameter tarnishing in depth estimation model of non-supervision conditional random field residual convolution neural network 1 When deriving, L (W) is related to the tarnish because it is related to parameters in only one element 1 The derivative of (2) is shown in equation (18).
In the process of tarnishing the parameters of convolutional neural network 2 When deriving, L (W) is also known to be a tarnish because it is only related to parameters in the binary item 2 The derivative of (2) is shown in formula (19):
p in formulas (18), (19) r (·)、T r (. Cndot.) all represent traces of the matrix.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention adopts an unsupervised residual convolution neural network scene depth estimation model, and the original image is subjected to the step A to obtain the depth value of each pixel in the imageObtaining the current target image I in step B t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 In the corresponding pixel coordinates, the pose value corresponding to the camera +.>Reconstructing a new image by step C1 +.>And C2, obtaining a supervision signal of the loss function, and finally obtaining the loss function of the non-supervision model by using the view synthesis idea in step C3, thereby realizing a non-supervision method. The non-supervision method adopted by the invention well solves the problem of difficult manual data marking, saves manpower and improves economic benefit.
2. The invention adopts the linear chain piece random field idea to realize the characteristic expression of the original image. The non-supervision residual error convolutional neural network scene depth estimation model is combined to form the non-supervision conditional random field residual error convolutional neural network scene depth estimation model, and the non-supervision conditional random field residual error convolutional neural network scene depth estimation model is experimentally compared with a supervision depth estimation model proposed by Eigen. Etc., a supervision depth estimation model proposed by Liu. Etc., and a non-supervision depth estimation model proposed by Godard. Etc. under the same data set. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.
Drawings
The invention is illustrated in fig. 3, in which:
fig. 1 is a flow chart of the present invention.
Fig. 2 is an original image of a scene.
Fig. 3 is a scene depth image obtained after convergence of model training.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The scene image is processed according to the flow chart shown in fig. 1, firstly, a camera is used for shooting video of a scene to be processed, images of continuous frames are selected, and the images are used as original images of scene depth estimation according to the invention as shown in fig. 2. According to the step A, B, C of the invention, mapping from an original image to a depth map is realized by utilizing view synthesis thought, and a loss function L based on an unsupervised residual convolution neural network scene depth estimation model is obtained VS Then, as shown in the formula (6), the feature expression D of the target image is obtained according to the step D of the present invention ij (d p ,d j ) And constructing an unsupervised conditional random field residual convolution neural network scene depth estimation model based on the output of the two parts. By minimizing the model loss function, training and parameter learning of an unsupervised residual convolution neural network scene depth estimation model are achieved by using a random gradient descent method. After convergence of the model training, a final scene depth map is obtained, as shown in fig. 3.
The present invention was experimentally compared with the supervised depth estimation model proposed by eigen. Etc, the supervised depth estimation model proposed by liu. Etc, the unsupervised depth estimation model proposed by golard. Etc under the same data set as shown in table 1. The model of the invention is superior to the other three models in terms of average relative error (rel) and accuracy (acc), but is almost leveled in terms of root mean square error (rms) and average logarithmic error (log), and the invention is based on the unsupervised method for depth estimation, so that the leveling state can be considered to be superior to the other three models.
Table 1 comparison of experimental results
The present invention is not limited to the present embodiment, and any equivalent concept or modification within the technical scope of the present invention is listed as the protection scope of the present invention.
Claims (1)
1. An unsupervised convolutional neural network-based monocular scene depth estimation method is characterized by comprising the following steps of: the method comprises the following steps:
A. obtaining depth values of all pixel points of target image
A1, assume that three continuous images I in a scene are input t-1 、I t 、I t+1, wherein ,It Representing the current frame image, I t-1 For the previous frame image, I t+1 For the next frame image, the subscript t represents the current frame and defines I t Is a target image;
a2, using object image I t The method comprises the steps of inputting a monocular depth estimation residual convolutional neural network model, wherein the monocular depth estimation residual convolutional neural network model comprises an input layer, a seven-layer convolutional layer, a seven-layer deconvolution layer and four residual items; input target image I t The result of the feature map obtained after the convolution layer is expressed by the following formula:
T L =f(w L H L-1 +b L ),L∈{1,2,...,L-1} (1)
H L+1 =w L+1 T L +b L+1 ,L∈{1,2,...,L-1} (2)
T L+1 =f(H L-1 +H L+1 ),L∈{1,2,...,L-1} (3)
wherein L represents the number of layers, w, of the convolution layers L and wL+1 Weight values respectively representing the L-1 layer and the L+1 layer convolution layers of the training monocular depth estimation residual convolution neural network model, b L To train the bias value of the monocular depth estimation residual convolutional neural network model, f (·) represents the activation function in the monocular depth estimation residual convolutional neural network model, H L-1 and HL+1 A feature map representing the output of the L-1 layer and the L+1 layer convolution layers; t (T) L Representing the value when the L-th layer convolution layer fails the residual term, T L+1 Representing the value after the l+1 layer convolutional layer is activated by the residual term;
a3, adding a depth linear regression function after the final feature map output by the convolution layer, and mapping each pixel point in the feature map into a corresponding depth value, as shown in a formula (4):
wherein G represents the last layer of the deconvolution layer, w Gd B, for training the weight of the monocular depth estimation residual convolution neural network model Gd Represents the offset vector, H G Representing a feature map obtained by a final deconvolution layer;representing the resulting depth value;
B. obtaining a camera pose value when pixel coordinates on a target image are transferred to a next frame image
Computing a current target image I by using pose residual convolution neural network model t The pixel coordinates of each point in (a) are transferred to the next frame of image I t+1 When the corresponding pixel coordinates are, the corresponding attitude value of the cameraThe pose residual convolution neural network model consists of an input layer, seven convolution layers and two residual items, and comprises the following specific steps:
b1, assume a given image I of two consecutive RGB t 、I t+1 The sizes of the model are 426 multiplied by 128 multiplied by 3, and the model is input into a pose residual convolution neural network model;
b2, after seven layers of convolution of the pose residual convolution neural network model, respectively obtaining two images I t 、I t+1 A feature vector ax+b of corresponding size 1×768, where a represents a convolution kernel, X represents an image feature, i.e., a gray matrix, and b represents an offset value;
B3obtaining an image I through a camera pose estimation algorithm t To image I t+1 Camera pose value of (2)I.e. the rotation matrix R and the translation vector v, expressed in six degrees of freedom, meaning the image I t The coordinates of the middle pixels are found in the image I through the camera attitude conversion value t+1 Corresponding pixel coordinates;
C. construction of a loss function
C1, use of predicted image depth valuesCamera motion pose estimation value ∈>Target image I t And a camera built-in parameter K is used as input, and a pixel coordinate point p in the target image can be obtained t And a pixel coordinate point p in the next frame image t+1 Mapping relation between the two; this process is called view synthesis; as shown in formula (5);
wherein K represents the built-in parameters of the camera,representing pose estimates of the camera moving from time t to time t +1,is the pixel coordinate point p t Depth values of (2);
c2, finding out the corresponding relation between each pixel point of the target image and the next frame of image through view synthesis to serve as an unsupervised signal of a loss function; because the coordinate values in the image are all discrete values and are all integers, in order to ensureI t (p t+1 ) With pixel values, I is determined by bilinear interpolation t (p t+1 ) The four field values, namely the upper left corner value, the lower left corner value, the upper right corner value and the lower right corner value are obtained by proportional conversionRepresenting the new image after coordinate conversion; as shown in formula (6);
wherein ,ωij And p is as follows t+1 Andin a linear relationship Σω ij =1, and ω ij Is a parameter of bilinear interpolation, N p Representing a neighborhood of pixel point coordinates p on an image, i representing linear interpolation calculation in the vertical axis direction in the neighborhood of the pixel point coordinates p, and j representing linear interpolation calculation in the horizontal axis direction in the neighborhood of the pixel point coordinates p;
and C3, jointly forming an unsupervised residual error convolutional neural network scene depth estimation model by the monocular depth estimation residual error convolutional neural network model and the pose residual error convolutional neural network model, obtaining a loss function of the unsupervised residual error convolutional neural network scene depth estimation model by view synthesis, and realizing mapping from an original image to a depth image, wherein the loss function is shown in a formula (7):
wherein ,It (p) representing the target image I t The coordinates of the pixel points on the image are calculated,representing pixel coordinates on the reconstructed new image, N representing the pixelThe number M represents the number of images; t represents a frame, and p is a pixel point coordinate;
D. depth estimation based on unsupervised conditional random field residual convolution neural network scene
D1, adding a conditional random field on the basis of a step A, B to form an unsupervised conditional random field residual convolution neural network scene depth estimation model, wherein the conditional random field consists of an input layer, seven convolution layers and two residual items;
d2, for scene depth estimation, given image X, regarding depthIs written in the form of a gibbs distribution as shown in equation (8):
wherein ,the normalization factor Z (X) is as shown in equation (9) as an energy function:
and D3, training an unsupervised conditional random field residual error convolutional neural network scene depth estimation model by adopting a maximum conditional likelihood estimation method, so that a negative log likelihood function L of a loss function of the unsupervised conditional random field residual error convolutional neural network scene depth estimation model is shown as a formula (10):
wherein the energy function isDefined as a form containing a univariate term and a bivariate term, as shown in formula (11):
wherein ,dp and dj Representing depth values at points p and j; d (D) ij (d p ,d j ) For the feature expression of the original image, as shown in formula (12),loss function L of scene depth estimation model of unsupervised residual convolution neural network VS As shown in formula (7):
wherein ,πpj Representing smoothness penalty for measuring similarity of adjacent pixel characteristics; u (U) p and Uj The characteristic values of the pixel points at the point p and the point j are respectively represented, and when the difference between the characteristic values of the two pixel points is smaller, the depth of the two pixel points is penalized to be more similar;
d4, mapping the original target image to the depth map is achieved by using the monocular depth estimation residual convolutional neural network model and the pose residual convolutional neural network model in the step A and the step B, and a unitary item part in the unsupervised conditional random field residual convolutional neural network scene depth estimation model is formed; binary item parts in the condition random field residual convolution neural network scene depth estimation model corresponding to the unsupervised condition random field are used for realizing the feature expression of the original image; utilizing the output of the two parts to construct an unsupervised conditional random field residual convolution neural network scene depth estimation model; similar to the conventional training method of parameters in a conditional random field, the method adopts a polarTraining of a scene depth estimation model of an unsupervised conditional random field residual convolution neural network based on a large condition likelihood estimation method is achieved, and a negative log likelihood function is adopted as a loss function of the modelAs shown in equation (13):
wherein W is a parameter for training an unsupervised conditional random field residual convolution neural network scene depth estimation model; to facilitate deriving parameters in the loss function, the following expression is introduced:
Q=B+C-R
wherein B represents an n×n identity matrix, R is equal to pi pj The similarity matrix of the square matrix is formed, C is the diagonal matrix of R, and C-R is a Laplacian matrix; thus (2)Finishing to formula (14):
wherein T represents the transpose of the matrix; w (W) 1 and W2 Parameters of a single item and a binary item of an unsupervised conditional random field residual convolution neural network scene depth estimation model are respectively represented; due toIs about depth vector->Thus directly calculating the function Z (X, W) as shown in equation (15):
wherein ,dI Representing a depth value for each pixel point on the image; synthesizing equations (14) - (15) to obtain a probability distribution function as shown in equation (16):
the loss function is sorted into a negative log likelihood function as shown in equation (17):
d5, synchronously training a unitary part and a binary part of the unsupervised conditional random field residual convolution neural network scene depth estimation model and learning parameters by minimizing a loss function of the unsupervised conditional random field residual convolution neural network scene depth estimation model and utilizing a random gradient descent method; when the parameter W in the depth estimation model of the unsupervised conditional random field residual convolution neural network is used 1 When deriving, L (W) is related to W because it is related to parameters in only one element 1 The derivative of (2) is shown in formula (18);
in the process of convolving the neural network parameter W 2 When deriving, L (W) is also obtained with respect to W because it is related only to parameters in the binary item 2 The derivative of (2) is shown in formula (19):
formulas (18), (1)9) P in (3) r (·)、T r (. Cndot.) all represent traces of the matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807213.8A CN110503680B (en) | 2019-08-29 | 2019-08-29 | Unsupervised convolutional neural network-based monocular scene depth estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807213.8A CN110503680B (en) | 2019-08-29 | 2019-08-29 | Unsupervised convolutional neural network-based monocular scene depth estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110503680A CN110503680A (en) | 2019-11-26 |
CN110503680B true CN110503680B (en) | 2023-08-18 |
Family
ID=68590325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910807213.8A Active CN110503680B (en) | 2019-08-29 | 2019-08-29 | Unsupervised convolutional neural network-based monocular scene depth estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110503680B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111340864B (en) * | 2020-02-26 | 2023-12-12 | 浙江大华技术股份有限公司 | Three-dimensional scene fusion method and device based on monocular estimation |
CN111354030B (en) * | 2020-02-29 | 2023-08-04 | 同济大学 | Method for generating unsupervised monocular image depth map embedded into SENet unit |
CN113822918A (en) * | 2020-04-28 | 2021-12-21 | 深圳市商汤科技有限公司 | Scene depth and camera motion prediction method and device, electronic device and medium |
CN111583345B (en) * | 2020-05-09 | 2022-09-27 | 吉林大学 | Method, device and equipment for acquiring camera parameters and storage medium |
CN112150531B (en) * | 2020-09-29 | 2022-12-09 | 西北工业大学 | Robust self-supervised learning single-frame image depth estimation method |
CN112270692B (en) * | 2020-10-15 | 2022-07-05 | 电子科技大学 | Monocular video structure and motion prediction self-supervision method based on super-resolution |
CN112561947A (en) * | 2020-12-10 | 2021-03-26 | 中国科学院深圳先进技术研究院 | Image self-adaptive motion estimation method and application |
WO2022165722A1 (en) * | 2021-02-04 | 2022-08-11 | 华为技术有限公司 | Monocular depth estimation method, apparatus and device |
CN112767468B (en) * | 2021-02-05 | 2023-11-03 | 中国科学院深圳先进技术研究院 | Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement |
CN113129370B (en) * | 2021-03-04 | 2022-08-19 | 同济大学 | Semi-supervised object pose estimation method combining generated data and label-free data |
CN113160294B (en) * | 2021-03-31 | 2022-12-23 | 中国科学院深圳先进技术研究院 | Image scene depth estimation method and device, terminal equipment and storage medium |
CN112801074B (en) * | 2021-04-15 | 2021-07-16 | 速度时空信息科技股份有限公司 | Depth map estimation method based on traffic camera |
CN114170286B (en) * | 2021-11-04 | 2023-04-28 | 西安理工大学 | Monocular depth estimation method based on unsupervised deep learning |
TWI823416B (en) * | 2022-06-08 | 2023-11-21 | 鴻海精密工業股份有限公司 | Training method, device, electronic device and storage medium for depth estimation network |
CN116245927B (en) * | 2023-02-09 | 2024-01-16 | 湖北工业大学 | ConvDepth-based self-supervision monocular depth estimation method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180231871A1 (en) * | 2016-06-27 | 2018-08-16 | Zhejiang Gongshang University | Depth estimation method for monocular image based on multi-scale CNN and continuous CRF |
CN108765479A (en) * | 2018-04-04 | 2018-11-06 | 上海工程技术大学 | Using deep learning to monocular view estimation of Depth optimization method in video sequence |
CN110009674A (en) * | 2019-04-01 | 2019-07-12 | 厦门大学 | Monocular image depth of field real-time computing technique based on unsupervised deep learning |
-
2019
- 2019-08-29 CN CN201910807213.8A patent/CN110503680B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180231871A1 (en) * | 2016-06-27 | 2018-08-16 | Zhejiang Gongshang University | Depth estimation method for monocular image based on multi-scale CNN and continuous CRF |
CN108765479A (en) * | 2018-04-04 | 2018-11-06 | 上海工程技术大学 | Using deep learning to monocular view estimation of Depth optimization method in video sequence |
CN110009674A (en) * | 2019-04-01 | 2019-07-12 | 厦门大学 | Monocular image depth of field real-time computing technique based on unsupervised deep learning |
Non-Patent Citations (1)
Title |
---|
李耀宇."基于结构化深度学习的单目图像深度估计".《机器人》.2017,第39卷(第6期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN110503680A (en) | 2019-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110503680B (en) | Unsupervised convolutional neural network-based monocular scene depth estimation method | |
CN108416840B (en) | Three-dimensional scene dense reconstruction method based on monocular camera | |
CN111325794B (en) | Visual simultaneous localization and map construction method based on depth convolution self-encoder | |
CN108921926B (en) | End-to-end three-dimensional face reconstruction method based on single image | |
CN109584353B (en) | Method for reconstructing three-dimensional facial expression model based on monocular video | |
CN109377530B (en) | Binocular depth estimation method based on depth neural network | |
CN105654492B (en) | Robust real-time three-dimensional method for reconstructing based on consumer level camera | |
CN110009674B (en) | Monocular image depth of field real-time calculation method based on unsupervised depth learning | |
CN110163974B (en) | Single-image picture reconstruction method based on undirected graph learning model | |
CN112766160A (en) | Face replacement method based on multi-stage attribute encoder and attention mechanism | |
CN112215050A (en) | Nonlinear 3DMM face reconstruction and posture normalization method, device, medium and equipment | |
CN112784736B (en) | Character interaction behavior recognition method based on multi-modal feature fusion | |
CN108932536A (en) | Human face posture method for reconstructing based on deep neural network | |
CN109684969B (en) | Gaze position estimation method, computer device, and storage medium | |
CN108280858B (en) | Linear global camera motion parameter estimation method in multi-view reconstruction | |
CN110853075A (en) | Visual tracking positioning method based on dense point cloud and synthetic view | |
CN111783582A (en) | Unsupervised monocular depth estimation algorithm based on deep learning | |
CN110176023B (en) | Optical flow estimation method based on pyramid structure | |
CN113313732A (en) | Forward-looking scene depth estimation method based on self-supervision learning | |
CN114996814A (en) | Furniture design system based on deep learning and three-dimensional reconstruction | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN113570658A (en) | Monocular video depth estimation method based on depth convolutional network | |
CN115471423A (en) | Point cloud denoising method based on generation countermeasure network and self-attention mechanism | |
Jiang et al. | H $ _ {2} $-Mapping: Real-time Dense Mapping Using Hierarchical Hybrid Representation | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |