CN110009674B - Monocular image depth of field real-time calculation method based on unsupervised depth learning - Google Patents
Monocular image depth of field real-time calculation method based on unsupervised depth learning Download PDFInfo
- Publication number
- CN110009674B CN110009674B CN201910256117.9A CN201910256117A CN110009674B CN 110009674 B CN110009674 B CN 110009674B CN 201910256117 A CN201910256117 A CN 201910256117A CN 110009674 B CN110009674 B CN 110009674B
- Authority
- CN
- China
- Prior art keywords
- image
- depth
- neural network
- convolutional neural
- images
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
- G06T2207/10012—Stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
Abstract
The invention discloses a monocular image depth of field real-time calculation method based on unsupervised depth learning, which utilizes the geometric constraint relation between binocular sequence images to construct a supervision signal, replaces the traditional manual marking data set and completes the unsupervised algorithm design; in a Depth-CNN network, a loss function considers geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, so that the algorithm accuracy is improved; the output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a monocular image depth of field real-time calculation method based on unsupervised deep learning.
Background
Due to the characteristics of low purchase price, real-time acquisition of scene complete information and the like, the camera is widely applied to the research of scene perception technologies of service robots, autonomous navigation robots and unmanned vehicles. With the development of high-performance computing devices, artificial intelligence technology for analyzing 2D image information by using a deep neural network increasingly plays an irreplaceable role in the fields of unmanned driving, robot navigation and the like. The scene depth of field real-time calculation problem based on the monocular image is a premise of a three-dimensional scene perception technology. In 2014, DavidEigen firstly uses a depth neural network to calculate scene depth corresponding to a 2D image, and establishes a mapping relation from 2D to 3D.
At present, scene depth of field calculation algorithms based on monocular images are mainly classified into supervised algorithms and unsupervised algorithms. Supervised algorithms require a large amount of data with artificial markers, and David Eigen proposes a method of coarse and fine estimation of images using two depth convolutional neural network sub-steps to obtain scene depth in the document "d.eigen, c.puhrsch, and r.fergus.depth map prediction from a single image using a multi-scale depth network. However, such manual marking data mostly depends on a laser scanner, and has high acquisition cost, difficult acquisition and limited application range. The unsupervised algorithm only uses scene images for a scene as a training set, is widely applied, and in documents of "t.zhou, m.brown, n.snavely, and d.g.lowe.unsuperviced left of depth and ego motion from video in CVPR, 2017", Zhou Tinghui et al uses sequence images as input and can directly calculate scene depth without manual marking. However, the depth neural network only analyzes scene information through a large number of images to acquire scene depth of field, and the accuracy cannot meet the specified requirement.
Through the analysis of the above problems, it was found that: they either require a large number of manually labeled images as training data sets or fail to fulfill the requirements of accurate computation, with varying degrees of loss in detail. The high-precision real-time depth of field calculation result has great significance in the image-based unmanned application scene, and therefore, an unsupervised real-time depth of field calculation method for the unmanned scene needs to be developed.
Disclosure of Invention
The invention provides a monocular image depth of field real-time calculation method based on unsupervised deep learning, aiming at the problems of three-dimensional scene perception in an outdoor unmanned automobile or an unmanned autonomous robot, difficulty in acquiring a large number of artificial labeled data sets, limited application scenes and the like.
In order to solve the problems, the invention is realized by the following technical scheme:
the monocular image depth of field real-time calculation method based on unsupervised deep learning comprises the following steps:
step 1, using binocular sequence images in an unmanned driving data set KITTI as input data, and classifying the binocular sequence images into two types through data preprocessing, namely a stereo image pair for a depth-of-field estimation convolutional neural network and a sequence image for a camera attitude estimation convolutional neural network;
step 2, establishing a depth-of-field estimation convolutional neural network based on a residual error network, constructing an end-to-end system, taking a stereo image pair as input, outputting a corresponding depth-of-field estimation image, and designing a loss function corresponding to the depth-of-field estimation convolutional neural network for feedback propagation;
step 3, establishing a camera attitude estimation convolutional neural network based on a convolutional neural network module, constructing an end-to-end system, outputting an attitude change matrix between sequence images by taking the sequence images and the depth of field estimation images as input, and designing a loss function corresponding to the camera attitude estimation convolutional neural network for feedback propagation;
step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3;
step 5, completing construction of a depth neural network based on the depth-of-field estimation convolutional neural network in the step 2 and the camera attitude estimation convolutional neural network in the step 3, completing design of a target function based on the step 4, and simultaneously training the depth-of-field estimation convolutional neural network and the camera attitude estimation convolutional neural network in the depth neural network by using all data in the unmanned data set KITTI obtained in the step 1 to fix a network parameter value and a network structure of the depth neural network to obtain a final calculation model;
and 6, inputting the monocular image actually obtained by the camera into the calculation model obtained in the step 5, wherein the output of the calculation model is the scene depth image corresponding to the image.
In the step 4, the constructed objective function is:
Lossfinal=λ1depth_loss+λ2pose_loss
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Representing the weight of a loss function of the camera attitude estimation convolutional neural network, depth _ loss representing the loss function of the depth-of-field estimation convolutional neural network, and pos _ loss representing the loss function of the camera attitude estimation convolutional neural network; alpha represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; n represents the total number of pixel points; i | · | purple wind1Represents the L1 norm; t represents a transposition of an image; SSIM () is a function that measures differences in image surface structuring; i islAnd IrA left view and a right view respectively representing a stereoscopic image;anda left view and a right view respectively representing a stereoscopic image reconstructed using a binocular camera geometric principle; dlAnd drRespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;andgradient images of the left disparity map in the abscissa and ordinate directions are respectively represented;andgradient images of the right disparity map in the abscissa and ordinate directions are respectively represented;andgradient images of a left image of the stereoscopic image in abscissa and ordinate directions, respectively;andgradient images of a right image of the stereoscopic image in abscissa and ordinate directions, respectively;anda left graph and a right graph respectively representing sequential images;andthe left image and the right image respectively represent a target image corresponding to a reference image in the sequence image under the s scale;a gradient map representing a depth image,andgradient images representing the left and right images of the sequence image, respectively.
In the step 1, aiming at the depth-of-field estimation convolutional neural network, extracting a corresponding stereo image pair from a binocular sequence image to be used as input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
Compared with the prior art, the invention has the following characteristics:
1. a supervision signal is constructed by using a geometric constraint relation between binocular sequence images, and a traditional manual marking data set is replaced, so that the design of an unsupervised algorithm is completed.
2. In the Depth-CNN network, the loss function considers the geometric constraint between images, and also designs a Depth of field estimation result consistency constraint term aiming at left and right images, thereby improving the algorithm accuracy.
3. The output of the Depth-CNN is used as a part of the input of the Pose-CNN to construct an integral target function, and meanwhile, the geometric relationship between binocular images and the geometric relationship between sequence images are used to construct a supervision signal, so that the accuracy and the robustness of the algorithm are further improved.
Drawings
Fig. 1 is a flow chart of a monocular image depth of field real-time calculation method based on unsupervised deep learning.
Fig. 2 is a Depth-CNN network framework diagram.
Fig. 3 is a diagram of a pos-CNN network framework.
Fig. 4 is an overall structural diagram of the objective function construction.
Fig. 5 is a graph of the algorithm results.
Fig. 5(a) is an input binocular sequence image, fig. 5(b) is an algorithm result of Zhou Tinghui, and fig. 5(c) is an algorithm result of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.
A monocular image depth of field real-time calculation method based on unsupervised deep learning is disclosed, as shown in FIG. 1, and specifically comprises the following steps:
step 1: and (5) preprocessing training data.
Using binocular sequence images in the unmanned data set KITTI as input data, the required images were classified into two types by data preprocessing: (1) stereoscopic image pairs for a Depth-CNN network; (2) sequence images for a Pose-CNN network.
All data in the KITTI database are preprocessed, and the original image is firstly converted into an image with the size of 256 multiplied by 512 and R, G, B on three channels, wherein the gray value of the image is between 0 and 1. The data is reorganized according to the difference of the deep neural network. And for Depth-CNN, extracting a corresponding stereo image pair from the binocular sequence image as input data of a training data set. For the Pose-CNN, three continuous images are respectively extracted from two sequence images (respectively corresponding to a left camera and a right camera) of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
Step 2: a Depth-CNN network (Depth of field estimation convolutional neural network) was established as shown in fig. 2.
And (3) establishing a Depth-CNN network based on the residual error network, constructing an end-to-end system, outputting a corresponding Depth-of-field estimation image by taking the stereo image pair mentioned in the step (1) as input, and designing a corresponding loss function for feedback propagation.
Establishing a coding-decoding model based on a residual error network, using a convolution kernel to successively extract high-dimensional features of an input image in the coding process to generate a multi-scale feature image, using a deconvolution kernel to perform deconvolution on an upper-layer feature image in the decoding process, wherein the scale of the generated target image and the feature image in the coding process are in one-to-one correspondence. The encoding and decoding process is an end-to-end learning process, and aims to learn an objective function d (f) (I) through a network, wherein the objective function establishes a corresponding relation of a pixel level, and a depth value corresponding to each pixel of an input image is obtained.
The solving process of the objective function d ═ f (i) is an iterative process, here we use the disparity map disp to replace depth of the depth image, and the relationship between the two is depth (i, j) ═ bf/disp (i, j), where b and f refer to the base line and focal length corresponding to the binocular camera, respectively, and i, j refer to the coordinates of the pixel in the image. Let IlAnd IrIs an input stereo image pair of a Depth-CNN network and is output as a corresponding parallax image displeftAnd dispright,
The loss function is divided into three parts, namely an image reconstruction part and a disparity map consistency part corresponding to the left image and the right image respectively. The principle of the corresponding image reconstruction part of the left image and the right image is the same, taking the construction loss term of the right image reconstructed by the left image as an example, taking the left image I as an examplelInputting the Depth-CNN, and outputting a parallax image disp corresponding to the left image as an output resultleftThe following correspondence can be obtained from the geometric principle of the binocular camera:
wherein omegalKnowing the area to which the image pixel belongs, Il,IrRespectively a left and a right input image,respectively a left reconstructed image and a right reconstructed image. Equation (1) a relational expression for reconstructing a right input image from a disparity map output from a left input image and a Depth-CNN is establishedAnd comparing the difference between the reconstructed right image and the original input right image to obtain a supervision signal of the deep convolutional neural network, namely:
where α is a parameter value function, SSIM () is a function that measures the difference in the structuralization results of two input images (see in particular paper Wang Z, Boyik A C, Sheikh H R, et al. Image Quality Association: From Error Visibility to Structural Similarity [ J ]. IEEE Transactions on Image Processing, 2004, 13 (4)). Meanwhile, considering that the scene depth value is more discontinuous in the edge area of the object, in order to maintain the image detail information, a loss item is constructed by using the image edge information:
wherein, N is the image pixel number, x, y represents the gradient of the image along the horizontal and vertical coordinate directions.
The reconstruction loss function for the right graph can be obtained from equations (2) (3) as:
the reconstruction loss function for the left image can be derived by the same method as equation (4):
where s is the image scale, in the present embodiment, s is 4, i.e., corresponding images of four scales are extracted as output results.
Since the left image and the right image are acquired by a binocular camera at the same time, the left parallax icon and the right parallax icon have the same magnitude, and parallax consistency loss terms are designed by using the principle, namely:
the loss function corresponding to the Depth-CNN obtained by simultaneous equations (4), (5) and (6) is:
depth_loss=depthriqht+depthleft+LR_losss (7)
and step 3: a pos-CNN network (camera Pose estimation convolutional neural network) was established as shown in fig. 3.
And establishing a Pose-CNN network based on a convolutional neural network module, constructing an end-to-end system, taking the sequence images mentioned in the step 1 and the depth of field estimation images mentioned in the step 2 as input, outputting a posture change matrix between the sequence images, and designing a corresponding loss function for feedback propagation.
And (3) establishing a deep convolutional neural network based on a convolutional neural network module, wherein the network uses the sequence image in the data preprocessing result in the step (1) as input and outputs four transformation matrixes corresponding to the transformation matrixes from the reference image to the target image in the left and right sequence images. Each transformation matrix comprises six degrees of freedom corresponding to spatial rotation and translation of the camera. And (3) reconstructing an image through the depth-of-field image in the step (2) and the matrix output by the Pose-CNN network, and using the reconstructed image as a supervision signal of the network.
The reconstruction process of the left and right sequence images is similar, taking the left sequence image as an example, let { I }1,I2,I3Denotes the left sequence of pictures, where I2Is a target image, I1And I3Is a reference image. Our goal is Depth map corresponding to the above three images output by Depth-CNN and I output by Pose-CNN1And I3To I2Is used for reconstructing a target image I2Then input the target graph I with the original2The comparison constructs a loss function. The construction principle is as follows:
Is1→t(pt)=It(KTs1→tDs1(ps1)K-1ps1) (8)
Is2→t(pt)=It(KTs2→tDs2(ps2)K-1ps2) (9)
wherein p iss1And ps2Respectively refer to reference picture I1And I3Pixel of (2), Dss(ps1) And Ds2(ps2) Respectively referring to the depth values, T, corresponding to the pixels in the reference image obtained in step 2s1→tAnd Ts2→tRefer to reference picture I output by Pose-CNN respectively1And I3To the target image I2The transformation matrix of (2). I iss1→t(pt) And Is2→t(pt) respectively refer to the target image reconstructed from the reference image at the scale s.
Like the image difference function construction of step 2, here a loss term is designed for the sequence image as a supervision signal:
where β is a parameter value, and in the present embodiment β is 0.85.
Similar to step 2, the loss term is constructed using image edge information:
from equations (10) and (11), the corresponding loss term of the left sequence image at the scale s is:
similarly, the corresponding loss term of the right sequence image at the scale s is:
thus, for the sequence images having the equations (12) (13), the total loss function is:
the constructed objective functions are respectively designed according to four scales, and finally summed.
And 4, step 4: and constructing an objective function.
In the training process of the network, the Depth-CNN and the Pose-CNN are trained simultaneously, and the loss terms of the two parts are all involved in the feedback propagation process of the network as a part of the final loss function, as shown in FIG. 4. The final objective function is composed of the loss function terms of the Depth-CNN and the Pose-CNN, and is shown as the formula (15):
Lossfinal=λ1depth_loss+λ2pose_loss (15)
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Weights representing the convolutional neural network loss function of camera pose estimation, in this embodiment, λ1=1.0,λ2The objective function considers the geometrical constraints of the image reconstruction process for both the stereo image pair and the sequence image pair, 0.8.
The loss function of the depth-of-field estimation convolutional neural network is:
the loss function of the convolutional neural network for camera pose estimation is:
wherein, alpha represents the weight for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; t represents a transposition of an image; i | · | purple wind1Represents the L1 norm; SSIM () is a measure of image surface structured differencesA function of (a); i and j respectively represent the abscissa and the ordinate of a pixel point in an image; n represents the total number of pixel points;representing the left image in the input stereo image pair,a right graph representing the input physical strength image pair;andrespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;is shown inAndfor input, a left image is reconstructed by using a binocular camera geometric principle;is shown inAndfor input, a right image is reconstructed by using a binocular camera geometric principle;andrespectively representing target images in the input sequence images corresponding to the left camera and the right camera;andrespectively representing in a left input sequence image and a right input sequence image, taking a reference image, a depth image corresponding to the reference image and a camera attitude change matrix as input, and combining a left target image reconstruction result and a right target image reconstruction result obtained by a camera parameter matrix;gradient images of the left disparity map in the abscissa direction and the ordinate direction are respectively represented;gradient images of the right disparity map in the abscissa direction and the ordinate direction are respectively represented;gradient images of the left input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;gradient images of the right input image in the horizontal coordinate direction and the vertical coordinate direction are respectively represented;an extracted map representing a depth image,gradient images respectively representing the left and right input images; p is a radical oftRepresenting the coordinates of the pixel points in the image; n → t denotes the transformation of the two reference images into the target image.
And 5: and (5) deep neural network training.
And (4) after completing the deep convolutional neural network construction and the target function design through the steps 2-4, entering a network training process. The total data in the KITTI data set is selected to be about 180GB, 22600 pairs of stereo images can be obtained after preprocessing, three groups of stereo images are input each time and enter the network for training network parameters, and the number of the parameters is about 6500 ten thousand. Here we set the network to iterate a total of 300000 times, eventually obtaining the computational model for the actual testing process.
Step 6: and (5) actual testing.
The design of the deep neural network and the calculation of network parameters are completed, and the monocular image is used as input data in the actual use process, so that the scene depth image corresponding to the image is directly obtained.
After the step 5 is finished, fixing the network parameter value and the network structure, directly inputting the monocular image at the moment, directly outputting the corresponding depth-of-field image by the network, and reaching the speed of 35ms for each image, thereby meeting the requirement of processing the video data. Thereby establishing a correspondence between the 2D image to the three-dimensional spatial perception.
The effects of the present invention are further illustrated by the following simulation results.
1. Simulation conditions
(1) Any image in the KITTI data set is selected and converted into a 256 multiplied by 512 RGB image.
(3) Setting experimental parameters: lambda [ alpha ]1=1.0,λ2=0.8,α=0.85,β=0.85
2. Simulation content and results
Simulation content: using a 256 × 512 RGB image as input, the two classical algorithm results were compared using a uniform error evaluation criterion. The error evaluation results were as follows:
Where N is the number of pixels, y is the depth of field prediction*Is the true depth of field value.
The experimental results are as follows:
the experimental results are shown in table 1, and compared with the supervised algorithm proposed by David Eigen and the unsupervised algorithm proposed by Zhou Tinghui, the accuracy of the method is improved, and the method meets the application requirements of navigation of unmanned automobiles and outdoor unmanned autonomous robots in real time and accuracy.
TABLE 1
Method | Abs Rel | Sq Rel | RMSE | RMSE log10 | Threshold value |
DavidEigen | 0.214 | 1.605 | 6.563 | 0.292 | 0.957 |
ZhouTinghui | 0.208 | 1.768 | 6.856 | 0.283 | 0.957 |
The invention | 0.151 | 1.325 | 5.653 | 0.231 | 0.975 |
The evaluation index absolute difference (Abs Rel), the mean square error (Sq Rel), the Root Mean Square Error (RMSE) and the logarithmic mean square error (RMSE log10) in table 1 represent algorithm error values, which are used to measure the accuracy of the algorithm, and a smaller error value represents a higher accuracy. The threshold value represents the degree of closeness of the depth of field predicted value and the true value, and the higher the threshold value is, the better the stability of the algorithm is. The experimental result shows that the precision of the method is obviously superior to that of the two methods. Considering that the algorithm of David Eigen is a supervised algorithm, we only compare the experimental results of the present invention with the Zhou Tinghui algorithm, as shown in fig. 5(a) - (c). Test results show that the method is obviously superior to the method of Zhou Tinghui in the aspect of target image detail detection.
In the neural network training process, the selection of the activation function has a large influence on the result, and almost all the partiesThe method uses a linear correction unit (Relu) as the activation function. After a plurality of experiments, the index correction unit (Elu) is selected as the activation function, the experimental result is shown in table 2, and the result of using the index correction unit as the activation function is obviously better than the result of using the linear correction unit as the activation function. In the present embodiment, the exponential linear correction unitAs a function of activation.
TABLE 2
Activating a function | Abs Rel | Sq Rel | RMSE | RMSE log10 | Threshold value |
Relu | 0.204 | 2.078 | 7.004 | 0.343 | 0.922 |
Elu | 0.151 | 1.325 | 5.653 | 0.231 | 0.975 |
Aiming at the problems of three-dimensional space perception in the autonomous navigation of the current unmanned and outdoor unmanned robots and high cost caused by the adoption of laser radar, the invention provides a low-cost scene depth real-time calculation method suitable for the autonomous navigation of the unmanned and outdoor unmanned robots. The method uses a monocular camera as a sensor, directly calculates scene depth through a depth convolution neural network trained under a line, and is an end-to-end method from an input image to a scene depth image. The method has the characteristics of real-time performance, high accuracy and the like, solves the problem of field depth calculation in three-dimensional scene perception only by relying on a low-cost image sensor, and provides an economical and reliable scene depth real-time calculation method for unmanned driving and unmanned robot autonomous navigation technologies.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.
Claims (2)
1. The monocular image depth of field real-time calculation method based on unsupervised deep learning is characterized by comprising the following steps of:
step 1, using binocular sequence images in an unmanned driving data set KITTI as input data, and classifying the binocular sequence images into two types through data preprocessing, namely a stereo image pair for a depth-of-field estimation convolutional neural network and a sequence image for a camera attitude estimation convolutional neural network;
step 2, establishing a depth-of-field estimation convolutional neural network based on a residual error network, constructing an end-to-end system, taking a stereo image pair as input, outputting a corresponding depth-of-field estimation image, and designing a loss function corresponding to the depth-of-field estimation convolutional neural network for feedback propagation;
step 3, establishing a camera attitude estimation convolutional neural network based on a convolutional neural network module, constructing an end-to-end system, outputting an attitude change matrix between sequence images by taking the sequence images and the depth of field estimation images as input, and designing a loss function corresponding to the camera attitude estimation convolutional neural network for feedback propagation;
step 4, constructing a target function based on the loss function corresponding to the depth-of-field estimation convolutional neural network designed in the step 2 and the loss function corresponding to the camera attitude estimation convolutional neural network designed in the step 3; wherein the constructed objective function is:
Lossfinal=λ1depth_loss+λ2pose_loss
wherein λ is1Weight, λ, representing the depth of field estimated convolutional neural network loss function2Representing the weight of a loss function of the camera attitude estimation convolutional neural network, depth _ loss representing the loss function of the depth-of-field estimation convolutional neural network, and pos _ loss representing the loss function of the camera attitude estimation convolutional neural network; alpha represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the depth-of-field estimation convolutional neural network; beta represents a weight value for measuring the importance of the image surface reconstruction result and the regularization term in the camera attitude estimation convolution neural network; s represents an image scale; n represents the total number of pixel points; i | · | purple wind1Represents the L1 norm; t represents a transposition of an image; SSIM () is a function that measures differences in image surface structuring; i islAnd IrA left view and a right view respectively representing a stereoscopic image;anda left view and a right view respectively representing a stereoscopic image reconstructed using a binocular camera geometric principle; dlAnd drRespectively representing a left disparity map and a right disparity map generated by a depth estimation convolutional neural network;andgradient images of the left disparity map in the abscissa and ordinate directions are respectively represented;andgradient images of the right disparity map in the abscissa and ordinate directions are respectively represented;andgradient images of a left image of the stereoscopic image in abscissa and ordinate directions, respectively;andladder for respectively representing right picture of stereo image in horizontal coordinate and vertical coordinate directionsA degree image;anda left graph and a right graph respectively representing sequential images;andthe left image and the right image respectively represent a target image corresponding to a reference image in the sequence image under the s scale;a gradient map representing a depth image,andgradient images respectively representing a left image and a right image of the sequence image;
step 5, completing construction of a depth neural network based on the depth-of-field estimation convolutional neural network in the step 2 and the camera attitude estimation convolutional neural network in the step 3, completing design of a target function based on the step 4, and simultaneously training the depth-of-field estimation convolutional neural network and the camera attitude estimation convolutional neural network in the depth neural network by using all data in the unmanned data set KITTI obtained in the step 1 to fix a network parameter value and a network structure of the depth neural network to obtain a final calculation model;
and 6, inputting the monocular image actually obtained by the camera into the calculation model obtained in the step 5, wherein the output of the calculation model is the scene depth image corresponding to the image.
2. The method for calculating the depth of field of a monocular image based on unsupervised deep learning in real time according to claim 1, wherein in step 1, a corresponding stereo image pair is extracted from a binocular sequence image aiming at a depth of field estimation convolutional neural network and is used as one input data of a training data set; and aiming at the camera attitude estimation convolutional neural network, three continuous images are respectively extracted from two sequence images of a binocular sequence image, wherein the second image is used as a target image, the first image and the third image are used as reference images, and the two sequence images are used as input data of a training data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256117.9A CN110009674B (en) | 2019-04-01 | 2019-04-01 | Monocular image depth of field real-time calculation method based on unsupervised depth learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256117.9A CN110009674B (en) | 2019-04-01 | 2019-04-01 | Monocular image depth of field real-time calculation method based on unsupervised depth learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110009674A CN110009674A (en) | 2019-07-12 |
CN110009674B true CN110009674B (en) | 2021-04-13 |
Family
ID=67169169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910256117.9A Active CN110009674B (en) | 2019-04-01 | 2019-04-01 | Monocular image depth of field real-time calculation method based on unsupervised depth learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110009674B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112258565B (en) * | 2019-07-22 | 2023-03-28 | 华为技术有限公司 | Image processing method and device |
CN110503680B (en) * | 2019-08-29 | 2023-08-18 | 大连海事大学 | Unsupervised convolutional neural network-based monocular scene depth estimation method |
CN110751100A (en) * | 2019-10-22 | 2020-02-04 | 北京理工大学 | Auxiliary training method and system for stadium |
CN111311664B (en) * | 2020-03-03 | 2023-04-21 | 上海交通大学 | Combined unsupervised estimation method and system for depth, pose and scene flow |
CN113393510B (en) * | 2020-03-12 | 2023-05-12 | 武汉Tcl集团工业研究院有限公司 | Image processing method, intelligent terminal and storage medium |
CN111583345B (en) * | 2020-05-09 | 2022-09-27 | 吉林大学 | Method, device and equipment for acquiring camera parameters and storage medium |
CN111753961B (en) | 2020-06-26 | 2023-07-28 | 北京百度网讯科技有限公司 | Model training method and device, prediction method and device |
CN112150531B (en) * | 2020-09-29 | 2022-12-09 | 西北工业大学 | Robust self-supervised learning single-frame image depth estimation method |
CN112561947A (en) * | 2020-12-10 | 2021-03-26 | 中国科学院深圳先进技术研究院 | Image self-adaptive motion estimation method and application |
CN113763474B (en) * | 2021-09-16 | 2024-04-09 | 上海交通大学 | Indoor monocular depth estimation method based on scene geometric constraint |
CN114332187B (en) * | 2022-03-09 | 2022-06-14 | 深圳安智杰科技有限公司 | Monocular target ranging method and device |
CN114967121B (en) * | 2022-05-13 | 2023-02-03 | 哈尔滨工业大学 | Design method of end-to-end single lens imaging system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106210450A (en) * | 2016-07-20 | 2016-12-07 | 罗轶 | Video display artificial intelligence based on SLAM |
CN109377530A (en) * | 2018-11-30 | 2019-02-22 | 天津大学 | A kind of binocular depth estimation method based on deep neural network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107204010B (en) * | 2017-04-28 | 2019-11-19 | 中国科学院计算技术研究所 | A kind of monocular image depth estimation method and system |
CN108961327B (en) * | 2018-05-22 | 2021-03-30 | 深圳市商汤科技有限公司 | Monocular depth estimation method and device, equipment and storage medium thereof |
CN109063746A (en) * | 2018-07-14 | 2018-12-21 | 深圳市唯特视科技有限公司 | A kind of visual similarity learning method based on depth unsupervised learning |
CN109472830A (en) * | 2018-09-28 | 2019-03-15 | 中山大学 | A kind of monocular visual positioning method based on unsupervised learning |
-
2019
- 2019-04-01 CN CN201910256117.9A patent/CN110009674B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106210450A (en) * | 2016-07-20 | 2016-12-07 | 罗轶 | Video display artificial intelligence based on SLAM |
CN109377530A (en) * | 2018-11-30 | 2019-02-22 | 天津大学 | A kind of binocular depth estimation method based on deep neural network |
Also Published As
Publication number | Publication date |
---|---|
CN110009674A (en) | 2019-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110009674B (en) | Monocular image depth of field real-time calculation method based on unsupervised depth learning | |
CN108416840B (en) | Three-dimensional scene dense reconstruction method based on monocular camera | |
CN111325797B (en) | Pose estimation method based on self-supervision learning | |
CN108921926B (en) | End-to-end three-dimensional face reconstruction method based on single image | |
CN111862126B (en) | Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm | |
CN110853075B (en) | Visual tracking positioning method based on dense point cloud and synthetic view | |
CN110675423A (en) | Unmanned aerial vehicle tracking method based on twin neural network and attention model | |
CN110689562A (en) | Trajectory loop detection optimization method based on generation of countermeasure network | |
CN113160375B (en) | Three-dimensional reconstruction and camera pose estimation method based on multi-task learning algorithm | |
CN110910437B (en) | Depth prediction method for complex indoor scene | |
CN108171249B (en) | RGBD data-based local descriptor learning method | |
CN112767467B (en) | Double-image depth estimation method based on self-supervision deep learning | |
CN113962858A (en) | Multi-view depth acquisition method | |
CN113762358A (en) | Semi-supervised learning three-dimensional reconstruction method based on relative deep training | |
CN114359509A (en) | Multi-view natural scene reconstruction method based on deep learning | |
CN113313732A (en) | Forward-looking scene depth estimation method based on self-supervision learning | |
CN113570658A (en) | Monocular video depth estimation method based on depth convolutional network | |
CN114299405A (en) | Unmanned aerial vehicle image real-time target detection method | |
CN114996814A (en) | Furniture design system based on deep learning and three-dimensional reconstruction | |
CN116958420A (en) | High-precision modeling method for three-dimensional face of digital human teacher | |
CN115984349A (en) | Depth stereo matching algorithm based on central pixel gradient fusion and global cost aggregation | |
CN116772820A (en) | Local refinement mapping system and method based on SLAM and semantic segmentation | |
CN115375838A (en) | Binocular gray image three-dimensional reconstruction method based on unmanned aerial vehicle | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
CN112686830B (en) | Super-resolution method of single depth map based on image decomposition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |