CN111340867A

CN111340867A - Depth estimation method and device for image frame, electronic equipment and storage medium

Info

Publication number: CN111340867A
Application number: CN202010121139.7A
Authority: CN
Inventors: 刘永进; 赵旺; 舒叶芷
Original assignee: Tsinghua University; Deep Blue Technology Shanghai Co Ltd
Current assignee: Tsinghua University; Deep Blue Technology Shanghai Co Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-26
Anticipated expiration: 2040-02-26
Also published as: CN111340867B

Abstract

The embodiment of the invention discloses a depth estimation method and device of an image frame, electronic equipment and a storage medium, wherein pixel relations are extracted by using an optical flow prediction network obtained through unsupervised training to replace the traditional manually set image characteristics SIFT and the like for matching, the determination of the relations among pixels becomes more accurate, and confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.

Description

Depth estimation method and device for image frame, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a depth estimation method and device for an image frame, electronic equipment and a storage medium.

Background

The depth estimation of the monocular image is a hotspot problem in computer vision and robotics, and has wide application in the fields of automatic driving, robot navigation, three-dimensional reconstruction and the like; visual odometry, or camera pose estimation, is an important hotspot problem in the field of robots. In a video sequence, depth information and camera pose information are constrained and influenced with each other, so that combined solution and application of depth estimation and pose estimation are receiving more and more attention.

The development of the neural network and the appearance of deep learning bring new ideas and schemes for solving the traditional computer vision task. It becomes possible to learn estimated depth information from large-scale data. Data-driven supervised learning requires raw data and corresponding labels, however, obtaining depth labels for color (RGB) images is difficult, especially in outdoor scenes, where only sparse point cloud depth information can be obtained by a laser radar, and a color depth (RGBD) camera cannot accurately obtain depth values at a far distance. This presents challenges to how to collect, utilize data and design algorithmic systems.

The deep learning system based on multitask and unsupervised learning has the core idea that mutual constraints among a plurality of tasks are explored to construct a loss function to supervise a neural network, so that the supervision by using a label is not needed. The image depth prediction task and the camera pose prediction task can be trained without label information because the constraint can be constructed by the back projection-pose transformation-projection reconstruction of the depth map. The existing image depth prediction and pose prediction system based on multitask and unsupervised learning is mostly established on the basis of predicting the pose change and the image depth of a camera by using two independent end-to-end neural networks PoseNet and DepthNet respectively and then calculating a loss function. However, the use of PoseNet to predict camera pose changes is not robust enough, and the trained PoseNet cannot give effective prediction on pose data distribution that does not appear in the training set, indicating that the generalization performance is limited.

Therefore, how to improve the stability and robustness of the system while inheriting the advantages of the unsupervised training neural network is a problem to be solved.

Disclosure of Invention

Because the existing methods have the problems, embodiments of the present invention provide a method and an apparatus for estimating depth of an image frame, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present invention provides a method for estimating depth of an image frame, including:

acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;

performing confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to a result of the confidence coefficient sampling to obtain a camera pose change estimated value, and performing triangularization operation according to the camera pose change estimated value and the corresponding relation between partial pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;

calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the pose change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;

and respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.

Optionally, the performing confidence level sampling on the corresponding relationship between all pixels of the two image frames, and estimating a relative change value of the camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value specifically includes:

and carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, and selecting the partial pixels with the highest confidence coefficient to input the relative change value of the camera pose estimated in the eight-point method and the random sampling consistency algorithm to obtain a camera pose change estimated value.

Optionally, the depth prediction network is based on an encoder-decoder architecture and adds a hopping connection between encoder and decoder.

Optionally, the training process of the optical flow prediction network is as follows:

inputting the two image frames to obtain a first corresponding relation between all pixels of the two image frames output by the optical flow prediction network; the first corresponding relation is optical flow from a first frame image to a second frame image;

carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating L between the reconstructed image and the original image₁The error and the SSIM error are used as supervision signals;

edge-sensitive smoothness loss function L_smoothTraining loss function L added to the optical flow network ensemble_flowThe method comprises the following steps:

L_flow＝L_recons+L_smooth

L_recons＝‖M_o(I₁-I′₁)‖+(1-SSIM(M_oI₁,M_oI′₁))

wherein, I₁Is the first frame image, I'₁For the reconstructed first frame image, f₁For forward optical flow prediction from a first frame to a second frame, M_oIs formed by f₁The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.

In a second aspect, an embodiment of the present invention further provides a depth estimation apparatus for an image frame, including:

the optical flow prediction module is used for acquiring two adjacent image frames from a training video sequence, and respectively inputting the two image frames into an optical flow prediction network obtained through unsupervised training to obtain the corresponding relation between all pixels of the two image frames output by the optical flow prediction network;

the point cloud acquisition module is used for carrying out confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of a camera pose according to the result of the confidence coefficient sampling to obtain a camera pose change estimated value, and carrying out triangularization operation according to the corresponding relation between the camera pose change estimated value and part of pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system;

the network training module is used for calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the machine posture change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map;

and the depth estimation module is used for respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.

Optionally, the point cloud obtaining module is specifically configured to:

L_flow＝L_recons+L_smoot

L_recons＝‖M_o(I₁-I′₁)‖+(1-SSIM(M_oI₁,M_oI′₁))

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.

According to the technical scheme, the pixel relationship is extracted by using the optical flow prediction network obtained through unsupervised training to replace the traditional manually set image features SIFT and the like for matching, the determination of the relationship between the pixels becomes more accurate, and the confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a depth estimation method for an image frame according to an embodiment of the present invention;

FIG. 2 is an interaction diagram of a prediction model for depth estimation of an image frame according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an image frame depth estimation apparatus according to an embodiment of the present invention;

fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Fig. 1 shows a flowchart of a depth estimation method for an image frame provided in this embodiment, which includes:

s101, two adjacent image frames are obtained from a training video sequence, and the two image frames are respectively input into an optical flow prediction network obtained through unsupervised training, so that the corresponding relation between all pixels of the two image frames output by the optical flow prediction network is obtained.

Wherein the training video sequence is a video sequence used for training an optical flow prediction network and a depth prediction network.

The optical flow prediction network and the depth prediction network are unsupervised and do not need to input annotation information other than image frames.

An unsupervised training process for an optical flow prediction network, comprising: inputting two image frames into an optical flow prediction network, reconstructing a second frame image by using a first frame image and a corresponding optical flow prediction result, and training a neural network by minimizing errors of the reconstructed image and an original image; the optical flow prediction network explicitly calculates to obtain an image shielding mask according to the predicted optical flow value, and the shielded area does not participate in the calculation of reconstruction errors; and simultaneously calculating the consistency loss of the forward optical flow and the backward optical flow of the two frames of images to give confidence of the pixel corresponding relation established based on the optical flows.

S102, performing confidence coefficient sampling on the corresponding relation between all pixels of the two image frames, estimating a relative change value of the camera pose according to the result of the confidence coefficient sampling to obtain a camera pose change estimated value, and performing triangularization operation according to the corresponding relation between the camera pose change estimated value and part of the pixels after sampling between the two image frames to obtain point cloud in a three-dimensional camera coordinate system.

The estimation of the camera pose means that the external parameter change of the camera is solved from an image video sequence, and the motion trail of the camera is recovered. Pose estimation is similar to the positioning task in simultaneous localization mapping (SLAM), and localization is achieved by solving for the relative pose. In the traditional SLAM algorithm, the corresponding relation between pixels is established by using manual visual features, such as scale invariant feature transform descriptors (SIFT) or oriented fast rotation binary coding descriptors (ORB), and then the least square problem is solved by epipolar geometric constraint, so that the relative pose change of the camera can be obtained. By combining random sample consensus (RANSAC) and other noise reduction algorithms, the camera pose with more accuracy and robustness can be recovered. However, the pose accuracy obtained by the solution depends on the pixel correspondence established based on SIFT or ORB, so that the pose estimation accuracy is greatly reduced due to inaccurate pixel correspondence under the condition of fast camera motion or low image quality. The depth estimation method for the image frame provided by the embodiment establishes the pixel corresponding relation based on the depth learning of the optical flow, so that the acquisition of the pixel corresponding relation becomes more accurate and robust.

The triangularization operation is guided, and end-to-end training can be guaranteed; the operation of triangularizing to obtain the point cloud is based on the corresponding relation of a small number of high-confidence pixels obtained by confidence sampling, namely the triangulated point cloud is sparse.

S103, calculating the projection of the point cloud, reconstructing a depth map, performing inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the pose change estimation value, and realizing the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map.

Specifically, the obtained sparse point cloud is projected to obtain sparse depth map reconstruction, the depth value of the predicted depth map is subjected to scale transformation to minimize errors of the sparse reconstructed depth map and the predicted depth map, and then the L2 errors of the sparse reconstructed depth map and the scale-transformed predicted depth map are used for supervising the depth prediction network.

S104, respectively inputting the image frames to be estimated into the depth prediction network to obtain the depth estimation values of the image frames to be estimated, which are output by the depth prediction network.

The depth estimation of the image frame refers to acquiring a scene depth value corresponding to each pixel from a single color image. The acquisition of scene depth values from a sequence of images of a monocular camera or from a single image is an underdetermined problem: the projections of all depth estimation values differing by a constant scale factor are the same on the image plane and conform to the current color picture, so that the true size of the depth value cannot be recovered, and only the relative size of the depth value can be recovered.

The embodiment provides a monocular camera pose and image depth prediction method based on multitask and unsupervised learning, which comprises the following steps: acquiring two adjacent image frames from a training video sequence; inputting the two frames of images into an optical flow prediction network obtained through unsupervised training together, and outputting the corresponding relation between all pixels of the two frames of images; inputting the two frames of images into a depth prediction network respectively, and outputting depth estimation values of the two frames of images; performing confidence coefficient sampling on the corresponding relation between all pixels of the two frames of images, selecting partial pixels with the highest confidence coefficient, inputting the partial pixels into an eight-point method and a random sample consensus (RANSAC) algorithm, and estimating a relative change value of the camera pose; performing triangularization operation according to the estimated value of the pose change of the camera and the corresponding relation of the pixels to obtain point cloud in a coordinate system of the three-dimensional camera; calculating a point cloud projection reconstruction depth map to supervise a depth prediction network, simultaneously performing inverse projection-transformation-projection reconstruction on depth prediction values of two frames of images by using a camera pose relative change estimation value, and realizing the training of the depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; and applying the trained optical flow network, depth prediction network and pose calculation algorithm to the pose and image depth estimation of the monocular camera.

In the embodiment, the pixel relationship is extracted by using the optical flow prediction network obtained through unsupervised training to replace the traditional manually set image characteristics SIFT and the like for matching, the determination of the relationship between pixels becomes more accurate, and the confidence sampling is introduced to further improve the robustness; and the camera pose relationship is solved by establishing the pixel relationship, and the end-to-end estimation of the relative change of the camera pose is replaced, so that the generalization capability is greatly improved, and the application performance of the whole system is improved.

Further, on the basis of the above method embodiment, the performing confidence level sampling on the corresponding relationship between all pixels of the two image frames in S102, and estimating a relative change value of the camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value specifically includes:

The confidence coefficient sampling is carried out according to an occlusion mask output by the optical flow prediction network and the consistency loss of a forward optical flow and a backward optical flow; after a camera transformation basic matrix is obtained by an eight-point method and a RANSAC algorithm, an optimal rotation matrix and translation matrix are determined by verifying the feasibility of triangulated point cloud, wherein the feasibility of triangulated point cloud means that the solution obtained by solving the rotation and translation matrix from the basic matrix is required to meet the condition that the point cloud position is in front of two camera planes.

Further, on the basis of the above method embodiment, the depth prediction network is based on an encoder-decoder structure and adds a hopping connection between encoder and decoder.

Where the output of the depth prediction network is disparity, i.e. the inverse of depth.

The refinement of the output results is improved by adding a jump connection between the encoder and the decoder.

Further, on the basis of the above method embodiment, the training process of the optical flow prediction network is as follows:

carrying out bilinear interpolation sampling according to the first corresponding relation, the second frame image and the optical flow prediction result to reconstruct the first frame image, and calculating an L1 error and an SSIM error between the reconstructed image and the original image as a supervision signal;

edge-sensitive smoothness loss function L_smooTraining loss function L added to the optical flow network ensemble_flowThe method comprises the following steps:

L_flow＝L_recons+L_smooth

L_recons＝‖M_o(I₁-I′₁)‖+(1-SSIM(M_oI₁,M_oI′₁))

Specifically, a constraint relation is established according to the predicted depth maps of the two image frames and the obtained relative pose change of the camera, and specifically: and reversely projecting the predicted depth map of the first frame to a three-dimensional space, performing rotation translation transformation on the point cloud by using the camera pose change parameters, projecting the point cloud to an image plane to reconstruct a depth map of a second frame, and supervising the depth prediction network by using an L2 error between the reconstructed depth map of the second frame and the predicted depth map of the second frame.

And calculating an edge-related image smoothing loss function through the obtained predicted depth maps of the two image frames, so that the predicted depth maps conform to the smooth change characteristic of the image and the sharp change characteristic of the edge.

The obtained prediction depth maps of the two image frames are subjected to inverse projection-rotational translation transformation-projection operation, so that a pixel corresponding relation between the two image frames can be established, the corresponding relation is compared with the corresponding relation established based on the optical flow, and supervision signals for a depth prediction network and the optical flow prediction network are generated.

The method comprises the steps that a deep neural network system based on multi-task learning and unsupervised learning simultaneously predicts an optical flow, a pose and a depth value during training, and obtains a loss function by utilizing inherent constraints of three tasks; during testing, tasks are decoupled, a depth prediction network or an optical flow prediction network can be tested independently, or the pose of a camera can be solved according to the optical flow prediction result.

Specifically, the depth estimation method for image frames provided by this embodiment may include the following specific steps:

step a1, two adjacent image frames are obtained from the training video sequence.

The adjacent two frames of images are not necessarily two adjacent frames in the original video sequence, but may be two images separated by several frames in time.

Step A2, inputting the two frames of images into an optical flow prediction network obtained through unsupervised training, and outputting the corresponding relation between all pixels of the two frames of images;

the optical flow prediction network inputs the stacking of two images and outputs the movement distance (delta x, delta y) from each pixel in the first image to the corresponding pixel in the second image; the corresponding pixels here refer to a pair of pixels in the two frame images that refer to the same region of the real world.

Obtaining optical flow f of first frame to second frame image using optical flow prediction network₁And optical flow f of second frame to first frame image₂(ii) a Ideally, by f₁Transformation plus f₂Transformed, reconstructed pixels should return to their original positions. Using f₁Transformation plus f₂Obtaining the consistency confidence coefficient of the corresponding relation between the pixels by the distance difference Delta D between the pixel position after the transformation and the original position before the transformation

The confidence is used in subsequent sampling to obtain reliable pixel correspondences for calculating camera pose changes and triangularizing operations.

And step A3, inputting the two frames of images into a depth prediction network respectively, and outputting depth estimation values of the two frames of images.

The depth prediction network is based on an encoder-decoder structure, and a jump connection between an encoder and a decoder is added to improve the fineness of an output result; the output of the depth prediction network is the disparity, i.e. the inverse form of the depth.

And A4, performing confidence coefficient sampling on the corresponding relation between all pixels of the two frames of images, and selecting partial pixels with the highest confidence coefficient to input into an eight-point method and a random sample consensus RANSAC algorithm to estimate the relative change value of the camera pose.

The confidence coefficient sampling is the shielding mask M output by the optical flow prediction network_oAnd a forward-backward optical flow consistency confidence C; after a camera transformation basis matrix is obtained by an eight-point method and a random sampling consensus RANSAC algorithm, an optimal rotation matrix and translation matrix are determined by verifying the feasibility of the triangulated point cloud.

The eight-point method obtains a basic matrix by solving a least square problem, and the random sampling consensus algorithm verifies the coincidence degree of the obtained basic matrix and the existing pixel corresponding relation through the loop detection of sampling, hypothesis, solving and verification, and finally realizes the removal of noise points, such as dynamic objects in a scene and pixel corresponding relations with inaccurate prediction. According to the final resultThe obtained basic matrix can calculate the distance from each pair of pixel corresponding points to the polar line corresponding to the pixel corresponding points, the pixel corresponding points under the ideal condition should fall on the polar line, and the distance is binarized by using a proper threshold value to obtain an inner point mask M_iA mask median of 0 represents a more likely dynamic object or inaccurate pixel correspondence, and a mask median of 1 represents a more reliable pixel correspondence.

And A5, performing triangularization operation according to the camera pose change estimation value and the pixel corresponding relation to obtain point cloud in a three-dimensional camera coordinate system.

The Triangulation method adopts a Mid-Point Triangulation algorithm (Mid-Point Triangulation), and has the advantages of simplicity, easy calculation and clear geometric meaning.

A6, calculating a point cloud projection reconstruction depth map to supervise a depth prediction network, performing inverse projection-transformation-projection reconstruction on depth prediction values of two frames of images by using a camera pose relative change estimation value, and realizing the training of the depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; the projection relation of the two frames of depth maps can be used for establishing the corresponding relation of pixels, the corresponding relation of the pixels is compared with the corresponding relation of the pixels established according to the predicted optical flow, and supervision signals for the depth prediction network and the optical flow prediction network are generated.

Loss function L for supervising deep prediction network_dComprises four items:

L_d＝w₁L_td+w₂L_pd+w₃L_sd+w₄L_fd

wherein L is_tdRepresenting a loss term supervised using triangulated depth maps, L_pdRepresenting a loss term, L, of a mutual projection reconstruction of two frame depth maps_sdRepresenting the loss of smoothness term, L, of the depth map_fdRepresenting two frames deepThe pixel correspondence calculated by the degree map and the camera pose, and the pixel correspondence calculated by the predicted optical flow. In particular, the present invention relates to a method for producing,

wherein D is_triIs a projection reconstructed sparse depth map of a triangulated sparse point cloud D_tAnd c is the calculated optimal scale transformation factor. L is_pdThe expression is as follows:

p_2d＝φ(K[T₁₂D₁(p₁)K^-1(h(p₁))])

wherein p is₁Representing a certain pixel position (x, y), M_oFor occlusion masks, M_iIn the form of an interior point mask,

for regularization; d₁For the predicted depth map of the first frame,

representing a second frame depth map reconstructed from the first frame depth map,

representing the depth map of the second frame at all corresponding grid points p_2dA depth map is obtained through interpolation; k is camera internal reference, T₁₂Representing the relative change of the pose of the camera, phi representing the conversion from a camera coordinate system to a pixel coordinate system, and h representing the conversion into homogeneous coordinates. L is_fdThe expression is as follows:

p_2f＝p₁+F₁₂(p₁)

wherein, F₁₂Representing the optical flow prediction results. L is_sdThe expression is as follows:

and A7, applying the trained optical flow prediction network, depth prediction network and pose calculation algorithm to the pose and image depth estimation of the monocular camera.

Fig. 2 is a frame diagram of a testing process of the method for predicting camera pose, optical flow between images, and pose change according to this embodiment, where the process includes:

and step B1, inputting a video sequence into the trained optical flow prediction network to obtain the optical flow prediction between adjacent image frames.

The occlusion mask and the confidence of the front lane and back lane consistency still need to be calculated for confidence sampling.

And step B2, obtaining the corresponding relation between pixels according to the optical flow prediction, obtaining a more reliable pixel relation through confidence coefficient sampling, solving a basic matrix by using an eight-point method and a random sampling consensus algorithm, and obtaining an optimal rotation matrix and an optimal translation matrix through triangularization verification.

The sampling and solving method is consistent with the training, but the random sampling consistent algorithm can be repeatedly operated in an iterative mode for many times, and more accurate results are obtained.

And step B3, inputting the same video sequence into the trained depth prediction network to obtain the depth map prediction result of each frame.

The prediction of the depth map is done frame by frame and does not require the entire piece of video context information.

The deep neural network system based on multi-task learning and unsupervised learning predicts optical flow, pose and depth value simultaneously during training, and obtains a loss function by utilizing the inherent constraints of three tasks; during testing, tasks are decoupled, a depth prediction network or an optical flow prediction network can be tested independently, or the pose of a camera can be solved according to the optical flow prediction result.

In the work of unsupervised learning of image depth and camera pose in the prior art, the learning of the end-to-end pose prediction neural network depends on the distribution of training data, the generalization capability is poor, and the performance on data which does not appear in a training set is poor. Therefore, the embodiment provides a camera pose and image depth prediction method based on multitask and unsupervised learning, and a neural network is combined to effectively estimate an optical flow and solve a pose of an antipodal geometric physical model, so that a more robust and generalized depth learning system is realized.

Fig. 3 is a schematic structural diagram illustrating a depth estimation apparatus for image frames according to this embodiment, where the apparatus includes: an optical flow prediction module 301, a point cloud acquisition module 302, a network training module 303 and a depth estimation module 304; wherein:

the optical flow prediction module 301 is configured to obtain two adjacent image frames from a training video sequence, and input the two image frames into an optical flow prediction network obtained through unsupervised training, respectively, to obtain a corresponding relationship between all pixels of the two image frames output by the optical flow prediction network;

the point cloud obtaining module 302 is configured to perform confidence level sampling on correspondence between all pixels of the two image frames, estimate a relative change value of a camera pose according to a result of the confidence level sampling to obtain a camera pose change estimation value, and perform triangulation operation according to the camera pose change estimation value and correspondence between some pixels after sampling between the two image frames to obtain a point cloud in a three-dimensional camera coordinate system;

the network training module 303 is configured to calculate a projection of the point cloud, reconstruct a depth map, perform inverse projection-transformation-projection reconstruction on depth prediction values of the two image frames in the depth map according to the pose change estimation value, and implement training on a depth prediction network by minimizing an error between the reconstructed depth map and a predicted depth map;

the depth estimation module 304 is configured to input the two image frames into the depth prediction network, respectively, to obtain depth estimation values of the two image frames output by the depth prediction network.

Specifically, the optical flow prediction module 301 obtains two adjacent image frames from a training video sequence, and respectively inputs the two image frames into an optical flow prediction network obtained through unsupervised training, so as to obtain a corresponding relationship between all pixels of the two image frames output by the optical flow prediction network; the point cloud obtaining module 302 performs confidence coefficient sampling on the corresponding relationship between all pixels of the two image frames, estimates a relative change value of a camera pose according to a result of the confidence coefficient sampling to obtain a camera pose change estimation value, and performs triangulation operation according to the camera pose change estimation value and the corresponding relationship between part of the pixels after sampling between the two image frames to obtain a point cloud in a three-dimensional camera coordinate system; the network training module 303 calculates the projection of the point cloud, reconstructs a depth map, performs inverse projection-transformation-projection reconstruction on depth predicted values of the two image frames in the depth map according to the machine pose change estimation value, and realizes the training of a depth prediction network by minimizing the errors of the reconstructed depth map and the predicted depth map; the depth estimation module 304 inputs the two image frames into the depth prediction network, respectively, to obtain depth estimation values of the two image frames output by the depth prediction network.

Further, on the basis of the above apparatus embodiment, the point cloud obtaining module 302 is specifically configured to:

Further, on the basis of the above apparatus embodiment, the depth prediction network is based on an encoder-decoder structure and adds a hopping connection between encoder and decoder.

Further, on the basis of the above device embodiment, the training process of the optical flow prediction network is as follows:

L_flow＝L_recons+L_smoot

L_recons＝‖M_o(I₁-I′₁)‖+(1-SSIM(M_oI₁,M_oI′₁))

wherein, I₁For the first frame image, I₁' is the reconstructed first frame image, f₁For forward optical flow prediction from a first frame to a second frame, M_oIs formed by f₁The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.

The depth estimation apparatus for image frames described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.

Referring to fig. 4, the electronic device includes: a processor (processor)401, a memory (memory)402, and a bus 403;

wherein,

the processor 401 and the memory 402 complete communication with each other through the bus 403;

the processor 401 is configured to call program instructions in the memory 402 to perform the methods provided by the above-described method embodiments.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of depth estimation of an image frame, comprising:

2. The method for depth estimation of image frames according to claim 1, wherein the confidence sampling is performed on correspondence between all pixels of the two image frames, and a relative change value of a camera pose is estimated according to a result of the confidence sampling to obtain a camera pose change estimation value, specifically comprising:

3. The method of depth estimation of image frames according to claim 1, characterized in that said depth prediction network is based on an encoder-decoder structure and adds a jump connection between encoder and decoder.

4. The method for estimating the depth of the image frame according to claim 1, wherein the training process of the optical flow prediction network is:

L_flow＝L_recons+L_smoot

L_recons＝||M_o(I₁-I′₁)]]+(1-SSIM(M_oI₁,M_ol′₁))

5. An apparatus for depth estimation of an image frame, comprising:

6. The image frame depth estimation device of claim 5, wherein the point cloud acquisition module is specifically configured to:

7. The apparatus for depth estimation of image frames according to claim 5, wherein said depth prediction network is based on an encoder-decoder architecture and adds a skip connection between encoder and decoder.

8. The image frame depth estimation device of claim 5, wherein the optical flow prediction network is trained by:

L_flow＝L_recons+L_smooth

L_recons＝||M_o(I₁-I′₁)||+(1-SSIM(M_oI₁，M_oI′₁))

wherein, I₁Is the first frame image, I'₁For the reconstructed first frame image, f₁From the first frame to the second frameTwo frame forward optical flow prediction, M_oIs formed by f₁The calculated occlusion mask is 1 at the place where the occlusion is not occluded and 0 at the place where the occlusion is occluded.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements a method of depth estimation of image frames according to any of claims 1 to 4.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, which, when being executed by a processor, implements a method of depth estimation of image frames according to any one of claims 1 to 4.