CN114998411A

CN114998411A - Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss

Info

Publication number: CN114998411A
Application number: CN202210475411.0A
Authority: CN
Inventors: 李嘉茂; 张天宇; 朱冬晨; 张广慧; 石文君; 刘衍青; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-09-02
Anticipated expiration: 2042-04-29
Also published as: CN114998411B

Abstract

The invention relates to an auto-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss, wherein the method comprises the following steps: acquiring a plurality of adjacent frame images in an image sequence; and inputting the image into a trained deep learning network to obtain depth information and pose information, wherein luminosity loss information of the deep learning network is obtained based on a spatial transformation model of depth perception pixel corresponding relation, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing an omnidirectional automatic mask. The method can improve the accuracy of luminosity loss, and further monitor the learning of the deep network better.

Description

Self-supervision monocular depth estimation method and device combined with space-time enhanced luminosity loss

Technical Field

The invention relates to the technical field of computer vision, in particular to a self-supervision monocular depth estimation method and device combined with space-time enhancement luminosity loss.

Background

Estimating depth information of a scene from an image, i.e., image depth estimation, is a fundamental and important task in computer vision today. The good image depth estimation algorithm can be applied to the fields of outdoor driving scenes, indoor small robots and the like, and has great application value. During the working process of the robot or the automatic driving automobile, scene depth information is obtained by using a depth estimation algorithm to assist the robot to carry out path planning or obstacle avoidance of the next movement.

Depth estimation using images is divided into supervised and unsupervised approaches. The supervised method mainly utilizes a neural network to establish mapping between an image and a depth map, and training is carried out under the supervision of a true value, so that the network gradually has the capability of fitting the depth. However, the self-supervision method is becoming mainstream in recent years because the truth of the supervision method is expensive. Compared with a method requiring binocular image training, the method based on sequence images has become a method widely concerned by researchers due to a wider application range.

The self-supervision monocular depth frame based on the sequence image mainly comprises a depth estimation network and a pose estimation network which respectively predict the depth of a target frame and the pose transformation of the target frame and a source frame. And combining the estimated depth and pose, transforming the source frame to a coordinate system of the target frame to obtain a reconstructed image, and supervising the simultaneous training of the two networks by utilizing the difference of luminosity of the target frame and the reconstructed image, namely luminosity loss. As the loss of luminosity decreases, the depth of the network estimate becomes increasingly accurate.

The method is characterized in that a space transformation model is required to be adopted when luminosity loss is generated, although the existing space transformation model conforms to a method of rigid body transformation theoretically, certain depth estimation errors can be brought by errors of translation vectors in poses in the calculation process, namely, the depth is larger, and the error of depth estimation is larger. In addition, in order to solve the problem of inaccurate luminosity loss caused by moving pixels which violate luminosity consistency in an image, the main idea of the existing mode is to find a binarization mask which is generated by filtering out pixels with unchanged luminosity from one frame to another frame in a training process, but the binarization mask can only distinguish an object with the same movement direction as a camera.

Disclosure of Invention

The inventors of the present invention found that the reason why the larger the depth, the larger the error of the depth estimation is as follows: the purpose of the spatial transformation is to spatially transform the corresponding pixels in the target frame and the source frame to coincide on the pixel plane, provided that a near point P is used _N To solve for the corresponding pixel p _t And p _s The corresponding relationship of (a) is shown in FIG. 1. The principle of the self-supervised depth estimation is by minimizing p _t And p _s To make the estimated pose and depth more accurate. For near regions, as shown in FIG. 1, in the case of a certain number of points, only if p _t And the transformed point p _F When the pose is relatively coincident, the estimated pose can be more accurate, and the depth performance is better. For distant regions, as shown in FIG. 2, p can be guaranteed only by the accuracy of the predicted rotation matrix _t And p _s The photometric error becomes small, so if the photometric error is constructed by using the estimated rotation matrix and translation vector without distinguishing the distance, the photometric error uncertainty is greatly increased, thereby causing the result of depth estimation to be poor.

The invention aims to solve the technical problem of providing a self-supervision monocular depth estimation method and a device combining with space-time enhanced luminosity loss, which can improve the accuracy of luminosity loss and further better supervise the learning of a depth network.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for estimating the depth of the self-supervision monocular combined with the space-time enhanced luminosity loss comprises the following steps:

acquiring a plurality of adjacent frame images in an image sequence;

and inputting the image into a trained deep learning network to obtain depth information and pose information, wherein luminosity loss information of the deep learning network is obtained based on a spatial transformation model of depth perception pixel corresponding relation, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing an omnidirectional automatic mask.

The luminosity loss information is obtained based on a spatial transformation model corresponding to the depth perception pixels, and specifically comprises the following steps:

carrying out spatial transformation on the far region by using a homography matrix, and constructing a first reconstruction map; wherein the far zone treats the far zone as a plane of infinity;

performing space transformation by using the basic matrix, and constructing a second reconstructed image;

and solving a luminosity error map based on the first reconstruction map and a luminosity error map based on the second reconstruction map through the corresponding relation of two pixels, and then selecting the minimum value pixel by pixel to obtain the final luminosity loss information.

The calculation for avoiding the pixel of the moving object from participating in the luminosity error by utilizing the omnidirectional automatic mask specifically comprises the following steps:

predicting the initial depth and the initial pose of a target frame through a pre-training network, and generating an initial reconstruction map;

adding interference items to the initial pose, and obtaining a plurality of assumed reconstruction frames by utilizing space transformation; generating a plurality of luminosity error maps by using the assumed reconstruction frame and combining the luminosity of the target frame, and obtaining a plurality of binarization masks by using the luminosity error maps;

and selecting the minimum value from the plurality of binarization masks as a final mask.

The disturbance term is a translational disturbance term, and comprises the following steps: [ t ] of _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0,0, -t _max ]Wherein, t _max Representing the maximum value in the initialized translation vector.

The technical scheme adopted by the invention for solving the technical problem is as follows: there is provided an apparatus for self-supervised monocular depth estimation in combination with spatio-temporal enhancement of photometric loss, comprising:

the acquisition module is used for acquiring a plurality of adjacent frame images in the image sequence;

the estimation module is used for inputting the image into a trained deep learning network to obtain depth information and pose information; luminosity loss information of the deep learning network is obtained based on a space transformation model of the depth perception pixel corresponding relation module, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing the omnidirectional automatic mask module.

The depth perception pixel correspondence module comprises:

the first construction unit is used for carrying out spatial transformation on the far region by using the homography matrix and constructing a first reconstruction map; wherein the far zone treats the far zone as a plane of infinity;

a second construction unit for performing spatial transformation using the basis matrix and constructing a second reconstruction map;

and the luminosity loss information acquisition unit is used for solving a luminosity error map based on the first reconstruction map and a luminosity error map based on the second reconstruction map through the corresponding relation of two pixels, and then selecting the minimum value pixel by pixel to obtain the final luminosity loss information.

The omnidirectional automatic mask module includes:

the initial reconstruction image generating unit is used for predicting the initial depth and the initial pose of the target frame through a pre-training network and generating an initial reconstruction image;

a binarization mask generating unit, which is used for adding interference items to the initial pose and obtaining a plurality of assumed reconstruction frames by utilizing space transformation; generating a plurality of luminosity error maps by using the assumed reconstruction frame and combining the luminosity of the target frame, and obtaining a plurality of binarization masks by using the luminosity error maps;

and the mask selecting unit is used for selecting the minimum value from the plurality of binary masks as a final mask.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention adopts a depth perception pixel corresponding mode to excavate the pixel corresponding relation of the far area, improves the problem of inaccurate pixel corresponding of the far area, and obtains an omnidirectional binary mask by utilizing an omnidirectional automatic mask mode to avoid the pixel of a moving object from participating in the calculation of luminosity error. The invention improves the accuracy of luminosity loss by improving space transformation and generating the dynamic object automatic mask, thereby better supervising the learning of the deep network.

Drawings

FIG. 1 is a schematic diagram of a near point pose solution;

FIG. 2 is a schematic diagram of a remote point pose solution;

FIG. 3 is a schematic representation of the Monodepth2 basic framework;

FIG. 4 is a schematic diagram of the generation of light loss in the first embodiment of the present invention;

fig. 5 is a schematic view of an omnidirectional automatic mask according to a first embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The first embodiment of the invention relates to an automatic supervision monocular depth estimation method combined with space-time enhanced luminosity loss, which comprises the following steps: acquiring a plurality of adjacent frame images in an image sequence; and inputting the image into a trained deep learning network to obtain depth information and pose information, wherein luminosity loss information of the deep learning network is obtained based on a spatial transformation model corresponding to the depth perception pixels, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing an omnidirectional automatic mask.

The method of the embodiment can be directly used in general self-supervision monocular depth estimation, and any work which takes the framework of SfMLearner as the implementation principle can use the method of the embodiment. The method only needs to adopt the spatial transformation model based on depth perception pixel correspondence in the embodiment to the spatial transformation part in the original frame, and adopts the omnidirectional automatic mask in the application to the automatic mask part.

The invention is further illustrated below by way of example in the basic framework of Monodepth2 of Godardetal.

For easier understanding, the overall framework of Monodepth2 is described first, as shown in FIG. 3, which is input as three adjacent frames of RGB images in a sequence; and outputting the target frame depth and the pose transformation between the target frame and the source frame.

The basic framework of this embodiment is the same as that of fig. 3. Since the improvement of the embodiment is mainly the method of generating the luminosity loss and the automatic mask part by spatial transformation, the two parts of the Monodepth2 are described first:

monodepth2 uses the same space transformation model as SfMLearner to obtain the target frame I _t Depth D of _t And a target frame I _t And source frame I _s Position and posture T _t→s ＝[R _t→s |t _t→s ]. For corresponding pixels p between the target frame and the source frame _t And p _s If the two points correspond to the same 3D point, the following conditions should be satisfied:

D _s K ^-1 p _s ＝D _t K ^-1 p _t

wherein K is an internal reference of the camera. Since monocular depth has scale ambiguity, the transformation can be used for spatial transformation as follows:

p _s ～KT _t→s D _t K ^-1 p _t

in space geometric transformation KT _t→s K ^-1 Defined as a basic matrix F for the correspondence of the pixels between frames. The relationship can then be used to construct a reconstructed frame

From the target frame and the reconstructed frame, a luminosity loss pe can be constructed, which consists of an L1 error and a Structural Similarity (SSIM) error, as follows:

α is a hyper-parameter, Monodepth2 is set to 0.85

The automatic masking of Monodepth2 is mainly to solve the inaccurate luminosity loss caused by moving pixels in the image that violate the luminosity consistency. The main idea is to find out the pixels with unchanged luminosity which are filtered from one frame to another frame in the training process, and the generated binarization mask mu is as follows:

[]is Iversonbrack, used to generate a binary mask. I is _t Is a target frame, I _s Is a frame of a source of the video,

is the reconstructed frame resulting from the spatial transformation.

For the luminosity loss generated by spatial transformation, the luminosity loss is obtained based on the spatial transformation model corresponding to the depth perception pixel in the embodiment. As shown in fig. 4, the details are as follows:

during the spatial transformation, a sufficiently distant region can be regarded as a plane of infinity, and the plane satisfies:

n ^T P+D＝0

wherein n is a normal vector of the plane, P is a three-dimensional point on the plane, D is a depth of the point, and the n is obtained by transformation:

bringing it into a spatially transformed relationship yields:

when D is _t At infinity, i.e. for the infinity plane:

p _s ～KR _t→s D _t K ^-1 p _t

KR _t→s K ^-1 defined as the homography matrix H ∞ at infinity, the reconstruction map is constructed by spatially transforming only the rotation matrix for the distant region

For the purpose of differentiation, the reconstruction map obtained using the basis matrix is represented as

Since the depth estimated by monocular scale estimation has scale ambiguity, the correspondence between two kinds of pixels cannot be selected directly through the predicted depth. Therefore, the embodiment designs a method for adaptive selection, specifically, two luminosity error maps are solved through a corresponding relationship between two pixels, and then a minimum value is selected pixel by pixel, that is, the final luminosity error is:

for the omnidirectional automatic mask, the embodiment directly inputs the image sequence into the module, and after obtaining the mask result, the mask result is applied to the photometric error to shield the unreliable part, as shown in fig. 5, specifically as follows:

the embodiment introduces a Monodepth2 pre-training network to predict the initial of the target frameDepth D _init And initial frame pose T _init Further generating an initial reconstruction map I _init . Because the depth and the pose are accurate, the luminosity error of the region which accords with the luminosity consistency is small, but the potential of the region which does not accord with the luminosity consistency is small.

According to the method, interference items are added to the initial pose, a plurality of interfered poses are introduced, and a plurality of assumed reconstruction frames are obtained after space transformation is utilized. Using these reconstructed frames I _i Wherein, i ∈ {1,2, … }, in combination with the luminance of the target frame, a plurality of luminance error maps can be generated, and a plurality of binary masks can be obtained by using the magnitudes of the luminance error values, corresponding to the pixels of the moving object in each direction, as follows:

M _i ＝[pe(I _t ,I _init ),pe(I _t ,I _i )]

in order to capture the object moving in each direction, the generated masks are minimized to obtain the final mask, namely:

M _oA ＝min(M ₁ ,M ₂ ,…)

in the implementation process of the embodiment, only the translation vector is disturbed, and the specific translation disturbance item t _i ： t ₁ ＝[t _max ,0,0]、t ₂ ＝[-t _max ,0,0]、t ₃ ＝[0,0,t _max ]And t ₄ ＝[0,0,-t _max ]Wherein, t _max Is the maximum value in the initialized translation vector.

It is not difficult to find that the invention adopts the depth perception pixel corresponding mode to excavate the pixel corresponding relation of the far area, improves the inaccurate pixel corresponding problem of the far area, and obtains an omnidirectional binary mask by utilizing the omnidirectional automatic mask mode to avoid the pixel of the moving object from participating in the calculation of the luminosity error. The invention improves the accuracy of luminosity loss by improving space transformation and generating the dynamic object automatic mask, thereby better supervising the learning of the deep network. Therefore, by applying the depth perception pixel mapping and omnidirectional automatic mask of the present embodiment to the framework of Monodepth2 in Godard et al, a monocular depth estimation result with higher accuracy can be obtained.

A second embodiment of the present invention relates to an apparatus for self-supervised monocular depth estimation in combination with spatiotemporal enhancement of photometric loss, comprising: the acquisition module is used for acquiring a plurality of adjacent frame images in the image sequence; the estimation module is used for inputting the image into a trained deep learning network to obtain depth information and pose information; luminosity loss information of the deep learning network is obtained based on a space transformation model of the depth perception pixel corresponding relation module, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing the omnidirectional automatic mask module.

The depth perception pixel correspondence module comprises: the first construction unit is used for carrying out spatial transformation on the far region by using the homography matrix and constructing a first reconstruction map; wherein the far zone treats the far zone as a plane of infinity; a second construction unit for performing spatial transformation using the basis matrix and constructing a second reconstruction map; and the luminosity loss information acquisition unit is used for solving a luminosity error map based on the first reconstruction map and a luminosity error map based on the second reconstruction map through the corresponding relation of two pixels, and then selecting the minimum value pixel by pixel to obtain the final luminosity loss information.

The omnidirectional automatic mask module includes: the initial reconstruction image generating unit is used for predicting the initial depth and the initial pose of the target frame through a pre-training network and generating an initial reconstruction image; a binarization mask generating unit, which is used for adding interference items to the initial pose and obtaining a plurality of assumed reconstruction frames by utilizing space transformation; generating a plurality of luminosity error maps by using the assumed reconstruction frame and combining the luminosity of the target frame, and obtaining a plurality of binarization masks by using the luminosity error maps; and the mask selecting unit is used for selecting the minimum value from the plurality of binary masks as a final mask. Wherein the interference term is a translational disturbance term, including: [ t ] of _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0,0, -t _max ]Wherein, t _max In translation vectors representing initialisationA maximum value.

Claims

1. An auto-supervised monocular depth estimation method combined with spatio-temporal enhancement luminosity loss is characterized by comprising the following steps of:

acquiring a plurality of adjacent frame images in an image sequence;

and inputting the image into a trained deep learning network to obtain depth information and pose information, wherein luminosity loss information of the deep learning network is obtained based on a spatial transformation model of depth perception pixel correspondence, and pixels of a moving object are prevented from participating in calculation of luminosity errors by utilizing an omnidirectional automatic mask.

2. The method for self-supervised monocular depth estimation with spatio-temporal enhancement photometric loss according to claim 1, wherein the photometric loss information is derived based on a spatial transformation model of depth perception pixel correspondences specifically as:

carrying out spatial transformation on the far region by using a homography matrix, and constructing a first reconstruction diagram; wherein the far zone treats the far zone as a plane of infinity;

3. The method for self-supervised monocular depth estimation with spatio-temporal enhancement luminosity loss according to claim 1, wherein the computation for avoiding the pixel participation luminosity error of the moving object by using the omnidirectional automatic mask is specifically as follows:

4. The method of claim 3, wherein the interference term is a translational perturbation term, and comprises: [ t ] of _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0,0, -t _max ]Wherein, t _max Representing the maximum value in the initialized translation vector.

5. An apparatus for self-supervised monocular depth estimation in conjunction with spatio-temporal enhancement of photometric loss, comprising:

the estimation module is used for inputting the image into a trained deep learning network to obtain depth information and pose information;

luminosity loss information of the deep learning network is obtained based on a space transformation model of the depth perception pixel corresponding relation module, and pixels of a moving object are prevented from participating in luminosity error calculation by utilizing the omnidirectional automatic mask module.

6. The apparatus of claim 5, wherein the depth-aware pixel correspondence module comprises:

the first construction unit is used for carrying out spatial transformation on the far region by using the homography matrix and constructing a first reconstruction map;

wherein the distance area regards the distance area as a plane at infinity;

7. The apparatus for self-supervised monocular depth estimation with spatio-temporal enhancement of photometric loss of claim 5, characterized in that the omnidirectional auto-mask module comprises:

the initial reconstruction image generation unit is used for predicting the initial depth and the initial pose of the target frame through a pre-training network and generating an initial reconstruction image;

a binarization mask generating unit, which is used for adding interference items to the initial pose and obtaining a plurality of assumed reconstruction frames by utilizing space transformation; generating a plurality of luminosity error maps by combining the luminosity of the target frame by using the assumed reconstruction frame, and obtaining a plurality of binarization masks by using the luminosity error maps

8. The apparatus of claim 7, wherein the interference term is a translational perturbation term, comprising: [ t ] of _max ,0,0]、[-t _max ,0,0]、[0,0,t _max ]And [0,0, -t _max ]Wherein, t _max Representing the maximum value in the initialized translation vector.