CN113313732A

CN113313732A - Forward-looking scene depth estimation method based on self-supervision learning

Info

Publication number: CN113313732A
Application number: CN202110708650.1A
Authority: CN
Inventors: 丁萌; 尹利董; 徐一鸣; 李旭; 宫淑丽
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-08-27

Abstract

The invention provides a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of: calculating a self-supervision learning reprojection formula; constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model; the visible light pre-training model is migrated to the FLIR infrared data for training, dense depth estimation of the infrared image is achieved, the problem that the existing forward-looking scene three-dimensional depth estimation method based on the vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility is solved, three-dimensional depth estimation of the infrared monocular image at night or under the condition of low visibility can be achieved under the condition without real depth data supervision, and further the defect of the vision-aided driving system in night infrared image depth estimation is overcome.

Description

Forward-looking scene depth estimation method based on self-supervision learning

Technical Field

The invention relates to the technical field of image processing, in particular to a forward-looking scene depth estimation method based on self-supervision learning.

Background

In the field of automatic driving, a vision-aided driving system receives more and more attention, and with the continuous enhancement of computing power of various hardware devices, the capability of a computer for acquiring scene information from a single image is continuously improved, the core of the vision-aided driving system is that the depth information of a forward-looking scene of a vehicle is acquired so as to realize lower tasks such as obstacle avoidance, distance measurement and the like, but at present, the estimation of the three-dimensional depth of the forward-looking scene based on a vision method can only be carried out under visible light conditions, but cannot be carried out at night or under the condition of low visibility, for example, the estimation method of the vision depth based on a binocular camera or a multi-view camera determines that the depth estimation range is limited by an installation base line of the camera, so that the estimation method is not suitable for observing long distances; the most representative method in the depth estimation method based on geometric vision is a motion recovery structure, but the motion recovery structure is an off-line algorithm and is not suitable for the real-time requirement in the field of automatic driving; in the deep learning method, a large amount of data with real deep labels needs to be calibrated in advance for supervised learning, a large amount of manpower and material resources are needed for acquiring training data, and only the difference between the fitting depth estimation value and the label value of a machine is considered in the training process, and the visual geometric constraint relation is not considered.

Disclosure of Invention

The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which solves the problem that the existing forward-looking scene three-dimensional depth estimation method based on a vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility, can realize three-dimensional depth estimation of infrared monocular images at night or under the condition of low visibility without real depth data supervision, and further makes up the defect of a vision-aided driving system in estimating the depth of infrared images at night.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of:

calculating a self-supervision learning reprojection formula;

constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model;

and transferring the visible light pre-training model to an FLIR infrared data set for training, and realizing the dense depth estimation of the infrared image.

Further, the specific steps of calculating the self-supervised learning reprojection formula include:

calculating an internal parameter matrix k of the camera according to the equipment parameters;

wherein f is the focal length of the camera, d_xAnd d_yIs the pixel size of the camera imaging sensor, u₀And v₀The coordinate of the center point of the image is located, H is the horizontal resolution of the image, W is the vertical resolution of the image, f_ovhFor horizontal direct field angle of the camera, f_ovwIs the vertical field view of the camera;

projecting the three-dimensional point to a two-dimensional plane, and calculating the coordinate transformation of a camera coordinate system and a world coordinate system;

wherein D is the horizontal depth of a point in the three-dimensional space from the camera, u and v are the coordinates of the point in the imaging plane of the camera, k is the camera reference matrix, and x is the horizontal depth of the point in the three-dimensional space from the camera_w、y_wAnd z_wIs the coordinate of the point in the world coordinate system, and t is the phaseA displacement vector between a machine coordinate system and a world coordinate system, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;

obtaining a self-supervision learning core formula;

p_t-1～kT_t→t-1D_t(p_t)k^-1p_t (4)

wherein p is_tIs the coordinate of a certain pixel at the time T, k is the camera internal reference matrix, D is the pixel depth, T_t→t-1And the pose transformation matrixes of the cameras at the time t and the time t-1 are obtained.

Further, the specific steps of constructing a depth estimation and pose estimation joint training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model comprise:

two networks needing joint training are constructed according to a learning task, a ResNet-18 network structure is adopted as a depth estimation network encoder, encoding characteristics are restored into a depth map through upsampling, the ResNet-18 network structure is adopted as a pose estimation network encoder, dimension reduction is carried out by using a small convolution kernel, and six-degree-of-freedom motion of a camera between two frames is estimated;

in the depth estimation network, an encoder sequentially samples the extracted feature quantity of each layer to different dimensions, reduces the scale of the high-dimensional features through maximum pooling, and performs next-step feature extraction to obtain an image with the size smaller than that of the original input image; in the pose estimation network, two superposed images are input, the same encoder is used, and the characteristic dimension is extracted to 2014 dimension;

uniformly upsampling different depth maps extracted from a depth network, and uniformly upsampling the image with the size smaller than that of the original input image obtained by a decoder to the size of the original input image; performing dimensionality reduction in a pose estimation network by using a convolution combination mode to obtain a pose transformation relation with six degrees of freedom;

designing a loss function;

reprojection loss:

loss of edge smoothing:

total loss function: l ═ μ L_rc+λL_s (7)

Wherein, I_nThe original image is represented by a digital image,

representing the image after the re-projection,

and

for reconstructed previous and subsequent frame image pixel values, p_-1And p₊₁The actual pixel values of the previous and the next frames, N is the total number of the image pixels, SSIM is the structural similarity evaluation of the original image and the image after the re-projection,

for the depth smoothing term, for suppressing the depth map from generating local singular noise,

the edge perception terms are used for encouraging the model to learn edge information with large depth gradient change, alpha is a weighting coefficient of SSIM and L1 norm loss in the reprojection loss, and mu and lambda are weighting coefficients of two losses in the total loss;

selecting the minimum value of the front and back frame losses according to the minimum reprojection error, and performing pre-training in the KITTI visible light data set to obtain a pre-training model;

further, the specific steps for realizing the infrared image dense depth estimation comprise:

reading a pre-training model, and carrying out migration initialization on convolutional neurons of the infrared training network on the basis of the pre-training model;

and carrying out hyper-parameter adjustment on the pre-training model, and selecting the optimal model as a final result.

Further, the method of pre-training the KITTI visible light data includes multi-scale up-sampling and re-projection errors.

The beneficial technical effects are as follows:

1. the invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which comprises the following steps of: calculating a self-supervision learning reprojection formula; constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model; the visible light pre-training model is transferred to the FLIR infrared data set for training, dense depth estimation of the infrared image is achieved, the problem that the existing forward-looking scene three-dimensional depth estimation method based on a vision method is only suitable for visible light conditions and cannot be used at night or under the condition of low visibility is solved, three-dimensional depth estimation of the infrared monocular image at night or under the condition of low visibility can be achieved under the condition without real depth data supervision, and further the defect of a vision-aided driving system in night infrared image depth estimation is overcome;

2. in the invention, the geometric constraint between frames of the monocular video sequence is adopted, and the difference between the reprojected image and the real image is used as the monitoring information, thereby realizing the self-monitoring learning and reducing the acquisition cost of the training data;

3. the invention adopts the minimum re-projection error to re-project a certain frame and two frames before and after the certain frame, but when calculating the loss of a certain point, the minimum loss is taken as the loss value of the neural network back propagation, and the accidental noise influence of the infrared image can be effectively reduced;

4. the invention adopts a multi-scale up-sampling loss calculation method to re-up-sample the small-scale depth map output by the depth estimation network decoder to the resolution of the original image, and performs loss value calculation on the basis, thereby relieving the hole phenomenon of the depth map caused by a low-resolution area;

5. transfer learning is introduced, so that the neural network constructed by the method can fully learn the infrared image characteristics by using the prior road scene knowledge of visible light.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the embodiments will be briefly described below.

FIG. 1 is a flow chart of a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;

FIG. 2 is a technical route diagram of a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;

FIG. 3 is a diagram of a depth estimation network in a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;

FIG. 4 is a diagram of a pose estimation network in a forward-looking scene depth estimation method based on self-supervised learning according to the present invention;

fig. 5 is a multi-scale up-sampling process diagram in the forward-looking scene depth estimation method based on the self-supervised learning according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which trains and tests a neural network on a desktop workstation, and the realization of a software and hardware platform is shown in table 1.

Table 1:

the implementation of the forward-looking scene depth estimation method based on the self-supervised learning specifically comprises the following steps, as shown in fig. 1-2:

s1: calculating a self-supervision learning reprojection formula;

specifically, an internal parameter matrix k of the camera is calculated according to the equipment parameters;

projecting the three-dimensional point to a two-dimensional plane, and calculating the coordinate transformation of a camera coordinate system and a world coordinate system; according to the camera projection model, observing a certain point in the three-dimensional space from the camera angle:

the coordinate transformation relationship between the camera coordinate system and the world coordinate system is described by the rotation matrix and the displacement vector as:

wherein D is the horizontal depth of a point in the three-dimensional space from the camera, u and v are the coordinates of the point in the imaging plane of the camera, k is the camera reference matrix, and x is the horizontal depth of the point in the three-dimensional space from the camera_w、y_wAnd z_wThe coordinate of the point in a world coordinate system is shown as T, a displacement vector (composed of three displacement freedom degrees delta x, delta y and delta z in a Cartesian coordinate system) between a camera coordinate system and the world coordinate system is shown as T, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;

obtaining a self-supervision learning core formula;

p_t-1～kT_t→t-1D_t(p_t)k^-1p_t (4)

wherein p is_tIs the coordinate of a certain pixel at the time T, k is the camera internal reference matrix, D is the pixel depth, T_t→t-1For the pose transformation matrix of the camera at the time t and the time t-1, it should be noted that the symbol "" represents the projection formula after omitting the depth at the time t-1, representing the constraint between the front and rear frames, specifically, the original derivation formula is D_t-1·p_t-1＝D_tKT_t→t-1K^-1p_tThe depth D at the time t-1 is omitted here_t-1Equation (4) is obtained, where the "-" symbol is also understood to be approximately equal to represent the relationship between before and after the reprojection equation.

S2: constructing a depth estimation and pose estimation combined training network, designing a loss function, and pre-training KITTI visible light data to obtain a visible light pre-training model;

specifically, two networks needing joint training are constructed according to a learning task, a depth network is shown in a figure 3, a pose estimation network is shown in a figure 4, a ResNet-18 network structure is adopted as a depth estimation network encoder, coding characteristics are restored into a depth map through up-sampling, the ResNet-18 network structure is adopted as a pose estimation network encoder, a small convolution kernel is used for dimension reduction, and six-degree-of-freedom motion of a camera between two frames is estimated;

in the depth estimation network, an encoder sequentially samples the extracted feature quantity of each layer to 64 dimensions, 128 dimensions, 256 dimensions and 512 dimensions, each layer sequentially adopts 3 x 3 convolution, quasi-normalization and Relu activation functions to reduce the scale of the high-dimensional features through maximum pooling, and next-step feature extraction is carried out to obtain 1/2, 1/4 and 1/8 size images of an original input image; in the pose estimation network, two superposed images are input, the same encoder is used, and the characteristic dimension is extracted to 2014 dimension;

different depth maps extracted from the depth network are uniformly up-sampled, original input images 1/2, 1/4 and 1/8 size maps obtained by a decoder are subjected to a specific process, the specific process is shown in fig. 5, after the decoder obtains four-scale depth maps, the four depth maps are completely restored to the original resolution, and the problem that holes appear in a low-language region of an infrared image can be relieved by carrying out reprojection according to the four depth maps; for a pose design network, designing a small convolutional decoder, and reducing dimensions by using a mode of combining 3 × 3 convolution and 1 × 1 convolution to finally obtain a pose transformation relation of six degrees of freedom;

designing a loss function;

reprojection loss:

loss of edge smoothing:

total loss function: l ═ μ L_rc+λL_s (7)

Wherein, I_nThe original image is represented by a digital image,

representing the image after the re-projection,

and

is an edge perception item used for encouraging the model to learn the edge information with larger depth gradient change so as to ensure that the outline of the depth map is clear, and alpha is in the reprojection lossSSIM and L1 loss weighting coefficients, μ and λ are the weighting coefficients of the two losses in the total loss, it being noted that the loss L is in the reprojection_rcIn, two parts are included: one part is SSIM structural loss and the second part is L1 norm loss, describing the difference between the projection result and the true value, it being understood that the L1 norm loss function, also called minimum absolute value deviation or minimum absolute value error, in the present patent, the L1 norm loss represents the minimum absolute error between the pixel values of the predicted image and the pixel values of the true image.

And selecting the minimum value of the loss of the front frame and the rear frame according to the minimum reprojection error, and pre-training in the KITTI visible light data set to learn the structural characteristics of the ground road scene in advance.

S3: migrating the visible light pre-training model to an FLIR infrared data set for training to realize dense depth estimation of the infrared image;

specifically, reading a pre-training model, and carrying out migration initialization on convolutional neurons of the infrared training network on the basis of the pre-training model;

and carrying out hyper-parameter adjustment on the pre-training model, wherein the hyper-parameter adjustment comprises the adjustment of Adam optimizer parameters, learning rate, epoch and the like, and selecting the optimal model as a final result.

The invention discloses a forward-looking scene depth estimation method based on self-supervision learning, which is used for training infrared images shot at night or in a low-visibility environment and making up for the deficiency of a vision-aided driving system in estimating the depth of the infrared images at night.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above examples are only for describing the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A forward-looking scene depth estimation method based on self-supervision learning is characterized by comprising the following steps:

calculating a self-supervision learning reprojection formula;

2. The method for estimating the depth of the forward-looking scene based on the self-supervised learning as recited in claim 1, wherein the step of calculating the self-supervised learning reprojection formula comprises:

where D is the horizontal depth from the camera, i.e. the pixel depth, of a point in the three-dimensional space as viewed from the camera, u and v are the coordinates of the point in the camera imaging plane, P_uvAnd P_cRespectively representing the values of two-dimensional pixel coordinates and three-dimensional camera coordinate system, k is camera reference matrix, x_w、y_wAnd z_wIs the coordinate of the point in the world coordinate system, x_c、y_cAnd z_cIs the coordinates of the point in the camera coordinate system. T is a displacement vector between the camera coordinate system and the world coordinate system, R is a rotation matrix between the camera coordinate system and the world coordinate system, and T is a pose transformation matrix;

obtaining a self-supervision learning core formula;

p_t-1～kT_t→t-1D_t(p_t)k^-1p_t (4)

3. The method for estimating the depth of the forward-looking scene based on the self-supervised learning as recited in claim 1, wherein the specific step of obtaining the visible light pre-training model comprises:

designing a loss function;

reprojection loss:

loss of edge smoothing:

total loss function: l ═ μ L_rc+λL_s (7)

Wherein, I_nThe original image is represented by a digital image,

representing the image after the re-projection,

and

and selecting the minimum value of the frame loss before and after the frame loss according to the minimum reprojection error, and pre-training in the KITTI visible light data set to obtain a pre-training model.

4. The forward-looking scene depth estimation method based on the self-supervised learning as recited in claim 1, wherein the specific steps of realizing the infrared image dense depth estimation comprise:

5. The method of claim 1, wherein the method of pre-training visible light data for KITTI comprises multi-scale up-sampling and re-projection errors.