CN112085776A

CN112085776A - Method for estimating scene depth of unsupervised monocular image by direct method

Info

Publication number: CN112085776A
Application number: CN202010754803.1A
Authority: CN
Inventors: 张治国; 孙业昊; 孙浩然; 王海霞; 卢晓; 盛春阳; 李玉霞
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-12-15
Anticipated expiration: 2040-07-31
Also published as: CN112085776B

Abstract

The invention discloses a direct method unsupervised monocular image scene depth estimation method, which belongs to the field of computer vision and depth estimation and comprises the following steps: constructing a neural network; calculating an image reprojection error; image mask calculation and camera pose updating. The method overcomes the defects that monocular image depth estimation has high requirement on environment, is easily interfered by low texture regions, has poor camera pose estimation precision and the like, and not only obviously improves the depth estimation precision but also has auxiliary action on positioning and navigation of a moving vehicle by combining the traditional monocular depth estimation problem with a visual odometer; the method has the advantages of high precision, strong flexibility, wide application range and the like, can be used for sensing, preventing collision and positioning navigation of the surrounding environment of equipment such as an automatic driving vehicle, a mobile robot and the like, and has various application scenes.

Description

Method for estimating scene depth of unsupervised monocular image by direct method

Technical Field

The invention belongs to the field of computer vision, neural networks and depth estimation, and particularly relates to a direct unsupervised monocular image scene depth estimation method.

Background

At present, depth estimation is rapidly developed under the promotion of related technologies such as neural networks and sensors, and is widely applied to the fields of intelligent robots, pedestrian recognition, face unlocking, VR application, automatic driving and the like. The primary task of depth estimation is to estimate the distance from the front object to the camera according to a single color image collected by the camera.

There are two main ways to obtain three-dimensional information corresponding to a scene from a real scene: the method is characterized in that one method is to use a sensor capable of sensing three-dimensional depth information of a scene to acquire the depth information in the scene, and the other method is to recover the three-dimensional depth information from a two-dimensional image corresponding to the scene. Structured light and time-of-flight ranging (ToF) are currently common sensor-based depth estimation algorithms. The structured light technology is composed of a special projector and a camera, and the camera acquires three-dimensional information of a scene by acquiring the change of specific light information emitted by the projector after the light information is projected to the scene. Structured light uses a special sensor to acquire high-precision scene three-dimensional information, and is widely applied to the fields of face unlocking, safe payment and the like at present. But the depth estimation method is not suitable for the depth estimation in the road scene because the technical principle limits that the method can be only used in the range of short-distance object ranging and a small scene. ToF is another commonly used depth information acquisition technique that uses the round-trip time-of-flight of a signal between two transceivers to obtain depth information. Depth cameras and laser radar ranging devices which are commonly used in mobile phones acquire high-precision depth maps of scenes by using a ToF technology, the ToF technology is widely applied to the fields of AR, somatosensory games and automatic driving, however, laser radars and sensors required by ToF are expensive, and the size of a roof laser radar applied to automatic driving is too large and has limitation, so that the recovery of three-dimensional depth information from two-dimensional images corresponding to the scenes gradually becomes mainstream.

The depth estimation based on the images can be divided into supervised learning and unsupervised learning, and the supervised learning method depends on a depth three-dimensional map corresponding to a training picture. Usually, the deep three-dimensional map is acquired by a radar sensor, so the scale of a training data set of the supervised learning method based on the neural network is usually very small, and the cost for acquiring the data set is also very high, which greatly limits the mobility and the adaptability of the supervised learning method.

Unsupervised learning can be generally divided into multi-ocular, monocular, and binocular depth estimation methods depending on the number of cameras needed to restore depth. The traditional depth estimation method is mainly based on the geometric constraint of feature point matching and environment assumption, the binocular method and the multi-view depth estimation method need accurate camera external parameters, and errors caused by external parameter change cannot be eliminated. The monocular image depth estimation method only needs the internal reference of the camera and does not need the characteristic matching process, so that the algorithm is simpler and wider in application range.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a direct unsupervised monocular image depth estimation method, which not only obviously improves the depth estimation precision, but also has auxiliary effects on the positioning and navigation of a moving vehicle, has reasonable design, overcomes the defects of the prior art and has good effect.

In order to achieve the purpose, the invention adopts the following technical scheme:

a direct method unsupervised monocular image scene depth estimation method comprises the following steps:

step 1: constructing a depth estimation neural network, taking a monocular continuous image as an input image, and outputting a depth estimation image by using the depth estimation neural network;

step 2: calculating an initial camera pose, calculating a re-projection image by using a previous frame image and a depth estimation neural image of an input image and the camera pose, calculating a re-projection error of the re-projection image and a current frame image, and updating parameters of a depth estimation neural network by using back propagation to obtain a new depth estimation image;

and step 3: and (3) calculating an image mask by utilizing the reprojected image and the input image, updating the camera pose estimation between the front frame image and the rear frame image, and repeating the iteration steps 2 and 3.

Preferably, in step 1, the method specifically comprises the following steps:

step 1.1: the constructed depth estimation neural network is in a full convolution U shape, and the convolution part uses a convolution network in a Res-Net18 network structure as a main structure network;

step 1.2: the deconvolution part comprises a plurality of deconvolution layers and ReLu activated layer stacking as a main body structure, and each deconvolution layer is connected with the last convolution layer in the convolution blocks with the same mesoscale in the convolution part to form a final deconvolution layer;

step 1.3: monocular continuous images are used as a training data set, and sample expansion and generalization operations such as image inversion, gamma transformation, color channel change and the like are carried out before the monocular continuous images are input into a depth estimation neural network.

Preferably, in step 2, the method specifically comprises the following steps:

step 2.1: pre-calibrating intrinsic parameters of the camera or obtaining the intrinsic parameters of the camera from the data set;

step 2.2: calculating the initial camera pose by using the current frame image and the previous frame image and adopting a direct method;

step 2.3: calculating a reprojection image by using a last frame image of the input image, the depth estimation image and the camera pose;

step 2.4: calculating a reprojection error between the current frame image and the reprojection image;

step 2.5: and updating parameters of the depth estimation neural network by utilizing the depth estimation image and the image reprojection error output by the depth estimation neural network through back propagation to obtain a new depth estimation image.

Preferably, in step 3, the method specifically comprises the following steps:

step 3.1: calculating a similarity error according to the reprojected image and the current frame image to obtain an image mask;

step 3.2: calculating a Jacobian matrix between the last frame of image and the obtained camera pose;

step 3.3: multiplying the image mask by the Jacobian matrix to obtain an improved Jacobian matrix;

step 3.4: and updating the camera pose estimation between the front frame image and the rear frame image by using the improved Jacobian matrix.

Step 3.5: and repeating the iteration steps 2 and 3.

Preferably, the convolution part uses a convolution network in a Res-Net18 network structure as a main structure network, and is composed of 5 convolution blocks and 5 deconvolution blocks, where the 1 st convolution block includes 1 convolution group, the convolution group includes 1 convolution layer, the input of the convolution layer is a 3-channel color image, the output is 64 channels, the 2 nd, 3 rd, 4 th and 5 th convolution blocks include 2 convolution groups, each convolution group includes 2 convolution layers, the number of channels included in the 2 nd, 3 th, 4 th and 5 th convolution blocks is 64, 128, 256 and 512, the number of deconvolution groups in the deconvolution blocks is 2 times of the number of convolution groups in the corresponding scale convolution blocks, and the number of channels included in each convolution group in each deconvolution block is 512, 256, 128, 64 and 1.

Preferably, the last volume block and the next volume block are connected by a maximum pool operation, the maximum pool operation also scales the size of the next volume block in the two adjacent volume blocks to be one half of the size of the last volume block, the size of the next deconvolution block in the two adjacent volume blocks is 2 times of the size of the last deconvolution block, the volume layer and the deconvolution layer in each volume unit and the deconvolution group are connected by 3 × 3 convolution, and in addition, the content of the second volume block is copied into the fourth deconvolution block, and the content of the fifth volume block is copied into the third deconvolution block.

The invention has the following beneficial technical effects:

according to the method, a monocular camera is used for acquiring a two-dimensional image of an environment, and then a three-dimensional depth map corresponding to the two-dimensional image is obtained through calculation by using a designed full convolution neural network. When the network is trained, the invention uses the designed image reprojection error and the image mask, thereby increasing the training efficiency and the depth estimation precision. The image mask used in the method can effectively remove interference factors such as a texture area, a moving vehicle and the like in a road environment, greatly improve the accuracy of monocular depth estimation, and simultaneously reduce the requirement on the environment and the training cost; in addition, the positioning method provided by the invention can be used for acquiring the depth map in front of the camera in real time, can be used for navigation and automatic driving of automobiles of ground mobile robots, and is also suitable for unmanned aerial vehicles flying in the air.

The method not only overcomes the defects of strong dependence on a radar sensor, high requirement on environment, inflexibility and the like of the traditional depth estimation method, but also effectively overcomes the problem of cavities generated by low texture areas in monocular depth estimation, and is also very suitable for positioning and navigating the unmanned aerial vehicle; the intelligent mobile navigation system has the advantages of high precision, strong flexibility, wide application range and the like, can be used for navigation and obstacle avoidance of intelligent mobile equipment such as mobile robots and unmanned aerial vehicles in indoor environments, and widens application scenes.

Drawings

FIG. 1 is a schematic diagram of an unsupervised monocular depth estimation module based on a full convolution neural network according to the present invention;

FIG. 2 is a schematic diagram of a training process for a full convolution neural network of the present invention;

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

as shown in fig. 1, a direct unsupervised monocular image scene depth estimation method includes:

step 1: the constructed depth estimation neural network is in a full convolution U shape and comprises a convolution part and a deconvolution part. The depth estimation neural network takes monocular continuous images as input images, and the depth estimation neural network is used for outputting depth estimation images.

Step 1 comprises the following substeps:

step 1.1: the convolution portion uses the convolution network in the Res-Net18 network structure as the main structure network. The convolution part is composed of a plurality of convolution blocks, the last convolution block is connected with the next convolution block through maximum pool operation, each convolution block comprises a plurality of convolution groups, each convolution group comprises a plurality of convolution layers, and meanwhile, the maximum pool operation also scales the size of the next convolution block in the two adjacent convolution blocks to be one half of the size of the last convolution block. As shown in fig. 2, in the embodiment of the present invention, the number of convolution blocks is 5, and each convolution module includes a plurality of different convolution groups. The 1 st convolution block comprises 1 convolution group, the convolution group comprises 1 convolution layer, the input of the convolution layer is 3-channel color image, and the output is 64 channels; the 2 nd, 3 rd, 4 th and 5 th convolution blocks respectively comprise 2 convolution groups, each convolution group respectively comprises 2 convolution layers, and the number of channels contained in the 2 nd, 3 rd, 4 th and 5 th convolution blocks is respectively 64, 128, 256 and 512. The convolution kernel of 7 × 7 is used in the 1 st volume block, the convolution kernel of 3 × 3 is used in the 2 nd, 3 rd, 4 th and 5 th volume blocks, and a Linear correction unit (ReLu) is used between the volume layers in each volume block as an activation function.

Step 1.2: the deconvolution part is composed of a plurality of deconvolution blocks. The deconvolution block with the same scale in the deconvolution part consists of two parts, namely a basic block with the scale in the convolution part being the same as that in the current deconvolution part and a deconvolution block with the current scale. The next deconvolution block has a dimension 2 times the dimension of the last deconvolution block. Each deconvolution block contains several deconvolution groups. As shown in fig. 2, the number of deconvolution blocks in the present invention is 5, the number of deconvolution groups in a deconvolution block is 2 times the number of convolution groups in a corresponding scale convolution block, the number of convolution layers included in each deconvolution group is the same as the number of convolution layers in the corresponding scale convolution group, and the number of channels included in each convolution group in each deconvolution block is 512, 256, 128, 64, and 1, respectively.

Step 1.3: and constructing a depth estimation neural network by utilizing the convolution part and the deconvolution part, inputting the constructed neural network for depth estimation by taking a monocular continuous image as a training data set, and performing sample expansion and generalization operations such as image inversion, gamma conversion, color channel change and the like before inputting the depth estimation neural network.

Step 2: calculating a re-projection image by using a previous frame image, a depth estimation image and a camera pose of an input image, calculating a re-projection error of the re-projection image and a current frame image to obtain a re-projection error of the image, and updating parameters of a depth estimation neural network by using back propagation to obtain a new depth estimation image.

Step 2 comprises the following substeps:

step 2.1: the internal parameters of the camera are calibrated in advance or obtained from the data set, and the internal parameters of the camera are calibrated in an off-line mode by using a Zhang Yongyou camera calibration method and adopting a checkerboard.

Step 2.2: and calculating the initial camera pose by using the current frame image and the previous frame image and adopting a direct method.

Step 2.3: a re-projection image will be calculated using the last frame image of the input image and the depth estimation image and the camera pose;

step 2.4: and calculating a re-projection error between the current frame image and the re-projection image, wherein the re-projection error is a loss function of the depth estimation neural network.

The loss function in the depth estimation neural network adopts the error between a re-projection image and an original image, and is defined as follows:

wherein, I_i，

Respectively representing the pixel values of the input image and the re-projected image, and i represents the index of the pixel value.

Step 2.5: and updating parameters of the depth estimation network through back propagation by using the depth estimation image and the image reprojection error output by the depth estimation neural network to obtain a new depth estimation image.

The step 3 comprises the following substeps:

step 3.1: and calculating a similarity error according to the re-projected image and the current frame image to obtain an image mask.

Specifically, the image mask is a binary image. When the reprojection error is smaller than the difference between the images of the previous and subsequent frames, the image mask is set to 1, otherwise, it is set to 0. In the method of the invention, the image masks of the low texture area and the moving vehicle area are 0, and the image masks of the other areas are 1.

Step 3.3: and calculating a Jacobian matrix between the image of the last frame and the obtained camera pose.

Step 3.4: the image mask is multiplied by the Jacobian matrix to obtain an improved Jacobian matrix.

Step 3.5: and updating the camera pose estimation between the front frame image and the rear frame image by using the improved Jacobian matrix.

Step 3.6: and repeating the iteration steps 2 and 3.

The iteration times are set to be 6 ten thousand, the best estimation effect is achieved on the model stored in the 5 th ten thousand training, and the total training time is 40 hours.

The above is a complete implementation process of the present embodiment.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. A direct method unsupervised monocular image scene depth estimation method is characterized by comprising the following steps: the method comprises the following steps:

step 2: calculating an initial camera pose, calculating a re-projection image by using a previous frame image, a depth estimation image and the camera pose of an input image, calculating a re-projection error of the re-projection image and a current frame image, and updating parameters of a depth estimation network by using back propagation to obtain a new depth estimation image;

2. The direct method unsupervised monocular image scene depth estimation method of claim 1, characterized by: in step 1, the method specifically comprises the following steps:

3. The direct method unsupervised monocular image scene depth estimation method of claim 1, characterized by: in the step 2, the method specifically comprises the following steps:

4. The direct method unsupervised monocular image scene depth estimation method of claim 1, characterized by: in step 3, the method specifically comprises the following steps:

step 3.4: updating the camera pose estimation between the front frame image and the rear frame image by using the improved Jacobian matrix;

step 3.5: and repeating the iteration steps 2 and 3.

5. The direct method unsupervised monocular image scene depth estimation method of claim 2, characterized by: the convolution part uses a convolution network in a Res-Net18 network structure as a main structure network, and comprises 5 convolution blocks and 5 deconvolution blocks, wherein the 1 st convolution block comprises 1 convolution group, the convolution group comprises 1 convolution layer, the input of the convolution layer is a 3-channel color image, the output is 64 channels, the 2 nd, 3 rd, 4 th and 5 th convolution blocks respectively comprise 2 convolution groups, each convolution group comprises 2 convolution layers, the number of channels contained in the 2 nd, 3 th, 4 th and 5 th convolution blocks is respectively 64, 128, 256 and 512, the number of deconvolution groups in the deconvolution blocks is 2 times of the number of convolution groups in the corresponding scale convolution blocks, and the number of channels contained in each convolution group in each deconvolution block is respectively 512, 256, 128, 64 and 1.

6. The direct method unsupervised monocular image scene depth estimation method of claim 2, characterized by: the last volume block and the next volume block are connected through a maximum pool operation, the maximum pool operation also scales the size of the next volume block in the two adjacent volume blocks to be one half of the size of the last volume block, the size of the next deconvolution block in the two adjacent volume blocks is 2 times of the size of the last deconvolution block, the convolution layer and the deconvolution layer in each volume unit and the deconvolution group are connected through a 3 x 3 convolution mode, in addition, the content of the second volume block is copied into the fourth deconvolution block, and the content of the fifth volume block is copied into the third deconvolution block.