CN113096176B

CN113096176B - Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Info

Publication number: CN113096176B
Application number: CN202110329765.XA
Authority: CN
Inventors: 任鹏举; 李凌阁; 丁焱; 景鑫; 赵文哲; 夏天; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-04-05
Anticipated expiration: 2041-03-26
Also published as: CN113096176A

Abstract

A semantic segmentation assisted binocular vision unsupervised depth estimation method, the method comprising: s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder; s200: photographing by using equipment with a binocular camera to obtain a left view and a right view; s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network; s400: and inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph.

Description

Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Technical Field

The present disclosure belongs to the field of computer vision and computer graphics, and in particular relates to a semantic segmentation assisted binocular vision unsupervised depth estimation method.

Background

Depth estimation is an important research topic in the field of computer vision. The method has important significance in the application of automatic driving, AR/VR, three-dimensional reconstruction, object grabbing and the like. Today, although there are a few hardware and methods that can obtain depth of field information, they all have their own drawbacks. For example, a method of performing parallax calculation using a monocular camera image and a method of performing parallax calculation using a binocular camera model image tend to have a large calculation amount and low accuracy, and to compare scene-dependent texture information. Laser lidar is expensive and relying solely on a sensor itself is also a risk. The depth perception sensor based on the structured light has the problems of limited measurement range and poor effect in the outdoor environment. With the advent of deep learning techniques, the industry now turned attention to visual depth estimation methods based on deep learning. There are methods in the prior art that employ supervised learning for depth estimation and achieve good results. However, the method for supervising learning has the defects of large limitation and poor generalization capability. Later research on unsupervised depth estimation methods became a popular and trending approach.

Semantic segmentation may be defined as the process of creating a mask over an image, where pixels are segmented into a predefined set of semantic categories. Such segmentation may be binary (e.g., "human pixels" or "non-human pixels") or multi-class (e.g., pixels may be labeled as "human", "car", "building", etc.). With the increasing accuracy and adoption of semantic segmentation techniques, it is becoming increasingly important to develop techniques that exploit such segmentation and develop techniques for integrating segmentation information into existing computer vision applications (e.g., depth or disparity estimation). The most advanced methods for semantic segmentation are based on deep learning.

The current depth estimation method based on the unsupervised deep learning is divided into a method based on monocular continuous frames and a method based on binocular image pairs, wherein the method based on the binocular image pairs is more common and practical: a monocular sequential frame based method: one network is a depth prediction network for predicting a depth map, the other network is a pose estimation network for estimating a pose, then adjacent frames are reconstructed by using the predicted depth map and the pose, and the reconstructed adjacent frames are compared with original adjacent frames, so that a loss function is calculated, and the aim of training the network is fulfilled. A binocular image pair-based method: firstly, inputting left views of binocular camera images into a full convolution network to obtain predicted left and right parallax images, then reconstructing the left and right parallax images with original left and right views by using the predicted left and right parallax images, reconstructing reconstructed left and right views, and finally comparing the original left and right views with the reconstructed left and right views to calculate a loss function so as to achieve the aim of training the network. Among the above mentioned methods, the network trained with binocular images can directly estimate absolute depth with higher accuracy even better than the supervised method, whereas the monocular image dataset does not contain absolute scale information, and the predicted depth is relative depth, so here the unsupervised depth estimation method based on binocular image pairs is mainly studied.

Existing methods of environmental depth estimation are most common and efficient to use laser lidar, but lidar is expensive, does not provide a dense depth map, and is risky to rely heavily on a single sensor itself. The binocular camera can directly obtain depth by calculating parallax, but has the problems of large calculation amount and low precision when the image lacks texture information. The depth estimation method by the deep learning technology can provide dense depth map, has good adaptability to scenes and can reduce cost. In the past, the method for assisting the depth estimation by means of semantic segmentation generally adopts an iterative network to combine the depth information and the semantic information, but the method can cause the network to be more complicated, and simultaneously can cause the lack of perception of structural information of another view in a forward reasoning stage of the network to influence the performance of final parallax prediction.

Disclosure of Invention

In order to solve the above-mentioned problems, the present disclosure provides a semantic segmentation-assisted binocular vision unsupervised depth estimation method, which includes:

s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder;

s200: photographing by using equipment with a binocular camera to obtain a left view and a right view;

s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network;

s400: and inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph.

Through the technical scheme, the binocular image is combined with the scene depth estimation and semantic method, and the parallax calculated by the traditional parallax calculation method SGM is added as a supervision signal. In the method, the semantic segmentation task is supervised learning, the disparity map calculated by the SGM algorithm is used as one of the supervision signals of network disparity estimation, but the predicted disparity map and the original map are reconstructed without the help of depth information provided by lidar, and the reconstructed map and the original map are compared to construct a loss function, so the method belongs to an unsupervised learning task for depth estimation. And (3) assuming that the input left view obtains a corresponding predicted left parallax image, and then reconstructing the right view corresponding to the input left view and the obtained predicted left parallax image to obtain a reconstructed right view. The reconstructed image refers to a reconstructed right view, and the reconstructed left view can be obtained by inputting the right view in the same way.

The invention has the beneficial effects that

1) The conventional multi-task learning network in the past is changed into a semantic segmentation task and a parallax estimation task to share the same encoder network, so that the intrinsic meaning of the task learning network is that semantic information and structural information of objects in images are consistent in content, so that the semantic information and the structural information of the images can mutually promote each other in training to achieve the effect of improving accuracy, and the scale of the network is simplified.

2) Considering that the previous method is to input a left view to directly predict the left-right disparity map d _l And d _r But in fact this will result in the predicted right disparity map d _r And true right image I _r A mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, it is difficult to select only from the left and right views if the structure and texture information of the right view is absentLeft view I _l And a right disparity map is obtained. Instead of outputting two predicted disparity maps with one view in the previous method, it is chosen to output only one disparity map corresponding to the input image.

3) In the parallax estimation stage, a traditional stereo matching algorithm SGM is introduced, parallax map calculation is carried out on a current input image, and the calculated parallax map is equivalent to a predicted parallax map added with a priori information, so that network convergence is helped and network prediction accuracy is improved.

Drawings

FIG. 1 is a flow chart of a semantic segmentation assisted binocular vision unsupervised depth estimation method provided in one embodiment of the present disclosure;

FIG. 2 is a block diagram of a full convolutional neural network provided in one embodiment of the present disclosure;

fig. 3 is a graph of experimental results provided in one embodiment of the present disclosure.

Detailed Description

The invention is described in further detail below with reference to fig. 1 to 3.

In one embodiment, referring to fig. 1, a semantic segmentation assisted binocular vision unsupervised depth estimation method is disclosed, the method comprising:

The whole full convolution neural network structure is shown in fig. 2, and consists of an encoder and a decoder, wherein the first convolution layer to the sixth convolution layer b are the encoders, and the fifth deconvolution layer to the prediction layer one+the loss one are the decoders. We have made suitable modifications of the network on the basis of the classical network ResNet. The common convolution in the encoder is changed into the expansion convolution, so that the receptive field can be increased as much as possible under the condition of not increasing the calculation amount of parameters, and the feature map in the scene can be extracted better. When the encoder performs feature extraction on an input image, the task is denoted as t, and is a feature matrix (whether 0 or 1 is determined according to the task, 0 is a disparity estimation task, and 1 is a semantic segmentation task) of all 0 or 1, and the feature map is composed of one or more matrices, and after feature extraction, the feature map enters a decoder part. The decoder part considers the fusion of the local features of the shallow layer and the abstract features of the deep layer, and adopts four far-connection modes to enhance the prediction effect. The model has four scales of (1, 1/2,1/4, 1/8) of the input image, and the loss function calculation is performed on the four scales.

The color left and right views acquired in step S300 are in the form of an image pair (i.e., left view and right view). The input sequence of the left view and the right view is not so-called, and according to the common habit, the left view is input first, and the predicted left parallax image is obtained after the neural network, so that the image reconstruction loss can be calculated. This is done similarly for the right view. And finally, uniformly adding the two losses into a total loss function. The loss of computation is here the embodiment of the unsupervised method in this context. Examples: the left view is input and the network predicts a left disparity map here. The predicted left disparity map and right view may be reconstructed from the image to obtain a reconstructed left view. The loss can be calculated by comparing the reconstructed left view with the original input left view. Right view is similarly. Thus, the network can be trained to have the ability to predict disparity maps. The conversion of the disparity map and the depth map can be obtained by only one formula conversion. Thus we can get the depth map we want.

The main flow of the full convolution neural network is as follows: firstly, inputting a color left view into a network, at the moment, predicting parallax, taking a zone bit t to be 0, cascading the feature map extracted by an encoder with the t, then entering a decoder, at the moment, connecting the avg.pooling after the decoder to obtain a corresponding predicted parallax map, and then carrying out image reconstruction to calculate a loss function. The same operation is performed on the right view of the color. At this time, an SGM algorithm is added, which is a traditional stereo matching algorithm, and a fixed set of formulas are used for calculating left and right views of the color to obtain left and right parallax images calculated by the method. The left and right disparity maps calculated by the method actually contain sparse real disparity values, so that a loss function can be obtained by using the disparity maps predicted by the network before the partial sparse real disparity values are used for monitoring. Then, a color left view is input into a network, at the moment, the semantics are predicted, the flag bit t is taken to be 1, the feature map extracted by the encoder is cascaded with the flag bit t and then enters a decoder, and at the moment, the decoder is connected with a softmax function to obtain a predicted semantic map, and the loss is monitored and calculated by using a corresponding label (label). The same operation is performed on the right view of the color. The two task calculated loss functions are then combined to construct a total loss function. And performing iteration once, and performing back propagation training on the network. Then the operation of the next step data is carried out, so that the aim of training the network is fulfilled. Wherein the input of the network is a color image, but the sources of the image are as follows: 1) Left or right view in binocular view contrast-this is to predict disparity; 2) Left or right view in binocular view opposite + corresponding semantic segmentation label (1 abel) -to predict semantic segmentation. The flag bit t is a tensor (tensor) of the same size as the feature map of all 0 or all 1.

The strict definition of image reconstruction is called "image reconstruction loss", which belongs to the category of depth error. The process of reconstructing the graph: for example, a left color view is input, and a left disparity map and a right disparity map are obtained through network prediction, so that I can reconstruct the left disparity map and the right color view through prediction at the moment to obtain a reconstructed left view. And then, carrying out image reconstruction loss construction on the color left view and the reconstructed left view. The following function is obtained:

L _re ＝|I ^l -I ^r-l |+|I ^r -I ^1-r |

for this embodiment, first, they are two tasks, two different networks, iteratively trained, as compared to the previous work of introducing semantic information into the depth estimation, whereas the method shares one encoder and decoder for both tasks and uses the flag bit t to distinguish the tasks in the training.

Secondly, a traditional stereo matching algorithm SGM is introduced to provide sparse real parallax values, and although the sparse real parallax values are sparse, the parallax prediction performance can be improved. Again, the network part is based on the Resnet, but the method modifies it so that it is more suitable for our task. For example, the normal convolution is changed into an expanded convolution form (the expanded convolution forms of rate=2, rate=4 and rate=8 are respectively used in the third-layer convolution b, the fourth-layer convolution b and the fifth-layer convolution b), and the receptive field is increased without increasing the parameter calculation. In order to enable deep abstract information and shallow local information to be fused, the method uses a far connection mode to conduct information fusion between layers (an encoder process can reduce the image size to 1/64 of the original size for extracting features, then in a decoder process, a feature map with the size being 1/2 of the original map, 1/4 of the original map and 1/8 of the original map is obtained through a deconvolution method, then cascade operation is conducted on the feature map with the four sizes and the feature map with the same size obtained in the original encoder (the process of adding the parallax map into cascade if the parallax map is generated), and finally a layer of convolution layer is respectively passed. The method calculates a loss function at a plurality of scales. By doing so, the network can learn the detailed information in the feature map better, and the 'hole' effect is reduced (namely, some inaccurate 'holes' can be generated in the predicted disparity map).

In another embodiment, the step S200 further includes the steps of:

s301: the input left and right views of the color pass through the full convolution neural network encoder to obtain a characteristic diagram;

s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation or semantic segmentation task;

s303: and after the feature map is cascaded with the zone bit, inputting the feature map to a decoder part of the full convolution neural network, and selecting whether to perform a semantic segmentation task or a parallax estimation task at a prediction loss fusion layer according to the value of the zone bit.

For this embodiment, the reason why two tasks can share one encoder here is inspired by the consistency in content of semantic information and structural information of objects in the image. The same input color image is distinguished into a semantic segmentation task or a parallax estimation task through a zone bit t, the segmentation result and the parallax estimation result of the image are calculated when a loss function is calculated, then a loss is calculated according to the loss function, and then the network is subjected to back propagation training. The loss function contains a function combining the semantically guided disparities. The semantic segmentation task is supervised and the disparity estimation is unsupervised, so it is still unsupervised for the depth estimation task.

The existing work of using semantic information to assist depth estimation often uses an iterative network, i.e., two different networks for two tasks. This can cause problems with oversized networks and difficult training. While two tasks share the same encoder and decoder, training of different tasks is performed by the flag t. The theoretical basis for this is that semantic information and structural information of objects in an image have consistency in content. In addition, the traditional stereo matching algorithm SGM is introduced, which is equivalent to adding an effect which can be understood as weak supervision prior under the condition that no data set real depth value exists, and helps to improve performance.

The left-right prediction disparity map in step S303 is obtained by cascading the feature map with the flag bit and then inputting the feature map to the decoder of the convolutional neural network, i.e. the output of the decoder. Here, the full convolutional neural network predicts essentially the disparity, but the Depth and disparity need only be obtained by a simple one formula depth=b×f/d. When the loss function is stable and approaches to 0, the full convolution neural network is trained, and then the real depth value calculated by the real laser lidar is compared with the depth value deduced by the full convolution neural network to judge the network effect.

Because this network is essentially a multitasking model, semantic segmentation and disparity estimation share a network. But the two are different in loss function, so that when disparity estimation is performed, t is set to be all 0 feature map, concatenated with the feature map generated by the encoder, and then enters the decoder. Because t is 0 at this time, the last layer performs the disparity estimation task. If the semantic segmentation task is performed, t is set to be an all-1 feature map, and is cascaded with the feature map generated by the encoder, then enters the decoder, and then performs the semantic segmentation task by using a loss function of semantic segmentation.

In another embodiment, the step S203 further includes the steps of:

s3031: parallax estimation task processing: after the corresponding left and right predicted parallax images are obtained, at the moment, a traditional SGM algorithm is used for the input left and right colored views, the calculated parallax images corresponding to the input views are obtained to provide sparse real parallax values, and a supervision signal is made for the left and right predicted parallax images by the calculated left and right parallax images;

s3032: semantic segmentation task processing: the input left and right views are subjected to feature extraction by the encoder, then are subjected to the decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.

In this embodiment, a left view is input to directly predict the left-right disparity map d _l And d _r The method of (1) will result in a predicted right disparity map d _r And true right image I _r A mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, it is difficult to select only the left view I from the left view I if the structure and texture information of the right view is lacking _l And a right disparity map is obtained. So that only one corresponding to the input image is selected to be inputInstead of outputting two predicted disparity maps with one view before.

The result of the disparity estimation is obtained through the pooling layer. Only one corresponding parallax map can be obtained by inputting one color image. The left and right parallax images calculated by the method can be obtained by predicting the left and right color views through a network and then using SGM algorithm to the left and right color views. And then, monitoring the two left and right disparity maps obtained by network prediction by using the left and right disparity maps obtained by the SGM algorithm. The calculated disparity map is one, and in the existing method, a color left view is input to predict to obtain left and right disparity maps. However, considering that the input color left view and the input color right view are actually different, it is not suitable to predict left and right disparity maps by using one input color view, so the method predicts the input color left view to obtain a corresponding left disparity map, and the method is similar to the right map.

The SGM algorithm is used here, so that the sparse real disparity value is obtained through the traditional stereo matching algorithm, and if the obtained sparse real disparity value is used for supervising the predicted disparity map, the network prediction performance can be improved. The semantic segmentation task is only to conduct a guidance on the parallax estimation task in the training process, guide the object edge of the parallax estimation task and smooth the effect. The input view is obtained by using a traditional SGM algorithm, the processing is additionally implemented, and the SGM algorithm has a complete set of calculation flow and does not need to pass through a full convolution neural network. The supervisory signal in step S3031 refers to a calculated color left and right views by the SGM algorithm, and is only used for the task of parallax estimation. The supervisory signal of the supervisory training in S3032 refers only to groudtluth in the data set. The disparity estimation task must be switched to the semantic segmentation task after being performed, because the result obtained by the semantic segmentation task needs to be used for guiding the disparity estimation task.

In another embodiment, the encoder uses ResNet with dilation convolutions.

In another embodiment, the decoder portion employs four long jump links.

In another embodiment, the output of the full convolutional neural network has four scales, respectively (1, 1/2,1/4, 1/8) of the input image, on which the loss function calculation is performed.

In another embodiment, the flag bit is a feature map of all 0 s or all 1 s.

In another embodiment, the result of the semantic segmentation task is obtained by a Softmax function.

In another embodiment, the result of the disparity estimation task is obtained by avg.

In another embodiment, the loss function L is composed of five sub-functions, which are a depth error, a semantic segmentation error, a left-right semantic consistency error, a semantic guided parallax smoothing error, and an SGM algorithm error, respectively.

For this embodiment, the loss function construct here is simply the loss function in four dimensions. In short, the loss function L is based on four different scales. Total loss function:

L＝L _depth +α _seg L _seg +α _lrsc L _lrsc +α _smooth L _smooth +α _sgm L _sgm

is composed of five sub-functions, L _depth Is depth error, L _seg Is a semantic segmentation error, L _lrsc Is left-right semantic consistency error, L _smooth Is a semantically guided parallax smoothing error, L _sgm Is the error of SGM algorithm, alpha _seg Is the weight coefficient of semantic segmentation error, alpha _lrsc Is the weight coefficient of the left and right semantic consistency error, alpha _smooth Is the weight coefficient of semantically guided parallax smoothing error, alpha _sgm Is the weight coefficient of the error of the SGM algorithm.

(1) Depth error:

wherein L is _re Loss for image reconstruction: l (L) _re ＝|I ^l -I ^r-l |+|I ^r -I ^l-r |，I ^l Is a color left view, I ^r-l Is a reconstructed left view obtained after the reconstruction of a color right view and a predicted left disparity map, and is the same as I ^r ，I ^l-r 。

α _lr ，α _ds The weight of the left and right parallax consistency loss and the weight of parallax smoothing are respectively represented. d, d ^l Is the predicted left disparity map, d ^r-l Is to warp the predicted right disparity map into a left disparity map, and d is the same as d ^r ，d ^l-r 。Is the gradient in the x, y direction of the image.

(2) Semantic segmentation errors, namely cross entropy functions

(3) Left-right semantic consistency error:

L _lrsc ＝|s ^l -s ^r-l |+|s ^r -s ^l-r |s ^l is the left semantic graph of the prediction s ^r-l Is the right semantic graph s to be predicted ^r And a predicted left disparity map d ^l Obtained after reconstruction, the same theory sr, s ^l-r 。

(4) Semantic guided parallax smoothing error:

d represents a predicted disparity map, s represents a semantic map,the representation e1 element-wise multiplication represents the multiplication of the vector corresponding components. ψ is an operation of setting the maximum value of each channel to 1 and the remaining values to 0. f (f) _→ Is an operation of shifting an input by one pixel along a horizontal axis.

(5) The SGM algorithm error, namely, the parallax image is calculated by the SGM algorithm on the original image and is compared with the predicted parallax image

Above d represents a predicted disparity map, I represents an original map, and s represents a semantic segmentation map.

In another embodiment, the network is trained on datasets using binocular images of the KITTI, while also semantically segmented datasets of the CityScaps are used. The network is realized based on a Pytorch deep learning framework and operates on an English-to-Chinese GTX 1080Ti display card. During training, the input image is resized to 256x512 resolution, and data enhancement operations are also performed to avoid overfitting, more specifically from [0.8,1.2 ]]，[0.5，2.0]And [0.8,1.2 ]]Three numbers are sampled in a uniform distribution over the range, the sampled numbers also using gamma offset. The optimizer uses Adam, initial learning rate λ=1e-4, β ₁ ＝0.9，β ₂ =0.999, e=1e-5. Weights alpha of different objective functions _lr ＝0.2，α _ds ＝0.02，α _seg ＝0.1，α _lrsc ＝0.2，α _smooth ＝2.0，α _sgm =0.1. The encoder section was pre-trained with the ImageNet dataset prior to training.

Note that the network needs to select tasks by the flag t, so when the input image makes forward reasoning, setting t to 0 is to obtain a disparity map, and setting t to 1 is to obtain a semantic segmentation map.

In another embodiment, the method is performed on a Cityscapes dataset with the results shown in fig. 3. The first line in fig. 3 represents the original color map, the second line is the semantic segmentation result of the corresponding input image, and the last line is the depth estimation result of the corresponding input image. Looking at the experimental results of the semantic segmentation task and the disparity estimation task, it can be seen that the disparity estimation task can be facilitated by means of the semantic segmentation task, and the effects of both are very good. The semantic segmentation task has good effect on object edge modification in the parallax estimation task. At the same time, the consistency of structural information and semantic information of objects in the scene in content is verified.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims

1. A semantic segmentation assisted binocular vision unsupervised depth estimation method, the method comprising:

s400: inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph;

the step S300 further includes the steps of:

s303: after cascading the feature map and the zone bit, inputting the feature map and the zone bit into a decoder part of the full convolution neural network, and selecting whether to perform a semantic segmentation task or a parallax estimation task at a prediction loss fusion layer according to the value of the zone bit;

the step S303 further includes the steps of:

s3031: parallax estimation task processing: after the corresponding left and right predicted disparity maps are obtained, at the moment, a traditional SGM algorithm is used for the input left and right color views, the calculated disparity map corresponding to the input view is obtained and used for providing sparse real disparity values, and a supervision signal is made for the left and right predicted disparity maps by the calculated left and right disparity maps;

2. The method of claim 1, wherein the encoder uses a ResNet with a dilation convolution.

3. The method of claim 1, wherein the decoder portion employs four long jump links.

4. The method of claim 1, the output of the full convolutional neural network having four scales, respectively (1, 1/2,1/4, 1/8) of the input image, on which the loss function calculation is performed.

5. The method of claim 1, the loss functionThe method consists of five sub-functions, namely a depth error, a semantic segmentation error, a left-right semantic consistency error, a semantic guidance parallax smoothing error and an SGM algorithm error.

6. The method of claim 1, wherein the flag bit is a feature map of all 0 s or all 1 s.

7. The method of claim 1, the result of the semantic segmentation task is obtained by a Softmax function.

8. The method of claim 1, the result of the disparity estimation task being obtained by avg.