CN113096176A

CN113096176A - Semantic segmentation assisted binocular vision unsupervised depth estimation method

Info

Publication number: CN113096176A
Application number: CN202110329765.XA
Authority: CN
Inventors: 任鹏举; 李凌阁; 丁焱; 景鑫; 赵文哲; 夏天; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-09
Anticipated expiration: 2041-03-26
Also published as: CN113096176B

Abstract

A semantic segmentation assisted binocular vision unsupervised depth estimation method comprises the following steps: s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder; s200: shooting by using equipment with a binocular camera to obtain a left view and a right view; s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network; s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.

Description

Semantic segmentation assisted binocular vision unsupervised depth estimation method

Technical Field

The disclosure belongs to the field of computer vision and computer graphics, and particularly relates to a binocular vision unsupervised depth estimation method assisted by semantic segmentation.

Background

Depth estimation is an important research topic in the field of computer vision. The method has important significance in applications such as automatic driving, AR/VR, three-dimensional reconstruction and object grabbing. Today, although there are several hardware and methods that can acquire scene depth information, they all have their own drawbacks. For example, a triangulation method using a monocular camera image and a method of performing parallax calculation using a binocular camera model image tend to be large in calculation amount and low in accuracy, and depend on texture information of a scene relatively. Laser lidar is expensive and the single reliance on a sensor is a risk in itself. The depth perception sensor based on the structured light has the problems of limited measurement range and poor effect in an outdoor environment. With the development of deep learning technology, the industry is now turning attention to a visual depth estimation method based on deep learning. In the prior art, a method adopting supervised learning is used for depth estimation, and good results are obtained. However, the supervised learning method has large limitation and poor generalization capability. The subsequent research of unsupervised depth estimation methods is becoming increasingly popular and trending.

Semantic segmentation may be defined as the process of creating a mask on an image in which pixels are segmented into a predefined set of semantic classes. Such segmentation may be binary (e.g., "person pixels" or "non-person pixels") or multi-type (e.g., pixels may be labeled as "person", "car", "building", etc.). As semantic segmentation techniques increase in accuracy and adoption, it becomes increasingly important to develop technical approaches that exploit such segmentation and develop techniques for integrating the segmentation information into existing computer vision applications (e.g., depth or disparity estimation). Today, the most advanced methods for semantic segmentation are based on deep learning.

At present, depth estimation methods based on unsupervised depth learning are divided into two types, namely a monocular continuous frame based method and a binocular image pair based method, wherein the binocular image pair based method is more common and practical: monocular continuous frame-based method: the method comprises two networks, wherein one network is a depth prediction network and used for predicting a depth map, the other network is a pose estimation network and used for estimating a pose, then an adjacent frame is reconstructed by using the predicted depth map and the predicted pose, the reconstructed adjacent frame is compared with an original adjacent frame, and therefore a loss function is calculated, and the purpose of training the network is achieved. The method based on binocular image pairs comprises the following steps: firstly, inputting a left view of a binocular camera image into a full convolution network to obtain a predicted left and right disparity map, then reconstructing the predicted left and right disparity map and an original left and right view to reconstruct a reconstructed left and right view, and finally comparing the original left and right view with the reconstructed left and right view to calculate a loss function so as to achieve the aim of training the network. In the above-mentioned methods, the network trained by using binocular images can directly estimate absolute depth, and the accuracy is higher and even better than that of the supervised method, whereas the monocular image data set does not contain absolute scale information, and the predicted depth is relative depth, so the unsupervised depth estimation method based on binocular image pairs is mainly studied here.

Existing methods of environmental depth estimation most often and efficiently utilize laser lidar, but lidar is expensive and does not provide dense depth maps, and there is a risk of over-relying on a single sensor per se. The binocular camera can directly obtain depth by calculating parallax, but has problems of large calculation amount and low accuracy when an image lacks texture information. The depth estimation method based on the depth learning technology can provide dense depth maps, has good adaptability to scenes, and can reduce cost. In the past, the method for assisting depth estimation by semantic segmentation generally adopts an iterative network to combine depth information and semantic information, but such a method causes the network to be cumbersome, and simultaneously causes the network to lack perception of structural information of another view in a forward reasoning stage, thereby affecting the performance of final disparity prediction.

Disclosure of Invention

In order to solve the above problem, the present disclosure provides a binocular vision unsupervised depth estimation method assisted by semantic segmentation, including:

s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder;

s200: shooting by using equipment with a binocular camera to obtain a left view and a right view;

s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network;

s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.

Through the technical scheme, the binocular image is combined with a scene depth estimation and semantic method, and the parallax calculated by a traditional parallax calculation method SGM is added to serve as a supervision signal. In the method, a semantic segmentation task is supervised learning, a disparity map calculated by an SGM algorithm is used as one of supervision signals for network disparity estimation, but in the method, depth information provided by lidar is not used, a reconstructed map is made of a predicted disparity map and an original map, and the reconstructed map and the original map are compared to construct a loss function, so that the method belongs to an unsupervised learning task for depth estimation. Assuming that the input left view obtains a corresponding predicted left disparity map, then reconstructing the right view corresponding to the input left view and the obtained predicted left disparity map, so as to obtain a reconstructed right view. The reconstructed image refers to the reconstructed right view, and the reconstructed left view can be obtained by inputting the right view in the same way.

The invention has the advantages that

1) The conventional multi-task learning network is changed into a coder network shared by a semantic segmentation task and a parallax estimation task, and the inherent meaning of the method is that semantic information and structural information of an object in an image have consistency in content, so that the semantic information and the structural information of the image can be mutually promoted in training to achieve the effect of improving the accuracy, and the scale of the network is simplified.

2) Considering the previous method, a left view is input to directly predict a left-right disparity map d_lAnd d_rHowever, this will actually result in the predicted right disparity map d_rWith the true right picture I_rA mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, if the structure and texture information of the right view is lacked, it is difficult to view only from the left view I_lAnd obtaining a right disparity map. Therefore, only one disparity map corresponding to the input image is selected to be output, instead of outputting two predicted disparity maps by one view in the prior art.

3) In the parallax estimation stage, a traditional stereo matching algorithm SGM is introduced, parallax image calculation is carried out on the current input image, and a priori information is added into the predicted parallax image by using the calculated parallax image, so that network convergence is facilitated and the network prediction accuracy is improved.

Drawings

Fig. 1 is a flow chart of a semantic segmentation assisted binocular vision unsupervised depth estimation method provided in an embodiment of the present disclosure;

FIG. 2 is a diagram of a full convolutional neural network provided in one embodiment of the present disclosure;

fig. 3 is a graph of experimental results provided in one embodiment of the present disclosure.

Detailed Description

The present invention will be described in further detail with reference to fig. 1 to 3.

In one embodiment, referring to fig. 1, a semantic segmentation assisted binocular vision unsupervised depth estimation method is disclosed, the method comprising:

The whole full convolution neural network structure is shown in fig. 2 and comprises an encoder and a decoder, wherein the first convolution layer to the sixth convolution layer b are the encoder, and the fifth deconvolution layer to the prediction layer i + the loss is the decoder. We have adapted the invention to the network based on the classical network ResNet. The ordinary convolution in the encoder is changed into the expansion convolution, so that the receptive field can be enlarged as much as possible under the condition of not increasing the parameter calculation amount, and the characteristic diagram in the scene can be better extracted. When the encoder extracts the features of the input image, the task is represented as t, which is a feature matrix (0 or 1 is determined according to the task, 0 is a disparity estimation task, and 1 is a semantic segmentation task) with all 0 or 1, the feature map is composed of one or more matrices, and the input image enters the decoder part after the features are extracted. And the decoder part considers the fusion of the local features of the shallow layer and the abstract features of the deep layer and adopts four far connections to enhance the prediction effect. The output of the model has four scales, one for each input image (1, 1/2, 1/4, 1/8), and the loss function calculations are performed on the four scales.

The left and right views in color obtained in step S300 are in the form of an image pair (i.e., left and right views). Here, the left-right view input sequence is nothing but input of the left view first and then the predicted left disparity map is obtained after passing through the neural network according to the common practice, so that the image reconstruction loss can be calculated. This is done for the right view as well. And finally, uniformly adding the two losses into the total loss function. The loss of computation is actually a manifestation of the unsupervised method herein. Examples are: the left view is input, where the network predicts the left disparity map. The predicted left disparity map and the predicted right view can be reconstructed to obtain a reconstructed left view through image reconstruction. And comparing the reconstructed left view with the original input left view to calculate loss. The same applies to the right view. Therefore, the network can be trained and has the capability of predicting the disparity map. The conversion between the disparity map and the depth map can be obtained by only one formula conversion. Thus we can get the depth map we want.

The main flow of the full convolution neural network is as follows: inputting a color left view into a network, at the moment, when parallax is to be predicted, taking 0 as a flag bit t, cascading a feature map extracted by an encoder with the t, and then, entering a decoder, wherein the t is 0, the decoder is connected with avg. The same operation is done for the color right view in the same way. At the moment, an SGM algorithm is added, which is a traditional stereo matching algorithm and has a fixed set of formulas, the color left and right views are calculated, and the left and right disparity maps calculated by the method are obtained. The left and right disparity maps calculated by the method actually comprise sparse real disparity values, and then the disparity map predicted by the network before the monitoring degree is carried out by using the sparse real disparity values can also obtain a loss function. And then inputting a color left view into a network, taking 1 as a flag bit t when the semantic meaning is required to be predicted, cascading the feature map extracted by the encoder with the t and then entering a decoder, obtaining the predicted semantic map by connecting the decoder with a softmax function because the t is 1, and carrying out supervision calculation loss by using a corresponding label (label). The same operation is done for the color right view in the same way. The loss functions calculated by the two tasks are then combined together to construct an overall loss function. And carrying out iteration once to train the network in a back propagation way. Then, the next operation of data is carried out, so as to achieve the purpose of training the network. The input to the network is a color image, but the sources of this image are the following two types: 1) left or right view of binocular pair-this is to predict disparity; 2) left or right view in the binocular pair + corresponding semantic segmentation label (1abel) -to predict semantic segmentation. The flag bit t is a tensor (tensor) of the same size as the feature map, all 0 s or all 1 s.

The strict definition of image reconstruction is called "image reconstruction loss", and belongs to the category of depth errors. The process of reconstructing the map is as follows: for example, a left color view is input, and a left disparity map and a right disparity map are obtained through network prediction, so that at this time, i can reconstruct through the predicted left disparity map and the color right view to obtain a reconstructed left view. And then, constructing an image reconstruction loss structure by the color left view and the reconstructed left view. The following function is obtained:

L_re＝|I^l-I^r-l|+|I^r-I^1-r|

for this embodiment, first, compared to the previous work of introducing semantic information into depth estimation, they are two different networks for two tasks, and iterative training is performed, and the method is to share one encoder and one decoder for the two tasks, and to use the flag bit t to distinguish the tasks in training.

Secondly, a traditional stereo matching algorithm SGM is introduced to provide a sparse real disparity value, and although the sparse real disparity value is sparse, the performance of disparity prediction can be still improved. Again, the network part, although based on Resnet, is modified by the method to make it more suitable for our task. For example, the ordinary convolution is changed into a form of a dilated convolution (in the third layer convolution b, the fourth layer convolution b, and the fifth layer convolution b, a dilated convolution form with rate of 2, rate of 4, and rate of 8 is used, respectively), and the receptive field is increased without increasing the parameter calculation. In order to fuse deep abstract information and shallow local information, the method uses a far-connection mode to fuse information between layers (the encoder process reduces the image size to 1/64 used for extracting features, then in the decoder process, the deconvolution method is used for obtaining feature maps with the sizes of original images, 1/2 of the original images, 1/4 of the original images and 1/8 of the original images, then the feature maps with the four sizes and the feature maps with the same size obtained in the original encoder are cascaded (if a disparity map is generated, the disparity map is added into the cascading process), and finally, the feature maps are respectively subjected to one layer of convolutional layer. The method calculates the loss function under a plurality of scales. By doing so, the network can better learn the detailed information in the feature map, and reduce the "holes" effect (i.e., some inaccurate "holes" are generated in the predicted disparity map).

In another embodiment, the step S200 further includes the steps of:

s301: the input color left and right views pass through the full convolution neural network encoder part to obtain a characteristic diagram;

s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation task or a semantic segmentation task;

s303: after the characteristic diagram is cascaded with the flag bit, the characteristic diagram is input to a decoder part of the full convolution neural network, and a semantic segmentation task or a parallax estimation task is selected according to the value of the flag bit in a prediction loss fusion layer.

For this embodiment, the reason why two tasks can share one encoder here is inspired by the consistency in content of semantic information and structural information of objects in an image. The same input color image is distinguished from a semantic segmentation task or a parallax estimation task through a flag bit t, when a loss function is calculated, a segmentation result and a parallax estimation result of the image are calculated, then a loss is calculated according to the loss function, and then a network is subjected to back propagation training. The loss function contains a function which combines semantic guidance parallax. The semantic segmentation task is supervised and the disparity estimation is unsupervised, so it is still unsupervised for the depth estimation task.

The existing work of utilizing semantic information to assist depth estimation usually adopts an iterative network, namely two tasks use two different networks. This causes problems of oversize network and difficulty in training. And two tasks share the same encoder and decoder, and training of different tasks is carried out through the mark t. The rationale for doing this is that semantic information and structural information of objects in an image have consistency in content. In addition, a traditional stereo matching algorithm SGM is introduced, namely, under the condition that no real depth value of a data set exists, an effect which can be understood as weak supervision prior is added, and the performance is improved.

The left and right prediction disparity map in step S303 is obtained at the output of the decoder by inputting the feature map and the flag bit to the decoder portion of the full convolution neural network after being concatenated. The full convolution neural network essentially predicts the disparity, but the Depth and disparity only need to be obtained by a simple formula of Depth ═ b × (f)/d. When the loss function is stable and approaches to 0, the full convolution neural network is trained, and then the real depth value calculated by using the real laser lidar is compared with the depth value deduced by using the full convolution neural network to judge the network effect.

Since this network is essentially a multitasking model, semantic segmentation and disparity estimation share one network. However, the two loss functions are different, so when performing disparity estimation, t is set to be all 0 feature maps, concatenated with the feature map generated by the encoder, and then enters the decoder. Since t is 0 at this time, the last layer performs the disparity estimation task. If the semantic segmentation task is carried out, setting t as a full 1 feature map, cascading with the feature map generated by the encoder, then entering a decoder, and then carrying out the semantic segmentation task by using a loss function of the semantic segmentation.

In another embodiment, the step S203 further comprises the steps of:

s3031: and (3) parallax estimation task processing: after the corresponding left and right predicted disparity maps are obtained, the traditional SGM algorithm is used for the input color left and right views, the calculated disparity map corresponding to the input view is obtained to provide sparse real disparity values, and a supervision signal is made for the left and right predicted disparity maps by the calculated left and right disparity maps;

s3032: and (3) semantic segmentation task processing: the input color left and right views are subjected to feature extraction by an encoder, then are subjected to a decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.

For this embodiment, a left view is input to directly predict a left-right disparity map d_lAnd d_rThe method results in a predicted right disparity map d_rWith the true right picture I_rA mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, if the structure and texture information of the right view is lacked, it is difficult to obtain the left view I only_lAnd obtaining a right disparity map. Therefore, only one disparity map corresponding to the input image is selected to be input instead of outputting two predicted disparity maps from one view before.

What is obtained through the pooling layer is the result of the disparity estimation. Only one corresponding disparity map can be obtained by inputting one color image. After the color left and right views obtain corresponding left and right disparity maps through network prediction, the SGM algorithm is used for the color left and right views to obtain the left and right disparity maps calculated by the method. And then, the left and right disparity maps obtained by the SGM algorithm are used for supervising the two left and right disparity maps obtained by network prediction. The calculated disparity map is one, and in the conventional method, two left and right disparity maps are obtained by inputting one color left view image. However, considering that the input color left and right views are different in nature, it is not appropriate to predict the left and right disparity maps with one input color view, so the method is to predict the input color left view to obtain a corresponding left disparity map, and similarly, the right view.

The SGM algorithm is used here, in order to obtain a sparse real disparity value through a conventional stereo matching algorithm, and if the obtained sparse real disparity value is used to supervise a predicted disparity map, the network prediction performance can be improved. The semantic segmentation task is only to guide the disparity estimation task, guide the object edge of the disparity estimation task, and smooth the effect in the training process. The processing is additionally implemented by using a traditional SGM algorithm to obtain the calculated disparity map for the input view, and the SGM algorithm has a set of complete calculation process and does not need to pass through a full convolution neural network. The supervisory signal in step S3031 is obtained by computing the left and right color views with the SGM algorithm, and is only used for the disparity estimation task. The supervisory signal for supervising the training in S3032 refers only to group dtruth in the data set. The disparity estimation task is switched to the semantic segmentation task after the disparity estimation task is finished, because the disparity estimation task needs to be guided by using a result obtained by the semantic segmentation task.

In another embodiment, the encoder uses ResNet with a dilation convolution.

In another embodiment, the decoder portion employs four long-jump links.

In another embodiment, the output of the full convolution neural network has four scales, one for each input image (1, 1/2, 1/4, 1/8), and the loss function calculation is performed on the four scales.

In another embodiment, the flag bit is a full 0 or full 1 signature.

In another embodiment, the result of the semantic segmentation task is obtained by a Softmax function.

In another embodiment, the result of the disparity estimation task is obtained by avg.

In another embodiment, the loss function L is composed of five sub-functions, which are depth error, semantic segmentation error, left-right semantic consistency error, semantic-guided disparity smoothing error, and error of SGM algorithm, respectively.

For the purposes of this embodiment, the loss function construction herein is actually a loss function on four scales. In brief, it is the loss function L that is built on four different scales. Overall loss function:

L＝L_depth+α_segL_seg+α_lrscL_lrsc+α_smoothL_smooth+α_sgmL_sgm

is composed of five sub-functions, L_depthIs the depth error, L_segIs a semantic segmentation error, L_lrscIs a left-right semantic consistency error, L_smoothIs a semantic guide to the parallax smoothing error, L_sgmIs the error of the SGM algorithm, α_segIs a weight coefficient of a semantic segmentation error, alpha_lrscIs a weight coefficient, alpha, of the left and right semantic consistency errors_smoothIs a weight coefficient, alpha, of a semantic-guided parallax smoothing error_sgmAre the weighting coefficients of the error of the SGM algorithm.

(1) Depth error:

wherein L is_reFor image reconstruction loss: l is_re＝|I^l-I^r-l|+|I^r-I^l-r|，I^lIs a color left view, I^r-lIs a reconstructed left view obtained by reconstructing a color right view and a predicted left disparity map, and has the same principle as the method I^r，I^l-r。

α_lr，α_dsAnd respectively representing the weight of the left and right parallax consistency loss and the weight of parallax smoothing. d^lIs a predicted left disparity map, d^r-lWarping the predicted right disparity map into a left disparity map, and d^r，d^l-r。

Is the gradient in the x, y direction of the image.

(2) Semantic segmentation error, i.e. cross entropy function

(3) Left and right semantic consistency errors:

L_lrsc＝|s^l-s^r-l|+|s^r-s^l-r|s^lis a predicted left semantic graph, s^r-lIs the right semantic graph s to be predicted^rAnd predicted left disparity map d^lObtained after reconstruction, for the same reason sr, s^l-r。

(4) Semantic guidance parallax smoothing error:

d denotes a predicted disparity map, s denotes a semantic map,

the expression e1 element-wisemaultivation means that the corresponding components of the vector are multiplied. ψ is an operation of setting the maximum value of each channel to 1 and setting the remaining values to 0. f. of_→Is an operation of translating the input by one pixel along the horizontal axis.

(5) Error of SGM algorithm, i.e. calculating parallax map of original image by SGM algorithm, comparing with predicted parallax map

The above d represents the predicted disparity map, I represents the original map, and s represents the semantic segmentation map.

In another embodiment, the network is trained using binocular images of KITTI for the dataset, while the citrescaps semantic segmentation dataset is also used. The network is realized based on a Pythrch deep learning framework and runs on an Ingland GTX 1080Ti display card. During training, the input image is resized to a resolution of 256x512, and data enhancement operations are performed to avoid overfitting, more specifically from [0.8, 1.2 ]]，[0.5，2.0]And [0.8, 1.2]Three numbers are sampled in a uniform distribution within the range, the sampled numbers also using gamma offsets. The optimizer uses Adam, the initial learning rate λ is 1e-4, β₁＝0.9，β₂0.999, e1 e-5. Weights alpha of different objective functions_lr＝0.2，α_ds＝0.02，α_seg＝0.1，α_lrsc＝0.2，α_smooth＝2.0，α_sgm0.1. The encoder section was pre-trained with the ImageNet data set prior to training.

Note that the network needs to select tasks by the flag t, so when the input image is subjected to forward inference, setting t to 0 obtains a disparity map, and setting t to 1 obtains a semantic segmentation map.

In another embodiment, the method is performed on a citrescaps dataset with results as shown in fig. 3. In fig. 3, the first line represents the original color map, the second line is the semantic segmentation result of the corresponding input image, and the last line is the depth estimation result of the corresponding input image. Observing the experimental results of the semantic segmentation task and the disparity estimation task, it can be seen that the disparity estimation task can be promoted by the semantic segmentation task, and the effects of the two are good. The semantic segmentation task has good effect on object edge modification in the parallax estimation task. Meanwhile, the consistency of structural information and semantic information of the objects in the scene in content is verified.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims

1. A semantic segmentation assisted binocular vision unsupervised depth estimation method comprises the following steps:

2. The method according to claim 1, preferably, the step S300 further comprises the steps of:

3. The method of claim 2, said step S303 further comprising the steps of:

4. The method of claim 1, wherein the encoder uses ResNet with a dilated convolution.

5. The method of claim 1, wherein the decoder portion employs four long-jump links.

6. The method of claim 1, the output of the full convolution neural network having four scales, one for (1, 1/2, 1/4, 1/8) of the input image, over which the loss function computation is performed.

7. The method of claim 1, wherein the loss function L is composed of five sub-functions, namely depth error, semantic segmentation error, left-right semantic consistency error, semantic guided disparity smoothing error, and error of SGM algorithm.

8. The method of claim 2, wherein the flag is a full 0 or full 1 profile.

9. The method of claim 2, the result of the semantic segmentation task being obtained by a Softmax function.

10. The method of claim 2, wherein the results of the disparity estimation task are obtained by avg.