CN113096176B - Semantic segmentation-assisted binocular vision unsupervised depth estimation method - Google Patents

Semantic segmentation-assisted binocular vision unsupervised depth estimation method Download PDF

Info

Publication number
CN113096176B
CN113096176B CN202110329765.XA CN202110329765A CN113096176B CN 113096176 B CN113096176 B CN 113096176B CN 202110329765 A CN202110329765 A CN 202110329765A CN 113096176 B CN113096176 B CN 113096176B
Authority
CN
China
Prior art keywords
neural network
semantic segmentation
parallax
task
convolution neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110329765.XA
Other languages
Chinese (zh)
Other versions
CN113096176A (en
Inventor
任鹏举
李凌阁
丁焱
景鑫
赵文哲
夏天
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110329765.XA priority Critical patent/CN113096176B/en
Publication of CN113096176A publication Critical patent/CN113096176A/en
Application granted granted Critical
Publication of CN113096176B publication Critical patent/CN113096176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

A semantic segmentation assisted binocular vision unsupervised depth estimation method, the method comprising: s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder; s200: photographing by using equipment with a binocular camera to obtain a left view and a right view; s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network; s400: and inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph.

Description

Semantic segmentation-assisted binocular vision unsupervised depth estimation method
Technical Field
The present disclosure belongs to the field of computer vision and computer graphics, and in particular relates to a semantic segmentation assisted binocular vision unsupervised depth estimation method.
Background
Depth estimation is an important research topic in the field of computer vision. The method has important significance in the application of automatic driving, AR/VR, three-dimensional reconstruction, object grabbing and the like. Today, although there are a few hardware and methods that can obtain depth of field information, they all have their own drawbacks. For example, a method of performing parallax calculation using a monocular camera image and a method of performing parallax calculation using a binocular camera model image tend to have a large calculation amount and low accuracy, and to compare scene-dependent texture information. Laser lidar is expensive and relying solely on a sensor itself is also a risk. The depth perception sensor based on the structured light has the problems of limited measurement range and poor effect in the outdoor environment. With the advent of deep learning techniques, the industry now turned attention to visual depth estimation methods based on deep learning. There are methods in the prior art that employ supervised learning for depth estimation and achieve good results. However, the method for supervising learning has the defects of large limitation and poor generalization capability. Later research on unsupervised depth estimation methods became a popular and trending approach.
Semantic segmentation may be defined as the process of creating a mask over an image, where pixels are segmented into a predefined set of semantic categories. Such segmentation may be binary (e.g., "human pixels" or "non-human pixels") or multi-class (e.g., pixels may be labeled as "human", "car", "building", etc.). With the increasing accuracy and adoption of semantic segmentation techniques, it is becoming increasingly important to develop techniques that exploit such segmentation and develop techniques for integrating segmentation information into existing computer vision applications (e.g., depth or disparity estimation). The most advanced methods for semantic segmentation are based on deep learning.
The current depth estimation method based on the unsupervised deep learning is divided into a method based on monocular continuous frames and a method based on binocular image pairs, wherein the method based on the binocular image pairs is more common and practical: a monocular sequential frame based method: one network is a depth prediction network for predicting a depth map, the other network is a pose estimation network for estimating a pose, then adjacent frames are reconstructed by using the predicted depth map and the pose, and the reconstructed adjacent frames are compared with original adjacent frames, so that a loss function is calculated, and the aim of training the network is fulfilled. A binocular image pair-based method: firstly, inputting left views of binocular camera images into a full convolution network to obtain predicted left and right parallax images, then reconstructing the left and right parallax images with original left and right views by using the predicted left and right parallax images, reconstructing reconstructed left and right views, and finally comparing the original left and right views with the reconstructed left and right views to calculate a loss function so as to achieve the aim of training the network. Among the above mentioned methods, the network trained with binocular images can directly estimate absolute depth with higher accuracy even better than the supervised method, whereas the monocular image dataset does not contain absolute scale information, and the predicted depth is relative depth, so here the unsupervised depth estimation method based on binocular image pairs is mainly studied.
Existing methods of environmental depth estimation are most common and efficient to use laser lidar, but lidar is expensive, does not provide a dense depth map, and is risky to rely heavily on a single sensor itself. The binocular camera can directly obtain depth by calculating parallax, but has the problems of large calculation amount and low precision when the image lacks texture information. The depth estimation method by the deep learning technology can provide dense depth map, has good adaptability to scenes and can reduce cost. In the past, the method for assisting the depth estimation by means of semantic segmentation generally adopts an iterative network to combine the depth information and the semantic information, but the method can cause the network to be more complicated, and simultaneously can cause the lack of perception of structural information of another view in a forward reasoning stage of the network to influence the performance of final parallax prediction.
Disclosure of Invention
In order to solve the above-mentioned problems, the present disclosure provides a semantic segmentation-assisted binocular vision unsupervised depth estimation method, which includes:
s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder;
s200: photographing by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network;
s400: and inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph.
Through the technical scheme, the binocular image is combined with the scene depth estimation and semantic method, and the parallax calculated by the traditional parallax calculation method SGM is added as a supervision signal. In the method, the semantic segmentation task is supervised learning, the disparity map calculated by the SGM algorithm is used as one of the supervision signals of network disparity estimation, but the predicted disparity map and the original map are reconstructed without the help of depth information provided by lidar, and the reconstructed map and the original map are compared to construct a loss function, so the method belongs to an unsupervised learning task for depth estimation. And (3) assuming that the input left view obtains a corresponding predicted left parallax image, and then reconstructing the right view corresponding to the input left view and the obtained predicted left parallax image to obtain a reconstructed right view. The reconstructed image refers to a reconstructed right view, and the reconstructed left view can be obtained by inputting the right view in the same way.
The invention has the beneficial effects that
1) The conventional multi-task learning network in the past is changed into a semantic segmentation task and a parallax estimation task to share the same encoder network, so that the intrinsic meaning of the task learning network is that semantic information and structural information of objects in images are consistent in content, so that the semantic information and the structural information of the images can mutually promote each other in training to achieve the effect of improving accuracy, and the scale of the network is simplified.
2) Considering that the previous method is to input a left view to directly predict the left-right disparity map d l And d r But in fact this will result in the predicted right disparity map d r And true right image I r A mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, it is difficult to select only from the left and right views if the structure and texture information of the right view is absentLeft view I l And a right disparity map is obtained. Instead of outputting two predicted disparity maps with one view in the previous method, it is chosen to output only one disparity map corresponding to the input image.
3) In the parallax estimation stage, a traditional stereo matching algorithm SGM is introduced, parallax map calculation is carried out on a current input image, and the calculated parallax map is equivalent to a predicted parallax map added with a priori information, so that network convergence is helped and network prediction accuracy is improved.
Drawings
FIG. 1 is a flow chart of a semantic segmentation assisted binocular vision unsupervised depth estimation method provided in one embodiment of the present disclosure;
FIG. 2 is a block diagram of a full convolutional neural network provided in one embodiment of the present disclosure;
fig. 3 is a graph of experimental results provided in one embodiment of the present disclosure.
Detailed Description
The invention is described in further detail below with reference to fig. 1 to 3.
In one embodiment, referring to fig. 1, a semantic segmentation assisted binocular vision unsupervised depth estimation method is disclosed, the method comprising:
s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder;
s200: photographing by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network;
s400: and inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph.
The whole full convolution neural network structure is shown in fig. 2, and consists of an encoder and a decoder, wherein the first convolution layer to the sixth convolution layer b are the encoders, and the fifth deconvolution layer to the prediction layer one+the loss one are the decoders. We have made suitable modifications of the network on the basis of the classical network ResNet. The common convolution in the encoder is changed into the expansion convolution, so that the receptive field can be increased as much as possible under the condition of not increasing the calculation amount of parameters, and the feature map in the scene can be extracted better. When the encoder performs feature extraction on an input image, the task is denoted as t, and is a feature matrix (whether 0 or 1 is determined according to the task, 0 is a disparity estimation task, and 1 is a semantic segmentation task) of all 0 or 1, and the feature map is composed of one or more matrices, and after feature extraction, the feature map enters a decoder part. The decoder part considers the fusion of the local features of the shallow layer and the abstract features of the deep layer, and adopts four far-connection modes to enhance the prediction effect. The model has four scales of (1, 1/2,1/4, 1/8) of the input image, and the loss function calculation is performed on the four scales.
The color left and right views acquired in step S300 are in the form of an image pair (i.e., left view and right view). The input sequence of the left view and the right view is not so-called, and according to the common habit, the left view is input first, and the predicted left parallax image is obtained after the neural network, so that the image reconstruction loss can be calculated. This is done similarly for the right view. And finally, uniformly adding the two losses into a total loss function. The loss of computation is here the embodiment of the unsupervised method in this context. Examples: the left view is input and the network predicts a left disparity map here. The predicted left disparity map and right view may be reconstructed from the image to obtain a reconstructed left view. The loss can be calculated by comparing the reconstructed left view with the original input left view. Right view is similarly. Thus, the network can be trained to have the ability to predict disparity maps. The conversion of the disparity map and the depth map can be obtained by only one formula conversion. Thus we can get the depth map we want.
The main flow of the full convolution neural network is as follows: firstly, inputting a color left view into a network, at the moment, predicting parallax, taking a zone bit t to be 0, cascading the feature map extracted by an encoder with the t, then entering a decoder, at the moment, connecting the avg.pooling after the decoder to obtain a corresponding predicted parallax map, and then carrying out image reconstruction to calculate a loss function. The same operation is performed on the right view of the color. At this time, an SGM algorithm is added, which is a traditional stereo matching algorithm, and a fixed set of formulas are used for calculating left and right views of the color to obtain left and right parallax images calculated by the method. The left and right disparity maps calculated by the method actually contain sparse real disparity values, so that a loss function can be obtained by using the disparity maps predicted by the network before the partial sparse real disparity values are used for monitoring. Then, a color left view is input into a network, at the moment, the semantics are predicted, the flag bit t is taken to be 1, the feature map extracted by the encoder is cascaded with the flag bit t and then enters a decoder, and at the moment, the decoder is connected with a softmax function to obtain a predicted semantic map, and the loss is monitored and calculated by using a corresponding label (label). The same operation is performed on the right view of the color. The two task calculated loss functions are then combined to construct a total loss function. And performing iteration once, and performing back propagation training on the network. Then the operation of the next step data is carried out, so that the aim of training the network is fulfilled. Wherein the input of the network is a color image, but the sources of the image are as follows: 1) Left or right view in binocular view contrast-this is to predict disparity; 2) Left or right view in binocular view opposite + corresponding semantic segmentation label (1 abel) -to predict semantic segmentation. The flag bit t is a tensor (tensor) of the same size as the feature map of all 0 or all 1.
The strict definition of image reconstruction is called "image reconstruction loss", which belongs to the category of depth error. The process of reconstructing the graph: for example, a left color view is input, and a left disparity map and a right disparity map are obtained through network prediction, so that I can reconstruct the left disparity map and the right color view through prediction at the moment to obtain a reconstructed left view. And then, carrying out image reconstruction loss construction on the color left view and the reconstructed left view. The following function is obtained:
L re =|I l -I r-l |+|I r -I 1-r |
for this embodiment, first, they are two tasks, two different networks, iteratively trained, as compared to the previous work of introducing semantic information into the depth estimation, whereas the method shares one encoder and decoder for both tasks and uses the flag bit t to distinguish the tasks in the training.
Secondly, a traditional stereo matching algorithm SGM is introduced to provide sparse real parallax values, and although the sparse real parallax values are sparse, the parallax prediction performance can be improved. Again, the network part is based on the Resnet, but the method modifies it so that it is more suitable for our task. For example, the normal convolution is changed into an expanded convolution form (the expanded convolution forms of rate=2, rate=4 and rate=8 are respectively used in the third-layer convolution b, the fourth-layer convolution b and the fifth-layer convolution b), and the receptive field is increased without increasing the parameter calculation. In order to enable deep abstract information and shallow local information to be fused, the method uses a far connection mode to conduct information fusion between layers (an encoder process can reduce the image size to 1/64 of the original size for extracting features, then in a decoder process, a feature map with the size being 1/2 of the original map, 1/4 of the original map and 1/8 of the original map is obtained through a deconvolution method, then cascade operation is conducted on the feature map with the four sizes and the feature map with the same size obtained in the original encoder (the process of adding the parallax map into cascade if the parallax map is generated), and finally a layer of convolution layer is respectively passed. The method calculates a loss function at a plurality of scales. By doing so, the network can learn the detailed information in the feature map better, and the 'hole' effect is reduced (namely, some inaccurate 'holes' can be generated in the predicted disparity map).
In another embodiment, the step S200 further includes the steps of:
s301: the input left and right views of the color pass through the full convolution neural network encoder to obtain a characteristic diagram;
s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation or semantic segmentation task;
s303: and after the feature map is cascaded with the zone bit, inputting the feature map to a decoder part of the full convolution neural network, and selecting whether to perform a semantic segmentation task or a parallax estimation task at a prediction loss fusion layer according to the value of the zone bit.
For this embodiment, the reason why two tasks can share one encoder here is inspired by the consistency in content of semantic information and structural information of objects in the image. The same input color image is distinguished into a semantic segmentation task or a parallax estimation task through a zone bit t, the segmentation result and the parallax estimation result of the image are calculated when a loss function is calculated, then a loss is calculated according to the loss function, and then the network is subjected to back propagation training. The loss function contains a function combining the semantically guided disparities. The semantic segmentation task is supervised and the disparity estimation is unsupervised, so it is still unsupervised for the depth estimation task.
The existing work of using semantic information to assist depth estimation often uses an iterative network, i.e., two different networks for two tasks. This can cause problems with oversized networks and difficult training. While two tasks share the same encoder and decoder, training of different tasks is performed by the flag t. The theoretical basis for this is that semantic information and structural information of objects in an image have consistency in content. In addition, the traditional stereo matching algorithm SGM is introduced, which is equivalent to adding an effect which can be understood as weak supervision prior under the condition that no data set real depth value exists, and helps to improve performance.
The left-right prediction disparity map in step S303 is obtained by cascading the feature map with the flag bit and then inputting the feature map to the decoder of the convolutional neural network, i.e. the output of the decoder. Here, the full convolutional neural network predicts essentially the disparity, but the Depth and disparity need only be obtained by a simple one formula depth=b×f/d. When the loss function is stable and approaches to 0, the full convolution neural network is trained, and then the real depth value calculated by the real laser lidar is compared with the depth value deduced by the full convolution neural network to judge the network effect.
Because this network is essentially a multitasking model, semantic segmentation and disparity estimation share a network. But the two are different in loss function, so that when disparity estimation is performed, t is set to be all 0 feature map, concatenated with the feature map generated by the encoder, and then enters the decoder. Because t is 0 at this time, the last layer performs the disparity estimation task. If the semantic segmentation task is performed, t is set to be an all-1 feature map, and is cascaded with the feature map generated by the encoder, then enters the decoder, and then performs the semantic segmentation task by using a loss function of semantic segmentation.
In another embodiment, the step S203 further includes the steps of:
s3031: parallax estimation task processing: after the corresponding left and right predicted parallax images are obtained, at the moment, a traditional SGM algorithm is used for the input left and right colored views, the calculated parallax images corresponding to the input views are obtained to provide sparse real parallax values, and a supervision signal is made for the left and right predicted parallax images by the calculated left and right parallax images;
s3032: semantic segmentation task processing: the input left and right views are subjected to feature extraction by the encoder, then are subjected to the decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.
In this embodiment, a left view is input to directly predict the left-right disparity map d l And d r The method of (1) will result in a predicted right disparity map d r And true right image I r A mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, it is difficult to select only the left view I from the left view I if the structure and texture information of the right view is lacking l And a right disparity map is obtained. So that only one corresponding to the input image is selected to be inputInstead of outputting two predicted disparity maps with one view before.
The result of the disparity estimation is obtained through the pooling layer. Only one corresponding parallax map can be obtained by inputting one color image. The left and right parallax images calculated by the method can be obtained by predicting the left and right color views through a network and then using SGM algorithm to the left and right color views. And then, monitoring the two left and right disparity maps obtained by network prediction by using the left and right disparity maps obtained by the SGM algorithm. The calculated disparity map is one, and in the existing method, a color left view is input to predict to obtain left and right disparity maps. However, considering that the input color left view and the input color right view are actually different, it is not suitable to predict left and right disparity maps by using one input color view, so the method predicts the input color left view to obtain a corresponding left disparity map, and the method is similar to the right map.
The SGM algorithm is used here, so that the sparse real disparity value is obtained through the traditional stereo matching algorithm, and if the obtained sparse real disparity value is used for supervising the predicted disparity map, the network prediction performance can be improved. The semantic segmentation task is only to conduct a guidance on the parallax estimation task in the training process, guide the object edge of the parallax estimation task and smooth the effect. The input view is obtained by using a traditional SGM algorithm, the processing is additionally implemented, and the SGM algorithm has a complete set of calculation flow and does not need to pass through a full convolution neural network. The supervisory signal in step S3031 refers to a calculated color left and right views by the SGM algorithm, and is only used for the task of parallax estimation. The supervisory signal of the supervisory training in S3032 refers only to groudtluth in the data set. The disparity estimation task must be switched to the semantic segmentation task after being performed, because the result obtained by the semantic segmentation task needs to be used for guiding the disparity estimation task.
In another embodiment, the encoder uses ResNet with dilation convolutions.
In another embodiment, the decoder portion employs four long jump links.
In another embodiment, the output of the full convolutional neural network has four scales, respectively (1, 1/2,1/4, 1/8) of the input image, on which the loss function calculation is performed.
In another embodiment, the flag bit is a feature map of all 0 s or all 1 s.
In another embodiment, the result of the semantic segmentation task is obtained by a Softmax function.
In another embodiment, the result of the disparity estimation task is obtained by avg.
In another embodiment, the loss function L is composed of five sub-functions, which are a depth error, a semantic segmentation error, a left-right semantic consistency error, a semantic guided parallax smoothing error, and an SGM algorithm error, respectively.
For this embodiment, the loss function construct here is simply the loss function in four dimensions. In short, the loss function L is based on four different scales. Total loss function:
L=L depthseg L seglrsc L lrscsmooth L smoothsgm L sgm
is composed of five sub-functions, L depth Is depth error, L seg Is a semantic segmentation error, L lrsc Is left-right semantic consistency error, L smooth Is a semantically guided parallax smoothing error, L sgm Is the error of SGM algorithm, alpha seg Is the weight coefficient of semantic segmentation error, alpha lrsc Is the weight coefficient of the left and right semantic consistency error, alpha smooth Is the weight coefficient of semantically guided parallax smoothing error, alpha sgm Is the weight coefficient of the error of the SGM algorithm.
(1) Depth error:
wherein L is re Loss for image reconstruction: l (L) re =|I l -I r-l |+|I r -I l-r |,I l Is a color left view, I r-l Is a reconstructed left view obtained after the reconstruction of a color right view and a predicted left disparity map, and is the same as I r ,I l-r
α lr ,α ds The weight of the left and right parallax consistency loss and the weight of parallax smoothing are respectively represented. d, d l Is the predicted left disparity map, d r-l Is to warp the predicted right disparity map into a left disparity map, and d is the same as d r ,d l-rIs the gradient in the x, y direction of the image.
(2) Semantic segmentation errors, namely cross entropy functions
(3) Left-right semantic consistency error:
L lrsc =|s l -s r-l |+|s r -s l-r |s l is the left semantic graph of the prediction s r-l Is the right semantic graph s to be predicted r And a predicted left disparity map d l Obtained after reconstruction, the same theory sr, s l-r
(4) Semantic guided parallax smoothing error:
d represents a predicted disparity map, s represents a semantic map,the representation e1 element-wise multiplication represents the multiplication of the vector corresponding components. ψ is an operation of setting the maximum value of each channel to 1 and the remaining values to 0. f (f) Is an operation of shifting an input by one pixel along a horizontal axis.
(5) The SGM algorithm error, namely, the parallax image is calculated by the SGM algorithm on the original image and is compared with the predicted parallax image
Above d represents a predicted disparity map, I represents an original map, and s represents a semantic segmentation map.
In another embodiment, the network is trained on datasets using binocular images of the KITTI, while also semantically segmented datasets of the CityScaps are used. The network is realized based on a Pytorch deep learning framework and operates on an English-to-Chinese GTX 1080Ti display card. During training, the input image is resized to 256x512 resolution, and data enhancement operations are also performed to avoid overfitting, more specifically from [0.8,1.2 ]],[0.5,2.0]And [0.8,1.2 ]]Three numbers are sampled in a uniform distribution over the range, the sampled numbers also using gamma offset. The optimizer uses Adam, initial learning rate λ=1e-4, β 1 =0.9,β 2 =0.999, e=1e-5. Weights alpha of different objective functions lr =0.2,α ds =0.02,α seg =0.1,α lrsc =0.2,α smooth =2.0,α sgm =0.1. The encoder section was pre-trained with the ImageNet dataset prior to training.
Note that the network needs to select tasks by the flag t, so when the input image makes forward reasoning, setting t to 0 is to obtain a disparity map, and setting t to 1 is to obtain a semantic segmentation map.
In another embodiment, the method is performed on a Cityscapes dataset with the results shown in fig. 3. The first line in fig. 3 represents the original color map, the second line is the semantic segmentation result of the corresponding input image, and the last line is the depth estimation result of the corresponding input image. Looking at the experimental results of the semantic segmentation task and the disparity estimation task, it can be seen that the disparity estimation task can be facilitated by means of the semantic segmentation task, and the effects of both are very good. The semantic segmentation task has good effect on object edge modification in the parallax estimation task. At the same time, the consistency of structural information and semantic information of objects in the scene in content is verified.
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims (8)

1. A semantic segmentation assisted binocular vision unsupervised depth estimation method, the method comprising:
s100: constructing a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and the semantic segmentation task and the parallax estimation task share the same encoder and decoder;
s200: photographing by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the obtained left and right views of the color into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss so as to train the full convolution neural network;
s400: inputting the prepared single Zhang Caise graph into the fully-convolved neural network after training, outputting a predicted parallax graph, and further obtaining a predicted depth graph;
the step S300 further includes the steps of:
s301: the input left and right views of the color pass through the full convolution neural network encoder to obtain a characteristic diagram;
s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation or semantic segmentation task;
s303: after cascading the feature map and the zone bit, inputting the feature map and the zone bit into a decoder part of the full convolution neural network, and selecting whether to perform a semantic segmentation task or a parallax estimation task at a prediction loss fusion layer according to the value of the zone bit;
the step S303 further includes the steps of:
s3031: parallax estimation task processing: after the corresponding left and right predicted disparity maps are obtained, at the moment, a traditional SGM algorithm is used for the input left and right color views, the calculated disparity map corresponding to the input view is obtained and used for providing sparse real disparity values, and a supervision signal is made for the left and right predicted disparity maps by the calculated left and right disparity maps;
s3032: semantic segmentation task processing: the input left and right views are subjected to feature extraction by the encoder, then are subjected to the decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.
2. The method of claim 1, wherein the encoder uses a ResNet with a dilation convolution.
3. The method of claim 1, wherein the decoder portion employs four long jump links.
4. The method of claim 1, the output of the full convolutional neural network having four scales, respectively (1, 1/2,1/4, 1/8) of the input image, on which the loss function calculation is performed.
5. The method of claim 1, the loss functionThe method consists of five sub-functions, namely a depth error, a semantic segmentation error, a left-right semantic consistency error, a semantic guidance parallax smoothing error and an SGM algorithm error.
6. The method of claim 1, wherein the flag bit is a feature map of all 0 s or all 1 s.
7. The method of claim 1, the result of the semantic segmentation task is obtained by a Softmax function.
8. The method of claim 1, the result of the disparity estimation task being obtained by avg.
CN202110329765.XA 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method Active CN113096176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110329765.XA CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110329765.XA CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Publications (2)

Publication Number Publication Date
CN113096176A CN113096176A (en) 2021-07-09
CN113096176B true CN113096176B (en) 2024-04-05

Family

ID=76670490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110329765.XA Active CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Country Status (1)

Country Link
CN (1) CN113096176B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882091B (en) * 2022-04-29 2024-02-13 中国科学院上海微系统与信息技术研究所 Depth estimation method combining semantic edges

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008848A (en) * 2019-03-13 2019-07-12 华南理工大学 A kind of travelable area recognizing method of the road based on binocular stereo vision
CN111028285A (en) * 2019-12-03 2020-04-17 浙江大学 Depth estimation method based on binocular vision and laser radar fusion
CN111325794A (en) * 2020-02-23 2020-06-23 哈尔滨工业大学 Visual simultaneous localization and map construction method based on depth convolution self-encoder
AU2020103905A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Unsupervised cross-domain self-adaptive medical image segmentation method based on deep adversarial learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008848A (en) * 2019-03-13 2019-07-12 华南理工大学 A kind of travelable area recognizing method of the road based on binocular stereo vision
CN111028285A (en) * 2019-12-03 2020-04-17 浙江大学 Depth estimation method based on binocular vision and laser radar fusion
CN111325794A (en) * 2020-02-23 2020-06-23 哈尔滨工业大学 Visual simultaneous localization and map construction method based on depth convolution self-encoder
AU2020103905A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Unsupervised cross-domain self-adaptive medical image segmentation method based on deep adversarial learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王欣盛 ; 张桂玲 ; .基于卷积神经网络的单目深度估计.计算机工程与应用.2020,(第13期),全文. *

Also Published As

Publication number Publication date
CN113096176A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN109919887B (en) Unsupervised image fusion method based on deep learning
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN113762358B (en) Semi-supervised learning three-dimensional reconstruction method based on relative depth training
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN110610486A (en) Monocular image depth estimation method and device
Maslov et al. Online supervised attention-based recurrent depth estimation from monocular video
CN115880720A (en) Non-labeling scene self-adaptive human body posture and shape estimation method based on confidence degree sharing
Yue et al. Semi-supervised monocular depth estimation based on semantic supervision
CN115511759A (en) Point cloud image depth completion method based on cascade feature interaction
CN113096176B (en) Semantic segmentation-assisted binocular vision unsupervised depth estimation method
CN116485867A (en) Structured scene depth estimation method for automatic driving
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN116188550A (en) Self-supervision depth vision odometer based on geometric constraint
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN115830094A (en) Unsupervised stereo matching method
CN116402874A (en) Spacecraft depth complementing method based on time sequence optical image and laser radar data
Yang et al. Unsupervised deep learning of depth, ego-motion, and optical flow from stereo images
Shi et al. Improved event-based dense depth estimation via optical flow compensation
Ji et al. RDRF-Net: A pyramid architecture network with residual-based dynamic receptive fields for unsupervised depth estimation
CN114693744A (en) Optical flow unsupervised estimation method based on improved cycle generation countermeasure network
CN113379715A (en) Underwater image enhancement and data set true value image acquisition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant