CN113096176A - Semantic segmentation assisted binocular vision unsupervised depth estimation method - Google Patents

Semantic segmentation assisted binocular vision unsupervised depth estimation method Download PDF

Info

Publication number
CN113096176A
CN113096176A CN202110329765.XA CN202110329765A CN113096176A CN 113096176 A CN113096176 A CN 113096176A CN 202110329765 A CN202110329765 A CN 202110329765A CN 113096176 A CN113096176 A CN 113096176A
Authority
CN
China
Prior art keywords
neural network
convolution neural
semantic segmentation
full convolution
disparity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110329765.XA
Other languages
Chinese (zh)
Other versions
CN113096176B (en
Inventor
任鹏举
李凌阁
丁焱
景鑫
赵文哲
夏天
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110329765.XA priority Critical patent/CN113096176B/en
Publication of CN113096176A publication Critical patent/CN113096176A/en
Application granted granted Critical
Publication of CN113096176B publication Critical patent/CN113096176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A semantic segmentation assisted binocular vision unsupervised depth estimation method comprises the following steps: s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder; s200: shooting by using equipment with a binocular camera to obtain a left view and a right view; s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network; s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.

Description

Semantic segmentation assisted binocular vision unsupervised depth estimation method
Technical Field
The disclosure belongs to the field of computer vision and computer graphics, and particularly relates to a binocular vision unsupervised depth estimation method assisted by semantic segmentation.
Background
Depth estimation is an important research topic in the field of computer vision. The method has important significance in applications such as automatic driving, AR/VR, three-dimensional reconstruction and object grabbing. Today, although there are several hardware and methods that can acquire scene depth information, they all have their own drawbacks. For example, a triangulation method using a monocular camera image and a method of performing parallax calculation using a binocular camera model image tend to be large in calculation amount and low in accuracy, and depend on texture information of a scene relatively. Laser lidar is expensive and the single reliance on a sensor is a risk in itself. The depth perception sensor based on the structured light has the problems of limited measurement range and poor effect in an outdoor environment. With the development of deep learning technology, the industry is now turning attention to a visual depth estimation method based on deep learning. In the prior art, a method adopting supervised learning is used for depth estimation, and good results are obtained. However, the supervised learning method has large limitation and poor generalization capability. The subsequent research of unsupervised depth estimation methods is becoming increasingly popular and trending.
Semantic segmentation may be defined as the process of creating a mask on an image in which pixels are segmented into a predefined set of semantic classes. Such segmentation may be binary (e.g., "person pixels" or "non-person pixels") or multi-type (e.g., pixels may be labeled as "person", "car", "building", etc.). As semantic segmentation techniques increase in accuracy and adoption, it becomes increasingly important to develop technical approaches that exploit such segmentation and develop techniques for integrating the segmentation information into existing computer vision applications (e.g., depth or disparity estimation). Today, the most advanced methods for semantic segmentation are based on deep learning.
At present, depth estimation methods based on unsupervised depth learning are divided into two types, namely a monocular continuous frame based method and a binocular image pair based method, wherein the binocular image pair based method is more common and practical: monocular continuous frame-based method: the method comprises two networks, wherein one network is a depth prediction network and used for predicting a depth map, the other network is a pose estimation network and used for estimating a pose, then an adjacent frame is reconstructed by using the predicted depth map and the predicted pose, the reconstructed adjacent frame is compared with an original adjacent frame, and therefore a loss function is calculated, and the purpose of training the network is achieved. The method based on binocular image pairs comprises the following steps: firstly, inputting a left view of a binocular camera image into a full convolution network to obtain a predicted left and right disparity map, then reconstructing the predicted left and right disparity map and an original left and right view to reconstruct a reconstructed left and right view, and finally comparing the original left and right view with the reconstructed left and right view to calculate a loss function so as to achieve the aim of training the network. In the above-mentioned methods, the network trained by using binocular images can directly estimate absolute depth, and the accuracy is higher and even better than that of the supervised method, whereas the monocular image data set does not contain absolute scale information, and the predicted depth is relative depth, so the unsupervised depth estimation method based on binocular image pairs is mainly studied here.
Existing methods of environmental depth estimation most often and efficiently utilize laser lidar, but lidar is expensive and does not provide dense depth maps, and there is a risk of over-relying on a single sensor per se. The binocular camera can directly obtain depth by calculating parallax, but has problems of large calculation amount and low accuracy when an image lacks texture information. The depth estimation method based on the depth learning technology can provide dense depth maps, has good adaptability to scenes, and can reduce cost. In the past, the method for assisting depth estimation by semantic segmentation generally adopts an iterative network to combine depth information and semantic information, but such a method causes the network to be cumbersome, and simultaneously causes the network to lack perception of structural information of another view in a forward reasoning stage, thereby affecting the performance of final disparity prediction.
Disclosure of Invention
In order to solve the above problem, the present disclosure provides a binocular vision unsupervised depth estimation method assisted by semantic segmentation, including:
s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder;
s200: shooting by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network;
s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.
Through the technical scheme, the binocular image is combined with a scene depth estimation and semantic method, and the parallax calculated by a traditional parallax calculation method SGM is added to serve as a supervision signal. In the method, a semantic segmentation task is supervised learning, a disparity map calculated by an SGM algorithm is used as one of supervision signals for network disparity estimation, but in the method, depth information provided by lidar is not used, a reconstructed map is made of a predicted disparity map and an original map, and the reconstructed map and the original map are compared to construct a loss function, so that the method belongs to an unsupervised learning task for depth estimation. Assuming that the input left view obtains a corresponding predicted left disparity map, then reconstructing the right view corresponding to the input left view and the obtained predicted left disparity map, so as to obtain a reconstructed right view. The reconstructed image refers to the reconstructed right view, and the reconstructed left view can be obtained by inputting the right view in the same way.
The invention has the advantages that
1) The conventional multi-task learning network is changed into a coder network shared by a semantic segmentation task and a parallax estimation task, and the inherent meaning of the method is that semantic information and structural information of an object in an image have consistency in content, so that the semantic information and the structural information of the image can be mutually promoted in training to achieve the effect of improving the accuracy, and the scale of the network is simplified.
2) Considering the previous method, a left view is input to directly predict a left-right disparity map dlAnd drHowever, this will actually result in the predicted right disparity map drWith the true right picture IrA mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, if the structure and texture information of the right view is lacked, it is difficult to view only from the left view IlAnd obtaining a right disparity map. Therefore, only one disparity map corresponding to the input image is selected to be output, instead of outputting two predicted disparity maps by one view in the prior art.
3) In the parallax estimation stage, a traditional stereo matching algorithm SGM is introduced, parallax image calculation is carried out on the current input image, and a priori information is added into the predicted parallax image by using the calculated parallax image, so that network convergence is facilitated and the network prediction accuracy is improved.
Drawings
Fig. 1 is a flow chart of a semantic segmentation assisted binocular vision unsupervised depth estimation method provided in an embodiment of the present disclosure;
FIG. 2 is a diagram of a full convolutional neural network provided in one embodiment of the present disclosure;
fig. 3 is a graph of experimental results provided in one embodiment of the present disclosure.
Detailed Description
The present invention will be described in further detail with reference to fig. 1 to 3.
In one embodiment, referring to fig. 1, a semantic segmentation assisted binocular vision unsupervised depth estimation method is disclosed, the method comprising:
s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder;
s200: shooting by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network;
s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.
The whole full convolution neural network structure is shown in fig. 2 and comprises an encoder and a decoder, wherein the first convolution layer to the sixth convolution layer b are the encoder, and the fifth deconvolution layer to the prediction layer i + the loss is the decoder. We have adapted the invention to the network based on the classical network ResNet. The ordinary convolution in the encoder is changed into the expansion convolution, so that the receptive field can be enlarged as much as possible under the condition of not increasing the parameter calculation amount, and the characteristic diagram in the scene can be better extracted. When the encoder extracts the features of the input image, the task is represented as t, which is a feature matrix (0 or 1 is determined according to the task, 0 is a disparity estimation task, and 1 is a semantic segmentation task) with all 0 or 1, the feature map is composed of one or more matrices, and the input image enters the decoder part after the features are extracted. And the decoder part considers the fusion of the local features of the shallow layer and the abstract features of the deep layer and adopts four far connections to enhance the prediction effect. The output of the model has four scales, one for each input image (1, 1/2, 1/4, 1/8), and the loss function calculations are performed on the four scales.
The left and right views in color obtained in step S300 are in the form of an image pair (i.e., left and right views). Here, the left-right view input sequence is nothing but input of the left view first and then the predicted left disparity map is obtained after passing through the neural network according to the common practice, so that the image reconstruction loss can be calculated. This is done for the right view as well. And finally, uniformly adding the two losses into the total loss function. The loss of computation is actually a manifestation of the unsupervised method herein. Examples are: the left view is input, where the network predicts the left disparity map. The predicted left disparity map and the predicted right view can be reconstructed to obtain a reconstructed left view through image reconstruction. And comparing the reconstructed left view with the original input left view to calculate loss. The same applies to the right view. Therefore, the network can be trained and has the capability of predicting the disparity map. The conversion between the disparity map and the depth map can be obtained by only one formula conversion. Thus we can get the depth map we want.
The main flow of the full convolution neural network is as follows: inputting a color left view into a network, at the moment, when parallax is to be predicted, taking 0 as a flag bit t, cascading a feature map extracted by an encoder with the t, and then, entering a decoder, wherein the t is 0, the decoder is connected with avg. The same operation is done for the color right view in the same way. At the moment, an SGM algorithm is added, which is a traditional stereo matching algorithm and has a fixed set of formulas, the color left and right views are calculated, and the left and right disparity maps calculated by the method are obtained. The left and right disparity maps calculated by the method actually comprise sparse real disparity values, and then the disparity map predicted by the network before the monitoring degree is carried out by using the sparse real disparity values can also obtain a loss function. And then inputting a color left view into a network, taking 1 as a flag bit t when the semantic meaning is required to be predicted, cascading the feature map extracted by the encoder with the t and then entering a decoder, obtaining the predicted semantic map by connecting the decoder with a softmax function because the t is 1, and carrying out supervision calculation loss by using a corresponding label (label). The same operation is done for the color right view in the same way. The loss functions calculated by the two tasks are then combined together to construct an overall loss function. And carrying out iteration once to train the network in a back propagation way. Then, the next operation of data is carried out, so as to achieve the purpose of training the network. The input to the network is a color image, but the sources of this image are the following two types: 1) left or right view of binocular pair-this is to predict disparity; 2) left or right view in the binocular pair + corresponding semantic segmentation label (1abel) -to predict semantic segmentation. The flag bit t is a tensor (tensor) of the same size as the feature map, all 0 s or all 1 s.
The strict definition of image reconstruction is called "image reconstruction loss", and belongs to the category of depth errors. The process of reconstructing the map is as follows: for example, a left color view is input, and a left disparity map and a right disparity map are obtained through network prediction, so that at this time, i can reconstruct through the predicted left disparity map and the color right view to obtain a reconstructed left view. And then, constructing an image reconstruction loss structure by the color left view and the reconstructed left view. The following function is obtained:
Lre=|Il-Ir-l|+|Ir-I1-r|
for this embodiment, first, compared to the previous work of introducing semantic information into depth estimation, they are two different networks for two tasks, and iterative training is performed, and the method is to share one encoder and one decoder for the two tasks, and to use the flag bit t to distinguish the tasks in training.
Secondly, a traditional stereo matching algorithm SGM is introduced to provide a sparse real disparity value, and although the sparse real disparity value is sparse, the performance of disparity prediction can be still improved. Again, the network part, although based on Resnet, is modified by the method to make it more suitable for our task. For example, the ordinary convolution is changed into a form of a dilated convolution (in the third layer convolution b, the fourth layer convolution b, and the fifth layer convolution b, a dilated convolution form with rate of 2, rate of 4, and rate of 8 is used, respectively), and the receptive field is increased without increasing the parameter calculation. In order to fuse deep abstract information and shallow local information, the method uses a far-connection mode to fuse information between layers (the encoder process reduces the image size to 1/64 used for extracting features, then in the decoder process, the deconvolution method is used for obtaining feature maps with the sizes of original images, 1/2 of the original images, 1/4 of the original images and 1/8 of the original images, then the feature maps with the four sizes and the feature maps with the same size obtained in the original encoder are cascaded (if a disparity map is generated, the disparity map is added into the cascading process), and finally, the feature maps are respectively subjected to one layer of convolutional layer. The method calculates the loss function under a plurality of scales. By doing so, the network can better learn the detailed information in the feature map, and reduce the "holes" effect (i.e., some inaccurate "holes" are generated in the predicted disparity map).
In another embodiment, the step S200 further includes the steps of:
s301: the input color left and right views pass through the full convolution neural network encoder part to obtain a characteristic diagram;
s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation task or a semantic segmentation task;
s303: after the characteristic diagram is cascaded with the flag bit, the characteristic diagram is input to a decoder part of the full convolution neural network, and a semantic segmentation task or a parallax estimation task is selected according to the value of the flag bit in a prediction loss fusion layer.
For this embodiment, the reason why two tasks can share one encoder here is inspired by the consistency in content of semantic information and structural information of objects in an image. The same input color image is distinguished from a semantic segmentation task or a parallax estimation task through a flag bit t, when a loss function is calculated, a segmentation result and a parallax estimation result of the image are calculated, then a loss is calculated according to the loss function, and then a network is subjected to back propagation training. The loss function contains a function which combines semantic guidance parallax. The semantic segmentation task is supervised and the disparity estimation is unsupervised, so it is still unsupervised for the depth estimation task.
The existing work of utilizing semantic information to assist depth estimation usually adopts an iterative network, namely two tasks use two different networks. This causes problems of oversize network and difficulty in training. And two tasks share the same encoder and decoder, and training of different tasks is carried out through the mark t. The rationale for doing this is that semantic information and structural information of objects in an image have consistency in content. In addition, a traditional stereo matching algorithm SGM is introduced, namely, under the condition that no real depth value of a data set exists, an effect which can be understood as weak supervision prior is added, and the performance is improved.
The left and right prediction disparity map in step S303 is obtained at the output of the decoder by inputting the feature map and the flag bit to the decoder portion of the full convolution neural network after being concatenated. The full convolution neural network essentially predicts the disparity, but the Depth and disparity only need to be obtained by a simple formula of Depth ═ b × (f)/d. When the loss function is stable and approaches to 0, the full convolution neural network is trained, and then the real depth value calculated by using the real laser lidar is compared with the depth value deduced by using the full convolution neural network to judge the network effect.
Since this network is essentially a multitasking model, semantic segmentation and disparity estimation share one network. However, the two loss functions are different, so when performing disparity estimation, t is set to be all 0 feature maps, concatenated with the feature map generated by the encoder, and then enters the decoder. Since t is 0 at this time, the last layer performs the disparity estimation task. If the semantic segmentation task is carried out, setting t as a full 1 feature map, cascading with the feature map generated by the encoder, then entering a decoder, and then carrying out the semantic segmentation task by using a loss function of the semantic segmentation.
In another embodiment, the step S203 further comprises the steps of:
s3031: and (3) parallax estimation task processing: after the corresponding left and right predicted disparity maps are obtained, the traditional SGM algorithm is used for the input color left and right views, the calculated disparity map corresponding to the input view is obtained to provide sparse real disparity values, and a supervision signal is made for the left and right predicted disparity maps by the calculated left and right disparity maps;
s3032: and (3) semantic segmentation task processing: the input color left and right views are subjected to feature extraction by an encoder, then are subjected to a decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.
For this embodiment, a left view is input to directly predict a left-right disparity map dlAnd drThe method results in a predicted right disparity map drWith the true right picture IrA mismatch problem occurs. Because the structure and texture information of the left and right views of the binocular camera are different, if the structure and texture information of the right view is lacked, it is difficult to obtain the left view I onlylAnd obtaining a right disparity map. Therefore, only one disparity map corresponding to the input image is selected to be input instead of outputting two predicted disparity maps from one view before.
What is obtained through the pooling layer is the result of the disparity estimation. Only one corresponding disparity map can be obtained by inputting one color image. After the color left and right views obtain corresponding left and right disparity maps through network prediction, the SGM algorithm is used for the color left and right views to obtain the left and right disparity maps calculated by the method. And then, the left and right disparity maps obtained by the SGM algorithm are used for supervising the two left and right disparity maps obtained by network prediction. The calculated disparity map is one, and in the conventional method, two left and right disparity maps are obtained by inputting one color left view image. However, considering that the input color left and right views are different in nature, it is not appropriate to predict the left and right disparity maps with one input color view, so the method is to predict the input color left view to obtain a corresponding left disparity map, and similarly, the right view.
The SGM algorithm is used here, in order to obtain a sparse real disparity value through a conventional stereo matching algorithm, and if the obtained sparse real disparity value is used to supervise a predicted disparity map, the network prediction performance can be improved. The semantic segmentation task is only to guide the disparity estimation task, guide the object edge of the disparity estimation task, and smooth the effect in the training process. The processing is additionally implemented by using a traditional SGM algorithm to obtain the calculated disparity map for the input view, and the SGM algorithm has a set of complete calculation process and does not need to pass through a full convolution neural network. The supervisory signal in step S3031 is obtained by computing the left and right color views with the SGM algorithm, and is only used for the disparity estimation task. The supervisory signal for supervising the training in S3032 refers only to group dtruth in the data set. The disparity estimation task is switched to the semantic segmentation task after the disparity estimation task is finished, because the disparity estimation task needs to be guided by using a result obtained by the semantic segmentation task.
In another embodiment, the encoder uses ResNet with a dilation convolution.
In another embodiment, the decoder portion employs four long-jump links.
In another embodiment, the output of the full convolution neural network has four scales, one for each input image (1, 1/2, 1/4, 1/8), and the loss function calculation is performed on the four scales.
In another embodiment, the flag bit is a full 0 or full 1 signature.
In another embodiment, the result of the semantic segmentation task is obtained by a Softmax function.
In another embodiment, the result of the disparity estimation task is obtained by avg.
In another embodiment, the loss function L is composed of five sub-functions, which are depth error, semantic segmentation error, left-right semantic consistency error, semantic-guided disparity smoothing error, and error of SGM algorithm, respectively.
For the purposes of this embodiment, the loss function construction herein is actually a loss function on four scales. In brief, it is the loss function L that is built on four different scales. Overall loss function:
L=LdepthsegLseglrscLlrscsmoothLsmoothsgmLsgm
is composed of five sub-functions, LdepthIs the depth error, LsegIs a semantic segmentation error, LlrscIs a left-right semantic consistency error, LsmoothIs a semantic guide to the parallax smoothing error, LsgmIs the error of the SGM algorithm, αsegIs a weight coefficient of a semantic segmentation error, alphalrscIs a weight coefficient, alpha, of the left and right semantic consistency errorssmoothIs a weight coefficient, alpha, of a semantic-guided parallax smoothing errorsgmAre the weighting coefficients of the error of the SGM algorithm.
(1) Depth error:
Figure BDA0002994318060000121
wherein L isreFor image reconstruction loss: l isre=|Il-Ir-l|+|Ir-Il-r|,IlIs a color left view, Ir-lIs a reconstructed left view obtained by reconstructing a color right view and a predicted left disparity map, and has the same principle as the method Ir,Il-r
αlr,αdsAnd respectively representing the weight of the left and right parallax consistency loss and the weight of parallax smoothing. dlIs a predicted left disparity map, dr-lWarping the predicted right disparity map into a left disparity map, and dr,dl-r
Figure BDA0002994318060000133
Is the gradient in the x, y direction of the image.
(2) Semantic segmentation error, i.e. cross entropy function
(3) Left and right semantic consistency errors:
Llrsc=|sl-sr-l|+|sr-sl-r|slis a predicted left semantic graph, sr-lIs the right semantic graph s to be predictedrAnd predicted left disparity map dlObtained after reconstruction, for the same reason sr, sl-r
(4) Semantic guidance parallax smoothing error:
Figure BDA0002994318060000131
d denotes a predicted disparity map, s denotes a semantic map,
Figure BDA0002994318060000132
the expression e1 element-wisemaultivation means that the corresponding components of the vector are multiplied. ψ is an operation of setting the maximum value of each channel to 1 and setting the remaining values to 0. f. ofIs an operation of translating the input by one pixel along the horizontal axis.
(5) Error of SGM algorithm, i.e. calculating parallax map of original image by SGM algorithm, comparing with predicted parallax map
The above d represents the predicted disparity map, I represents the original map, and s represents the semantic segmentation map.
In another embodiment, the network is trained using binocular images of KITTI for the dataset, while the citrescaps semantic segmentation dataset is also used. The network is realized based on a Pythrch deep learning framework and runs on an Ingland GTX 1080Ti display card. During training, the input image is resized to a resolution of 256x512, and data enhancement operations are performed to avoid overfitting, more specifically from [0.8, 1.2 ]],[0.5,2.0]And [0.8, 1.2]Three numbers are sampled in a uniform distribution within the range, the sampled numbers also using gamma offsets. The optimizer uses Adam, the initial learning rate λ is 1e-4, β1=0.9,β20.999, e1 e-5. Weights alpha of different objective functionslr=0.2,αds=0.02,αseg=0.1,αlrsc=0.2,αsmooth=2.0,αsgm0.1. The encoder section was pre-trained with the ImageNet data set prior to training.
Note that the network needs to select tasks by the flag t, so when the input image is subjected to forward inference, setting t to 0 obtains a disparity map, and setting t to 1 obtains a semantic segmentation map.
In another embodiment, the method is performed on a citrescaps dataset with results as shown in fig. 3. In fig. 3, the first line represents the original color map, the second line is the semantic segmentation result of the corresponding input image, and the last line is the depth estimation result of the corresponding input image. Observing the experimental results of the semantic segmentation task and the disparity estimation task, it can be seen that the disparity estimation task can be promoted by the semantic segmentation task, and the effects of the two are good. The semantic segmentation task has good effect on object edge modification in the parallax estimation task. Meanwhile, the consistency of structural information and semantic information of the objects in the scene in content is verified.
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims (10)

1. A semantic segmentation assisted binocular vision unsupervised depth estimation method comprises the following steps:
s100: building a full convolution neural network for predicting a depth map, wherein the full convolution neural network consists of an encoder and a decoder, and a semantic segmentation task and a parallax estimation task share the same encoder and decoder;
s200: shooting by using equipment with a binocular camera to obtain a left view and a right view;
s300: respectively inputting the acquired left and right color views into the full convolution neural network, and substituting left and right predicted disparity maps output by the full convolution neural network into a loss function to calculate loss under the condition so as to train the full convolution neural network;
s400: and inputting the prepared single color image into the trained full convolution neural network, outputting a predicted parallax image and further obtaining a predicted depth image.
2. The method according to claim 1, preferably, the step S300 further comprises the steps of:
s301: the input color left and right views pass through the full convolution neural network encoder part to obtain a characteristic diagram;
s302: acquiring a flag bit t for distinguishing whether a prediction layer is a parallax estimation task or a semantic segmentation task;
s303: after the characteristic diagram is cascaded with the flag bit, the characteristic diagram is input to a decoder part of the full convolution neural network, and a semantic segmentation task or a parallax estimation task is selected according to the value of the flag bit in a prediction loss fusion layer.
3. The method of claim 2, said step S303 further comprising the steps of:
s3031: and (3) parallax estimation task processing: after the corresponding left and right predicted disparity maps are obtained, the traditional SGM algorithm is used for the input color left and right views, the calculated disparity map corresponding to the input view is obtained to provide sparse real disparity values, and a supervision signal is made for the left and right predicted disparity maps by the calculated left and right disparity maps;
s3032: and (3) semantic segmentation task processing: the input color left and right views are subjected to feature extraction by an encoder, then are subjected to a decoder, and finally are subjected to supervised training by using a conventional semantic segmentation method based on supervised learning.
4. The method of claim 1, wherein the encoder uses ResNet with a dilated convolution.
5. The method of claim 1, wherein the decoder portion employs four long-jump links.
6. The method of claim 1, the output of the full convolution neural network having four scales, one for (1, 1/2, 1/4, 1/8) of the input image, over which the loss function computation is performed.
7. The method of claim 1, wherein the loss function L is composed of five sub-functions, namely depth error, semantic segmentation error, left-right semantic consistency error, semantic guided disparity smoothing error, and error of SGM algorithm.
8. The method of claim 2, wherein the flag is a full 0 or full 1 profile.
9. The method of claim 2, the result of the semantic segmentation task being obtained by a Softmax function.
10. The method of claim 2, wherein the results of the disparity estimation task are obtained by avg.
CN202110329765.XA 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method Active CN113096176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110329765.XA CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110329765.XA CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Publications (2)

Publication Number Publication Date
CN113096176A true CN113096176A (en) 2021-07-09
CN113096176B CN113096176B (en) 2024-04-05

Family

ID=76670490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110329765.XA Active CN113096176B (en) 2021-03-26 2021-03-26 Semantic segmentation-assisted binocular vision unsupervised depth estimation method

Country Status (1)

Country Link
CN (1) CN113096176B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882091A (en) * 2022-04-29 2022-08-09 中国科学院上海微系统与信息技术研究所 Depth estimation method combined with semantic edge

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008848A (en) * 2019-03-13 2019-07-12 华南理工大学 A kind of travelable area recognizing method of the road based on binocular stereo vision
CN111028285A (en) * 2019-12-03 2020-04-17 浙江大学 Depth estimation method based on binocular vision and laser radar fusion
CN111325794A (en) * 2020-02-23 2020-06-23 哈尔滨工业大学 Visual simultaneous localization and map construction method based on depth convolution self-encoder
AU2020103905A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Unsupervised cross-domain self-adaptive medical image segmentation method based on deep adversarial learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008848A (en) * 2019-03-13 2019-07-12 华南理工大学 A kind of travelable area recognizing method of the road based on binocular stereo vision
CN111028285A (en) * 2019-12-03 2020-04-17 浙江大学 Depth estimation method based on binocular vision and laser radar fusion
CN111325794A (en) * 2020-02-23 2020-06-23 哈尔滨工业大学 Visual simultaneous localization and map construction method based on depth convolution self-encoder
AU2020103905A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Unsupervised cross-domain self-adaptive medical image segmentation method based on deep adversarial learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王欣盛;张桂玲;: "基于卷积神经网络的单目深度估计", 计算机工程与应用, no. 13 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882091A (en) * 2022-04-29 2022-08-09 中国科学院上海微系统与信息技术研究所 Depth estimation method combined with semantic edge
CN114882091B (en) * 2022-04-29 2024-02-13 中国科学院上海微系统与信息技术研究所 Depth estimation method combining semantic edges

Also Published As

Publication number Publication date
CN113096176B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN113240691B (en) Medical image segmentation method based on U-shaped network
CN110782490A (en) Video depth map estimation method and device with space-time consistency
WO2022206020A1 (en) Method and apparatus for estimating depth of field of image, and terminal device and storage medium
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN110610486B (en) Monocular image depth estimation method and device
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN111242844A (en) Image processing method, image processing apparatus, server, and storage medium
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN111210432A (en) Image semantic segmentation method based on multi-scale and multi-level attention mechanism
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN115880720A (en) Non-labeling scene self-adaptive human body posture and shape estimation method based on confidence degree sharing
CN116310219A (en) Three-dimensional foot shape generation method based on conditional diffusion model
Yue et al. Semi-supervised monocular depth estimation based on semantic supervision
CN117315169A (en) Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching
CN113096176B (en) Semantic segmentation-assisted binocular vision unsupervised depth estimation method
CN114283315A (en) RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
CN110580726A (en) Dynamic convolution network-based face sketch generation model and method in natural scene
CN114119669A (en) Image matching target tracking method and system based on Shuffle attention
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN115830094A (en) Unsupervised stereo matching method
CN116310396A (en) RGB-D significance target detection method based on depth quality weighting
CN116402874A (en) Spacecraft depth complementing method based on time sequence optical image and laser radar data
CN114926591A (en) Multi-branch deep learning 3D face reconstruction model training method, system and medium
CN113850189A (en) Embedded twin network real-time tracking method applied to maneuvering platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant