CN113920317B

CN113920317B - Semantic segmentation method based on visible light image and low-resolution depth image

Info

Publication number: CN113920317B
Application number: CN202111369121.XA
Authority: CN
Inventors: 袁媛; 苏月皎; 姜志宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2024-02-27
Anticipated expiration: 2041-11-15
Also published as: CN113920317A

Abstract

The invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image. Based on multi-task learning, the super-resolution module network is utilized to process the low-resolution depth image by utilizing the high-resolution visible light image to obtain the high-resolution depth image, and then the semantic segmentation module network is utilized to obtain the semantic segmentation result, so that the high-resolution visible light information available in actual conditions is utilized, the resolution and quality of the semantic segmentation are ensured, the problem of resolution alignment is solved, and the application of the semantic segmentation in real life is expanded.

Description

Semantic segmentation method based on visible light image and low-resolution depth image

Technical Field

The invention belongs to the technical field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a visible light image and a low-resolution depth image.

Background

Visible light and Depth image (RGB-Depth, abbreviated RGB-D) semantic segmentation is the segmentation of scene objects and regions using visual appearance information of the scene and scene distance information acquired by a Depth sensor. RGB-D semantic segmentation has wide applications as an underlying core technology, such as indoor navigation, autopilot, and machine vision.

With the development of a deep learning method, the RGB-D semantic segmentation technology is effectively improved, and the method is widely applied to actual visual tasks. Gupta et al in document "S.Gupta, R.Girshick, P.Arbel az, and j.malik.learning Rich Features from RGB-D Images for Object Detection and segment.in European Conference on Computer Vision,2014, pp.345-360 propose a dual stream (Two-stream) encoder-decoder network containing Two encoder branches for extracting the visible light information and depth information features, respectively, to predict the segmentation result. The following most RGB-D semantic segmentation methods adopt a double-flow network as the basis of the RGB-D semantic segmentation network to extract, fuse and upsample multi-mode features. Chen et al, in the literature "X.Chen, K.Y.Lin, J.Wang, W.Wu, C.Qian, H.Li, and G.Zeng. Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D security segmentation.In European Conference on Computer Vision,2020 et al, propose a two-way cross-modal feature propagation based on gates, improving the feature fusion approach of both modalities.

While the above methods all achieve good results, they often rely on a large number of pairs of visible and depth images of the same resolution as training data. However, in real situations, due to the depth sensor principle and hardware limitation, the resolution of the depth image is often lower, and the visible light camera is rapidly developed, and the acquired visible light image has higher resolution. Based on this, many studies are currently performed to achieve matching with the resolution of the depth image by capturing a visible light image with a lower resolution. This does not fully utilize the high resolution visible image information on the one hand, and on the other hand limits the generalization ability of the RGB-D semantic segmentation model and the ability to solve practical problems in real life.

Disclosure of Invention

In order to overcome the defect that the existing RGB-D semantic segmentation method cannot process the non-aligned resolution RGB-D image pair, the invention provides a semantic segmentation method based on a visible light image and a low resolution depth image. Based on multi-task learning, the super-resolution module network is utilized to process the high-resolution visible light image and the low-resolution depth image to obtain the high-resolution depth image, and then the semantic segmentation module network is utilized to obtain the semantic segmentation result, so that the high-resolution visible light information available in actual conditions is utilized, the resolution and quality of the semantic segmentation are ensured, the problem of resolution alignment is solved, and the application of the semantic segmentation in real life is expanded.

A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:

step 1: downsampling the depth image in the input RGB-D data set to obtain a low-resolution depth image; taking a visible light image and a low-resolution depth image in an RGB-D data set as input data, taking the depth image as supervision information, and training a super-resolution module network to obtain a trained network;

the RGB-D data set is a public visible light and depth image data set;

the super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder performs up-sampling processing on the fused image to obtain a depth image with the same resolution as the input visible light image;

step 2: taking a visible light image in the RGB-D data set and a depth image obtained by the super-resolution module as input data, taking the middle two-layer parameters of the encoder of the trained super-resolution module network as the supervision information of the middle two-layer parameters of the encoder of the semantic segmentation module network, and training the semantic segmentation module network to obtain the trained network;

the semantic segmentation module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder carries out up-sampling processing on the fused image, and the obtained image is a semantic segmentation result image;

step 3: the RGB-D data obtained through real acquisition is input into a trained super-resolution module network, and then is output into a semantic segmentation result image through a trained semantic segmentation module network.

2. The semantic segmentation method based on the visible light image and the low-resolution depth image according to claim 1, wherein: the loss function L of the super-resolution module network _SR The method comprises the following steps:

where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the RGB-D dataset, < >>Representing a low resolution depth image obtained by downsampling, I _rgb Representing visible light images in RGB-D dataset, W _sr Representing super-resolution module network parameters, f _sr (. Cndot.) represents super resolution modular network processing.

Specifically, the loss function L of the super-resolution module network _SR The method comprises the following steps:

Specifically, the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolution networks, and the loss function L of the semantic segmentation module network _seg The method comprises the following steps:

wherein L (x, class) represents weighted cross entropy between the prediction segmentation result and the real label, and x represents the prediction semantic segmentation resultClass represents the class of the object,parameter representing the ith layer of the encoder in the super resolution module network, a method for determining the parameter of the ith layer of the encoder in the super resolution module network>Parameters representing the ith layer of the encoder in the semantic segmentation module network;

wherein, the calculation formula of L (x, class) is:

wherein weight [ class ] represents the weight of class, the value of which is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x [ class ] represents the class-th channel of the output feature map, j represents the position of the pixel, and x [ j ] represents the probability that the pixel j is predicted to be class.

The beneficial effects of the invention are as follows: the super-resolution subtask is introduced, so that the gap between the depth resolution and the visible light resolution which can be obtained in real life can be made up, the high-resolution visible light information which can be obtained in real life is fully utilized, and the practical and industrial values are realized; the correlation between the super-resolution subtask and the semantic segmentation subtask is utilized to assist in optimizing the semantic segmentation subtask architecture, so that the accuracy of the semantic segmentation network can be improved. The invention is suitable for vehicle-mounted auxiliary systems, automatic driving systems, indoor autonomous navigation systems and the like, and has good practical value.

Drawings

FIG. 1 is a flow chart of a semantic segmentation method based on a visible light image and a low resolution depth image according to the present invention;

FIG. 2 is a depth image result contrast diagram of the super resolution module network of the present invention;

in the figure, (a) -an original depth image, (b) -a super-resolution module network output depth image, (c) -an actual high-resolution depth image, (d) -a high-resolution visible light image;

FIG. 3 is a comparative image of the segmentation result of the present invention;

in the figure, (a) -input depth image, (b) -input visible light image, (c) -present invention segmentation result image, (d) -true segmentation image.

Detailed Description

The invention will be further illustrated with reference to the following figures and examples, which include but are not limited to the following examples.

As shown in fig. 1, the invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image, which comprises the following specific implementation processes:

the visible light sensor in real life has higher resolution than the depth sensor, and in order to enable the method to be applied to practice, the processing object of the invention is a high-resolution visible light image I _rgb And corresponding low resolution depth imageThe resolutions of two modal images in the existing large RGB-D data set are consistent, so that in order to train the super-resolution subtask, the depth images in the RGB-D data set are firstly downsampled to obtain the depth images with relatively low resolution

1. Super-resolution subtasks

Image I of visible light _rgb And low resolution depth imageSuper resolution module network training is performed as input. The super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the two branch encoders are used for extracting features of a high-resolution visible light image and a low-resolution depth image, and the extracted features are fused and transferred into the decoder for up-sampling. With the original high resolution depth information in the dataset +.>Supervising the generation of the high resolution depth image as a supervisory signal thereby training the super resolution module network:

where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the ith pair of images in the RGB-D dataset,/and>depth image representing low resolution of the i-th pair of images obtained by downsampling, +.>Representing visible light images in an ith pair of images in an RGB-D dataset, W _sr Representing super-resolution module network parameters, f _sr (. Cndot.) represents super resolution modular network processing.

2. Semantic segmentation subtasks

The super-resolution subtask may obtain an aligned resolution RGB-D image pair from a non-aligned resolution RGB-D image pair. The predicted aligned RGB-D image pair is input into a semantic segmentation module network, and features of a visible light image and a depth image predicted by a super-resolution module network are extracted by using two encoders comprising K layers of convolution, wherein K is the number of network layers and can be adjusted according to actual conditions. And merging the extracted features, and transmitting the merged features into a decoder for up-sampling to finally obtain a segmentation result.

In order to improve the generalization performance of the model, the invention aims at the training of a semantic segmentation module network and the optimization solution of a loss function thereof, adopts a two-part joint mode, one part is from the cross entropy loss between a predicted segmentation result and a real label in a data set, the other part is from the coupling constraint between a super-resolution module network and the semantic segmentation module network, namely, after the super-resolution module network training is finished, the parameters of the middle two-layer encoder of the super-resolution sub-module network are taken out, and the parameters are used as the supervision information of the middle two-layer parameters of the semantic segmentation module network encoder to assist the middle two-layer encoder of the semantic segmentation network encoder to train. The following objective function is used for carrying out optimization solution on the network parameters of the semantic segmentation module:

the weight represents the weight of class, the value of the weight is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x class represents the class-th channel of the output feature diagram, j represents the position of the pixel, x j represents the probability of predicting the pixel j as class, and the value range is 0-1.

3. Model application

After model parameter optimization learning, the highest resolution visible light image which can be acquired by a visible light sensor in a real scene and the highest resolution depth image which can be acquired by a depth sensor are input into a super-resolution module network, so that the depth image which is aligned upwards to the resolution of the visible light image can be obtained. And then, inputting the predicted depth image and the visible light image acquired by the visible light sensor into a semantic segmentation module network for segmentation result prediction.

In order to verify the effectiveness of the method, a simulation experiment is carried out by using Pytorch on an operating system with a central processing unit of Intel (R) Core (TM) i7-6800K CPU@3.40GHz and a memory 60G, linux. The data used in the experiments were the published SUN RGB-D dataset. The SUN RGB-D dataset is the current largest RGB-D semantic segmentation dataset, containing 40 classes of 10335 annotated RGB-D images, of which there are 5285 pairs for training and 5050 pairs for testing. It is captured by four different sensors, kinect V1, kinect V2, xtion and RealSense.

To demonstrate the effectiveness of the algorithm RedNet, ACNet, PAP, FSFNet was chosen as a comparison method, respectively. Wherein, redNet is proposed in the literature "J.Jiang, L.Zheng, F.Luo, and Z.zhang. RedNet: residual Encoder-Decoder Network for indoor RGB-D security segment. Eprint Arxiv, 2018."; PAP is proposed in documents "Z.Zhang, Z.Cui, C.Xu, Y.Yan, N.Sebe, J.Yang.Pattern-Affinitive Propagation Across Depth, surface Normal and Semantic segment.In IEEE Conference on Computer Vision and Pattern Recognition,2019, pp.4106-4115"; FSFNet is set forth in the document "Y.Su, Y.Yuan, Z.Jiang.Deep feature selection-and-fusion for RGB-D setup section.In International Conference on Multimedia & Expo, 2021"; ACNet is proposed in documents X.Hu, K.Yang, L.Fei, and K.Wang, "ACNET: attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation," in Proc.IEEE International Conference on Image Processing,2019, pp.1440-1444. And respectively calculating two indexes of an average intersection ratio (mIoU) and Pixel precision (Pixel Acc), and evaluating the RGB-D semantic segmentation quality, wherein the segmentation effect is better as the index value is larger. The calculation results are shown in Table 1, and it can be seen that both indexes of the invention perform better than other methods.

TABLE 1

Fig. 2 shows that the super-resolution subtask of the present invention can super-divide non-aligned resolution RGB-D image data into aligned resolution RGB-D data, thereby laying a foundation for the subsequent semantic segmentation subtask.

Fig. 3 shows a semantic segmentation result image and a real segmentation image obtained by the method of the invention when a low resolution depth image and a visible light image are input. From the above, the invention can achieve better semantic segmentation performance under the condition that the resolution of the input depth image is lower than that of the visible light image.

Claims

1. A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:

the RGB-D data set is a public visible light and depth image data set;

3. A semantic segmentation method based on visible light images and low resolution depth images as claimed in claim 1 or 2, characterized in that: the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolution networks, and the loss function L of the semantic segmentation module network _seg The method comprises the following steps:

wherein L (x, class) represents the weighted cross entropy between the prediction segmentation result and the real label, and x represents the pre-predictionMeasuring semantic segmentation results, class represents category,parameter representing the ith layer of the encoder in the super resolution module network, a method for determining the parameter of the ith layer of the encoder in the super resolution module network>Parameters representing the ith layer of the encoder in the semantic segmentation module network;

wherein, the calculation formula of L (x, class) is: