CN112861911A

CN112861911A - RGB-D semantic segmentation method based on depth feature selection fusion

Info

Publication number: CN112861911A
Application number: CN202110027615.3A
Authority: CN
Inventors: 袁媛; 苏月皎; 姜志宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-01-10
Filing date: 2021-01-10
Publication date: 2021-05-28
Anticipated expiration: 2041-01-10
Also published as: CN112861911B

Abstract

The invention discloses a depth feature selection fusion-based RGB-D semantic segmentation method, which adopts a double-current convolution neural network as a semantic segmentation model, firstly preprocesses an RGB-D image and a corresponding label image, then extracts feature maps of each layer of a visible light image and a depth image by an encoder, and then fuses the feature maps of the visible light image and the depth image to obtain the fusion feature of each layer; and then, selecting the fusion characteristics of the encoder by using space attention, and up-sampling the selected result to finally obtain a segmentation result. The method is more accurate for small objects and contour segmentation, has stronger robustness for illumination change and objects with similar appearances, and simultaneously has higher accuracy and average intersection ratio of segmentation results.

Description

RGB-D semantic segmentation method based on depth feature selection fusion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an RGB-D semantic segmentation method.

Background

Semantic Segmentation (Semantic Segmentation) refers to the classification of images at the pixel level according to Semantic information. With the development of depth sensors, in addition to visual information that is widely used, depth information is regarded as another supplementary information that improves scene resolution performance. The depth information contains 3D geometric information that is insensitive to illumination variations and that can distinguish objects that are similar in appearance. Therefore, the depth cue can make up for the defect of semantic segmentation using only visual cue to some extent. RGB-D semantic segmentation is very important for many applications, such as auto-driving, robot vision and indoor navigation, etc.

With the development of deep learning, the dual-stream network achieves excellent performance in RGB-D semantic segmentation. However, how to effectively fuse visible light and depth information into a unified representation remains a basic but difficult problem in RGB-D semantic segmentation. Hazirbas et al in the document "C.Hazirbas, L.Ma, C.Domokos, and D.Cremers.FuseNet: incorporated Depth and Selective Segmentation video-Based CNN architecture. in assessment Conference on Computer video, 2016, pp.213-228" directly add the Depth feature map to the visible light feature map to fuse the information of both. Deng in the document "l.deng, m.yang, t.li, y.he, and c.wang.rfbnet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D semiconducting segmentation. envelope axiv, 2019" proposes a Residual Fusion block to achieve bottom-up interaction and Fusion between the two modes. Hu et al, in the literature "X.Hu, K.Yang, L.Fei, and K.Wang.ACNet: Attention Based networks to explicit computation resources for RGBD selection. in IEEE International Conference on Image Processing,2019, pp.1440-1444", propose an Attention supplementing module that assigns different weights to different modalities to achieve better integration. The document "s.lee, s.j.park, and k.s.hong.rdfnet RGB-D Multi-level reactive Feature Fusion for the interior semiconductor segmentation. in IEEE International Conference on Computer Vision,2017, pp.4990-4999" by s.lee et al uses monomodal Residual joining to learn RGB and depth features and combinations thereof to exploit complementary features. Although these methods provide a structured model that integrates both types of information, how to ensure that the network takes full advantage of the information in both modes for fine semantic segmentation remains an open question.

In addition, most of the current RGB-D semantic segmentation methods are based on an encoder-decoder architecture, continuous down-sampling in an encoder may cause a part of detailed information to be lost, and reducing the resolution of the feature map to a very small level through a cascaded pooling layer is not favorable for generating an accurate segmentation result. To further exploit the spatial information lost in the encoding stage, o.ronnberger et al propose hopping connections in the literature "o.ronnberger, p.fischer, and t.brox.u-Net: relational Networks for biological Image segmentation. in Medical Image Computing and Computer-Assisted interpretation, 2015, pp.234-241" to reuse the information lost in the encoder to assist in the up-sampling learning, and finally to obtain a fine segmentation result. Although it may enable reuse of some missing features, it lacks pertinence and does not explicitly model the recovery of important detail information.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an RGB-D semantic segmentation method based on depth feature selection fusion, which adopts a double-current convolution neural network as a semantic segmentation model, firstly, an RGB-D image and a corresponding label image are preprocessed, then, a coder extracts feature maps of each layer of a visible light image and a depth image, and then, the feature maps of the visible light image and the depth image are fused to obtain the fusion feature of each layer; and then, selecting the fusion characteristics of the encoder by using space attention, and up-sampling the selected result to finally obtain a segmentation result. The method is more accurate for small objects and contour segmentation, has stronger robustness for illumination change and objects with similar appearances, and simultaneously has higher accuracy and average intersection ratio of segmentation results.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: constructing a semantic segmentation model:

the method comprises the steps that a double-current convolutional neural network is adopted as a semantic segmentation model and comprises two encoders and a decoder;

step 2: fusing RGB-D information;

step 2-1: preprocessing an image;

taking the RGB-D images and the corresponding label images from the public training set as a training sample set, and uniformly changing the sizes of the images in the training sample set into A x A; the RGB-D image consists of a visible light image and a depth image;

coding the depth image of the single channel into a three-channel HHA image;

step 2-2: respectively inputting the visible light image and the HHA image into two encoders for feature extraction to obtain two feature sequences, wherein each feature sequence comprises an m-layer feature diagram and is represented as formula (1):

wherein

And

j-th layer feature maps of the visible light image and the depth image respectively, wherein H, W and C respectively represent the height, width and channel number of the feature maps;

step 2-3: will be provided with

And

obtaining selected characteristic graphs through convolution of 1 x 1 respectively, wherein the selected characteristic graphs are expressed as a formula (2):

wherein

And

selected feature maps for the j-th layer of the visible image and the depth image respectively,

and

respectively representing the process of selection through convolution of 1 x 1;

step 2-4: calculating the fusion characteristics of j layers of the two characteristic sequences by adopting an equation (3):

wherein the content of the first and second substances,

and

respectively representing the fusion characteristics of the j-th layer and the j-1 st layer of the two characteristic sequences, f_conv(.) represents convolution, f_down(.) represents downsampling, f_concat(.) represents a cascade; fusion features of layer 1 of two feature sequences

Comprises the following steps:

and step 3: reusing the detail information;

step 3-1: inputting the fused features of the mth layer of the two feature sequences into a decoder to be used as a first layer of the decoder; the fusion characteristics of the m-th layers of the two characteristic sequences are up-sampled to be used as a second layer of a decoder;

step 3-2: respectively selecting the fusion characteristics of the 2 nd layer to the m-1 th layer of the two characteristic sequences by using space attention, re-fusing the selected result and the fusion characteristics of the two characteristic sequences according to a formula (5), and up-sampling the re-fused result to obtain the third layer to the m th layer of the decoder;

wherein the content of the first and second substances,

shows the results of the m-i +1 th layer after the re-fusion,

represents the fusion characteristics of the ith layers of the two characteristic sequences,

representing the m-i +1 th layer of the decoder,

representing the m-i +2 th layer of the decoder, f_up(.) represents upsampling;

when i is 2, obtain

The image size of the segmentation result is A x A;

and 4, step 4: training a semantic segmentation model;

training the semantic segmentation model by using a training sample set; inputting the RGB-D image of the training sample set into a semantic segmentation model, and segmenting through the steps 2 and 3;

sequentially performing m-2 times of downsampling on the label images of the training sample set, wherein the size of the image obtained by each downsampling is the same as the scale of the feature maps from the (m-1) th layer to the (2) th layer of the feature sequence;

respectively taking the labeled images of the training sample set and the images sequentially subjected to m-2 times of downsampling as supervision information from the mth layer to the 2 nd layer of the semantic segmentation model to train the semantic segmentation model;

the objective function of the training is:

wherein l_iAnd λ_iRespectively representing the weighted cross entropy loss function and the weight thereof of the ith layer, wherein class is a class number, weight [ class ]]Is the frequency, x, at which the pixels of each class appear in the training set.]Is the predicted segmentation result;

and 5: and inputting the RGB-D image to be segmented into the trained semantic segmentation model to obtain an image segmentation result.

Preferably, a is 480, m is 4, λ₁＝0.1,λ₂＝0.1,λ₃＝0.2。

Preferably, in the step 4, when the label images of the training sample set are sequentially subjected to m-2 times of downsampling, nearest neighbor interpolation downsampling is adopted.

The invention has the following beneficial effects:

1. and the method is more accurate for small objects and contour segmentation. The method is more accurate for small object and contour segmentation, and the proposed spatial attention mechanism is used for paying attention to and reusing features which are easy to lose in down-sampling, so that the features can play a role in recovering a segmentation mask, and the method is more helpful for small object and contour segmentation.

2. The robustness is stronger for illumination changes and object segmentation of similar appearance. The method of the invention uses two kinds of modal information of visible light image and depth image as input, and uses an information fusion module to fuse the two kinds of information to obtain a unified representation with unified and high distinguishability. The unified representation can make up the deficiency of visible light semantic segmentation, namely, the influence of illumination change on the performance of the model can be reduced, and objects with partial similar appearances in a scene can be segmented.

3. The accuracy rate and the average intersection ratio of the segmentation results are higher. The method can help the existing model architecture to better perform scene analysis.

4. The generalization ability is stronger. The RGB-D information fusion module provided by the invention can be generalized to other multi-mode information fusion algorithms, such as visible light and infrared information fusion.

5. The system has more practical and industrial values. The invention is suitable for vehicle auxiliary driving system and robot indoor navigation, thus having higher practical value.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a graph of semantic segmentation results generated by the method of the present invention and the comparison method.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The method mainly solves the two problems that multi-modal feature fusion is not complete enough in RGB-D semantic segmentation and important features are lost in the down-sampling process. The purpose is to solve the following aspects:

1. existing algorithms do not explicitly model the fusion of both visible light and depth information.

2. The existing algorithm loses some important features in the down-sampling process, and further causes poor small object and outline segmentation performance.

As shown in fig. 1, a depth-feature-selection-based fused RGB-D semantic segmentation method includes the following steps:

step 1: constructing a semantic segmentation model:

step 2: fusing RGB-D information;

step 2-1: preprocessing an image;

coding the depth image of the single channel into a three-channel HHA image;

wherein

And

step 2-3: will be provided with

And

wherein

And

and

wherein the content of the first and second substances,

and

Comprises the following steps:

and step 3: reusing the detail information;

wherein the content of the first and second substances,

shows the results of the m-i +1 th layer after the re-fusion,

representing the m-i +1 th layer of the decoder,

representing the m-i +2 th layer of the decoder, f_up(.) represents upsampling;

when i is 2, obtain

The image size of the segmentation result is A x A;

and 4, step 4: training a semantic segmentation model;

the objective function of the training is:

The specific embodiment is as follows:

in this embodiment, the simulation is performed by using a Pythroch on an operating system with a central processing unit of Intel (R) core (TM) i7-6800K CPU @3.40GHz and a memory 60G, Linux. The data used in the simulation is an open data set.

The data used in the simulation were from NYUDv2 and SUN RGB-D data. The NYUDv2 data contained 1449 pairs of densely labeled RGB-D image pairs captured by Microsoft Kinect, with 795 pairs of images for training and 654 pairs of images for testing. The SUN RGB-D dataset is the currently largest RGB-D semantically segmented dataset with 10,335 densely annotated RGB-D images taken from 20 different scenes. It is captured by four different sensors, Kinect V1, Kinect V2, xution and RealSense. The official-divided training set includes 5285 pairs of RGB-D images and labels, with the remaining 5050 pairs being used for testing. The number of categories in both data sets is 40.

To demonstrate the effectiveness of the method, the present invention selected 3DGNN, RedNet, CFN, ACNet, PAP, SA-Gate as the comparison algorithm on both data sets. Wherein 3DGNN is a method as proposed in the literature "X.Qi, R.Liao, J.jia, S.Fidler, and R.Urtasun.3D Graph Neural Networks for RGBD magnetic segmentation. in IEEE International Conference on Computer Vision,2017, pp.5209-5218"; RedNet is proposed in the literature "J.Jiang, L.Zheng, F.Luo, and Z.Zhang.RedNet: Residual Encode-Decoder Network for index RGB-D magnetic segmentation. Eprint Arxiv, 2018."; CFN is described in the documents "D.Lin, G.Chen, D.Cohen-Or, P.A.heng, and H.Huang.Cascade Feature Network for magnetic Segmentation of RGB-D images.In International Conference on Computer Vision,2017, pp.1320-1328"; PAP is proposed in the literature "Z.ZHenyu, C.ZHen, X.Chunyan, Y.Yan, S.Nicu, Y.Jian.Pattern-affinity amplification Depth, Surface Normal and Semantic Segmentin.in IEEE Conference on Computer Vision and Pattern Recognition,2019, pp.4106-4115"; SA-Gate is proposed in the literature "X.Chen, K.Y.Lin, J.Wang, W.Wu, C.Qian, H.Li, and G.Zeng.Bi-directional Cross-modification Feature creation with Separation-and-Aggregation Gate for RGB-D semiconductor selection. in European Conference on Computer Vision, 2020". FSFNet is the method proposed in the invention, mIoU and Pixel Acc are evaluation indexes for RGB-D semantic segmentation quality, and the comparison result is shown in Table 1:

TABLE 1

As can be seen from table 1, on NYUDv2 dataset, the present invention is equivalent to the current optimal algorithm on Pixel acc. On the SUN RGB-D data set, the method is superior to other algorithms in the mIoU index.

FIG. 2 is a graph of semantic segmentation results generated by the present invention and comparison algorithm. As seen from the figure, the present invention has better segmentation performance on different classes of objects such as ceilings, tables, etc. compared to the comparative algorithm, which can prove that the present invention effectively combines the features of RGB and depth information. In addition, the present invention can distinguish small objects and can perform more accurate contour segmentation, thereby proving that the present invention can better utilize detailed information in the down-sampling process.

Claims

1. A depth feature selection fusion-based RGB-D semantic segmentation method is characterized by comprising the following steps:

step 1: constructing a semantic segmentation model:

step 2: fusing RGB-D information;

step 2-1: preprocessing an image;

coding the depth image of the single channel into a three-channel HHA image;

wherein

And

step 2-3: will be provided with

And

respectively convolving by 1 x 1 to obtain selected characteristic diagram represented by formula (2))：

Wherein

And

and

wherein the content of the first and second substances,

and

Comprises the following steps:

and step 3: reusing the detail information;

wherein the content of the first and second substances,

shows the results of the m-i +1 th layer after the re-fusion,

representing the m-i +1 th layer of the decoder,

representing the m-i +2 th layer of the decoder, f_up(.) represents upsampling;

when i is 2, obtain

The image size of the segmentation result is A x A;

and 4, step 4: training a semantic segmentation model;

the objective function of the training is:

2. The RGB-D semantic segmentation method based on depth feature selection fusion as claimed in claim 1, wherein A-480, m-4, λ₁＝0.1,λ₂＝0.1,λ₃＝0.2。

3. The RGB-D semantic segmentation method based on depth feature selection fusion as claimed in claim 1, wherein in step 4, nearest neighbor interpolation downsampling is adopted when downsampling is performed m-2 times on the label images of the training sample set in sequence.