CN116452805A

CN116452805A - Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Info

Publication number: CN116452805A
Application number: CN202310401129.2A
Authority: CN
Inventors: 葛斌; 朱序; 夏晨星; 张梦格; 卢洋; 陆一鸣
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2023-04-15
Filing date: 2023-04-15
Publication date: 2023-07-18

Abstract

The invention name is as follows: transformer-based cross-modal fusion network RGB-D semantic segmentation method abstract: the invention provides a trans-former-based cross-modal fusion RGB-D semantic segmentation method, which utilizes multi-modal data of RGB images and Depth images to extract cross-modal characteristics for semantic segmentation tasks in computer vision. The contribution of the invention mainly aims at realizing unreliable information of deep learning obtained by a Depth sensor (for example, deep objects or reflective surfaces read by some Depth sensors often have inaccurate readings or holes) by considering the Depth characteristics, and enhancing the effect of the Depth characteristics by utilizing bilateral filtering, and effectively fusing RGB characteristics and the Depth characteristics by a cross-mode residual fusion module. The challenges encountered by the semantic segmentation of RGB images (it is difficult to distinguish instances with similar colors and textures) can be effectively handled by the proposed method, and the Depth image can be effectively utilized.

Description

Transformer-based RGB-D semantic segmentation method of cross-modal fusion network

Technical Field

The invention relates to the field of image processing, in particular to a semantic segmentation method based on feature extraction and fusion of different modes.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Semantic segmentation (Semantic Segmentation) is one of the most challenging problems in the computer vision field, whose purpose is to transform image input into its potential semantic meaning area, and to achieve pixel-level dense scene understanding for many real-world applications. With the rise of the most popular topics in the field of computer vision such as scene understanding, reconstruction and image processing, the image semantic segmentation is taken as the basis of the popular topics, and is also paid more and more attention to scientific researchers in the field. Semantic segmentation is a fundamental and persistent problem in computer vision, and is used as a multi-label classification problem, wherein a class label is allocated to each pixel, so that the method is suitable for various applications (such as automatic driving, object classification, image retrieval, detection of medical instruments in man-machine interaction surgery and the like). Although there are some excellent results in semantic segmentation, most studies focus on RGB images only. Since RGB learning gives models with distinct colors and textures, without geometric information, it is difficult to distinguish instances with similar colors and textures. To solve the above problems, researchers began to utilize depth information to assist RGB semantic segmentation. The combination of RGB and depth information, known as RGB-D, is a quite important approach, and depth images can provide the required geometrical information, thus potentially enriching the representation of the RGB image and better distinguishing between various objects.

The existing RGB-D semantic segmentation method has two main challenges: how to effectively extract features from the additional Depth; and how to effectively merge different features of two modes. The current approach primarily treats the depth map as a single channel image and uses convolutional neural networks (Convolutional Neural Network, CNN) to extract features similar to RGB maps from the depth map, however this approach ignores that the depth obtained by the depth sensor is not reliable for every depth value. Since RGB images and depth images belong to two different modalities, how to effectively fuse features of the two different modalities is also an important challenge for RGB-D semantic segmentation.

Based on the defects of the convolutional neural network-based method, the invention tries to design a framework capable of efficiently extracting RGB and depth features, and in the process of feature extraction, the reliability of input depth values is explicitly considered, noise processing is carried out on the depth image, and the features of the depth image can be effectively utilized. In order to solve the problem of fusion of RGB features and depth features, the invention designs a cross-mode residual error fusion module.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a trans-former-based RGB-D semantic segmentation method of a cross-modal fusion network, which adopts the following technical scheme:

1. RGB-D data sets for training and testing are acquired and collated.

1.1 The acquired dataset (NYU Depth V2 dataset, SUN RGB-D dataset) is sorted and generalized into the following categories: RGB image P _RGB Depth image P _Depth And a truth image P marked by manual _GT 。

1.2 A) the received data set is divided into a training set and a test set. The NYU Depth V2 is totally 1449 pictures to the right, 795 pictures are selected as a training set, and the rest 654 pictures are selected as a testing set. SUN RGB-D consisted of 10335 indoor RGB-D pictures, which were divided into a training set of 5285 samples and a test set of 5050 samples.

2. The network framework of the invention consists of two parallel encoders (RGB Encoder and Depth Encoder), extracts specific mode features from RGB image and Depth mode respectively, and then generates final semantic segmentation result by one semantic decoder.

2.1 Two parallel independent trunks extract features from RGB and Depth modality inputs, respectively, and the semantic decoder takes the fusion features of each fusion module as input to generate a final segmentation result.

2.2 RGB and Depth are respectively obtained into 4 layers of RGB features and Depth features through two parallel Encoder trunks by four sequential transform blocks, which are respectively named as And

2.3 Generally, existing depth sensors have difficulty measuring the depth of highly reflective or light absorbing surfaces because the measurement of the depth sensor may be affected by the physical environment. Conventional depth sensors, such as Kinect, return only a null value when depth cannot be measured accurately. In these cases, we represent its uncertainty map as a binary map U.epsilon. {0,1} ^H×W Where 0 indicates that there is no sensor reading at that location and 1 indicates a valid sensor reading. For the Depth image measured by the sensor, the problem of uncertain Depth is solved by bilateral filtering, a neighborhood to be used for filtering is divided or classified according to pixel values, then relatively high weight is given to the category to which the point belongs, and then neighborhood weighted summation is carried out, so that a final result is obtained.

Generating a space domain kernel by using a two-dimensional Gaussian function, and generating a color domain kernel by using a one-dimensional Gaussian function:

wherein, (k, l) is the core center coordinate and (i, j) is the intra-core neighborhood coordinate.σ _d Is the standard deviation of the gaussian function.

Where f (i, j) represents the gray value of the image at (i, j), and the other labels are consistent with the spatial domain.

2.4 The present invention uses the pyrerch framework to implement and train the network of the present invention. The encoder of the present invention uses the default configuration of Swin-S.

3. Based on the RGB features extracted in step 2And depth profile-> The invention fuses the characteristics between the RGB encoder and the depth encoder by using the cross-mode residual fusion module provided by the invention and combines the characteristics of the two modes into a single fusion characteristic. The fusion module takes input from the RGB branches and the depth branches and returns updated features to the corresponding encoder of the next block to enhance the complementarity of features between the two different modalities.

3.1 First, the present invention contemplates a Cross-modality survivor fusion module (Cross-Model Residual Feature Fusion Module, CRFFM) that first selects features from one modality that are complementary to another modality, and then performs feature fusion between the modalities and levels.

3.1.1 First, in the first stage of the fusion module, the invention inputs RGB image features and depth image features, respectively, to an improved coordinate attention module (Coordinate attention, CAM) for enhanced feature representation capabilities. And then, the RGB features and the depth features pass through a symmetrical feature selection stage, complementary information of different modes is selected for residual linking, and the features after residual linking are used as the output of a decoder of the next stage and the input of a fusion stage.

3.1.2 The RGB features and depth features after the residual connection of the results are respectively passed through Conv _3×3 The convolution is performed by cross element-by-element multiplication and maximization operation and the features generated by the cross element-by-element multiplication and maximization operation are connected, and then a Conv is passed _3×3 Convolution performs the output of the fusion feature.

4. Through the steps, the cross-modal fusion characteristic F can be obtained _i . The semantic decoder takes the fusion characteristics of each fusion module as input to generate a final segmentation result. The invention uses the UuperNet decoder as our semantic decoder, which has higher efficiency.

5. Semantic segmentation map P predicted by the invention _pre Semantic segmentation truth diagram P with artificial annotation _GT And comparing to calculate a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and weight parameters of the RGB-D semantic segmentation algorithm. The loss function of the present invention uses a cross entropy loss function:

6. and 5, testing the RGB-D image on the test set on the basis of determining the structure and the weight parameters of the model, generating a semantic segmentation map, and evaluating by using the PixelAcc.

Drawings

FIG. 1 is a schematic view of a model structure according to the present invention

FIG. 2 is a flow chart of a bilateral filtering module

FIG. 3 is a schematic diagram of a cross-modal residual fusion module

FIG. 4 is a schematic diagram of an improved coordinate attention module

Detailed Description

The following description of the embodiments of the present invention will be made more fully and clearly with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.

Referring to fig. 1, an RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network mainly includes the following steps:

1. RGB-D data sets for training and testing are acquired and collated.

wherein, (k, l) is the core center coordinate and (i, j) is the intra-core neighborhood coordinate. Sigma (sigma) _d Is the standard deviation of the gaussian function.

6. and 5, testing the RGB-D image on the test set on the basis of determining the structure and the weight parameters of the model, generating a semantic segmentation map, and evaluating by using the Pixel Acc.

The foregoing is a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The RGB-D semantic segmentation method based on the trans-former cross-mode fusion network is characterized by comprising a bilateral filtering module for acquiring and arranging image samples for training and testing, constructing a double-stream encoder, extracting and fusing cross-mode characteristics and processing depth images.

2. The method for RGB-D semantic segmentation of a trans-former-based cross-modal fusion network according to claim 1, wherein the data used comprises NYU V2 data set, SUN RGB-D data set, and the single sample is divided into RGB images P _RGB Depth image P _Depth And manually annotated semantically segmented images P _GT The method comprises the steps of carrying out a first treatment on the surface of the The training set consisted of 795 samples in the NYU V2 dataset and 5285 samples in the SUN RGB-D dataset, with the remaining samples as the test set.

3. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network according to claim 1, wherein the network framework consists of two parallel encoders (RGB Encoder and Depth Encoder), extracting specific mode features from RGB images and Depth modes, respectively, and then generating final semantic segmentation results by a semantic decoder.

3.1 Two parallel independent trunks extract features from RGB and Depth modality inputs, respectively, and the semantic decoder takes the fusion features of each fusion module as input to generate a final segmentation result.

3.2 RGB and Depth are respectively obtained into 4 layers of RGB features and Depth features through two parallel Encoder trunks by four sequential transform blocks, which are respectively named as And

3.3 Generally, existing depth sensors have difficulty measuring the depth of highly reflective or light absorbing surfaces because the measurement of the depth sensor may be affected by the physical environment. Conventional depth sensors, such as Kinect, return only a null value when depth cannot be measured accurately. In these cases, we represent its uncertainty map as a binary map U.epsilon. {0,1} ^H×W Where 0 indicates that there is no sensor reading at that location and 1 indicates a valid sensor reading. For the Depth image measured by the sensor, the Depth uncertainty problem is solved by utilizing bilateral filtering.

4. The method for RGB-D semantic segmentation based on a trans-former cross-mode fusion network according to claim 3, wherein the method is characterized in that the output of each encoder block is fused by using the cross-mode residual fusion module provided by the invention, and the features of the RGB encoder and the depth encoder are combined into a single fusion feature. The fusion module takes input from the RGB branches and the depth branches and returns updated features to the corresponding encoder of the next block to enhance the complementarity of features between the two different modalities.

4.1 First, the present invention contemplates a Cross-modality survivor fusion module (Cross-Model Residual Feature Fusion Module, CRFFM) that first selects features from one modality that are complementary to another modality, and then performs feature fusion between the modalities and levels.

4.1.1 First, in the first stage of the fusion module, the invention inputs RGB image features and depth image features, respectively, to an improved coordinate attention module (Coordinate attention, CAM) for enhanced feature representation capabilities. And then, the RGB features and the depth features pass through a symmetrical feature selection stage, complementary information of different modes is selected for residual linking, and the features after residual linking are used as the output of a decoder of the next stage and the input of a fusion stage.

4.1.2 The RGB features and depth features after the residual connection of the results are respectively passed through Conv _3×3 Convolution is performing cross element-wise multiplication and summationMaximizing operation and connecting the features generated by the maximizing operation and the maximizing operation, and then passing through a Conv _3×3 Convolution performs the output of the fusion feature.

5. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network of claim 4, wherein the semantic decoder takes fusion features of each fusion module as input to generate a final segmentation result.

6. The RGB-D semantic segmentation method of a trans-former-based cross-modal fusion network according to claim 5, wherein the semantic segmentation map P predicted by the method is characterized in that _pre Semantic segmentation truth diagram P with artificial annotation _GT And comparing to calculate a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and weight parameters of the RGB-D semantic segmentation algorithm.

7. The RGB-D semantic segmentation method of the trans-former-based cross-modal fusion network of claim 6, wherein the RGB-D images on the test set are tested to generate a semantic segmentation map, and the semantic segmentation map is evaluated by using pixelacc.