CN116486112A

CN116486112A - RGB-D significance target detection method based on lightweight cross-modal fusion network

Info

Publication number: CN116486112A
Application number: CN202310410912.5A
Authority: CN
Inventors: 夏晨星; 王晶晶; 高修菊; 葛斌; 段松松; 赵文俊; 李续兵
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-25

Abstract

The invention belongs to the field of computer vision, and provides a RGB-D saliency target detection method based on a lightweight cross-modal fusion network, which comprises the following steps: 1) Acquiring an RGB-D dataset for training and testing the task and defining an algorithm object 2) of the invention, constructing an encoder for extracting RGB image features and an encoder for depth image features; 3) Establishing a cross-modal characteristic fusion network, and enhancing the expression of the characteristic features of the RGB image and the depth image through a progressive guiding attention mechanism; 4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a lightweight global context integration module is constructed to extract the multi-scale context characteristics of the fused mode; 5) A simple and efficient multi-path aggregation module is constructed to integrate the fusion features, the original RGB and depth map features, and a final predicted saliency map is obtained by activating the function.

Description

RGB-D significance target detection method based on lightweight cross-modal fusion network

Technical field:

the invention relates to the field of computer vision and image processing, in particular to a high-efficiency light RGB-D (red, green and blue-digital) saliency target detection method.

The background technology is as follows:

the task of Salient Object Detection (SOD) is to find the most attractive objects in a scene by simulating the visual attention mechanism of humans. It has instructive implications in many computer vision processing tasks, including weakly supervised semantic segmentation, vision tracking, object recognition, and video analysis. The existing SOD method mainly focuses on the processing of RGB images and obtains good performance. However, they can only use visual cues in RGB images, which in some complex scenes encounter serious obstacles such as cluttered backgrounds, similar foreground and background. The main reason for this is that RGB images provide enough visual cues but lack defined spatial structure information. Meanwhile, along with the popularization of the depth sensor, the depth map can be conveniently acquired. The embedded depth information in the complex scene processing process is used as the supplement of the spatial structure information, so that the RGB can be helped to complete the robust significance detection. Due to the introduction of depth maps, RGB-D SODs have made tremendous progress in recent years.

Many RGB-D SOD processes benefit greatly from very deep and extensive models and achieve significant results. However, the success comes at the cost of a heavy computational burden and slow running speed. These models increase the depth and width of the network by adjusting the number of layers and channels, which brings about huge parameters and calculations. Considering calculation of a model and memory consumption, the invention designs a high-efficiency lightweight cross-modal fusion network for RGB-D SOD to realize light-weight and high-efficiency RGB-D significance target detection segmentation. Specifically, a feature interaction module (CMI) for fusing RGB and depth maps is first proposed. Context information is extracted from a single modality by depth separable convolution, RGB and depth map features are enhanced by a progressive guided attention mechanism (PAG) respectively, and features of all modalities are integrated by a multi-source feature integration unit (MAU). Finally, considering the saliency information retained by the original RGB and depth maps, the invention designs a multi-path aggregation Module (MPA) in the decoder to integrate the fusion features from different layers in a coarse-to-fine fusion manner.

The invention comprises the following steps:

aiming at the problems, the invention provides a RGB-D significance target detection method based on a lightweight cross-modal fusion network, which adopts the following technical scheme:

1. an RGB-D dataset is acquired that trains and tests the task.

1.1 The NJU2K data set, the NLPR data set, and the NLPR data set, the remaining NJU2K data set, the NLPR data set, the SIP data set, the STERE data set, and the DES data set are taken as test sets.

1.2 RGB-D image dataset comprising color image I _RGB Corresponding depth image I _Depth And a corresponding artificially annotated salient object segmentation image P.

2. Constructing a saliency target detection model network for extracting RGB image features and depth image features by using a convolutional neural network;

2.1 Using MobileNet-v3 as the backbone network of the model of the present invention for extracting RGB image features and depth image features of the causal pairs, respectively And->

3. Based on the multi-scale RGB image features extracted in step 2And corresponding depth image features->And utilizing the extracted features of each layer to perform cross-modal feature fusion. Since the lowest level features contain too much noise we do not use here, only 2,3,4 and 5 level features are used for fusion. Similarly, depth encoders also use only 2,3,4, and 5 layer features.

3.1 Cross-modal feature interaction network consisting of 4 levels of CMI modules and 4 levels of RGB image featuresAnd corresponding depth image features-> Composing and generating 4 layers of multi-modal features +.>And->

3.2 I) the input data of the CMI module of the i-th hierarchy isAnd->The composition and the output of the multi-modal feature of the ith hierarchy by the multi-source integration unit>Where i ε {2,3,4,5}.

3.3 The CMI module generates multi-modal features through a progressively guided attention mechanism, the specific process is as follows:

3.3.1 Firstly, the invention adopts a depth separable convolution module to extract the features of a single mode, enhances the expression capability of the significance of the features, and can further enhance the expression of RGB and depth features through the convolution module.

3.3.2 Then further feature extraction and enhancement processing is performed on the RGB and depth features, respectively, using two parallel progressively guided attention mechanisms. In order to obtain the global information of a single mode, we use two parallel channel attentions to extract features of the RGB and depth maps respectively:

where DSConv () represents a depth separable convolution module, AVG () represents a global average pooling operation at a channel level, sigmoid () represents a Sigmoid activation function.

3.3.3 At the time of obtaining global information X _r And global information X _d After that, we apply to X _r And X _d And then, learning local details so as to prevent the loss of local details of a plurality of remarkable targets, and generating a space feature map with a plurality of receptive fields by utilizing a multi-scale space attention mechanism:

wherein max () represents a max pooling operation, cat () represents a stitching operation, C ₁ ()、C ₃ () And C ₅ () Representing voids with void fractions 1, 3 and 5, respectivelyHole convolution operation.

3.3.4 For all features Z by means of a multisource integration unit _r 、Z _d 、And->Integrating and fusing RGB features Z _r And depth image Z _d Finally, the fusion characteristic F is obtained ⁱ _fusion ：

Where i ε {2,3,4,5} represents the hierarchy of the model where the feature is located, conv1 () represents the convolution operation with a convolution kernel size of 1×1, DSConv () represents the depth separable convolution operation, cat () represents the feature stitching operation, and add represents the addition operation.

4) Through the operation, the multi-mode characteristics of 4 layers are extracted Andand inputting the 4 layers into a context information extraction module, and enhancing the receptive field information of the multi-modal features and promoting the expression of the significance targets through convolution operations of multiple layers and different sizes.

4.1 Respectively extracting the context information of the fusion features from the multi-modal features through the context operation:

where i ε {2,3,4,5} represents the hierarchy where the fused feature is located and GCM () represents the contextual feature extraction module.

4.2 Inputting the context information modal characteristics generated by the steps into a decoder, integrating the fusion characteristics through a multi-path aggregation module, and integrating the RGB characteristics of each layerAnd depth feature per layer->

Wherein MPA () represents a multipath aggregation module, deconv () represents a deconvolution operation, S _out Representing the predicted saliency map, i epsilon {2,3,4,5} represents the hierarchy of the model in which the fused feature is located, and finally we can obtain the final saliency map S _out 。

5) Saliency map S predicted by the present invention _out And calculating a loss function with the artificially marked salient object segmentation graph G, gradually updating the parameter weight of the model provided by the invention through SGD and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D salient detection algorithm.

6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map S _test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.

The invention uses a lightweight MobileNet-v3 as backbone network, thereby avoiding heavy computation. Unlike previous fusion methods, here we do not do excessive modal interactions in order to avoid creating invalid fusion features. But uses a simple and efficient attention-directing mechanism to enhance the characterization capability of the features, and finally uses a multi-source integration unit to perform the final fusion operation. In the decoder, to be able to obtain more efficient saliency information, we use a simple multipath integration module to obtain the final saliency map. To make the whole network more lightweight we use separable convolution to learn the features. The invention can have certain robustness.

Drawings

FIG. 1 is a schematic view of a model structure according to the present invention

FIG. 2 is a schematic diagram of a cross-modal feature fusion module

FIG. 3 is a schematic diagram of a global context module

FIG. 4 is a schematic diagram of a multi-path aggregation module

FIG. 5 is a schematic diagram of model training and testing

Detailed Description

The following description of the embodiments of the present invention will be made more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, and in which embodiments of the invention are shown, by way of illustration only, and not all embodiments in which the invention may be practiced. All other embodiments, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the embodiments of the present invention.

Referring to fig. 1, an RGB-D saliency target detection method based on a lightweight cross-modal fusion network mainly includes the following steps:

1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NJU2K data set and the NLPR data set are used as training sets, and the rest data sets are the most tested sets and comprise the rest NJU2K data set, the rest NLPR data set, the SIP data set, the STERE data set and the DES data set.

2. Constructing a saliency target detection model backbone network for extracting RGB image features and depth image features by using a MobileNet-v3 network, wherein the backbone network comprises an encoder for extracting RGB image features and an encoder for extracting depth image features:

2.1. the RGB image with three channels is input to an RGB encoder to generate 4 levels of RGB features, respectivelyAnd->Since the lowest level features contain too much noise we do not use here, only 2,3,4 and 5 level features are used for fusion. Similarly, depth encoders also use only 2,3,4, and 5 layer features.

2.2. Inputting the three-channel depth image into a depth encoder to generate 4 layers of depth image features, namelyAnd->

3. Referring to fig. 2, the 4-level RGB image generated in step 2 is characterized by a cross-modal fusion moduleAnd->And depth image feature->And->Cross-modal fusion is carried out to obtain multi-modal characteristics of 4 layers +.> And->The main steps are as followsThe illustration is:

3.1. the cross-modal feature fusion network consists of 4 layers of CMI modules, and the cross-modal feature fusion network is characterized by 4 layers of RGB image featuresAnd->And corresponding depth image features-> And->Composing and generating 4 layers of multi-modal features +.> And->

33.2 I) the input data of the CMI module of the i-th hierarchy isAnd->The composition and the output of the multi-modal feature of the ith hierarchy by the multi-source integration unit>Where i ε {2,3,4,5}.

wherein Conv3 () represents a convolution module with a convolution kernel of 3×3, max () represents a max pooling operation, cat () represents a splicing operation, C ₁ ()、C ₃ () And C ₅ () Representing hollow rolls with hollow rates of 1, 3 and 5, respectivelyAnd (5) performing product operation.

3.3.4 For all features Z by means of a multisource feature integration unit _r 、Z _d 、And->Integrating and fusing RGB features Z _r And depth image Z _d Finally, fusion characteristics are obtained>

4. Referring to fig. 3, multi-modal features extracted into 4 levels And->And inputting the 4 layers into a context information extraction module, and enhancing the receptive field information of the multi-modal features and promoting the expression of the significance targets through convolution operations of multiple layers and different sizes.

4.2 Referring to fig. 4, the context information modal characteristics generated in the above steps are input into a decoder, and the fusion characteristics are integrated by a multi-path aggregation module, and the RGB characteristics of each layerAnd depth feature per layer->

Where MPA () represents a multipath aggregation module, deconv () represents a deconvolution operation,representing the predicted saliency map, i epsilon {2,3,4,5} represents the hierarchy of the model where the fusion feature is located, and finally we can obtain the final saliency map

5) Saliency maps predicted by the present inventionAnd calculating a loss function with the artificially marked salient object segmentation graph G, gradually updating the parameter weight of the model provided by the invention through SGD and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D salient detection algorithm.

The foregoing is a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The RGB-D significance target detection method based on the lightweight cross-modal fusion network is characterized by comprising the following steps:

1) Acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing an algorithm;

2) Constructing an encoder for extracting RGB image features and an encoder for extracting depth image features;

3) Establishing a lightweight network for fusing RGB features and depth map features, and guiding the fusion of the RGB features and the depth image features through a depth separable rolling and attention mechanism;

4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a multi-scale context capturing mechanism is constructed to extract multi-modal characteristic context information;

5) Establishing a simple and efficient multipath aggregation decoder for fusing RGB, depth features and fusion features, and obtaining a final predicted saliency map through an activation function;

6) The predicted saliency map P and the artificially marked saliency target segmentation map G are subjected to loss function calculation, the parameter weights of the model provided by the invention are gradually updated through SGD and a back propagation algorithm, and finally the structure and the parameter weights of the RGB-D saliency detection algorithm are determined.

7) And (3) testing RGB-D image pairs on a test set on the basis of determining the structure and the parameter weight of the model in the step (6), generating a saliency map S, and performing performance evaluation by using an evaluation index.

2. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 2) is as follows:

2.1 With the NJUK data set and the NLPR data set as training sets and the remaining NLPR data set, the NJU2K data set, the SIP data set, the fire data set, and the DES data set as test sets.

2.2 RGB-D image dataset comprising a single color image I _RGB Corresponding depth image I _Depth And a corresponding artificially annotated salient object segmentation image G.

3. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 3) is as follows:

3.1 Using MobileNet-v3 as the backbone network of the model of the present invention for extracting RGB image features and corresponding depth image features, respectivelyE、/>And->

3.2 Initializing the MobileNet-v3 weights of the present invention for constructing a backbone network with pre-trained MobileNet-v3 parameter weights on an ImageNet dataset.

4. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 4) is as follows:

4.1 Cross-modal feature fusion network is composed of 4 layers of CMI modules and generates 4 layers of multi-modal featuresAnd->

4.2 I) the input data of the CMI module of the i-th hierarchy isAnd->Constituted and outputting the multi-modal feature of the ith hierarchy by the progressive guided attention mechanism +.>Where i ε {2,3,4,5}.

5. The RGB-D saliency target detection method based on the lightweight cross-modal fusion network of claim 1, wherein the method is characterized by: the specific method of the step 5) is as follows:

5.1 A multi-scale depth separable convolution operation, respectively, using different kernel sizes to obtain multiple acceptance domains, which can capture rich context information:

where i ε {2,3,4,5} represents the hierarchy where the fused feature is located and GCM () represents the contextual feature extraction operation.

6) Inputting the 4-level multi-mode features with a plurality of receiving domains, which are obtained in the step 5, into a decoder formed by a multi-path integration network to obtain a final fusion feature, and activating through a sigmoid function to obtain a predicted saliency map S:

where MPA () represents a multipath aggregation module.

7) The loss function is calculated by the predicted saliency map S and the artificially marked saliency target segmentation map G, the parameter weight of the model provided by the invention is gradually updated by SGD and a back propagation algorithm, and the structure and the parameter weight of the RGB-D saliency detection algorithm are finally determined.