CN116310396A

CN116310396A - RGB-D significance target detection method based on depth quality weighting

Info

Publication number: CN116310396A
Application number: CN202310201765.0A
Authority: CN
Inventors: 夏晨星; 杨凤; 梁兴柱; 崔建华; 王列伟; 段松松
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-23

Abstract

The invention belongs to the field of computer vision, and provides a depth quality weighting-based RGB-D significance target detection method, which comprises the following steps: 1) Acquiring RGB-D data sets for training and testing the task and defining the algorithmic objects of the present inventionMarking; 2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder; 3) Constructing a cross-modal weighted fusion module, and guiding the extracted RGB image features and Depth image features to carry out weighted fusion through a weighted guided Depth quality assessment mechanism; 4) Constructing a bidirectional scale-dependent convolution mechanism for multi-scale feature extraction fusion to enhance the advanced semantic information of the multi-modal features; 5) Building a decoder to generate saliency map P _est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P _est Significant object segmentation map P with manual annotation _GT Calculating loss; 7) Testing the test data set to generate a saliency map P _est And performance evaluation is performed using the evaluation index. The method can effectively integrate complementary information from images of different modes, and improves accuracy of salient target prediction in complex scenes.

Description

RGB-D significance target detection method based on depth quality weighting

Technical field:

the invention relates to the field of computer vision and image processing, in particular to a depth quality weighting-based RGB-D saliency target detection method.

The background technology is as follows:

in the field of computer vision and image processing, saliency Object Detection (SOD) aims to identify and segment the most attractive objects or regions in given data (e.g. RGB pictures, RGB-D pictures, video, etc.) by simulating human visual attention mechanisms, and has been widely used in various computer vision tasks such as semantic segmentation, image compression, object tracking, etc.

Because of the challenging factors of complex background and illumination conditions, etc. faced by single-modality RGB salient target detection algorithms, it is difficult to locate salient targets from a cluttered background. One way to overcome these challenges is to use a depth map to compensate for the missing spatial information in the RGB image. RGB images contain detail information (e.g., rich texture, color, and visual cues), while depth maps provide spatial information, expressing geometry and distance information. Therefore, combining RGB images with depth maps for SOD tasks (called RGB-D SOD) is a reasonable choice that can handle more complex scenes, meeting the requirements of advanced detection.

Although significant progress has been made in the existing RGB-D SOD approach, most ignore the problem that low quality depth maps can adversely affect the RGB-D SOD task. The high-quality depth map has clear boundaries and accurate target positioning, and is beneficial to SOD. However, low quality depth maps not only blur the edges, but also target localization is inaccurate, which may introduce some noise in the cross-modal feature fusion, thus degrading SOD performance. Therefore, in the task of RGB-D SOD, it is necessary to consider the quality of the depth map.

Considering that the low-quality depth map inevitably affects the saliency target detection, the invention tries to explore a high-efficiency cross-modal feature fusion method, and effectively reduces the influence of the low-quality depth map on the saliency detection. In addition, in order to further explore complementary information among the multi-scale features, the advantages of correlation between the high-level features and the remote information are fully utilized, and the significance detection model is helped to more accurately predict a significance target by mining the effect of multi-scale feature fusion.

The invention comprises the following steps:

aiming at the problems, the invention provides a depth quality weighting-based RGB-D significance target detection method, which specifically adopts the following technical scheme:

1. an RGB-D dataset is acquired that trains and tests the task.

The NJUD dataset, the NLPR dataset, and the DUT-RGBD dataset are used as training sets, and the remaining portion of the NJUD dataset, the remaining NLPR dataset, the SIP dataset, the LFSD dataset, and the RGBD135 dataset are used as test sets.

2. And constructing a saliency target detection model network for extracting RGB image features and Depth image features by using the convolutional neural network.

2.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectively

And->

Where i represents the number of layers, corresponding to each layer output of VGG 16.

2.2 VGG16 parameters weights pre-trained with ImageNet dataset to initialize VGG16 weights for constructing the backbone network of the present invention.

3. Based on the multi-scale RGB image features extracted in step 2

And corresponding Depth image features +.>

And performing multi-scale cross-modal feature weighted fusion, and constructing a cross-modal feature fusion network by using the weighted fusion to generate multi-modal features.

3.1 Cross-modal feature fusion network, wherein 5 layers of RGB image features are respectively extracted by 5 layers of cross-modal weighted fusion (CMWF) modules

And corresponding Depth image features +.>

Composing, and generating 5-level multi-modal features

3.2 Input of the CMWF module of the i-th hierarchy is composed of data

And->

The multi-modal feature of the ith hierarchy is generated by a weighted guided depth quality assessment mechanism>

3.3 The CMWF module generates the multi-modal feature by a weighted guided deep quality assessment mechanism as follows:

3.3.1 First, the present invention constructs a generalThe channel-space attention feature enhancement module is used for filtering and enhancing the features to enhance the saliency expression capability of the features. By the channel-spatial attention feature enhancement module, unnecessary noise can be further removed, common salient objects are emphasized, and enhanced multi-modal features are obtained

Wherein c.epsilon. { r, d },

and->

Representing channel attention and spatial attention at level i, GAP representing global average pooling, GMP representing global maximum pooling, cat representing feature join operation, conv _k Representing a convolution operation with a convolution kernel size of k x k, sigmoid representing a sigmoid activation function, and multi representing a matrix multiplication operation of element perception.

3.3.2 Embodying the difference of two modalities at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention pattern

The resulting difference is then divided by the absolute value of the enhanced RGB feature pixel values by a weighting coefficient lambda _i ：

Wherein, subtra represents the matrix subtraction operation of element perception,

the || represents the average absolute operation, H and W being the height and width of feature f.

3.3.3 Further employing a cross enhancement strategy to characterize the original RGB

And depth profile->

RGB image features after enhancement with channel-spatial attention features, respectively->

And corresponding Depth image features +.>

Adopting a cross enhancement strategy to obtain cross enhancement characteristics +.>

And->

3.3.4 After obtaining the weighting coefficient and the cross enhancement feature, fusing the cross-modal feature and the RGB image feature by a weighted fusion method

And corresponding Depth image features +.>

Obtaining fusion characteristics->

Wherein i epsilon {1,2,3,4,5} represents the hierarchy of the model in which the feature is located, add represents the matrix addition operation of element perception, and Cat represents the feature connection operation.

4) Through the operation, the multi-mode characteristics of 5 layers are extracted

And the features of the 4 th and 5 th layers are input to a two-way scale correlation convolution module, and the receptive field information and the advanced semantic information of the multi-modal features are enhanced through depth separable convolution operation.

4.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:

R ₁ ＝DConv ₃ (R) +R formula (9)

R _i ＝DConv _2×i+1 (R _i-1 ) +R, i.e. (2, 3, 4) formula (10)

Wherein R represents an input feature, DConv ₃ Representing a 3 x 3 depth separable convolution, DConv _2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.

4.2 Connecting all the multi-scale features to add a residual connection to obtain advanced features

Where c ε {4,5}, A represents global average pooling.

4.3 Low-level features generated by the steps

And->

And high-level features->

And->

Inputting the obtained result into a decoder network to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P _est ：

5) Saliency map P predicted by the present invention _est Significant object segmentation map P with manual annotation _GT Calculating a loss function, and gradually updating parameters of the model proposed by the invention through Adam and a back propagation algorithmAnd the weights are used for finally determining the structure and parameter weights of the RGB-D significance target detection algorithm.

6) On the basis of determining the structure and parameter weight of the model in step 5, testing the RGB-D image pairs on the test set to generate a saliency map P _test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.

The multi-modal salient object detection method based on the deep convolutional neural network utilizes abundant space structure information in the Depth image, and performs cross-modal feature fusion in a weighted guided Depth quality assessment mechanism mode with the Depth features extracted from the RGB image, so that the method can meet the requirement of salient object detection in different scenes, and particularly has certain robustness in challenging scenes (complex background, low contrast, transparent objects and the like). Compared with the previous RGB-D significance target detection method, the method has the beneficial effects that:

firstly, a relation between an RGB-D image pair and an image salient target is constructed through an encoder and decoder structure by utilizing a deep learning technology, and salient prediction is obtained through extraction and fusion of cross-modal characteristics. Secondly, the complementary information of the Depth image features on the RGB image features is effectively modulated in a weighted fusion mode, the Depth distribution information of the complementary information is utilized to guide cross-mode feature fusion, interference of background information in the RGB image is eliminated, and a foundation is laid for prediction of a remarkable target in the next stage. And finally, carrying out multi-scale multi-mode feature fusion through the constructed decoder, and predicting a final saliency map.

Drawings

FIG. 1 is a schematic view of a model structure according to the present invention

FIG. 2 is a schematic diagram of a cross-modal feature fusion module

FIG. 3 is a schematic diagram of a two-way scale dependent convolution module

FIG. 4 is a schematic diagram of a Decoder (Decoder)

FIG. 5 is a schematic diagram of model training and testing

FIG. 6 is a graph comparing results of the present invention with other RGB-D significance target detection methods

Detailed Description

The following description of the embodiments of the present invention will be made more clearly and fully with reference to the accompanying drawings, in which some, but not all examples of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.

Referring to fig. 1, a depth quality weighting-based RGB-D saliency target detection method mainly includes the following steps:

1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NJUD dataset, the NLPR dataset and the DUT-RGBD dataset are used as training sets, and the rest dataset is used as a test set, wherein the rest NJUD dataset, the rest NLPR dataset, the SIP dataset, the LFSD dataset and the RGBD135 dataset are included.

2. Constructing a saliency target detection model network for extracting RGB image features and Depth image features by using a convolutional neural network, wherein the saliency target detection model network comprises an RGB encoder for extracting RGB image features and a Depth encoder for extracting Depth image features:

2.1. the RGB image with three channels is input into an RGB encoder to generate 5 layers of RGB image features, namely

2.2. Inputting the three-channel Depth image into a Depth encoder to generate 5 layers of Depth image features, which are respectively

3. Referring to fig. 2, the 5-level RGB image generated in step 2 is characterized by a cross-modal fusion module

And Depth image feature->

Weighted fusion is carried out to obtain 5 layers of multi-mode characteristics ++>

The main steps are as follows:

3.1. the cross-modal feature fusion network consists of 5 layers of cross-modal weighted fusion CMWF modules, and extracts the RGB image features of 5 layers

And corresponding Depth image features +.>

Composing and generating 5 layers of multi-modal features +.>

3.2. The input data of the CMWF module of the ith hierarchy is

And->

The multi-modal feature of the ith hierarchy is output by a weighted guided depth quality assessment mechanism>

The specific process of generating the multi-modal feature by the cmwf module through the weighted guided depth quality assessment mechanism is as follows:

3.3.1. first, the invention constructs a channel-space attention feature enhancement module for filtering and enhancing features to enhance the significance expression capability of the features. By means of the channel-spatial attention feature enhancement module, further removal is possibleRemoving unnecessary noise and emphasizing common salient objects, resulting in enhanced multi-modal features

Wherein c.epsilon. { r, d },

and->

3.3.2. Representing the difference of two modes at the feature level by calculating the difference between the enhanced RGB feature level attention pattern and the depth feature level attention pattern

3.3.3. Further adopting the cross enhancement strategy, we characterize the original RGB

And depth profile->

And corresponding Depth image features +.>

And->

3.3.4. After the weighting coefficients and the cross enhancement features are obtained, the cross-modal features and the RGB image features are fused through a weighted fusion method

And corresponding Depth image features +.>

Obtaining fusion characteristics->

4. Referring to fig. 3, the receptive field information and the advanced semantic information of the multi-modal features are enhanced by a bi-directional scale dependent convolution module:

R ₁ ＝DConv ₃ (R) +R formula (9)

R _i ＝DConv _2×i+1 (R _i-1 ) +R, i.e. (2, 3, 4) formula (10)

Wherein R represents an input feature, DCconv ₃ Representing a 3 x 3 depth separable convolution, DConv _2×i+1 A depth separable convolution representing a convolution kernel of 2 x i + 1.

4.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain an advanced levelFeatures (e.g. a character)

Where c ε {4,5}, A represents global average pooling.

5. Referring to fig. 4, the acquired low-level features are shown

And->

And high-level features->

And->

Inputting into decoder network, activating by sigmoid function to obtain predicted saliency map P _est ：

6) Saliency map P predicted by the present invention _est Significant object segmentation map P with manual annotation _GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.

7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P _test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.

It will be appreciated by persons skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, and that although the invention has been described in detail with reference to the foregoing embodiment, it will be apparent to those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for part of the technical features thereof. Any modifications, equivalent substitutions, etc. within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. The RGB-D significance target detection method based on depth quality weighting is characterized by comprising the following steps:

1) Acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing an algorithm;

2) Constructing an RGB encoder for extracting RGB image features and a Depth (Depth) image feature Depth encoder;

3) Establishing a cross-modal feature fusion network, and guiding RGB image features and Depth image features to carry out cross-weighted fusion through a weighted guided Depth quality assessment mechanism;

4) Based on the multi-modal characteristics fused by the cross-modal characteristics, a bidirectional scale correlation convolution fusion mechanism is constructed to enhance the high-level semantic information of the multi-modal characteristics;

5) Establishing a decoder, and obtaining a final predicted saliency map through an activation function;

6) Predicted saliency map P _est Significant object segmentation map P with manual annotation _GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.

7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P _test And performance evaluation is performed using the evaluation index.

2. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 1) is as follows:

3. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 2) is as follows:

3.1 VGG16 is used as a backbone network of the model of the invention for extracting RGB image features and corresponding Depth image features, respectively

And->

3.2 VGG16 weights for constructing the backbone network of the present invention are initialized with VGG16 parameter weights pre-trained in the ImageNet dataset.

4. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 3) is as follows:

4.1 The cross-modal weighted fusion network is composed of 5 layers of cross-modal weighted fusion (CMWF) modules and generates 5 layers of multi-modal characteristics

4.2 I-th level of CMWF module input data is

And f _i ^d Is composed of and isGenerating multi-modal features of the ith hierarchy by a weighted guided depth quality assessment mechanism>

5. The depth quality weighting-based RGB-D saliency target detection method of claim 1, wherein: the specific method of the step 4) is as follows:

5.1 Multi-modal characteristics of the 4 th and 5 th layers are extracted through depth separable convolution operation to obtain multi-scale receptive field information, and depth separable convolution with different convolution kernel sizes is set:

R ₁ ＝DConv ₃ (R) +R formula (9)

R _i ＝DConv _2×i+1 (R _i-1 ) +R, i.e. (2, 3, 4) formula (10)

5.2 Connecting all the multi-scale features, adding a residual connection to stabilize and optimize to obtain advanced features

6) Characterizing the front 3 layers of low-level multi-mode obtained in the step 4

And->

And 2 layers of high-level multiscale complementary features obtained in the step 5 +.>

And->

Inputting the result into a decoder to obtain a final fusion characteristic, and activating the final fusion characteristic through a sigmoid function to obtain a predicted saliency map P _est ：

7) Saliency map P predicted by the present invention _est Significant object segmentation map P with manual annotation _GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through Adam and a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D significance detection algorithm.