CN114693951A

CN114693951A - RGB-D significance target detection method based on global context information exploration

Info

Publication number: CN114693951A
Application number: CN202210300694.5A
Authority: CN
Inventors: 黄荣梅; 廖涛; 段松松
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-07-01

Abstract

The invention belongs to the field of computer vision, and discloses an RGB-D saliency target detection method based on global context information exploration, which comprises the following steps: 1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm; 2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information; 3) a multi-scale feature decoder (MFD) that defines a stack of consecutive convolutional layers and multi-scale features and spatial channel attention; 4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result; 5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. The error between the predicted value and the true value at different pixels is calculated.

Description

RGB-D significance target detection method based on global context information exploration

The technical field is as follows:

the invention relates to the field of computer vision and image processing, and provides a novel global context exploration network (GCENet) for an RGB-D significance target detection (SOD) task, and the performance gain of multi-scale context features is explored in a fine-grained manner.

Background art:

salient object detection aims at segmenting the most visually appealing objects from a given scene. As a preprocessing tool, SOD has been widely used in computer vision tasks such as image retrieval, vision tracking, etc. Most previous SOD methods focus on RGB images, but they are difficult to handle challenging scenes, such as low contrast environments, similar foreground and background, and complex backgrounds. With the popularization of depth sensor devices such as microsoft Kinect, iPhone XR and hua Mate30, the acquisition of RGB-D images is feasible and can be realized. Since depth cues affect visual attention in addition to 2D features such as texture, direction, and brightness, RGB-D SOD is increasingly being focused and studied. The effective utilization of the multi-scale context characteristics endows the characteristics with richer global context information, is beneficial to better understanding the whole scene, and improves the performance of the RGB-D SOD network.

Inspired by the advantages of multi-scale features, many RGB-D SOD methods exploit the advantages of multi-scale features to improve performance. However, they focus primarily on hierarchical multi-scale representations and are unable to capture fine-grained global context cues in a single layer. In contrast to these methods, the present invention proposes a global context exploration network (GCENet) for RGB-D SOD to explore the gain effects of multi-scale context features at a fine-grained level. Specifically, a cross-modal context feature module (CCFM) is proposed that extracts cross-modal global features from RGB images and depth maps by a convolution operator stack on a single feature scale, and then fuses multi-scale multimodal features in a multipath fusion (MPF) mechanism. Then, the fusion characteristics are fused in a cascade aggregation mode. Furthermore, multi-scale information from multiple blocks of the skeleton needs to be considered and integrated to produce the final salient result. To this end, the present invention designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation.

The invention content is as follows:

aiming at the problems, a new global context exploration network (GCENet) is provided for the RGB-D SOD task, and a multi-scale feature decoder is provided, and the technical scheme is specifically adopted as follows:

1. obtaining RGB-D data set for training and testing the task

1.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the remaining samples of the first three data sets and samples of the RGBD, STERE and RGBD data sets into test sets;

1.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.

2. Extraction of feature information based on continuous convolutional layer stacking for constructing cross-modal context feature module

2.1) a multi-path fusion (MPF) strategy is proposed that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;

2.2) four RGB features

And depth feature

Extracted from a stack of successive convolutional layers, described below:

conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },

and

showing the output of four convolutional layers in succession. i is belonged to {1,2,3,4,5}, and represents a backbone networkThe ith layer of (1);

2.3) a multi-scale feature decoder (MFD) defining multi-scale features, MPF is calculated as follows:

wherein, O_ad、O_mlAnd O_ctRespectively element addition, element multiplication and concatenation,

the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;

2.4) the realization of spatial channel attention can be defined as follows:

where SA and CA represent spatial attention and channel attention respectively,

is an enhancement feature that presents spatial channel attention at the MPF layer;

2.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtained

And

finally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:

representing features of an ith layer of the hierarchical backbone;

3. constructing a multi-scale feature decoder

3.1) bottom-up mode fusion

And

the definition is as follows:

where BN is a batch normalization layer, Conv1 denotes a convolutional layer for converting channels,

is the output of the MFD k-th layer, W₄Is composed of

The generated weight matrix, Sigmoid, represents an activation function, UP₂Representing two upper acquisition operations;

3.2) continuing the above step next until generation

Can be expressed by the following formula:

W_t＝Sigmoid(Conv1(FU_t)) (9)

wherein t is ∈ {1,2,3},

is shown in (2)^5-tMultiple upsampling, FU_tTo represent

Fused feature of

Containing more global information, W_tIs represented from FU_tA weight matrix of (a);

4. the loss function is calculated, and in the training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:

wherein P ═ { P |0<p<1}∈R^1×H×WAnd G ═ G |0<g<1}∈R^1×H×WRespectively representing the predicted value and the corresponding true value, H and W representing the height and width of the input image, L_bceError of predicted and actual values for each pixel.

The invention is different from the multi-scale features of a hierarchical mode integration backbone network adopted by most methods, and provides a fine-grained method, which extracts and integrates the multi-scale features on a single feature scale instead of a plurality of feature scales, thereby capturing the global context clues of fine granularity in a single layer. Firstly, a cross-modal context feature module (CCFM) is provided, cross-modal global features are extracted from an RGB image and a depth map through a convolution operator stack on a single feature scale, and then multi-scale multi-modal features are fused in a multi-path fusion (MPF) mechanism; then, fusing the fusion characteristics by adopting a cascade polymerization mode; the present invention then designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation to consider and integrate multi-scale information from multiple blocks of the skeleton to produce a final salient result.

Drawings

FIG. 1 is a schematic diagram of the model structure of the present invention

FIG. 2 is a block diagram of cross-modal context feature modules

FIG. 3 is a multi-path fusion diagram

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the examples of the present invention, and moreover, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present invention.

Referring to fig. 1, an RGB-D saliency target detection method based on global context information exploration mainly includes the following steps:

1. an RGB-D dataset is acquired for training and testing the task, and the algorithm goals of the present invention are defined, and a training set and a test set for training and testing the algorithm are determined. Randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and samples of RGBD, STERE and RGBD data sets into test sets;

2.1 proposes a multi-path fusion (MPF) strategy that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;

2.2 four RGB features

And depth feature

Extracted from a stack of successive convolutional layers, described below:

conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },

and

showing the output of four convolutional layers in succession. i belongs to {1,2,3,4,5}, and represents the ith layer of the backbone network;

2.3 Multi-scale feature decoder (MFD) that defines multi-scale features, MPF is calculated as follows:

2.4 spatial channel attention implementation can be defined as follows:

where SA and CA represent spatial attention and channel attention respectively,

the remaining layers of the 2.5MPF perform similar steps as the first layer and three additional fused features can be obtained

And

representing features of an ith layer of the hierarchical backbone;

3. constructing a multi-scale feature decoder

3.1 bottom-up mode fusion

And

the definition is as follows:

is the output of the MFD k-th layer, W₄Is formed by

The generated weight matrix, Sigmoid, represents an activation function, UP₂Representing two acquisition operations;

3.2 Next step continues the above step until it is produced

Can be expressed by the following formula:

W_t＝Sigmoid(Conv1(FU_t)) (9)

wherein t is equal to {1,2,3},

is shown in (2)^5-tMultiple upsampling, FU_tTo represent

Fused feature of

wherein P ═ { P |0<p<1}∈R^1×H×WAnd G ═ G |0<g<1}∈R^1×H×WRespectively representing the predicted values and the corresponding true values, H and W representing the height and width of the input image, L_bceError of predicted and actual values for each pixel.

The above description is for the purpose of illustrating preferred embodiments of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An RGB-D saliency target detection method based on global context information exploration is characterized by comprising the following steps:

1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm;

2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information;

3) a multi-scale feature decoder (MFD) that defines a stack of successive convolutional layers and multi-scale features, and spatial channel attention;

4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result;

5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. And calculating the error between the predicted value and the true value under different pixels.

2. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 2 is as follows:

2.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and the samples of the RGBD, STERE and RGBD data sets into test sets;

2.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.

3. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 3 is as follows:

3.1) a multi-path fusion (MPF) strategy is proposed that fuses cross-modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;

3.2) four RGB features

And depth feature

Extracted from a stack of successive convolutional layers, described below:

conv3 denotes convolution with a 3 × 3 kernelThe operation, α ∈ { R, D },

and

3.3) Multi-Scale feature decoder (MFD) defining Multi-Scale features, MPF is calculated as follows:

3.4) the realization of spatial channel attention can be defined as follows:

where SA and CA represent spatial attention and channel attention respectively,

3.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtained

And

representing the characteristics of the ith layer of the hierarchical backbone.

4. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 4 is as follows:

4.1) bottom-up mode fusion

And

the definition is as follows:

wherein BN is a batch normalization layer, Conv1 represents a convolution layer for switching channels,

is the output of the MFD k-th layer, W₄Is composed of

4.2) continuing the above step next until generation

Can be expressed by the following formula:

W_t＝Sigmoid(Conv1(FU_t)) (9)

wherein t is ∈ {1,2,3},

is shown in (2)^5-tMultiple upsampling, FU_tTo represent

Fused feature of

Containing more global information, W_tIs represented from FU_tThe weight matrix of (2).

5. The RGB-D saliency object detection method based on global context information exploration according to claim 1 characterized in that: the specific method of the step 5 is as follows:

5.1) calculating a loss function, and in a training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as: