CN114693951A - RGB-D significance target detection method based on global context information exploration - Google Patents

RGB-D significance target detection method based on global context information exploration Download PDF

Info

Publication number
CN114693951A
CN114693951A CN202210300694.5A CN202210300694A CN114693951A CN 114693951 A CN114693951 A CN 114693951A CN 202210300694 A CN202210300694 A CN 202210300694A CN 114693951 A CN114693951 A CN 114693951A
Authority
CN
China
Prior art keywords
rgb
layer
feature
scale
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210300694.5A
Other languages
Chinese (zh)
Inventor
黄荣梅
廖涛
段松松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202210300694.5A priority Critical patent/CN114693951A/en
Publication of CN114693951A publication Critical patent/CN114693951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and discloses an RGB-D saliency target detection method based on global context information exploration, which comprises the following steps: 1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm; 2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information; 3) a multi-scale feature decoder (MFD) that defines a stack of consecutive convolutional layers and multi-scale features and spatial channel attention; 4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result; 5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. The error between the predicted value and the true value at different pixels is calculated.

Description

RGB-D significance target detection method based on global context information exploration
The technical field is as follows:
the invention relates to the field of computer vision and image processing, and provides a novel global context exploration network (GCENet) for an RGB-D significance target detection (SOD) task, and the performance gain of multi-scale context features is explored in a fine-grained manner.
Background art:
salient object detection aims at segmenting the most visually appealing objects from a given scene. As a preprocessing tool, SOD has been widely used in computer vision tasks such as image retrieval, vision tracking, etc. Most previous SOD methods focus on RGB images, but they are difficult to handle challenging scenes, such as low contrast environments, similar foreground and background, and complex backgrounds. With the popularization of depth sensor devices such as microsoft Kinect, iPhone XR and hua Mate30, the acquisition of RGB-D images is feasible and can be realized. Since depth cues affect visual attention in addition to 2D features such as texture, direction, and brightness, RGB-D SOD is increasingly being focused and studied. The effective utilization of the multi-scale context characteristics endows the characteristics with richer global context information, is beneficial to better understanding the whole scene, and improves the performance of the RGB-D SOD network.
Inspired by the advantages of multi-scale features, many RGB-D SOD methods exploit the advantages of multi-scale features to improve performance. However, they focus primarily on hierarchical multi-scale representations and are unable to capture fine-grained global context cues in a single layer. In contrast to these methods, the present invention proposes a global context exploration network (GCENet) for RGB-D SOD to explore the gain effects of multi-scale context features at a fine-grained level. Specifically, a cross-modal context feature module (CCFM) is proposed that extracts cross-modal global features from RGB images and depth maps by a convolution operator stack on a single feature scale, and then fuses multi-scale multimodal features in a multipath fusion (MPF) mechanism. Then, the fusion characteristics are fused in a cascade aggregation mode. Furthermore, multi-scale information from multiple blocks of the skeleton needs to be considered and integrated to produce the final salient result. To this end, the present invention designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation.
The invention content is as follows:
aiming at the problems, a new global context exploration network (GCENet) is provided for the RGB-D SOD task, and a multi-scale feature decoder is provided, and the technical scheme is specifically adopted as follows:
1. obtaining RGB-D data set for training and testing the task
1.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the remaining samples of the first three data sets and samples of the RGBD, STERE and RGBD data sets into test sets;
1.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.
2. Extraction of feature information based on continuous convolutional layer stacking for constructing cross-modal context feature module
2.1) a multi-path fusion (MPF) strategy is proposed that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
2.2) four RGB features
Figure BDA0003562779540000031
And depth feature
Figure BDA0003562779540000032
Extracted from a stack of successive convolutional layers, described below:
Figure BDA0003562779540000033
conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },
Figure BDA0003562779540000034
and
Figure BDA0003562779540000035
showing the output of four convolutional layers in succession. i is belonged to {1,2,3,4,5}, and represents a backbone networkThe ith layer of (1);
2.3) a multi-scale feature decoder (MFD) defining multi-scale features, MPF is calculated as follows:
Figure BDA0003562779540000036
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,
Figure BDA0003562779540000037
the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
2.4) the realization of spatial channel attention can be defined as follows:
Figure BDA0003562779540000038
where SA and CA represent spatial attention and channel attention respectively,
Figure BDA0003562779540000039
is an enhancement feature that presents spatial channel attention at the MPF layer;
2.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtained
Figure BDA00035627795400000310
And
Figure BDA00035627795400000311
finally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
Figure BDA00035627795400000312
Figure BDA00035627795400000313
representing features of an ith layer of the hierarchical backbone;
3. constructing a multi-scale feature decoder
3.1) bottom-up mode fusion
Figure BDA0003562779540000041
And
Figure BDA0003562779540000042
the definition is as follows:
Figure BDA0003562779540000043
Figure BDA0003562779540000044
Figure BDA0003562779540000045
where BN is a batch normalization layer, Conv1 denotes a convolutional layer for converting channels,
Figure BDA0003562779540000046
is the output of the MFD k-th layer, W4Is composed of
Figure BDA0003562779540000047
The generated weight matrix, Sigmoid, represents an activation function, UP2Representing two upper acquisition operations;
3.2) continuing the above step next until generation
Figure BDA0003562779540000048
Can be expressed by the following formula:
Figure BDA0003562779540000049
Wt=Sigmoid(Conv1(FUt)) (9)
Figure BDA00035627795400000410
wherein t is ∈ {1,2,3},
Figure BDA00035627795400000411
is shown in (2)5-tMultiple upsampling, FUtTo represent
Figure BDA00035627795400000412
Fused feature of
Figure BDA00035627795400000413
Containing more global information, WtIs represented from FUtA weight matrix of (a);
4. the loss function is calculated, and in the training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
Figure BDA00035627795400000414
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted value and the corresponding true value, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
The invention is different from the multi-scale features of a hierarchical mode integration backbone network adopted by most methods, and provides a fine-grained method, which extracts and integrates the multi-scale features on a single feature scale instead of a plurality of feature scales, thereby capturing the global context clues of fine granularity in a single layer. Firstly, a cross-modal context feature module (CCFM) is provided, cross-modal global features are extracted from an RGB image and a depth map through a convolution operator stack on a single feature scale, and then multi-scale multi-modal features are fused in a multi-path fusion (MPF) mechanism; then, fusing the fusion characteristics by adopting a cascade polymerization mode; the present invention then designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation to consider and integrate multi-scale information from multiple blocks of the skeleton to produce a final salient result.
Drawings
FIG. 1 is a schematic diagram of the model structure of the present invention
FIG. 2 is a block diagram of cross-modal context feature modules
FIG. 3 is a multi-path fusion diagram
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the examples of the present invention, and moreover, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present invention.
Referring to fig. 1, an RGB-D saliency target detection method based on global context information exploration mainly includes the following steps:
1. an RGB-D dataset is acquired for training and testing the task, and the algorithm goals of the present invention are defined, and a training set and a test set for training and testing the algorithm are determined. Randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and samples of RGBD, STERE and RGBD data sets into test sets;
2. extraction of feature information based on continuous convolutional layer stacking for constructing cross-modal context feature module
2.1 proposes a multi-path fusion (MPF) strategy that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
2.2 four RGB features
Figure BDA0003562779540000061
And depth feature
Figure BDA0003562779540000062
Extracted from a stack of successive convolutional layers, described below:
Figure BDA0003562779540000063
conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },
Figure BDA0003562779540000064
and
Figure BDA0003562779540000065
showing the output of four convolutional layers in succession. i belongs to {1,2,3,4,5}, and represents the ith layer of the backbone network;
2.3 Multi-scale feature decoder (MFD) that defines multi-scale features, MPF is calculated as follows:
Figure BDA0003562779540000066
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,
Figure BDA0003562779540000071
the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
2.4 spatial channel attention implementation can be defined as follows:
Figure BDA0003562779540000072
where SA and CA represent spatial attention and channel attention respectively,
Figure BDA0003562779540000073
is an enhancement feature that presents spatial channel attention at the MPF layer;
the remaining layers of the 2.5MPF perform similar steps as the first layer and three additional fused features can be obtained
Figure BDA0003562779540000074
And
Figure BDA0003562779540000075
finally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
Figure BDA0003562779540000076
Figure BDA0003562779540000077
representing features of an ith layer of the hierarchical backbone;
3. constructing a multi-scale feature decoder
3.1 bottom-up mode fusion
Figure BDA0003562779540000078
And
Figure BDA0003562779540000079
the definition is as follows:
Figure BDA00035627795400000710
Figure BDA00035627795400000711
Figure BDA00035627795400000712
where BN is a batch normalization layer, Conv1 denotes a convolutional layer for converting channels,
Figure BDA00035627795400000713
is the output of the MFD k-th layer, W4Is formed by
Figure BDA00035627795400000714
The generated weight matrix, Sigmoid, represents an activation function, UP2Representing two acquisition operations;
3.2 Next step continues the above step until it is produced
Figure BDA00035627795400000715
Can be expressed by the following formula:
Figure BDA0003562779540000081
Wt=Sigmoid(Conv1(FUt)) (9)
Figure BDA0003562779540000082
wherein t is equal to {1,2,3},
Figure BDA0003562779540000083
is shown in (2)5-tMultiple upsampling, FUtTo represent
Figure BDA0003562779540000084
Fused feature of
Figure BDA0003562779540000085
Containing more global information, WtIs represented from FUtA weight matrix of (a);
4. the loss function is calculated, and in the training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
Figure BDA0003562779540000086
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted values and the corresponding true values, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
The above description is for the purpose of illustrating preferred embodiments of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (5)

1. An RGB-D saliency target detection method based on global context information exploration is characterized by comprising the following steps:
1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm;
2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information;
3) a multi-scale feature decoder (MFD) that defines a stack of successive convolutional layers and multi-scale features, and spatial channel attention;
4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result;
5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. And calculating the error between the predicted value and the true value under different pixels.
2. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 2 is as follows:
2.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and the samples of the RGBD, STERE and RGBD data sets into test sets;
2.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.
3. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 3 is as follows:
3.1) a multi-path fusion (MPF) strategy is proposed that fuses cross-modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
3.2) four RGB features
Figure FDA0003562779530000021
And depth feature
Figure FDA0003562779530000022
Extracted from a stack of successive convolutional layers, described below:
Figure FDA0003562779530000023
conv3 denotes convolution with a 3 × 3 kernelThe operation, α ∈ { R, D },
Figure FDA0003562779530000024
and
Figure FDA0003562779530000025
showing the output of four convolutional layers in succession. i belongs to {1,2,3,4,5}, and represents the ith layer of the backbone network;
3.3) Multi-Scale feature decoder (MFD) defining Multi-Scale features, MPF is calculated as follows:
Figure FDA0003562779530000026
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,
Figure FDA0003562779530000027
the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
3.4) the realization of spatial channel attention can be defined as follows:
Figure FDA0003562779530000028
where SA and CA represent spatial attention and channel attention respectively,
Figure FDA0003562779530000029
is an enhancement feature that presents spatial channel attention at the MPF layer;
3.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtained
Figure FDA00035627795300000210
And
Figure FDA00035627795300000211
finally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
Figure FDA0003562779530000031
Figure FDA0003562779530000032
representing the characteristics of the ith layer of the hierarchical backbone.
4. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 4 is as follows:
4.1) bottom-up mode fusion
Figure FDA0003562779530000033
And
Figure FDA0003562779530000034
the definition is as follows:
Figure FDA0003562779530000035
Figure FDA0003562779530000036
Figure FDA0003562779530000037
wherein BN is a batch normalization layer, Conv1 represents a convolution layer for switching channels,
Figure FDA0003562779530000038
is the output of the MFD k-th layer, W4Is composed of
Figure FDA0003562779530000039
The generated weight matrix, Sigmoid, represents an activation function, UP2Representing two upper acquisition operations;
4.2) continuing the above step next until generation
Figure FDA00035627795300000310
Can be expressed by the following formula:
Figure FDA00035627795300000311
Wt=Sigmoid(Conv1(FUt)) (9)
Figure FDA00035627795300000312
wherein t is ∈ {1,2,3},
Figure FDA00035627795300000313
is shown in (2)5-tMultiple upsampling, FUtTo represent
Figure FDA00035627795300000314
Fused feature of
Figure FDA00035627795300000315
Containing more global information, WtIs represented from FUtThe weight matrix of (2).
5. The RGB-D saliency object detection method based on global context information exploration according to claim 1 characterized in that: the specific method of the step 5 is as follows:
5.1) calculating a loss function, and in a training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
Figure FDA0003562779530000041
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted values and the corresponding true values, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
CN202210300694.5A 2022-03-24 2022-03-24 RGB-D significance target detection method based on global context information exploration Pending CN114693951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210300694.5A CN114693951A (en) 2022-03-24 2022-03-24 RGB-D significance target detection method based on global context information exploration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210300694.5A CN114693951A (en) 2022-03-24 2022-03-24 RGB-D significance target detection method based on global context information exploration

Publications (1)

Publication Number Publication Date
CN114693951A true CN114693951A (en) 2022-07-01

Family

ID=82138691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210300694.5A Pending CN114693951A (en) 2022-03-24 2022-03-24 RGB-D significance target detection method based on global context information exploration

Country Status (1)

Country Link
CN (1) CN114693951A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117245672A (en) * 2023-11-20 2023-12-19 南昌工控机器人有限公司 Intelligent motion control system and method for modularized assembly of camera support

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117245672A (en) * 2023-11-20 2023-12-19 南昌工控机器人有限公司 Intelligent motion control system and method for modularized assembly of camera support
CN117245672B (en) * 2023-11-20 2024-02-02 南昌工控机器人有限公司 Intelligent motion control system and method for modularized assembly of camera support

Similar Documents

Publication Publication Date Title
CN109522966B (en) Target detection method based on dense connection convolutional neural network
CN109377530B (en) Binocular depth estimation method based on depth neural network
Zhou et al. Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder
CN108648161B (en) Binocular vision obstacle detection system and method of asymmetric kernel convolution neural network
CN104952083B (en) A kind of saliency detection method based on the modeling of conspicuousness target background
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN112651423A (en) Intelligent vision system
CN112767478B (en) Appearance guidance-based six-degree-of-freedom pose estimation method
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN114529793A (en) Depth image restoration system and method based on gating cycle feature fusion
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
Wei et al. Bidirectional attentional interaction networks for rgb-d salient object detection
CN112668662A (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN116229104A (en) Saliency target detection method based on edge feature guidance
CN113780305B (en) Significance target detection method based on interaction of two clues
Nie et al. Binocular image dehazing via a plain network without disparity estimation
JP2018124740A (en) Image retrieval system, image retrieval method and image retrieval program
CN111931793A (en) Saliency target extraction method and system
Jiang et al. Light field saliency detection based on multi-modal fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination