CN114693951A - RGB-D significance target detection method based on global context information exploration - Google Patents
RGB-D significance target detection method based on global context information exploration Download PDFInfo
- Publication number
- CN114693951A CN114693951A CN202210300694.5A CN202210300694A CN114693951A CN 114693951 A CN114693951 A CN 114693951A CN 202210300694 A CN202210300694 A CN 202210300694A CN 114693951 A CN114693951 A CN 114693951A
- Authority
- CN
- China
- Prior art keywords
- rgb
- layer
- feature
- scale
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision, and discloses an RGB-D saliency target detection method based on global context information exploration, which comprises the following steps: 1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm; 2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information; 3) a multi-scale feature decoder (MFD) that defines a stack of consecutive convolutional layers and multi-scale features and spatial channel attention; 4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result; 5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. The error between the predicted value and the true value at different pixels is calculated.
Description
The technical field is as follows:
the invention relates to the field of computer vision and image processing, and provides a novel global context exploration network (GCENet) for an RGB-D significance target detection (SOD) task, and the performance gain of multi-scale context features is explored in a fine-grained manner.
Background art:
salient object detection aims at segmenting the most visually appealing objects from a given scene. As a preprocessing tool, SOD has been widely used in computer vision tasks such as image retrieval, vision tracking, etc. Most previous SOD methods focus on RGB images, but they are difficult to handle challenging scenes, such as low contrast environments, similar foreground and background, and complex backgrounds. With the popularization of depth sensor devices such as microsoft Kinect, iPhone XR and hua Mate30, the acquisition of RGB-D images is feasible and can be realized. Since depth cues affect visual attention in addition to 2D features such as texture, direction, and brightness, RGB-D SOD is increasingly being focused and studied. The effective utilization of the multi-scale context characteristics endows the characteristics with richer global context information, is beneficial to better understanding the whole scene, and improves the performance of the RGB-D SOD network.
Inspired by the advantages of multi-scale features, many RGB-D SOD methods exploit the advantages of multi-scale features to improve performance. However, they focus primarily on hierarchical multi-scale representations and are unable to capture fine-grained global context cues in a single layer. In contrast to these methods, the present invention proposes a global context exploration network (GCENet) for RGB-D SOD to explore the gain effects of multi-scale context features at a fine-grained level. Specifically, a cross-modal context feature module (CCFM) is proposed that extracts cross-modal global features from RGB images and depth maps by a convolution operator stack on a single feature scale, and then fuses multi-scale multimodal features in a multipath fusion (MPF) mechanism. Then, the fusion characteristics are fused in a cascade aggregation mode. Furthermore, multi-scale information from multiple blocks of the skeleton needs to be considered and integrated to produce the final salient result. To this end, the present invention designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation.
The invention content is as follows:
aiming at the problems, a new global context exploration network (GCENet) is provided for the RGB-D SOD task, and a multi-scale feature decoder is provided, and the technical scheme is specifically adopted as follows:
1. obtaining RGB-D data set for training and testing the task
1.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the remaining samples of the first three data sets and samples of the RGBD, STERE and RGBD data sets into test sets;
1.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.
2. Extraction of feature information based on continuous convolutional layer stacking for constructing cross-modal context feature module
2.1) a multi-path fusion (MPF) strategy is proposed that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
2.2) four RGB featuresAnd depth featureExtracted from a stack of successive convolutional layers, described below:
conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },andshowing the output of four convolutional layers in succession. i is belonged to {1,2,3,4,5}, and represents a backbone networkThe ith layer of (1);
2.3) a multi-scale feature decoder (MFD) defining multi-scale features, MPF is calculated as follows:
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
2.4) the realization of spatial channel attention can be defined as follows:
where SA and CA represent spatial attention and channel attention respectively,is an enhancement feature that presents spatial channel attention at the MPF layer;
2.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtainedAndfinally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
3. constructing a multi-scale feature decoder
where BN is a batch normalization layer, Conv1 denotes a convolutional layer for converting channels,is the output of the MFD k-th layer, W4Is composed ofThe generated weight matrix, Sigmoid, represents an activation function, UP2Representing two upper acquisition operations;
Wt=Sigmoid(Conv1(FUt)) (9)
wherein t is ∈ {1,2,3},is shown in (2)5-tMultiple upsampling, FUtTo representFused feature ofContaining more global information, WtIs represented from FUtA weight matrix of (a);
4. the loss function is calculated, and in the training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted value and the corresponding true value, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
The invention is different from the multi-scale features of a hierarchical mode integration backbone network adopted by most methods, and provides a fine-grained method, which extracts and integrates the multi-scale features on a single feature scale instead of a plurality of feature scales, thereby capturing the global context clues of fine granularity in a single layer. Firstly, a cross-modal context feature module (CCFM) is provided, cross-modal global features are extracted from an RGB image and a depth map through a convolution operator stack on a single feature scale, and then multi-scale multi-modal features are fused in a multi-path fusion (MPF) mechanism; then, fusing the fusion characteristics by adopting a cascade polymerization mode; the present invention then designs a multi-scale feature decoder (MFD) that fuses multi-scale features from multiple blocks in a top-down aggregation to consider and integrate multi-scale information from multiple blocks of the skeleton to produce a final salient result.
Drawings
FIG. 1 is a schematic diagram of the model structure of the present invention
FIG. 2 is a block diagram of cross-modal context feature modules
FIG. 3 is a multi-path fusion diagram
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the examples of the present invention, and moreover, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present invention.
Referring to fig. 1, an RGB-D saliency target detection method based on global context information exploration mainly includes the following steps:
1. an RGB-D dataset is acquired for training and testing the task, and the algorithm goals of the present invention are defined, and a training set and a test set for training and testing the algorithm are determined. Randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and samples of RGBD, STERE and RGBD data sets into test sets;
2. extraction of feature information based on continuous convolutional layer stacking for constructing cross-modal context feature module
2.1 proposes a multi-path fusion (MPF) strategy that fuses across modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
2.2 four RGB featuresAnd depth featureExtracted from a stack of successive convolutional layers, described below:
conv3 represents a convolution operation with a 3 x 3 kernel, α ∈ { R, D },andshowing the output of four convolutional layers in succession. i belongs to {1,2,3,4,5}, and represents the ith layer of the backbone network;
2.3 Multi-scale feature decoder (MFD) that defines multi-scale features, MPF is calculated as follows:
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
2.4 spatial channel attention implementation can be defined as follows:
where SA and CA represent spatial attention and channel attention respectively,is an enhancement feature that presents spatial channel attention at the MPF layer;
the remaining layers of the 2.5MPF perform similar steps as the first layer and three additional fused features can be obtainedAndfinally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
3. constructing a multi-scale feature decoder
where BN is a batch normalization layer, Conv1 denotes a convolutional layer for converting channels,is the output of the MFD k-th layer, W4Is formed byThe generated weight matrix, Sigmoid, represents an activation function, UP2Representing two acquisition operations;
3.2 Next step continues the above step until it is producedCan be expressed by the following formula:
Wt=Sigmoid(Conv1(FUt)) (9)
wherein t is equal to {1,2,3},is shown in (2)5-tMultiple upsampling, FUtTo representFused feature ofContaining more global information, WtIs represented from FUtA weight matrix of (a);
4. the loss function is calculated, and in the training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted values and the corresponding true values, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
The above description is for the purpose of illustrating preferred embodiments of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (5)
1. An RGB-D saliency target detection method based on global context information exploration is characterized by comprising the following steps:
1) acquiring an RGB-D data set for training and testing the task, defining an algorithm target of the invention, and determining a training set and a testing set for training and testing the algorithm;
2) constructing a cross-modal context feature module based on continuous convolutional layer stacking to extract feature information;
3) a multi-scale feature decoder (MFD) that defines a stack of successive convolutional layers and multi-scale features, and spatial channel attention;
4) constructing a multi-scale feature decoder, fusing the multi-scale features into a top-down aggregation strategy, and generating a significance result;
5) binary Cross Entropy (BCE) was used to train the model of the invention, which is also a ubiquitous loss function in SOD tasks. And calculating the error between the predicted value and the true value under different pixels.
2. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 2 is as follows:
2.1) randomly selecting 650 samples of an NLPR data set, 1400 samples of an NJU2K data set and 800 samples of a DUT data set as training sets, and classifying the residual samples of the first three data sets and the samples of the RGBD, STERE and RGBD data sets into test sets;
2.2) NJU2K contains 1985 pairs of RGB images and depth maps, where the depth maps are estimated from stereo images. STERE is the first proposed data set, containing a total of 1000 pairs of low quality depth maps.
3. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 3 is as follows:
3.1) a multi-path fusion (MPF) strategy is proposed that fuses cross-modal features, employing a cooperative set of multiple element-level operations, including element-level addition, element-level multiplication, and concatenation. In addition, in order to reduce redundant information and non-salient features in the cross-channel integration process, the invention utilizes a spatial channel attention mechanism to filter out unwanted information;
3.2) four RGB featuresAnd depth featureExtracted from a stack of successive convolutional layers, described below:
conv3 denotes convolution with a 3 × 3 kernelThe operation, α ∈ { R, D },andshowing the output of four convolutional layers in succession. i belongs to {1,2,3,4,5}, and represents the ith layer of the backbone network;
3.3) Multi-Scale feature decoder (MFD) defining Multi-Scale features, MPF is calculated as follows:
wherein, Oad、OmlAnd OctRespectively element addition, element multiplication and concatenation,the RGB and depth characteristics of the first layer of the CCFM are respectively, and i belongs to {1,2,3,4,5} layer represents the ith layer in the layer-by-layer main body;
3.4) the realization of spatial channel attention can be defined as follows:
where SA and CA represent spatial attention and channel attention respectively,is an enhancement feature that presents spatial channel attention at the MPF layer;
3.5) the remaining layers of the MPF perform a similar procedure as the first layer, and three other fused features can be obtainedAndfinally, an advanced global information-guided mechanism is employed to enhance the correlation of the outputs of the different convolutional layers, which can be expressed as follows:
4. The RGB-D saliency object detection method based on global context information exploration according to claim 1, characterized by: the specific method of the step 4 is as follows:
wherein BN is a batch normalization layer, Conv1 represents a convolution layer for switching channels,is the output of the MFD k-th layer, W4Is composed ofThe generated weight matrix, Sigmoid, represents an activation function, UP2Representing two upper acquisition operations;
Wt=Sigmoid(Conv1(FUt)) (9)
5. The RGB-D saliency object detection method based on global context information exploration according to claim 1 characterized in that: the specific method of the step 5 is as follows:
5.1) calculating a loss function, and in a training stage, the invention trains our network by adopting Binary Cross Entropy (BCE), which is a universal loss function in the SOD task. It performs error calculations at different pixels, defined as:
wherein P ═ { P |0<p<1}∈R1×H×WAnd G ═ G |0<g<1}∈R1×H×WRespectively representing the predicted values and the corresponding true values, H and W representing the height and width of the input image, LbceError of predicted and actual values for each pixel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210300694.5A CN114693951A (en) | 2022-03-24 | 2022-03-24 | RGB-D significance target detection method based on global context information exploration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210300694.5A CN114693951A (en) | 2022-03-24 | 2022-03-24 | RGB-D significance target detection method based on global context information exploration |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114693951A true CN114693951A (en) | 2022-07-01 |
Family
ID=82138691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210300694.5A Pending CN114693951A (en) | 2022-03-24 | 2022-03-24 | RGB-D significance target detection method based on global context information exploration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114693951A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117245672A (en) * | 2023-11-20 | 2023-12-19 | 南昌工控机器人有限公司 | Intelligent motion control system and method for modularized assembly of camera support |
-
2022
- 2022-03-24 CN CN202210300694.5A patent/CN114693951A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117245672A (en) * | 2023-11-20 | 2023-12-19 | 南昌工控机器人有限公司 | Intelligent motion control system and method for modularized assembly of camera support |
CN117245672B (en) * | 2023-11-20 | 2024-02-02 | 南昌工控机器人有限公司 | Intelligent motion control system and method for modularized assembly of camera support |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522966B (en) | Target detection method based on dense connection convolutional neural network | |
CN109377530B (en) | Binocular depth estimation method based on depth neural network | |
Zhou et al. | Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder | |
CN108648161B (en) | Binocular vision obstacle detection system and method of asymmetric kernel convolution neural network | |
CN104952083B (en) | A kind of saliency detection method based on the modeling of conspicuousness target background | |
CN110689599A (en) | 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
CN110827312B (en) | Learning method based on cooperative visual attention neural network | |
CN113449691A (en) | Human shape recognition system and method based on non-local attention mechanism | |
CN113076957A (en) | RGB-D image saliency target detection method based on cross-modal feature fusion | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN112651423A (en) | Intelligent vision system | |
CN112767478B (en) | Appearance guidance-based six-degree-of-freedom pose estimation method | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN114529793A (en) | Depth image restoration system and method based on gating cycle feature fusion | |
CN114693951A (en) | RGB-D significance target detection method based on global context information exploration | |
Wei et al. | Bidirectional attentional interaction networks for rgb-d salient object detection | |
CN112668662A (en) | Outdoor mountain forest environment target detection method based on improved YOLOv3 network | |
CN116229104A (en) | Saliency target detection method based on edge feature guidance | |
CN113780305B (en) | Significance target detection method based on interaction of two clues | |
Nie et al. | Binocular image dehazing via a plain network without disparity estimation | |
JP2018124740A (en) | Image retrieval system, image retrieval method and image retrieval program | |
CN111931793A (en) | Saliency target extraction method and system | |
Jiang et al. | Light field saliency detection based on multi-modal fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |