CN116704174A - RGB-D image salient object detection method based on deep learning - Google Patents

RGB-D image salient object detection method based on deep learning Download PDF

Info

Publication number
CN116704174A
CN116704174A CN202310668228.7A CN202310668228A CN116704174A CN 116704174 A CN116704174 A CN 116704174A CN 202310668228 A CN202310668228 A CN 202310668228A CN 116704174 A CN116704174 A CN 116704174A
Authority
CN
China
Prior art keywords
rgb
encoder
image
depth
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310668228.7A
Other languages
Chinese (zh)
Inventor
张继勇
戚媛媛
周晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202310668228.7A priority Critical patent/CN116704174A/en
Publication of CN116704174A publication Critical patent/CN116704174A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting an RGB-D image salient target based on deep learning, which comprises the following steps: s1, constructing an encoder, acquiring multi-level features, specifically comprising RGB branch and depth branch feature extraction and constructing an interaction attention module; s2, constructing a decoder module, specifically comprising a construction RGB branch and depth branch cross-level feature fusion module and a construction cross-mode feature fusion module; s3, constructing an RGB-D image salient object detection model based on deep learning based on an encoder and a decoder; s4, training the established model, and storing parameters. The invention improves the detection capability of the model by comprehensively exploring the cross-modal feature fusion.

Description

RGB-D image salient object detection method based on deep learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a method for detecting an RGB-D image salient target based on deep learning.
Background
RGB-D images include RGB images that provide people with rich appearance and color information and depth images that provide additional spatial cues. Each pixel value of the depth image represents the actual distance between the sensor and the object, and there is typically a one-to-one correspondence between the RGB image and the pixels of the depth image.
Image salient object detection is to simulate the human visual system to detect salient objects or areas, and the previous work of salient object detection mainly processes RGB images. Although algorithms for image salient object detection are continually improving and innovating to achieve processing mechanisms as good as the human visual system, there are many problems in processing the task of image salient object detection in complex scenes, such as when the object and background colors are close, and when the object is small compared to the background, it is difficult to accurately detect the salient object by means of RGB images alone. With the development of the three-dimensional perception sensing technology, people can obtain not only the shape and color information of the object, but also the spatial position information of the object, and the perception capability of the scene is further improved. The depth information is a supplement to RGB image information, has important significance for detecting the obvious target of the image, and can effectively improve the accuracy of detection and recognition results. For the same scene, different data sources can provide additional information of different modes, so that the scene expression is richer and more comprehensive, and therefore, a better obvious target detection result can be obtained by comprehensively considering the fusion of RGB image information and depth image information.
According to different feature extraction strategies, RGB-D image salient object detection can be roughly divided into RGB-D image salient object detection research based on a traditional model and RGB-D image salient object detection research based on a deep learning model. The traditional RGB-D image salient object detection method mainly relies on priori knowledge to set object features and extracts manual features. While good results have been achieved, this relies heavily on the priori knowledge of the designer and requires manual adjustment in different situations, making it less generalizable in complex scenarios. In view of the various shortcomings of the traditional manual design feature-based methods, more and more researchers have begun to apply neural networks to RGB-D image salient object detection studies.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for detecting an RGB-D image salient target based on deep learning.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
a method for detecting an RGB-D image salient object based on deep learning comprises the following steps:
s1, acquiring an image data set, and preprocessing the image data set;
s2, constructing and training a RGB-D image salient object detection model based on deep learning, wherein the RGB-D image salient object detection model based on deep learning comprises an encoder and a decoder,
the encoder comprises an RGB encoder and a depth encoder, and the tail interaction attention modules of the RGB encoder and the depth encoder;
the decoder comprises a cross-level feature fusion module, a cross-mode fusion module and a convolution layer with a convolution kernel size of 3 multiplied by 3;
s3, performing salient object detection through a training-completed RGB-D image salient object detection model based on deep learning
S3-1, respectively extracting corresponding level encoder characteristics through an RGB encoder and a depth encoder, wherein the multi-level encoder characteristics of an RGB image and a depth image are respectively expressed asAnd->Where r represents an RGB image, d represents a depth image, and i represents a hierarchy of features;
s3-2, strengthening the relation between the multi-level encoder characteristics of the RGB image and the depth image through an interaction attention module in the encoder to obtain a fusion characteristic
S3-3, to fuse featuresAs the input of the cross-level feature fusion module, RGB branch and depth branch decoding features are obtained;
s3-4, respectively inputting RGB branch and depth branch decoding characteristics of each layer into a cross-modal characteristic fusion module of a corresponding level to obtain f i rgbd The method comprises the steps of carrying out a first treatment on the surface of the Finally output f of the last layer of the decoder 1 rgbd Obtaining a final significance prediction graph S through a convolution layer with a convolution kernel size of 3 multiplied by 3 rgbd Corresponding saliency maps for RGB branch and depth branch predictions can also be obtained, S r And S is d
Preferably, in the step S1, the preprocessing method of the image dataset includes: the data expansion is performed by random inversion, rotation and multi-scale input modes.
Preferably, the multi-scale input mode randomly adjusts the image size to 128×128, 256×256, and 352×352.
Preferably, the RGB encoder and depth encoder employ a res net50 backbone network to extract multi-level encoder features for RGB images and depth images.
Preferably, the interactive attention module comprises a self-attention mechanism, a cross-attention layer and a feed-forward network.
Preferably, in the step S2, the method for training the RGB-D image salient object detection model based on deep learning includes: initializing the network by pre-training parameters on ImageNet, optimizing by Adam algorithm, batch processing size of 8, initial learning rate of 1e-4, adjusting learning rate every 40 rounds, training 150 rounds in total,
in training the network, a binary cross entropy loss function is used for optimization, and the final loss function is defined as:
Loss=l bce (S r ,G)+l bce (S d ,G)+l bce (S rgbd ,G)
wherein G is a truth chart, l bce =-[Glog(S)+(1-G)log(1-S)]。
Preferably, the fusion feature is obtained through the interaction attention moduleThe method of (1) is as follows: the fifth layer RGB encoder feature and the depth encoder feature of the input are divided into a number of tokens of 1 x 128; secondly, applying a self-attention mechanism on the RGB encoder characteristics to perform characteristic self-enhancement; introducing a cross attention layer between the self-enhanced RGB encoder features and the depth encoder features associated therewith; a feed forward network is adopted to obtain the fusion characteristic +.>The expression is as follows:
where SA is the self-attention mechanism within the feature, CA is the cross-attention mechanism between features, and FF is the feed-forward network.
Preferably, in the cross-level feature fusion module, two convolution layers are adopted to process the corresponding level encoder features and the previous level output of the cascade fusion, so as to obtain decoder features, and the cross-level feature fusion module comprises an RGB branch and a depth branch, so that RGB branch and depth branch decoding features are respectively obtained.
Preferably, in the cross-modal fusion module, a convolution layer is used to fuse the RGB decoder features and the corresponding encoder features to obtain a fused RGB feature f i r Depth decoder feature f i d The method comprises the steps of carrying out a first treatment on the surface of the Then the output of the previous-level cross-modal fusion module is respectively connected with the fused RGB features and the fused depth features to obtain f i r ' and f i d 'A'; for the fifth level of cross-modal fusion modules, interactive attention will be usedOutput after refinement of force moduleFeatures replace the output of the previous level; finally, the channel attention mechanism is used for enhancing the fused f i r ' and f i d ' obtain f i rgbd The method is characterized by comprising the following steps:
wherein CA stands for channel attention mechanism, [. Cndot.,. Cndot. ] stands for channel connection.
The invention has the following characteristics and beneficial effects:
(1) And an interactive attention module is added in the encoder stage, the RGB features are enhanced by utilizing the RGB features and the depth features, and then the RGB features are input into a subsequent decoder module, so that the effective fusion of the cross-mode information is realized, and the calculated amount is reduced to a certain extent.
(2) In the decoder stage, a double-branch structure is adopted, firstly, RGB encoder characteristics and depth information encoder characteristics are respectively decoded, and decoding information is input into a cross-mode fusion module of a corresponding level to explore more comprehensive and deep cross-mode fusion. A channel attention mechanism is also introduced into the cross-modality fusion module to enhance post-fusion features.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a diagram of an overall network framework in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of an interactive attention module in an embodiment of the invention.
FIG. 3 is a block diagram of a cross-modality fusion module in an embodiment of the invention.
FIG. 4 is a graph showing the results of an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a method for detecting an RGB-D image salient target based on deep learning, which is shown in figure 1 and comprises the following steps:
s1, constructing an encoder to obtain multi-level features;
s1-1, RGB branch and depth branch feature extraction. Multi-level features of RGB images and depth images, respectively represented asAnd->
S1-2, constructing an interactive attention module. As shown in fig. 2, first, the fifth layer RGB encoder feature and the depth encoder feature of the input are divided into several tokens of 1×1×128. Second, a self-attention mechanism is applied to the RGB encoder features to fully mine the relationships within the RGB encoder features. A cross-attention layer between features is introduced between the self-enhanced RGB encoder features and the depth encoder features associated therewith to further explore the relationship between different modality features. Finally, a feed-forward layer is employed. The whole process is described as follows:
where SA is the self-attention mechanism within the feature, CA is the cross-attention mechanism between features, and FF is the feed-forward network. Both the self-attention and cross-attention mechanisms use efficient attention to reduce memory and computational costs. After each attention mechanism there is a residual connection and normalization operation. In the self-attention mechanism in the feature, the RGB encoder feature is input as a query, keys and values, and the self-enhanced RGB encoder feature is output. In the cross-attention mechanism between features, the self-enhanced RGB encoder features are input as queries and the depth encoder features as keys and values. While introducing position coding to avoid learning position errors between attention weight tokens.
By stacking the l=2 architecture described above, the final RGB encoder feature both strengthens the intra-feature relationship and further refines with the depth encoder feature.
S2, constructing a decoder module;
s2-1, constructing an RGB branch and depth branch cross-level feature fusion module, and adopting two convolution layers to process the corresponding level encoder features and the previous level output of the cascade fusion, so as to obtain decoder features, thereby constructing the cross-level feature fusion module.
S2-2, constructing a cross-modal feature fusion module. As shown in the figure 3 of the drawings,firstly, a convolution layer is utilized to fuse the RGB decoder characteristic and the corresponding encoder characteristic, and the fused RGB characteristic f is obtained i r Depth decoder feature f i d And the same is true. Then the output of the previous-level cross-modal fusion module is respectively connected with the fused RGB features and the fused depth features to obtain f i r ' and f i d '. For the cross-modal fusion module of the fifth level, the invention refines the output after the interaction attention module due to the lack of the output of the last levelFeatures replace the outputs of the previous level. Finally, the channel attention mechanism is used for enhancing the fused f i r ' and f i d ' obtain f i rgbd . The method comprises the following steps:
wherein CA stands for channel attention mechanism, [. Cndot.,. Cndot. ] stands for channel connection.
S3, constructing RGB-D image saliency based on deep learning based on encoder and decoderAnd (5) a target detection model. Layered extraction of RGB features from an input image according to an encoderAnd depth profile->Fusion feature obtained by interaction attention module>Together input into a decoder; RGB branch and depth branch decoding characteristics are obtained in the decoder through the S2-1 cross-level characteristic fusion module, and then the decoder characteristics of each layer are input into the S2-2 cross-mode characteristic fusion module to obtain f i rgbd The method comprises the steps of carrying out a first treatment on the surface of the Finally output f of the last layer of the decoder 1 rgbd Obtaining a final significance prediction graph S through a convolution layer with a convolution kernel size of 3 multiplied by 3 rgbd . Corresponding saliency maps for RGB branch and depth branch predictions can also be obtained, S r And S is d . In training the network, a binary cross entropy loss function (BCE) is used to optimize the RGB branches, the depth branches and the RGB-D branches simultaneously. The final loss function is defined as:
Loss=l bce (S r ,G)+l bce (S d ,G)+l bce (S rgbd ,G)
wherein G is a truth chart, l bce =-[Glog(S)+(1-G)log(1-S)]。
S4, training the established model, and storing parameters. The invention is realized based on Pytorch and is trained by using a GPU of 2080Ti model. The data expansion is performed by random flipping, rotation, and multi-scale input methods that randomly adjust the image size to 128×128, 256×256, and 352×352. In the training process, resNet50 is used as a backbone network of an encoder stage, the network is initialized through pre-training parameters on an ImageNet, optimization is carried out by using an Adam algorithm, the batch processing size is 8, the initial learning rate is 1e-4, the learning rate is adjusted every 40 rounds, and 150 rounds of training are carried out in total.
S5, inputting the image data to be detected into a training RGB-D image salient target detection model based on deep learning, thereby outputting a final salient prediction graph S of the image data to be detected rgbd
Fig. 4 is a comparison chart of the results of the method of the present invention, wherein the first column is an RGB image, the second column is a depth image, the third column is a truth chart, and the fourth column is a result chart of the method of the present invention. As can be seen by comparison, the scheme provided in this example ultimately yields a final significance prediction map S rgbd Closest to the truth chart to the third column.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims (9)

1. The method for detecting the RGB-D image salient target based on the deep learning is characterized by comprising the following steps of:
s1, acquiring an image data set, and preprocessing the image data set;
s2, constructing and training a RGB-D image salient object detection model based on deep learning, wherein the RGB-D image salient object detection model based on deep learning comprises an encoder and a decoder,
the encoder comprises an RGB encoder and a depth encoder, and the tail interaction attention modules of the RGB encoder and the depth encoder;
the decoder comprises a cross-level feature fusion module, a cross-mode fusion module and a convolution layer with a convolution kernel size of 3 multiplied by 3;
s3, performing salient object detection through a training-completed RGB-D image salient object detection model based on deep learning
S3-1, respectively extracting corresponding level encoder characteristics, an RGB image and a depth image through the RGB encoder and the depth encoderMulti-level encoder features, respectively denoted asAnd->Where r represents an RGB image, d represents a depth image, and i represents a hierarchy of features;
s3-2, strengthening the relation between the multi-level encoder characteristics of the RGB image and the depth image through an interaction attention module in the encoder to obtain a fusion characteristic
S3-3, to fuse featuresAs the input of the cross-level feature fusion module, RGB branch and depth branch decoding features are obtained;
s3-4, respectively inputting RGB branch and depth branch decoding characteristics of each layer into a cross-modal characteristic fusion module of a corresponding level to obtainFinally, the final layer output of the decoder is +.>Obtaining a final significance prediction graph S through a convolution layer with a convolution kernel size of 3 multiplied by 3 rgbd Corresponding saliency maps for RGB branch and depth branch predictions can also be obtained, S r And S is d
2. The method for detecting the salient object of the RGB-D image based on the deep learning of claim 1, wherein in the step S1, the preprocessing method of the image dataset is as follows: the data expansion is performed by random inversion, rotation and multi-scale input modes.
3. A method for detecting a salient object of an RGB-D image based on deep learning as claimed in claim 2, wherein the multi-scale input method randomly adjusts the image size to 128×128, 256×256 and 352×352.
4. A deep learning based RGB-D image salient object detection method according to claim 1, wherein the RGB encoder and depth encoder employ a res net50 backbone network to extract multi-level encoder features of RGB images and depth images.
5. A deep learning based RGB-D image salient object detection method according to claim 1, wherein the interactive attention module comprises a self-attention mechanism, a cross-attention layer and a feed-forward network.
6. The method for detecting the salient object of the RGB-D image based on the deep learning according to claim 1, wherein in the step S2, the method for training the salient object detection model of the RGB-D image based on the deep learning is as follows: initializing the network by pre-training parameters on ImageNet, optimizing by Adam algorithm, batch processing size of 8, initial learning rate of 1e-4, adjusting learning rate every 40 rounds, training 150 rounds in total,
in training the network, a binary cross entropy loss function is used for optimization, and the final loss function is defined as:
Loss=l bce (S r ,G)+l bce (S d ,G)+l bce (S rgbd ,G)
wherein G is a truth chart, l bce =-[Glog(S)+(1-G)log(1-S)]。
7. The method for detecting the salient object of the RGB-D image based on the deep learning of claim 5, wherein the fusion characteristic is obtained by the interactive attention moduleThe method of (1) is as follows: the fifth layer RGB encoder feature and the depth encoder feature of the input are divided into a number of tokens of 1 x 128; secondly, applying a self-attention mechanism on the RGB encoder characteristics to perform characteristic self-enhancement; introducing a cross attention layer between the self-enhanced RGB encoder features and the depth encoder features associated therewith; a feed forward network is adopted to obtain the fusion characteristic +.>The expression is as follows:
where SA is the self-attention mechanism within the feature, CA is the cross-attention mechanism between features, and FF is the feed-forward network.
8. The method for detecting the salient object of the RGB-D image based on the deep learning as claimed in claim 4, wherein two convolution layers are adopted in the cross-level feature fusion module to process the corresponding level encoder features and the previous level output of the cascade fusion, so as to obtain decoder features, and the cross-level feature fusion module comprises RGB branches and depth branches, so as to obtain RGB branch decoding features and depth branch decoding features respectively.
9. The method for detecting the salient object of the RGB-D image based on the deep learning as claimed in claim 8, wherein in the cross-modal fusion module, a convolution layer is utilized to fuse the RGB decoder characteristic and the corresponding encoder characteristic to obtain the fused RGB characteristic f i r Depth decoder feature f i d The method comprises the steps of carrying out a first treatment on the surface of the Then the output of the previous-level cross-modal fusion module is respectively connected with the fused RGB features and the fused depth features to obtain f i r′ And f i d′ The method comprises the steps of carrying out a first treatment on the surface of the For a cross-modal fusion module of a fifth level, the cross-modal fusion module is output after being refined by the interactive attention moduleFeatures replace the output of the previous level; finally, the channel attention mechanism is used for enhancing the fused f i r′ And f i d′ Obtaining f i rgbd The method is characterized by comprising the following steps:
f i rgbd =CA(conv([f i r′ ,f i d′ ]))
wherein CA stands for channel attention mechanism, [. Cndot.,. Cndot. ] stands for channel connection.
CN202310668228.7A 2023-06-07 2023-06-07 RGB-D image salient object detection method based on deep learning Pending CN116704174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310668228.7A CN116704174A (en) 2023-06-07 2023-06-07 RGB-D image salient object detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310668228.7A CN116704174A (en) 2023-06-07 2023-06-07 RGB-D image salient object detection method based on deep learning

Publications (1)

Publication Number Publication Date
CN116704174A true CN116704174A (en) 2023-09-05

Family

ID=87838676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310668228.7A Pending CN116704174A (en) 2023-06-07 2023-06-07 RGB-D image salient object detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN116704174A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117854009A (en) * 2024-01-29 2024-04-09 南通大学 Cross-collaboration fusion light-weight cross-modal crowd counting method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117854009A (en) * 2024-01-29 2024-04-09 南通大学 Cross-collaboration fusion light-weight cross-modal crowd counting method

Similar Documents

Publication Publication Date Title
Zhou et al. Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder
WO2022111236A1 (en) Facial expression recognition method and system combined with attention mechanism
CN111915619A (en) Full convolution network semantic segmentation method for dual-feature extraction and fusion
CN111931787A (en) RGBD significance detection method based on feature polymerization
Varga et al. Fully automatic image colorization based on Convolutional Neural Network
CN110880165A (en) Image defogging method based on contour and color feature fusion coding
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN109886391B (en) Neural network compression method based on space forward and backward diagonal convolution
CN114299559A (en) Finger vein identification method based on lightweight fusion global and local feature network
CN112348033B (en) Collaborative saliency target detection method
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN109410266A (en) Stereo Matching Algorithm based on four mould Census transformation and discrete disparity search
CN113392711A (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN114743027B (en) Weak supervision learning-guided cooperative significance detection method
CN116704174A (en) RGB-D image salient object detection method based on deep learning
CN114998615B (en) Collaborative saliency detection method based on deep learning
Ren et al. Parallel RCNN: A deep learning method for people detection using RGB-D images
CN110659680B (en) Image patch matching method based on multi-scale convolution
CN117292117A (en) Small target detection method based on attention mechanism
CN115147932A (en) Static gesture recognition method and system based on deep learning
CN112767539B (en) Image three-dimensional reconstruction method and system based on deep learning
CN117934849A (en) Deep learning-based RGB-D image semantic segmentation method
CN115375922B (en) Light-weight significance detection method based on multi-scale spatial attention
CN114445889A (en) Lightweight face aging method based on double attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination