CN116704174A

CN116704174A - RGB-D image salient object detection method based on deep learning

Info

Publication number: CN116704174A
Application number: CN202310668228.7A
Authority: CN
Inventors: 张继勇; 戚媛媛; 周晓飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-05

Abstract

The invention discloses a deep learning-based RGB-D image salient target detection method, comprising the following steps: S1, constructing an encoder, obtaining multi-level features, specifically including RGB branch and depth branch feature extraction and building an interactive attention module; S2. Construct a decoder module, specifically including constructing a RGB branch and a depth branch cross-level feature fusion module and constructing a cross-modal feature fusion module; S3. Based on an encoder and a decoder, construct a RGB-D image based on deep learning. A target detection model; S4. Train the established model and save the parameters. The present invention improves the detection ability of the model by comprehensively exploring cross-modal feature fusion.

Description

A salient target detection method for RGB-D images based on deep learning

技术领域technical field

本发明涉及计算机视觉技术领域，具体指一种基于深度学习的RGB-D图像显著目标检测方法。The invention relates to the technical field of computer vision, in particular to a method for detecting salient objects in RGB-D images based on deep learning.

背景技术Background technique

RGB-D图像包括RGB图像和深度图像，RGB图像为人们提供了丰富的外观和色彩信息，而深度图像提供了额外的空间线索。深度图像的每个像素值表示的是传感器与物体之间的实际距离，RGB图像和深度图像的像素点之间一般来说是一一对应关系。RGB-D images include RGB images and depth images. RGB images provide people with rich appearance and color information, while depth images provide additional spatial cues. Each pixel value of the depth image represents the actual distance between the sensor and the object, and there is generally a one-to-one correspondence between the pixels of the RGB image and the depth image.

图像显著目标检测就是模仿人类视觉系统来检测显著的物体或区域，显著目标检测以前的工作主要是处理RGB图像。尽管针对图像显著目标检测的算法正在不断改进、创新，来尽可能实现与人类视觉系统相媲美的处理机制，但在处理复杂场景中图像显著目标检测任务时还存在许多问题，如当目标和背景颜色相接近，目标相比背景比较小时，仅仅依靠RGB图像很难对显著目标进行精确检测。随着三维感知传感技术的发展，人们不但能获得物体的形状和色彩信息，还可以得到物体的空间位置信息，对场景的感知能力进一步提升。深度信息是对RGB图像信息的一种补充，对图像显著目标检测具有重要意义，能有效提高检测和识别结果的准确率。对于相同场景，不同的数据源能提供不同模态的附加信息，使得场景表达更丰富、全面，因此综合考虑融合RGB图像信息和深度图像信息可以获得更好的显著目标检测结果。Image salient object detection is to imitate the human visual system to detect salient objects or regions. Previous work on salient object detection mainly deals with RGB images. Although algorithms for image salient object detection are being continuously improved and innovated to achieve a processing mechanism comparable to that of the human visual system, there are still many problems in dealing with image salient object detection tasks in complex scenes, such as when the object and the background The colors are similar, and the target is smaller than the background. It is difficult to accurately detect the salient target only by relying on the RGB image. With the development of three-dimensional perception and sensing technology, people can not only obtain the shape and color information of objects, but also obtain the spatial position information of objects, and further improve the ability to perceive the scene. Depth information is a supplement to RGB image information, which is of great significance to image salient target detection, and can effectively improve the accuracy of detection and recognition results. For the same scene, different data sources can provide additional information of different modalities, making the expression of the scene richer and more comprehensive. Therefore, comprehensively considering the fusion of RGB image information and depth image information can obtain better salient object detection results.

根据特征提取策略的不同，RGB-D图像显著目标检测大概可分为基于传统模型的RGB-D图像显著目标检测研究和基于深度学习模型的RGB-D图像显著目标检测研究。传统的RGB-D图像显著目标检测方法多依赖先验知识来设定目标特征，以提取手工特征为主。虽然已经可以取得良好结果，但是这严重依赖设计者的先验知识，且在不同情况下需要进行手动调整，使得其在复杂场景中的泛化能力较差。考虑到基于传统的手工设计特征的方法的种种不足，越来越多的研究者开始把神经网络应用于RGB-D图像显著目标检测研究。According to different feature extraction strategies, salient object detection in RGB-D images can be roughly divided into research on salient object detection in RGB-D images based on traditional models and research on salient object detection in RGB-D images based on deep learning models. Traditional RGB-D image salient target detection methods rely on prior knowledge to set target features, and mainly extract manual features. Although good results have been achieved, it relies heavily on the designer's prior knowledge and requires manual adjustments in different situations, making its generalization ability poor in complex scenes. Considering the shortcomings of traditional methods based on manually designed features, more and more researchers have begun to apply neural networks to the detection of salient objects in RGB-D images.

发明内容Contents of the invention

本发明的目的是针对现有技术的不足，提出一种基于深度学习的RGB-D图像显著目标检测方法。The purpose of the present invention is to address the deficiencies in the prior art and propose a method for detecting salient objects in RGB-D images based on deep learning.

为实现上述目的，本发明的技术方案为：To achieve the above object, the technical solution of the present invention is:

一种基于深度学习的RGB-D图像显著目标检测方法，包括如下步骤：A method for detecting salient objects in RGB-D images based on deep learning, comprising the following steps:

S1、获取图像数据集，并对图像数据集进行预处理；S1. Obtain an image data set, and preprocess the image data set;

S2、构建并训练基于深度学习的RGB-D图像显著目标检测模型，所述基于深度学习的RGB-D图像显著目标检测模型包括编码器和解码器，S2. Construct and train a deep learning-based RGB-D image salient target detection model, the deep learning-based RGB-D image salient target detection model including an encoder and a decoder,

所述编码器包括RGB编码器和深度编码器，所述RGB编码器和深度编码器的末尾交互注意力模块；The encoder includes an RGB encoder and a depth encoder, and the end interaction attention module of the RGB encoder and the depth encoder;

所述解码器包括跨层级特征融合模块、跨模态融合模块以及卷积核尺寸为3×3的卷积层；The decoder includes a cross-level feature fusion module, a cross-modal fusion module, and a convolution layer with a convolution kernel size of 3×3;

S3、通过完成训练的基于深度学习的RGB-D图像显著目标检测模型进行显著目标检测S3. Perform salient target detection through the trained deep learning-based RGB-D image salient target detection model

S3-1、通过RGB编码器和深度编码器分别提取对应的层级编码器特征，RGB图像和深度图像的多层级编码器特征，分别表示为和/>其中，r表示RGB图像，d表示深度图像，i表示特征的层级；S3-1. Extract the corresponding hierarchical encoder features through the RGB encoder and the depth encoder respectively. The multi-level encoder features of the RGB image and the depth image are expressed as and /> Among them, r represents the RGB image, d represents the depth image, and i represents the level of the feature;

S3-2、通过编码器中交互注意力模块强化RGB图像和深度图像的多层级编码器特征之间的关系，得到融合特征 S3-2. Strengthen the relationship between the multi-level encoder features of the RGB image and the depth image through the interactive attention module in the encoder, and obtain the fusion feature

S3-3、将融合特征作为跨层级特征融合模块的输入，得到RGB分支和深度分支解码特征；S3-3. Merge features As the input of the cross-level feature fusion module, the RGB branch and depth branch decoding features are obtained;

S3-4、将每层的RGB分支和深度分支解码特征分别输入到对应层级的跨模态特征融合模块中，得到f_i ^rgbd；最后将解码器最后一层输出f₁ ^rgbd经过卷积核尺寸为3×3的卷积层得到最终显著性预测图S^rgbd，同样可以得到RGB分支和深度分支预测的相应的显著图，S^r和S^d。S3-4. Input the RGB branch and depth branch decoding features of each layer into the cross-modal feature fusion module of the corresponding level to obtain f _i ^rgbd ; finally, output f ₁ ^rgbd from the last layer of the decoder through the convolution kernel size The final saliency prediction map S ^rgbd is obtained for the 3×3 convolutional layer, and the corresponding saliency maps predicted by the RGB branch and the depth branch, S ^r and S ^d , can also be obtained.

作为优选，所述步骤S1中，图像数据集的预处理方法为：通过随机翻转、旋转和多尺度输入方式来进行数据扩充。Preferably, in the step S1, the preprocessing method of the image data set is: performing data expansion through random flipping, rotation and multi-scale input.

作为优选，所述多尺度输入方式将图像尺寸随机调整为128×128，256×256和352×352。Preferably, the multi-scale input method randomly adjusts the image size to 128×128, 256×256 and 352×352.

作为优选，所述RGB编码器和深度编码器采用ResNet50骨干网络来提取RGB图像和深度图像的多层级编码器特征。Preferably, the RGB encoder and depth encoder use a ResNet50 backbone network to extract multi-level encoder features of RGB images and depth images.

作为优选，所述交互注意力模块包括自注意力机制、交叉注意层和前馈网络。Preferably, the interactive attention module includes a self-attention mechanism, a cross-attention layer and a feed-forward network.

作为优选，所述步骤S2中，训练基于深度学习的RGB-D图像显著目标检测模型的方法为：通过ImageNet上的预训练参数来初始化网络，并使用Adam算法进行优化，批处理大小为8，初始学习率为1e-4，每40轮对学习率进行调整，总共训练了150轮，As a preference, in the step S2, the method of training the RGB-D image salient object detection model based on deep learning is: initialize the network through the pre-training parameters on ImageNet, and use the Adam algorithm to optimize, the batch size is 8, The initial learning rate is 1e-4, and the learning rate is adjusted every 40 rounds, with a total of 150 rounds of training.

训练网络时，采用二元交叉熵损失函数来进行优化，最终损失函数被定义为：When training the network, the binary cross-entropy loss function is used for optimization, and the final loss function is defined as:

Loss＝l_bce(S^r,G)+l_bce(S^d,G)+l_bce(S^rgbd,G)Loss＝l _bce (S ^r ,G)+l _bce (S ^d ,G)+l _bce (S ^rgbd ,G)

其中，G是真值图，l_bce＝-[Glog(S)+(1-G)log(1-S)]。Wherein, G is a truth graph, l _bce =-[Glog(S)+(1-G)log(1-S)].

作为优选，通过所述交互注意力模块得到融合特征的方法为：输入的第五层RGB编码器特征和深度编码器特征被分成若干1×1×128的令牌；其次，将自注意力机制应用在RGB编码器特征上进行特征自增强；在自增强后的RGB编码器特征和与之关联的深度编码器特征之间引入了特征间的交叉注意层；采用了一个前馈网络得到融合特征/>表达式如下：Preferably, the fusion feature is obtained through the interactive attention module The method is as follows: the input fifth-layer RGB encoder features and depth encoder features are divided into several 1×1×128 tokens; secondly, the self-attention mechanism is applied to the RGB encoder features for feature self-enhancement; in An inter-feature cross-attention layer is introduced between the self-enhanced RGB encoder features and the associated deep encoder features; a feed-forward network is used to obtain the fused features/> The expression is as follows:

其中，SA是特征内的自注意力机制，CA是特征间的交叉注意力机制，FF是前馈网络。Among them, SA is a self-attention mechanism within a feature, CA is a cross-attention mechanism between features, and FF is a feed-forward network.

作为优选，所述跨层级特征融合模块中，采用两个卷积层，来对级联融合的对应层级编码器特征和前一层级输出进行处理，从而得到解码器特征，所述跨层级特征融合模块包括RGB分支和深度分支，从而分别得到RGB分支和深度分支解码特征。Preferably, in the cross-level feature fusion module, two convolutional layers are used to process the corresponding level encoder features and the output of the previous level of cascade fusion, so as to obtain decoder features, and the cross-level feature fusion The module includes RGB branch and depth branch, so as to obtain RGB branch and depth branch decoding features respectively.

作为优选，所述跨模态融合模块中，利用一个卷积层将RGB解码器特征和相应的编码器特征进行融合，得到融合后的RGB特征f_i ^r，深度解码器特征f_i ^d；然后将前一层级跨模态融合模块的输出分别与融合后的RGB特征和融合后的深度特征连接在一起，得到f_i ^r′和f_i ^d′；对于第五层级的跨模态融合模块，将用交互注意力模块细化后输出的特征来代替上一层级的输出；最后再用通道注意力机制来增强融合后的将f_i ^r′和f_i ^d′，得到f_i ^rgbd，具体如下：Preferably, in the cross-modal fusion module, a convolutional layer is used to fuse the RGB decoder features and corresponding encoder features to obtain fused RGB features f _i ^r and depth decoder features f _i ^d ; then Connect the output of the cross-modal fusion module at the previous level with the fused RGB features and the fused depth features to obtain f _i ^r ′ and f _i ^d ′; for the fifth-level cross-modal fusion module, The output after refining with the interactive attention module feature to replace the output of the previous level; finally, the channel attention mechanism is used to enhance the fused f _i ^r ′ and f _i ^d ′ to obtain f _i ^rgbd , as follows:

其中，CA代表通道注意力机制，[·,·]代表通道连接。Among them, CA represents the channel attention mechanism, [·,·] represents the channel connection.

本发明具有以下的特点和有益效果：The present invention has following characteristics and beneficial effect:

(1)在编码器阶段加入交互注意力模块，利用RGB特征自身和深度特征强化了RGB特征，然后把它输入到后续的解码器模块，实现了跨模态信息的有效融合并在一定程度上减小计算量。(1) Add the interactive attention module in the encoder stage, use the RGB feature itself and the depth feature to strengthen the RGB feature, and then input it to the subsequent decoder module, realizing the effective fusion of cross-modal information and to a certain extent Reduce the amount of calculation.

(2)在解码器阶段，采用双分支结构，先对RGB编码器特征和深度信息编码器特征分别进行解码，并将解码信息输入到对应层级的跨模态融合模块中，来探索更全面深入的跨模态融合。跨模态融合模块中还引入了通道注意力机制来增强融合后特征。(2) In the decoder stage, a dual-branch structure is adopted to decode the RGB encoder features and depth information encoder features separately, and input the decoded information into the cross-modal fusion module of the corresponding level to explore more comprehensively and deeply. cross-modal fusion. A channel attention mechanism is also introduced in the cross-modal fusion module to enhance the fused features.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本发明实施例的网络整体框架图；Fig. 1 is the overall frame diagram of the network of the embodiment of the present invention;

图2为本发明实施例中交互注意力模块结构图。Fig. 2 is a structural diagram of an interactive attention module in an embodiment of the present invention.

图3为本发明实施例中跨模态融合模块结构图。Fig. 3 is a structural diagram of a cross-modal fusion module in an embodiment of the present invention.

图4为本发明实施例的结果图。Fig. 4 is a result graph of an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明提供了一种基于深度学习的RGB-D图像显著目标检测方法，如图1所示，包括如下步骤：The present invention provides a RGB-D image salient target detection method based on deep learning, as shown in Figure 1, comprising the following steps:

S1、构建编码器，获取多层级特征；S1. Build an encoder to obtain multi-level features;

S1-1、RGB分支和深度分支特征提取。采用ResNet50骨干网络来提取RGB图像和深度图像的多层级特征，分别表示为和/> S1-1, RGB branch and depth branch feature extraction. The ResNet50 backbone network is used to extract multi-level features of RGB images and depth images, which are expressed as and />

S1-2、构建交互注意力模块。如图2所示，首先，输入的第五层RGB编码器特征和深度编码器特征被分成若干1×1×128的令牌。其次，将自注意力机制应用在RGB编码器特征上，来充分挖掘RGB编码器特征内的关系。在自增强后的RGB编码器特征和与之关联的深度编码器特征之间引入了特征间的交叉注意层，来进一步探索不同模态特征之间的关系。最后，采用了一个前馈层。整个过程描述如下：S1-2. Build an interactive attention module. As shown in Fig. 2, first, the input fifth-layer RGB encoder features and depth encoder features are divided into several 1×1×128 tokens. Second, a self-attention mechanism is applied to the RGB encoder features to fully mine the relationships within the RGB encoder features. An inter-feature cross-attention layer is introduced between the self-enhanced RGB encoder features and the associated deep encoder features to further explore the relationship between different modality features. Finally, a feed-forward layer is employed. The whole process is described as follows:

其中，SA是特征内的自注意力机制，CA是特征间的交叉注意力机制，FF是前馈网络。自注意力机制和交叉注意力机制均使用efficient attention来降低内存和计算成本。在每个注意力机制之后都有一个残差连接和归一化操作。在特征内的自注意力机制里，输入RGB编码器特征作为查询，键和值，输出自增强后的RGB编码器特征。而特征间的交叉注意力机制里，输入自增强后的RGB编码器特征作为查询，深度编码器特征作为键和值。同时引入了位置编码来避免学习注意权重令牌之间的位置误差。Among them, SA is a self-attention mechanism within a feature, CA is a cross-attention mechanism between features, and FF is a feed-forward network. Both the self-attention mechanism and the cross-attention mechanism use efficient attention to reduce memory and computational costs. After each attention mechanism there is a residual connection and normalization operation. In the intra-feature self-attention mechanism, input RGB encoder features as query, key and value, and output self-augmented RGB encoder features. In the inter-feature cross-attention mechanism, the self-enhanced RGB encoder features are input as queries, and the depth encoder features are used as keys and values. Meanwhile, a positional encoding is introduced to avoid learning the positional errors between attention weight tokens.

通过堆叠L＝2上述架构，最后RGB编码器特征既强化了特征内的关系，又利用深度编码器特征进一步实现了细化。By stacking L=2 of the above architectures, the final RGB encoder features not only strengthen intra-feature relationships, but also further achieve refinement with deep encoder features.

S2、构建解码器模块；S2, building a decoder module;

S2-1、构建RGB分支和深度分支跨层级特征融合模块，采用两个卷积层，来对级联融合的对应层级编码器特征和前一层级输出进行处理，从而得到解码器特征，以此构建跨层级特征融合模块。S2-1. Construct the cross-level feature fusion module of the RGB branch and the depth branch, and use two convolutional layers to process the corresponding level encoder features and the output of the previous level of cascade fusion, so as to obtain the decoder features. Build a cross-level feature fusion module.

S2-2、构建跨模态特征融合模块。如图3所示，首先，利用一个卷积层将RGB解码器特征和相应的编码器特征进行融合，得到融合后的RGB特征f_i ^r，深度解码器特征f_i ^d同理。然后将前一层级跨模态融合模块的输出分别与融合后的RGB特征和融合后的深度特征连接在一起，得到f_i ^r′和f_i ^d′。对于第五层级的跨模态融合模块而言，由于缺少上一层级的输出，本发明将用交互注意力模块细化后输出的特征来代替上一层级的输出。最后再用通道注意力机制来增强融合后的将f_i ^r′和f_i ^d′，得到f_i ^rgbd。具体如下：S2-2. Construct a cross-modal feature fusion module. As shown in Figure 3, first, a convolutional layer is used to fuse the RGB decoder features and the corresponding encoder features to obtain the fused RGB features f _i ^r , and the depth decoder features f _i ^d are the same. Then the output of the cross-modal fusion module at the previous level is concatenated with the fused RGB features and the fused depth features respectively to obtain f _i ^r ′ and f _i ^d ′. For the cross-modal fusion module of the fifth level, due to the lack of the output of the previous level, the present invention will use the interactive attention module to refine the output feature to replace the output of the previous layer. Finally, the channel attention mechanism is used to enhance the fused f _i ^r ′ and f _i ^d ′ to obtain f _i ^rgbd . details as follows:

S3、以编码器、解码器为基础，构建基于深度学习的RGB-D图像显著目标检测模型。依据编码器从输入图像中分层提取RGB特征和深度特征/>通过交互注意力模块得到融合特征/>一起输入到解码器中；解码器中通过S2-1中跨层级特征融合模块得到RGB分支和深度分支解码特征，然后将每层的解码器特征输入到S2-2所提及的跨模态特征融合模块中，得到f_i ^rgbd；最后将解码器最后一层输出f₁ ^rgbd经过卷积核尺寸为3×3的卷积层得到最终显著性预测图S^rgbd。同样可以得到RGB分支和深度分支预测的相应的显著图，S^r和S^d。训练网络时，采用二元交叉熵损失函数(BCE)来同时对RGB分支，深度分支和RGB-D分支进行优化。最终损失函数被定义为：S3. Based on the encoder and decoder, construct a RGB-D image salient target detection model based on deep learning. Extract RGB features hierarchically from the input image according to the encoder and deep features /> Get fusion features through interactive attention module/> Input to the decoder together; the decoder obtains the RGB branch and depth branch decoding features through the cross-level feature fusion module in S2-1, and then inputs the decoder features of each layer to the cross-modal features mentioned in S2-2 In the fusion module, f _i ^rgbd is obtained; finally, the output f ₁ ^rgbd of the last layer of the decoder is passed through a convolutional layer with a convolution kernel size of 3×3 to obtain the final saliency prediction map S ^rgbd . Corresponding saliency maps, S ^r and S ^d , are also obtained for RGB branch and depth branch predictions. When training the network, a binary cross-entropy loss function (BCE) is used to simultaneously optimize the RGB branch, the depth branch and the RGB-D branch. The final loss function is defined as:

S4、对所建立模型进行训练，并保存参数。本发明是基于Pytorch实现的，利用2080Ti型号的GPU进行训练。通过随机翻转、旋转和多尺度输入方式来进行数据扩充，其中，多尺度输入方式将图像尺寸随机调整为128×128，256×256和352×352。训练过程中，使用ResNet50作为编码器阶段的骨干网络，通过ImageNet上的预训练参数来初始化网络，并使用Adam算法进行优化，批处理大小为8，初始学习率为1e-4，每40轮对学习率进行调整，总共训练了150轮。S4. Train the established model and save the parameters. The present invention is realized based on Pytorch, and the GPU of 2080Ti model is used for training. Data augmentation is performed by random flipping, rotation, and multi-scale input. The multi-scale input randomly adjusts the image size to 128×128, 256×256 and 352×352. During the training process, use ResNet50 as the backbone network of the encoder stage, initialize the network through the pre-training parameters on ImageNet, and use the Adam algorithm for optimization, the batch size is 8, the initial learning rate is 1e-4, and every 40 rounds The learning rate was adjusted and a total of 150 epochs were trained.

S5、将待检测的图像数据输入至完成训练的基于深度学习的RGB-D图像显著目标检测模型中，从而输出待检测的图像数据的最终显著性预测图S^rgbd。S5. Input the image data to be detected into the trained RGB-D image salient object detection model based on deep learning, so as to output the final saliency prediction map S ^rgbd of the image data to be detected.

图4为本发明方法结果对比图，第一列为RGB图像，第二列为深度图像，第三列为真值图，第四列为本发明方法的结果图。通过对比可以看出，本实施例所提供的方案最终所得到的最终显著性预测图S^rgbd，与至第三列的真值图最为接近。Fig. 4 is a comparison chart of the results of the method of the present invention, the first column is the RGB image, the second column is the depth image, the third column is the truth map, and the fourth column is the result map of the method of the present invention. It can be seen from the comparison that the final saliency prediction map S ^rgbd finally obtained by the scheme provided in this embodiment is closest to the truth map up to the third column.

以上结合附图对本发明的实施方式作了详细说明，但本发明不限于所描述的实施方式。对于本领域的技术人员而言，在不脱离本发明原理和精神的情况下，对这些实施方式进行多种变化、修改、替换和变型，仍落入本发明的保护范围内。The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. For those skilled in the art, without departing from the principle and spirit of the present invention, various changes, modifications, substitutions and modifications to these embodiments still fall within the protection scope of the present invention.

Claims

1. a RGB-D image salient target detection method based on depth learning, is characterized in that, comprises the steps:

S1. Obtain an image data set, and preprocess the image data set;

S2. Construct and train a deep learning-based RGB-D image salient target detection model, the deep learning-based RGB-D image salient target detection model including an encoder and a decoder,

The encoder includes an RGB encoder and a depth encoder, and the end interaction attention module of the RGB encoder and the depth encoder;

The decoder includes a cross-level feature fusion module, a cross-modal fusion module, and a convolution layer with a convolution kernel size of 3×3;

S3. Perform salient target detection through the trained deep learning-based RGB-D image salient target detection model

S3-1. Extract the corresponding hierarchical encoder features through the RGB encoder and the depth encoder respectively. The multi-level encoder features of the RGB image and the depth image are expressed as and /> Among them, r represents the RGB image, d represents the depth image, and i represents the level of the feature;

S3-2. Strengthen the relationship between the multi-level encoder features of the RGB image and the depth image through the interactive attention module in the encoder, and obtain the fusion feature

S3-3. Merge features As the input of the cross-level feature fusion module, the RGB branch and depth branch decoding features are obtained;

S3-4. Input the decoding features of the RGB branch and the depth branch of each layer into the cross-modal feature fusion module of the corresponding layer, and obtain Finally, output the last layer of the decoder /> The final saliency prediction map S ^rgbd is obtained through the convolution layer with a convolution kernel size of 3×3, and the corresponding saliency maps predicted by the RGB branch and the depth branch, S ^r and S ^d , can also be obtained.

2. a kind of RGB-D image salient object detection method based on deep learning according to claim 1, is characterized in that, in described step S1, the pretreatment method of image data set is: through random flipping, rotation and multiple Scale input method for data augmentation.

3. A method for detecting salient objects in RGB-D images based on deep learning according to claim 2, wherein the multi-scale input method randomly adjusts the image size to 128×128, 256×256 and 352× 352.

4. a kind of RGB-D image salient target detection method based on deep learning according to claim 1, is characterized in that, described RGB coder and depth coder adopt ResNet50 backbone network to extract the multiplicity of RGB image and depth image Hierarchical encoder features.

5. A deep learning-based RGB-D image salient object detection method according to claim 1, wherein the interactive attention module includes a self-attention mechanism, a cross-attention layer and a feed-forward network.

6. a kind of RGB-D image salient target detection method based on deep learning according to claim 1, is characterized in that, in described step S2, the method for training the RGB-D image salient target detection model based on deep learning is : The network is initialized by pre-training parameters on ImageNet and optimized using the Adam algorithm. The batch size is 8, the initial learning rate is 1e-4, and the learning rate is adjusted every 40 rounds. A total of 150 rounds of training are performed.

When training the network, the binary cross-entropy loss function is used for optimization, and the final loss function is defined as:

Loss＝l _bce (S ^r ,G)+l _bce (S ^d ,G)+l _bce (S ^rgbd ,G)

Wherein, G is a truth graph, l _bce =-[Glog(S)+(1-G)log(1-S)].

7. A kind of RGB-D image salient target detection method based on deep learning according to claim 5, is characterized in that, obtain fusion feature by described interactive attention module The method is as follows: the input fifth-layer RGB encoder features and depth encoder features are divided into several 1×1×128 tokens; secondly, the self-attention mechanism is applied to the RGB encoder features for feature self-enhancement; in An inter-feature cross-attention layer is introduced between the self-enhanced RGB encoder features and the associated deep encoder features; a feed-forward network is used to obtain the fused features/> The expression is as follows:

Among them, SA is a self-attention mechanism within a feature, CA is a cross-attention mechanism between features, and FF is a feed-forward network.

8. A kind of RGB-D image salient target detection method based on deep learning according to claim 4, is characterized in that, in described cross-level feature fusion module, adopts two convolutional layers, to the cascade fusion Corresponding layer encoder features and previous layer output are processed to obtain decoder features, and the cross-level feature fusion module includes RGB branch and depth branch, thereby obtaining RGB branch and depth branch decoding features respectively.

9. A kind of RGB-D image salient target detection method based on deep learning according to claim 8, is characterized in that, in described cross-modal fusion module, utilizes a convolutional layer to combine RGB decoder feature and corresponding The encoder features are fused to obtain the fused RGB features f _i ^r and the depth decoder features f _i ^d ; then the output of the cross-modal fusion module at the previous level is connected to the fused RGB features and the fused depth features respectively Together, f _i ^r′ and f _i ^d′ are obtained; for the fifth-level cross-modal fusion module, the output after refinement with the interactive attention module feature to replace the output of the previous level; finally, the channel attention mechanism is used to enhance the fused f _i ^r′ and f _i ^d′ to obtain f _i ^rgbd , as follows:

f _i ^rgbd ＝CA(conv([f _i ^r′ ,f _i ^d′ ]))

Among them, CA represents the channel attention mechanism, [·,·] represents the channel connection.