CN111814895A

CN111814895A - A saliency object detection method based on absolute and relative depth-induced networks

Info

Publication number: CN111814895A
Application number: CN202010695446.6A
Authority: CN
Inventors: 杨钢; 尹学玲; 卢湖川; 岳廷秀
Original assignee: Dalian Institute Of Artificial Intelligence Dalian University Of Technology; Northeastern University China
Current assignee: Dalian Institute Of Artificial Intelligence Dalian University Of Technology; Northeastern University China
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-10-23
Anticipated expiration: 2040-07-17
Also published as: CN111814895B

Abstract

The present invention discloses the present invention relates to a saliency target detection method based on absolute and relative depth induction network, comprising the following steps: depth induction network training with residual network as backbone network; absolute depth induction module cross-modal feature fusion to locate objects; The relative depth induction module builds spatial geometric models to supplement detailed information. The invention not only extracts RG image features from the residual network, but also proposes to use depth information to help the salient target detection task, and the absolute depth induction module RGB image features and depth image information are used for cross-modal fusion in a coarse-to-fine manner, The cluttered noise interference caused by the asynchronous characteristics of the two spaces is avoided; the relative depth induction module establishes a spatial graph convolution model to explore the spatial structure and geometric information to enhance the local feature representation ability, thereby improving the accuracy and robustness of detection , so that it can achieve excellent detection effect and has broad application prospects.

Description

A saliency object detection method based on absolute and relative depth-induced networks

技术领域technical field

本发明属于及显著性目标检测技术领域，具体为基于绝对和相对深度诱导网络的显著性目标检测方法。The invention belongs to the technical field of saliency target detection, in particular to a salient target detection method based on absolute and relative depth induction networks.

背景技术Background technique

显著性目标检测是计算机图像处理中的基本操作，显著性目标检测旨在定位和分割图像中最具视觉特色的物体。近年来，它被广泛应用到各个领域，例如重新定位，场景分类，视觉跟踪和语义分割等。计算机在进行相关图像处理操作前可以采用显著性检测技术过滤掉无关信息，从而大大减小图像处理的工作，提高效率。Salient object detection is a fundamental operation in computer image processing, and salient object detection aims to locate and segment the most visually distinctive objects in an image. In recent years, it has been widely used in various fields, such as relocalization, scene classification, visual tracking and semantic segmentation, etc. The computer can use the saliency detection technology to filter out irrelevant information before performing related image processing operations, thereby greatly reducing the work of image processing and improving efficiency.

早期的显著性目标检测方法主要是设计手工制作的特征(例如亮度，颜色和纹理)来进行检测图像中的显着目标。近年来，由于CNN的发展，提出了各种基于深度学习的模型。2017年Hou等人，提出了一种在层与层之间的短连接机制，并使用它来聚合来自多个尺度的特征图。2017年Zhang等人，探索每个尺度的多层次特征，并以递归方式生成显着性图。2019年Feng等人，提出了一个注意反馈模块，以更好地探索显着物体的结构。但是，这些最近提出的方法在面对例如语义上复杂的背景，低亮度环境和透明对象等极端复杂的情况下具有一定挑战，为了解决这个问题，我们提出利用深度信息来补充RGB图像。这样我们就可以探索场景的空间结构和3D几何信息，从而提高网络的有效性和鲁棒性。Early salient object detection methods mainly designed hand-crafted features (such as brightness, color, and texture) to detect salient objects in images. In recent years, due to the development of CNN, various deep learning based models have been proposed. In 2017, Hou et al. proposed a short connection mechanism between layers and used it to aggregate feature maps from multiple scales. Zhang et al., 2017, explore multi-level features at each scale and generate saliency maps recursively. In 2019, Feng et al. proposed an attention feedback module to better explore the structure of salient objects. However, these recently proposed methods have certain challenges in the face of extremely complex situations such as semantically complex backgrounds, low-luminance environments, and transparent objects. To address this problem, we propose to supplement RGB images with depth information. In this way we can explore the spatial structure and 3D geometric information of the scene, thereby improving the effectiveness and robustness of the network.

传统的RGB-D显着物体检测方法提取的特征，缺少全局上下文信息和特征中的语义线索。近年来，深度和RGB特征的有效集成方法是此任务的关键问题。2019年Zhao等人设计了一个对比度损失来探索深度图像中的先验对比度。然后，通过融合细化的深度和RGB特征生成注意力图。通过充分利用多尺度跨模态特征的流体金字塔集成策略输出最终的显着性映射。2019年Pial等人分层整合了深度和RGB图像，并通过递归注意力模型细化最终显着性图。但是目前的方法融合深度和RGB图像特征空间是异步的，会在网络中引入杂波噪声。The features extracted by traditional RGB-D salient object detection methods lack global context information and semantic clues in the features. In recent years, efficient integration methods of depth and RGB features are a key issue for this task. In 2019 Zhao et al. designed a contrast loss to explore the prior contrast in depth images. Then, an attention map is generated by fusing the refined depth and RGB features. The final saliency map is output through a fluid pyramid ensemble strategy that takes full advantage of multi-scale cross-modal features. In 2019 Pial et al. integrated depth and RGB images hierarchically and refined the final saliency map through a recursive attention model. However, the current method fusing depth and RGB image feature space is asynchronous, which will introduce clutter noise into the network.

综上所述，现有显著性目标检测技术有以下几个方面的缺陷：第一，大多数现有方法仅从RGB图像中提取特征，这些特征不足以从凌乱的背景区域中区分出显着物体；第二，现有的大多数方法都通过单独的网络提取深度和RGB特征，并使用不同的策略直接融合它们。但是，跨模态特征空间不一致。直接将它们融合会导致预测结果中出现嘈杂的响应；第三，虽然利用绝对深度诱导模块可以精确的定位显著物体，但是仍没有深入的探索局部区域的详细显著性信息，这也限制了模型性能的进一步提升。To sum up, the existing saliency object detection techniques have the following shortcomings: First, most existing methods only extract features from RGB images, which are not enough to distinguish salient objects from messy background regions objects; second, most existing methods extract depth and RGB features through separate networks and fuse them directly using different strategies. However, the feature space is inconsistent across modalities. Directly fusing them results in noisy responses in the prediction results; thirdly, although the absolute depth induction module can accurately locate salient objects, the detailed saliency information of local regions is still not deeply explored, which also limits the model performance further improvement.

发明内容SUMMARY OF THE INVENTION

(一)解决的技术问题(1) Technical problems solved

针对现有技术的不足，本发明提供基于绝对和相对深度诱导网络的显著性目标检测方法，解决了背景技术中提到的问题。In view of the deficiencies of the prior art, the present invention provides a saliency target detection method based on absolute and relative depth induction networks, which solves the problems mentioned in the background art.

(二)技术方案(2) Technical solutions

为实现上述目的，本发明提供如下技术方案：一种基于绝对和相对深度诱导网络的显著性目标检测方法，包括如下步骤：In order to achieve the above object, the present invention provides the following technical solutions: a method for detecting a salient target based on absolute and relative depth induced networks, comprising the following steps:

a.以残差网络为主干网络的深度诱导网络训练：将ResNet-50的最后池化层和全连接层移除，网络输入图像统一调整为256×256，并将数据集进行归一化处理，将五个卷积块生成的特征图通过金字塔的方式生成对应的侧输出图，然后在网络中自上而下的进行融合操作；a. Depth-induced network training with residual network as backbone network: The last pooling layer and fully connected layer of ResNet-50 are removed, the network input image is uniformly adjusted to 256×256, and the dataset is normalized , the feature maps generated by the five convolution blocks are generated by pyramids to generate the corresponding side output maps, and then the top-down fusion operation is performed in the network;

b.绝对深度诱导模块跨模态特征融合，定位物体：将输入图像的深度图像输入到一组卷积中，得到一个与Res2_x特征映射尺寸相同的深度特征映射图，多次应用绝对深度诱导网络，以递归的方式将深度特征图和RGB特征图集成在一起，实现跨模态的特征融合，避免了简单的融合两种异步模态特征，带来的噪音干扰，加强了深度和颜色特征之间的深度交互作用，可以在每个尺度上自适应地融合RGB和深度特征；b. Absolute depth induction module cross-modal feature fusion to locate objects: input the depth image of the input image into a set of convolutions to obtain a depth feature map with the same size as the Res2_x feature map, and apply the absolute depth induction network multiple times , integrates the depth feature map and RGB feature map in a recursive way to achieve cross-modal feature fusion, avoids the noise interference caused by simply fusing two asynchronous modal features, and strengthens the relationship between depth and color features. Depth interaction between RGB and depth features can be adaptively fused at each scale;

c.相对深度诱导模块建立空间几何模型补充细节信息：首先将来自解码网络最后阶段Res5_x的特征图进行上采样并与绝对深度诱导模块跨模态融合得到的特征图集成在一起，生成新的特征图，然后将其和绝对深度诱导模块产生的深度图共同输入到相对深度诱导模块中，来探索图像的空间结构和详细的显着性信息，将相对深度信息包裹在网络中以提高显着性模型的性能；c. The relative depth induction module builds the spatial geometric model to supplement the detailed information: first, the feature map from the final stage of the decoding network Res5_x is upsampled and integrated with the feature map obtained by cross-modal fusion of the absolute depth induction module to generate new features and then input it together with the depth map generated by the absolute depth induction module into the relative depth induction module to explore the spatial structure and detailed saliency information of the image, and wrap the relative depth information in the network to improve the saliency the performance of the model;

进一步的，步骤a中所述输入网络图像尺寸大小一样时，我们利用双线性插值的方法对数据集进行操作。Further, when the size of the input network images described in step a is the same, we use the bilinear interpolation method to operate on the dataset.

进一步的，步骤a中生成侧输出图时，将四个残差块的输出特征图输入到一个1*1的卷积层，将特征图的通道降维，即为侧输出图，从而用于后续的自上而下的集成多级特征图。Further, when the side output map is generated in step a, the output feature maps of the four residual blocks are input into a 1*1 convolutional layer, and the channel dimension of the feature map is reduced, that is, the side output map, which is used for Subsequent top-down integrated multi-level feature maps.

进一步的，步骤b中所述以递归的方式将深度特征图和RGB2特征图集成在一起，绝对深度诱导网络由门控递归单元(GRU)实现，该单元旨在处理序列问题，我们将多尺度特征集成过程表述为一个序列问题，并将每个尺度视为一个时间步。Further, the depth feature map and RGB2 feature map are integrated recursively as described in step b, and the absolute depth induction network is implemented by a gated recursive unit (GRU), which is designed to deal with sequence problems, we will multi-scale The feature integration process is formulated as a sequence problem and treats each scale as a time step.

进一步的，在每个时间步中，首先将深度特征图降维，然后通过全局最大池化将深度和RGB特征图进行级联和转化，生成新的特征向量，再经过全连接层等操作，可以实现在每个尺度上自适应地融合RGB和深度特征。Further, in each time step, the depth feature map is first dimensionally reduced, and then the depth and RGB feature maps are cascaded and transformed through global max pooling to generate new feature vectors, and then go through operations such as fully connected layers. Adaptive fusion of RGB and depth features at each scale can be achieved.

进一步的，步骤c中所述利用相对深度诱导模块来探索图像的空间结构和详细的显着性信息，该模块利用图卷积网络(GCN)来探索相对深度信息。Further, the relative depth induction module described in step c is used to explore the spatial structure and detailed saliency information of the image, which utilizes a graph convolutional network (GCN) to explore the relative depth information.

进一步的，提出的图卷积网络(GCN)，根据图像像素的空间位置和深度值将其投影到3D空间中，弥补了2D空间中的相邻像素在3D点云空间中没有强烈关联这一劣势，根据短距离相对深度关系在局部区域执行信息传播，通过在多尺度上探索空间结构和几何信息，相继增强了局部特征表示能力，通过这种方式，可以在相对诱导网络中利用详细的显着性信息，从而有助于精确预测最终结果。Further, the proposed Graph Convolutional Network (GCN) projects image pixels into 3D space according to their spatial positions and depth values, making up for the fact that adjacent pixels in 2D space are not strongly correlated in 3D point cloud space. Disadvantages, information propagation is performed in local regions according to short-range relative depth relationships, and local feature representation capabilities are successively enhanced by exploring spatial structure and geometric information at multiple scales. relevant information, which can help to accurately predict the final result.

(三)有益效果(3) Beneficial effects

与现有技术相比，本发明提供了基于绝对和相对深度诱导网络的显著性目标检测方法，具备以下有益效果：Compared with the prior art, the present invention provides a saliency target detection method based on absolute and relative depth induced networks, which has the following beneficial effects:

本发明不仅从残差网络中提取RGB图像特征，而且提出利用深度信息来帮助显著性目标检测任务，大多数现有的RGB-D模型仅简单地提取深度和RGB特征，并启发式地融合它们，利用绝对深度诱导模块将RGB图像特征和深度图像信息以从粗到细的方式跨模态融合利用，避免了由于两个空间的异步特性而引起的杂乱噪声干扰，从而精确定位物体；利用相对深度诱导模块建立空间图卷积模型探索空间结构和几何信息，以增强局部特征表示能力，从而提高检测的准确性以及鲁棒性，使其可达到极好的检测效果，有助于与其它领域融合，具有广阔的应用前景。The present invention not only extracts RGB image features from the residual network, but also proposes to use depth information to help the salient object detection task, most existing RGB-D models simply extract depth and RGB features and fuse them heuristically , using the absolute depth induction module to fuse the RGB image features and the depth image information in a coarse-to-fine manner across modalities, avoiding the cluttered noise interference caused by the asynchronous characteristics of the two spaces, so as to accurately locate the object; using the relative The depth induction module establishes a spatial graph convolution model to explore the spatial structure and geometric information to enhance the local feature representation ability, thereby improving the accuracy and robustness of detection, making it possible to achieve excellent detection results, which is helpful for other fields. Fusion has broad application prospects.

附图说明Description of drawings

图1是本发明提出的一种基于绝对和相对深度诱导网络的显著性目标测流程图。FIG. 1 is a flow chart of a saliency target detection based on absolute and relative depth-induced network proposed by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参阅图1，本实用新形提供一种技术方案：基于绝对和相对深度诱导网络的显著性目标检测方法，包括以下步骤：Referring to Fig. 1, the present invention provides a technical solution: a salient target detection method based on an absolute and relative depth induced network, comprising the following steps:

以残差网络为主干网络的深度诱导网络训练：将Resnet-50的最后池化层和全连接层移除，网络输入图像统一调整为256×256，并将数据集进行归一化处理，将五个卷积块生成的特征图通过金字塔的方式生成对应的侧输出图。然后在网络中自上而下的进行融合操作。Depth-induced network training with residual network as the backbone network: The last pooling layer and fully connected layer of Resnet-50 are removed, the network input image is uniformly adjusted to 256×256, and the dataset is normalized. The feature maps generated by the five convolution blocks generate corresponding side output maps in a pyramid fashion. Then the fusion operation is performed top-down in the network.

以残差网络为主干网络的深度诱导网络训练：将ResNet-50的最后池化层和全连接层移除，主干网络包括五个卷积块，Conv1,Res2_x,…Res5_x,输入尺寸大小为×W H的RGB图像，通过卷积块分别生成尺寸为

的特征图

较浅的层捕获图像的低级信息，如纹理和空间细节，深层特征映射包含高级语义信息。我们以金字塔的方式融合特征图

利用1*1卷积核将通道降低为C，得到侧输出图

然后以自上而下的方式集成多级特征图，Depth induced network training with residual network as backbone network: The last pooling layer and fully connected layer of ResNet-50 are removed. The backbone network includes five convolution blocks, Conv1, Res2_x,...Res5_x, and the input size is × The RGB images of WH are generated by convolution blocks with dimensions of

feature map of

Shallow layers capture low-level information of the image, such as texture and spatial details, and deep feature maps contain high-level semantic information. We fuse feature maps in a pyramid fashion

Use the 1*1 convolution kernel to reduce the channel to C to get the side output map

Then multi-level feature maps are integrated in a top-down manner,

其中，(.)σ是ReLU激活函数，CAT[·，·]是在相同通道维度下连接两个特征图的级联运算，UP(.)是具有双线性插值的上采样运算，W_l，b_l是网络中的可训练参数。where (.)σ is the ReLU activation function, CAT[ , ] is the concatenation operation connecting two feature maps under the same channel dimension, UP(.) is the upsampling operation with bilinear interpolation, W _l , _bl is a trainable parameter in the network.

绝对深度诱导模块跨模态特征融合，定位物体：首先将尺寸为W*H的输入深度图像D输入到一组卷积层中，并生成尺寸为

的特征图f_d，然后，多次应用绝对深度诱导模块(ADIM),以递归的方式将深度特征图与RGB特征图

集成在一起，以加强深度和颜色特征之间的深度交互作用，The absolute depth induction module fuses cross-modal features to locate objects: first, input an input depth image D of size W*H into a set of convolutional layers, and generate a size of

The feature map f _d of

are integrated to enhance the deep interaction between depth and color features,

其中，

是更新的深度特征，

是第l层中的深度和RGB信息的聚合结果。in,

is the updated depth feature,

is the aggregation result of depth and RGB information in layer l.

根据上述实施例，优选地，ADIM由门控递归单元(GRU)实现，该单元旨在处理序列问题，我们将多尺度特征集成过程表述为一个序列问题，并将每个尺度视为一个时间步。在每个时间步中，我们将RGB特征

视为GRU的输入，而深度特征

可视为最后一步的隐藏状态，通过全局最大池化(GMP)操作将两个特征图进行级联和转换，并生成特征向量。随后在该特征向量上应用完全连接层以生成重置门r和更新门z。这两个门的值通过S型函数进行归一化，实际上门r控制深度和RGB特征的集成度，z控制

的更新。通过这种方式，可以在每个尺度上自适应地融合RGB和深度特征。通过网络增强了深度和RGB特征之间的交互作用。然后将所生成的多尺度特征图

与处于解码状态的特征图组合，即公式(1)重新表示为：According to the above embodiments, ADIM is preferably implemented by a Gated Recurrent Unit (GRU), which is designed to handle sequence problems, we formulate the multi-scale feature integration process as a sequence problem, and treat each scale as a time step . At each time step, we convert the RGB features

regarded as the input of the GRU, while the deep features

It can be regarded as the hidden state of the last step, and the two feature maps are concatenated and transformed through a global max pooling (GMP) operation, and a feature vector is generated. A fully connected layer is then applied on this feature vector to generate reset gate r and update gate z. The values of these two gates are normalized by a sigmoid function, in fact the gate r controls the integration of depth and RGB features, z controls

's update. In this way, RGB and depth features can be adaptively fused at each scale. The interaction between depth and RGB features is enhanced by the network. Then the generated multi-scale feature map

Combined with the feature map in the decoding state, that is, formula (1) is re-expressed as:

相对深度诱导模块建立空间几何模型补充细节信息：相对深度诱导模块(RDIM)用于解码阶段，首先将来自解码网络最后阶段的特征图

进行上采样与特征图

集成在一起，如公式(3)所述，所生成的特征图被表示为

然后将RDIM应用于特征图

和深度图像，以在网络中嵌入相对深度信息The relative depth induction module builds the spatial geometric model Supplementary details: The relative depth induction module (RDIM) is used in the decoding stage, first by incorporating the feature maps from the last stage of the decoding network

Perform upsampling and feature maps

integrated together, as described in Equation (3), the resulting feature map is denoted as

Then apply RDIM to the feature map

and depth images to embed relative depth information in the network

根据上述实施例，优选地，RDIM由图卷积网络(GCN)实现，为了探究像素之间地相对深度关系，我们首先将由ADIM生成的特征图

表示为G＝(V，E，其中节点集合为V，边缘集合为E。我们将图中的每个节点ni视为3D坐标系中的一个点，并将坐标表示为(x_i，y_i，d_i)，其中(x_i，y_i)是特征的空间位置映射

且di是相应地深度值。将节点集合表示为V＝{n₁，n₂，...，n_k},k是节点数。我们定义3D坐标和其相邻的m个元素的边缘集合e_i，j∈E，计算边缘e_i，j上的权重w_i，j,作为相对深度值，以测量节点ni和nj之间的空间相关性，According to the above embodiment, RDIM is preferably implemented by a graph convolutional network (GCN). In order to explore the relative depth relationship between pixels, we first use the feature map generated by ADIM

Denoted as G=(V, E, where the set of nodes is V and the set of edges is E. We treat each node ni in the graph as a point in a 3D coordinate system, and denote the coordinates as (x _i , y _i , d _i ), where ( _xi , y _i ) is the spatial location map of the feature

And di is the corresponding depth value. Denote the set of nodes as V={n ₁ , n ₂ , . . . , n _k }, where k is the number of nodes. We define the 3D coordinates and their adjacent m elements of the edge set _ei,j ∈ E, calculate the weight wi _,j on the edge ei _,j , as the relative depth value, to measure the distance between nodes ni and nj spatial correlation,

w_i，j＝|(x_i，y_i，d_i)-(x_j，y_j，d_j)| (5)w _i,j =|(x _i , _yi , d _i )-(x _j , y _j , d _j )| (5)

为了描述节点ni和nj之间的语义关系，我们为边缘e_i，j,定义了一个属性特征a_i，j，为了进一步考虑图像的全局上下文信息，对特征图

应用GAP以提取高级语义信息，输出特征向量f_g。In order to describe the semantic relationship between nodes ni and nj, we define an attribute feature ai,j for the edge e _i, _j . In order to further consider the global context information of the image, the feature map

GAP is applied to extract high-level semantic information, outputting a feature vector f _g .

空间GCN由一组堆叠的图卷积层(GCL)组成，对于每个GCL，首先更新边缘e_i，j的属性特征a_i，j，Spatial GCN consists of a set of stacked graph convolutional layers (GCLs), for each GCL, the attribute features a _i,j of edges e _i,j are first updated,

其中,

和

分别是特征图

的位置(x_i，y_i)和(x_j，y_j)的特征向量，利用MLP更新每个节点的功能，in,

and

feature map

eigenvectors of the positions (x _i , y _i ) and (x _j , y _j ) of , using MLP to update the function of each node,

其中N(n_i)是节点ni的相邻像素集合w_i，j视为边缘e_i，j上的相对深度值的关注值，通过这种方式，RDIM更加关注具有较大相对距离的区域，度值的关注值，通过这种方式，RDIM更加关注具有较大相对距离的区域，消息通过相邻节点的边缘传输。然后，将所有节点的更新的特征

馈送到全局最大池化层，并且获得更新的全局特征向量f_g，最后，我们通过最后一个

获得了特征图

其中

是尺度为lRDIM的整体输出。通过使用GCN在节点之间传输消息，每个节点的功能将根据其与所有其他相邻节点的关系进行更新和完善。在我们的网络中，我们在解码阶段的第3级和第4级应用RDIM。然后将所生成的RDIM特征图输入到下一个解码阶段。where N( _ni ) is the set of adjacent pixels wi _,j of node ni as the attention value of the relative depth value on the edge _ei,j , in this way, RDIM pays more attention to the area with larger relative distance, The attention value of the degree value, in this way, RDIM pays more attention to the area with large relative distance, and the message is transmitted through the edge of the adjacent node. Then, the updated features of all nodes are

is fed to the global max pooling layer, and the updated global feature vector f _g is obtained, and finally, we pass the last

obtained feature map

in

is the overall output of scale lRDIM. By using GCNs to transmit messages between nodes, each node's functionality will be updated and refined based on its relationship to all other neighboring nodes. In our network, we apply RDIM at stage 3 and 4 of the decoding stage. The generated RDIM feature maps are then fed into the next decoding stage.

选择最后一个解码阶段生成的特征图

预测最终的显著图，因为它结合了绝对和相对深度信息，首先使用双线性插值运算将特征图

向上采样，使其与输入大小相同，最后输入单个通道的卷积层中，得到最终的显著图S,在训练过程中，最终的显著图通过交叉熵损失函数由真值图

监督，Select the feature map generated by the last decoding stage

To predict the final saliency map, since it combines absolute and relative depth information, the feature maps are first

Up-sampling to make it the same size as the input, and finally input into the convolutional layer of a single channel to get the final saliency map S. During the training process, the final saliency map is obtained from the ground-truth map through the cross-entropy loss function

supervise,

其中

和S_i，j分别是真值图和显着性图的位置(i，j)中的显着性值。in

and S _i,j are the saliency values in position (i,j) of the ground-truth map and saliency map, respectively.

本发明不仅从残差网络中提取RGB图像特征，而且提出利用深度信息来帮助显著性目标检测任务，大多数现有的RGB-D模型仅简单地提取深度和RGB特征，并启发式地融合它们，本发明中设计绝对深度诱导模块将RGB图像特征和深度图像信息以从粗到细的方式跨模态融合利用，避免了由于两个空间的异步特性而引起的杂乱噪声干扰，从而精确定位物体；同时设计相对深度诱导模块建立空间图卷积模型探索空间结构和几何信息，以增强局部特征表示能力，从而提高检测的准确性以及鲁棒性，使其可达到极好的检测效果，有助于与其它领域融合，具有广阔的应用前景。The present invention not only extracts RGB image features from the residual network, but also proposes to use depth information to help the salient object detection task, most existing RGB-D models simply extract depth and RGB features and fuse them heuristically In the present invention, the absolute depth induction module is designed to fuse and utilize RGB image features and depth image information across modes in a coarse-to-fine manner, avoiding cluttered noise interference caused by the asynchronous characteristics of the two spaces, so as to accurately locate objects ; At the same time, a relative depth induction module is designed to establish a spatial graph convolution model to explore spatial structure and geometric information to enhance the local feature representation ability, thereby improving the accuracy and robustness of detection, so that it can achieve excellent detection results and help It has broad application prospects for integration with other fields.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus.

在本发明的描述中，需要说明的是，术语“上”、“下”、“内”、“外”“前端”、“后端”、“两端”、“一端”、“另一端”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front end", "rear end", "two ends", "one end" and "the other end" The orientation or positional relationship indicated by etc. is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation, with a specific orientation. The orientation configuration and operation are therefore not to be construed as limitations of the present invention.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, and substitutions can be made in these embodiments without departing from the principle and spirit of the invention and modifications, the scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. a significant target detection method based on absolute and relative depth induced network, is characterized in that, comprises the steps:

a. Depth-induced network training with residual network as backbone network: The last pooling layer and fully connected layer of ResNet-50 are removed, the network input image is uniformly adjusted to 256×256, and the dataset is normalized , the feature maps generated by the five convolution blocks are generated by pyramids to generate the corresponding side output maps, and then the top-down fusion operation is performed in the network;

b. Absolute depth induction module cross-modal feature fusion to locate objects: input the depth image of the input image into a set of convolutions to obtain a depth feature map with the same size as the Res2_x feature map, and apply the absolute depth induction network multiple times , integrates the depth feature map and RGB feature map in a recursive way to achieve cross-modal feature fusion, avoids the noise interference caused by simply fusing two asynchronous modal features, and strengthens the relationship between depth and color features. Depth interaction between RGB and depth features can be adaptively fused at each scale;

c. The relative depth induction module builds the spatial geometric model to supplement the detailed information: first, the feature map from the final stage of the decoding network Res5_x is upsampled and integrated with the feature map obtained by cross-modal fusion of the absolute depth induction module to generate new features and then input it together with the depth map generated by the absolute depth induction module into the relative depth induction module to explore the spatial structure and detailed saliency information of the image, and wrap the relative depth information in the network to improve the saliency performance of the model.

2. The saliency target detection method based on absolute and relative depth induced network according to claim 1, it is characterized in that: when the input network image size is the same in the step a, we use the method of bilinear interpolation to analyze the data. set to operate.

3. The saliency target detection method based on absolute and relative depth induction network according to claim 1, it is characterized in that: when generating side output map in step a, the output feature maps of four residual blocks are input into a 1 *1 The convolutional layer reduces the channel dimension of the feature map, that is, the side output map, which is used for the subsequent top-down integrated multi-level feature map.

4. The saliency target detection method based on absolute and relative depth induction network according to claim 1, is characterized in that: described in step b, the depth feature map and RGB2 feature map are integrated together in a recursive manner, and the absolute depth The inductive network is implemented by a Gated Recurrent Unit (GRU), which is designed to handle sequence problems, and we formulate the multi-scale feature integration process as a sequence problem and treat each scale as a time step.

5. The saliency target detection method based on absolute and relative depth-induced networks according to claim 1, characterized in that: in each time step, the depth feature map is first reduced in dimension, and then the depth is reduced by global max pooling. It is cascaded and transformed with the RGB feature map to generate a new feature vector, and then through the fully connected layer and other operations, the RGB and depth features can be adaptively fused at each scale.

6. The saliency target detection method based on absolute and relative depth induction network according to claim 1, is characterized in that: described in step c, utilizes relative depth induction module to explore the spatial structure of image and detailed saliency information , which utilizes graph convolutional networks (GCNs) to explore relative depth information.

7. The saliency target detection method based on absolute and relative depth-induced networks according to claim 1, characterized in that: the proposed graph convolutional network (GCN), according to the spatial position and depth value of image pixels, is projected to In 3D space, it makes up for the disadvantage that adjacent pixels in 2D space are not strongly correlated in 3D point cloud space, performs information propagation in local areas according to short-distance relative depth relationships, and explores spatial structure and geometric information at multiple scales. , successively enhance the local feature representation ability, in this way, detailed saliency information can be exploited in the relative induction network, which helps to accurately predict the final result.