CN110263813A

CN110263813A - A kind of conspicuousness detection method merged based on residual error network and depth information

Info

Publication number: CN110263813A
Application number: CN201910444775.0A
Authority: CN
Inventors: 周武杰; 吴君委; 雷景生; 何成; 钱亚冠; 王海江; 张伟
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Huahao Technology Xi'an Co ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2019-09-20
Anticipated expiration: 2039-05-27
Also published as: CN110263813B

Abstract

The invention discloses a saliency detection method based on residual network and depth information fusion, which constructs a convolutional neural network in the training stage, the input layer includes an RGB image input layer and a depth image input layer, and the hidden layer includes 5 RGB images Neural network block, 4 RGB image max pooling layers, 5 depth image neural network blocks, 4 depth image max pooling layers, 5 cascade layers, 5 fused neural network blocks, 4 deconvolution layers, The output layer includes 5 sub-output layers; the color real object images and depth images in the training set are input into the convolutional neural network for training, and the saliency detection prediction map is obtained; by calculating the saliency detection prediction map and the real saliency detection label image Between the loss function value, the convolutional neural network training model is obtained; in the test stage, the convolutional neural network training model is used to predict the color real object image to be detected, and the predicted saliency detection image is obtained; the advantage is that the saliency detection is accurate High rate.

Description

A Saliency Detection Method Based on Residual Network and Depth Information Fusion

技术领域technical field

本发明涉及一种视觉显著性检测技术，尤其是涉及一种基于残差网络和深度信息融合的显著性检测方法。The invention relates to a visual saliency detection technology, in particular to a saliency detection method based on residual network and depth information fusion.

背景技术Background technique

视觉显著性可以帮助人类快速地过滤掉不重要的信息，让人们的注意力更加集中在有意义的区域，从而能更好地理解眼前的场景。随着计算机视觉领域的快速发展，人们希望电脑也能拥有和人类相同的能力，即在理解和分析复杂的场景时，电脑可以更加针对性地处理有用的信息，从而能更大的降低算法的复杂度，并且排除杂波的干扰。在传统做法中，研究人员根据观察到的各种先验知识对显著性对象检测算法进行建模，生成显著性图。这些先验知识包括对比度、中心先验、边缘先验、语义先验等。然而，在复杂的场景中，传统做法往往不够准确，这是因为这些观察往往局限于低级别的特征(例如：颜色和对比度等)，所以不能准确反映出显著性对象本质的共同点。Visual saliency can help humans quickly filter out unimportant information, allowing people to focus more on meaningful areas, so that they can better understand the scene in front of them. With the rapid development of the field of computer vision, people hope that computers can also have the same ability as humans, that is, when understanding and analyzing complex scenes, computers can process useful information in a more targeted manner, which can greatly reduce the algorithm. Complexity, and eliminate the interference of clutter. In traditional approaches, researchers model salient object detection algorithms based on various prior knowledge of observations to generate saliency maps. These prior knowledge include contrast, center prior, edge prior, semantic prior, etc. However, in complex scenes, traditional approaches are often inaccurate because these observations are often limited to low-level features (e.g., color and contrast, etc.), so they cannot accurately reflect the common ground of the essence of salient objects.

近年来，卷积神经网络已广泛运用于计算机视觉的各个领域，许多困难的视觉问题都获得了重大的进展。不同于传统做法，深度卷积神经网络能够从大量的训练样本中建模并自动的端到端(end-to-end)地学习到更为本质的特性，从而有效地避免了传统人工建模和设计特征的弊端。最近，3D传感器的有效应用更加丰富了数据库，人们不但可以获得彩色图片，而且可以获取彩色图片的深度信息。深度信息在现实3D场景中是人眼视觉系统中很重要的一环，这是在之前的传统做法中所完全忽略掉的一条重要的信息，因此现在最重要的任务就是如何建立模型从而有效地利用好深度信息。In recent years, convolutional neural networks have been widely used in various fields of computer vision, and significant progress has been made in many difficult vision problems. Unlike traditional practices, deep convolutional neural networks can model from a large number of training samples and automatically learn more essential features end-to-end, thus effectively avoiding traditional manual modeling. and design feature drawbacks. Recently, the effective application of 3D sensors has enriched the database, and people can obtain not only color pictures, but also the depth information of color pictures. Depth information is an important part of the human visual system in real 3D scenes. This is an important piece of information that was completely ignored in the previous traditional methods. Therefore, the most important task now is how to build a model to effectively Make good use of deep information.

在RGB-D数据库中采用深度学习的显著性检测方法，直接进行像素级别端到端的显著性检测，只需要将训练集中的图像输入进模型框架中训练，得到权重与模型，即可在测试集进行预测。目前，基于RGB-D数据库的深度学习显著性检测模型主要用的结构为编码-译码架构，在如何利用深度信息的方法上有三种：第一种方法就是直接将深度信息与彩色图信息叠加为一个四维的输入信息或在编码过程中将彩色图信息和深度信息进行相加或者叠加，这类方法称为前融合；第二种方法则是将在编码过程中对应的彩色图信息和深度信息利用跳层(skip connection)的方式相加或者叠加到对应的译码过程中，这类称为后融合；第三种方法则是分别利用彩色图信息和深度信息进行显著性预测，将最后的结果融合。上述第一种方法，由于彩色图信息和深度信息的分布有较大差异，因此直接在编码过程中加入深度信息会在一定程度上添加了噪声。上述第三种方法，分别利用深度信息和彩色图信息进行显著性预测，但是如果深度信息和彩色图信息的预测结果都不准确时，那么最终的融合结果也是相对不够精确的。上述第二种方法不仅避免了在编码阶段直接利用深度信息带来的噪声，而且在网络模型的不断优化中能够充分学习到彩色图信息和深度信息的互补关系。相比于之前的后融合的方案，如RGB-D Saliency Detection by Multi-streamLate Fusion Network(基于多流的后融合RGB-D显著性检测网络模型)，以下简称为MLF，MLF分别对彩色图信息和深度信息进行特征提取和下采样操作，并在最高维通过对应位置元素相乘的方法进行融合，在此融合的结果上输出一个尺寸很小的显著性预测图。MLF由于只有下采样操作，因此使得物体的空间细节信息在不断的下采样的操作中变得模糊，而且MLF是在最小的尺寸上进行显著性预测输出，再放大到原始尺寸后会丢失很多的显著物体的信息。In the RGB-D database, the saliency detection method of deep learning is used to directly perform end-to-end saliency detection at the pixel level. It only needs to input the images in the training set into the model framework for training, and obtain the weights and models, which can be used in the test set. Make predictions. At present, the deep learning saliency detection model based on the RGB-D database mainly uses the encoding-decoding architecture. There are three methods on how to use the depth information: the first method is to directly superimpose the depth information and the color image information For a four-dimensional input information or add or superimpose the color image information and depth information during the encoding process, this method is called pre-fusion; the second method is to combine the corresponding color image information and depth information during the encoding process The information is added or superimposed to the corresponding decoding process by means of skip connection, which is called post-fusion; the third method is to use color image information and depth information for saliency prediction respectively, and the final result fusion. In the first method above, since the distribution of color image information and depth information is quite different, directly adding depth information in the encoding process will add noise to a certain extent. The above third method uses depth information and color image information to perform saliency prediction respectively, but if the prediction results of depth information and color image information are not accurate, then the final fusion result is relatively inaccurate. The above second method not only avoids the noise caused by direct use of depth information in the encoding stage, but also can fully learn the complementary relationship between color image information and depth information in the continuous optimization of the network model. Compared with the previous post-fusion scheme, such as RGB-D Saliency Detection by Multi-streamLate Fusion Network (based on multi-stream post-fusion RGB-D saliency detection network model), hereinafter referred to as MLF, MLF separately for color map information Perform feature extraction and down-sampling operations with depth information, and fuse in the highest dimension by multiplying corresponding position elements, and output a small-sized saliency prediction map on the result of this fusion. Since MLF only has a downsampling operation, the spatial details of the object become blurred in the continuous downsampling operation, and MLF performs saliency prediction output on the smallest size, and a lot of it will be lost after being enlarged to the original size. Information about salient objects.

发明内容Contents of the invention

本发明所要解决的技术问题是一种基于残差网络和深度信息融合的显著性检测方法，其通过高效地利用深度信息和彩色图信息，从而提升了显著性检测准确率。The technical problem to be solved by the present invention is a saliency detection method based on residual network and depth information fusion, which improves the accuracy of saliency detection by efficiently utilizing depth information and color image information.

本发明解决上述技术问题所采用的技术方案为：一种基于残差网络和深度信息融合的显著性检测方法，其特征在于包括训练阶段和测试阶段两个过程；The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a saliency detection method based on residual network and depth information fusion, which is characterized in that it includes two processes of training phase and testing phase;

所述的训练阶段过程的具体步骤为：The specific steps of the described training phase process are:

步骤1_1：选取Q幅原始的彩色真实物体图像及每幅原始的彩色真实物体图像对应的深度图像和真实显著性检测标签图像，并构成训练集，将训练集中的第q幅原始的彩色真实物体图像及其对应的深度图像和真实显著性检测标签图像对应记为{I^q(i,j)}、{D^q(i,j)}、其中，Q为正整数，Q≥200，q为正整数，q的初始值为1，1≤q≤Q，1≤i≤W，1≤j≤H，W表示{I^q(i,j)}、{D^q(i,j)}、的宽度，H表示{I^q(i,j)}、{D^q(i,j)}、的高度，W和H均能够被2整除，{I^q(i,j)}为RGB彩色图像，I^q(i,j)表示{I^q(i,j)}中坐标位置为(i,j)的像素点的像素值，{D^q(i,j)}为单通道的深度图像，D^q(i,j)表示{D^q(i,j)}中坐标位置为(i,j)的像素点的像素值，表示中坐标位置为(i,j)的像素点的像素值；Step 1_1: Select Q original color real object images and the corresponding depth images and true saliency detection label images of each original color real object image, and form a training set, and the qth original color real object in the training set The image and its corresponding depth image and the real saliency detection label image are correspondingly denoted as {I ^q (i,j)}, {D ^q (i,j)}, Among them, Q is a positive integer, Q≥200, q is a positive integer, the initial value of q is 1, 1≤q≤Q, 1≤i≤W, 1≤j≤H, W means {I ^q (i, j )}, {D ^q (i,j)}, The width of , H represents {I ^q (i,j)}, {D ^q (i,j)}, height, both W and H can be divisible by 2, {I ^q (i,j)} is an RGB color image, I ^q (i,j) means that the coordinate position in {I ^q (i,j)} is (i, The pixel value of the pixel point of j), {D ^q (i,j)} is a single-channel depth image, D ^q (i,j) means that the coordinate position in {D ^q (i,j)} is (i,j) ) pixel value of the pixel point, express The pixel value of the pixel point whose middle coordinate position is (i, j);

步骤1_2：构建卷积神经网络：该卷积神经网络包含输入层、隐层、输出层，输入层包括RGB图输入层和深度图输入层，隐层包括5个RGB图神经网络块、4个RGB图最大池化层、5个深度图神经网络块、4个深度图最大池化层、5个级联层、5个融合神经网络块、4个反卷积层，输出层包括5个子输出层；其中，5个RGB图神经网络块和4个RGB图最大池化层构成RGB图的编码结构，5个深度图神经网络块和4个深度图最大池化层构成深度图的编码结构，RGB图的编码结构和深度图的编码结构构成卷积神经网络的编码层，5个级联层、5个融合神经网络块和4个反卷积层构成卷积神经网络的译码层；Step 1_2: Construct a convolutional neural network: The convolutional neural network includes an input layer, a hidden layer, and an output layer. The input layer includes an RGB image input layer and a depth image input layer. The hidden layer includes 5 RGB image neural network blocks, 4 RGB map max pooling layer, 5 depth map neural network blocks, 4 depth map max pooling layers, 5 cascade layers, 5 fused neural network blocks, 4 deconvolution layers, output layer including 5 sub-outputs layer; among them, 5 RGB image neural network blocks and 4 RGB image maximum pooling layers constitute the encoding structure of RGB image, 5 depth image neural network blocks and 4 depth image maximum pooling layers constitute the encoding structure of depth image, The encoding structure of the RGB image and the encoding structure of the depth image constitute the encoding layer of the convolutional neural network, and 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers constitute the decoding layer of the convolutional neural network;

对于RGB图输入层，其输入端接收一幅训练用RGB彩色图像的R通道分量、G通道分量和B通道分量，其输出端输出训练用RGB彩色图像的R通道分量、G通道分量和B通道分量给隐层；其中，要求训练用RGB彩色图像的宽度为W且高度为H；For the RGB image input layer, its input terminal receives the R channel component, G channel component and B channel component of a RGB color image for training, and its output terminal outputs the R channel component, G channel component and B channel component of the RGB color image for training The component is given to the hidden layer; among them, the width of the RGB color image required for training is W and the height is H;

对于深度图输入层，其输入端接收RGB图输入层的输入端接收的训练用RGB彩色图像对应的训练用深度图像，其输出端输出训练用深度图像给隐层；其中，训练用深度图像的宽度为W且高度为H；For the depth image input layer, its input terminal receives the training depth image corresponding to the training RGB color image received by the input end of the RGB image input layer, and its output terminal outputs the training depth image to the hidden layer; wherein, the training depth image Width is W and height is H;

对于第1个RGB图神经网络块，其输入端接收RGB图输入层的输出端输出的训练用RGB彩色图像的R通道分量、G通道分量和B通道分量，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为CP₁；For the first RGB image neural network block, its input terminal receives the R channel component, G channel component and B channel component of the training RGB color image output by the output terminal of the RGB image input layer, and its output terminal outputs 32 widths of W And the feature map with a height of H, the set of all output feature maps is recorded as CP ₁ ;

对于第1个RGB图最大池化层，其输入端接收CP₁中的所有特征图，其输出端输出32幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₁；For the first RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₁ , and its output terminal outputs 32 widths of and the height is The feature map of , the set of all output feature maps is recorded as ZC ₁ ;

对于第2个RGB图神经网络块，其输入端接收ZC₁中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₂；For the second RGB image neural network block, its input terminal receives all feature maps in ZC ₁ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all the output feature maps is recorded as CP ₂ ;

对于第2个RGB图最大池化层，其输入端接收CP₂中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₂；For the second RGB image maximum pooling layer, its input receives all feature maps in CP ₂ , and its output outputs 64 widths of and the height is The feature map of , record the set of all feature maps that are output as ZC ₂ ;

对于第3个RGB图神经网络块，其输入端接收ZC₂中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₃；For the third RGB image neural network block, its input terminal receives all feature maps in ZC ₂ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all output feature maps is recorded as CP ₃ ;

对于第3个RGB图最大池化层，其输入端接收CP₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₃；For the third RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₃ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all feature maps of the output is recorded as ZC ₃ ;

对于第4个RGB图神经网络块，其输入端接收ZC₃中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₄；For the fourth RGB image neural network block, its input terminal receives all feature maps in ZC ₃ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as CP ₄ ;

对于第4个RGB图最大池化层，其输入端接收CP₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₄；For the fourth RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all feature maps that are output is recorded as ZC ₄ ;

对于第5个RGB图神经网络块，其输入端接收ZC₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₅；For the fifth RGB image neural network block, its input terminal receives all feature maps in ZC ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all the output feature maps is recorded as CP ₅ ;

对于第1个深度图神经网络块，其输入端接收深度图输入层的输出端输出的训练用深度图像，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为DP₁；For the first depth map neural network block, its input terminal receives the training depth image output from the output terminal of the depth map input layer, and its output terminal outputs 32 feature maps with a width of W and a height of H, and all the output features The collection of graphs is denoted as DP ₁ ;

对于第1个深度图最大池化层，其输入端接收DP₁中的所有特征图，其输出端输出32幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₁；For the first depth map maximum pooling layer, its input terminal receives all feature maps in DP ₁ , and its output terminal outputs 32 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₁ ;

对于第2个深度图神经网络块，其输入端接收DC₁中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₂；For the second depth map neural network block, its input terminal receives all feature maps in DC ₁ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all the output feature maps is recorded as DP ₂ ;

对于第2个深度图最大池化层，其输入端接收DP₂中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₂；For the second depth map maximum pooling layer, its input receives all feature maps in DP ₂ , and its output outputs 64 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₂ ;

对于第3个深度图神经网络块，其输入端接收DC₂中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₃；For the third depth map neural network block, its input terminal receives all feature maps in DC ₂ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all feature maps that are output is recorded as DP ₃ ;

对于第3个深度图最大池化层，其输入端接收DP₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₃；For the third depth map maximum pooling layer, its input terminal receives all feature maps in DP ₃ , and its output terminal outputs 128 widths. and the height is The feature map of , the set of all output feature maps is recorded as DC ₃ ;

对于第4个深度图神经网络块，其输入端接收DC₃中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₄；For the fourth depth map neural network block, its input terminal receives all feature maps in DC ₃ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all feature maps that are output is recorded as DP ₄ ;

对于第4个深度图最大池化层，其输入端接收DP₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₄；For the fourth depth map maximum pooling layer, its input terminal receives all feature maps in DP ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₄ ;

对于第5个深度图神经网络块，其输入端接收DC₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₅；For the fifth depth map neural network block, its input terminal receives all feature maps in DC ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as DP ₅ ;

对于第1个级联层，其输入端接收CP₅中的所有特征图和DP₅中的所有特征图，对CP₅中的所有特征图和DP₅中的所有特征图进行叠加，其输出端输出512幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₁；For the first cascade layer, its input receives all feature maps in CP ₅ and all feature maps in DP ₅ , and superimposes all feature maps in CP ₅ and all feature maps in DP ₅ , and its output The output width of 512 is and the height is The feature map of , the set of all the output feature maps is recorded as Con ₁ ;

对于第1个融合神经网络块，其输入端接收Con₁中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₁；For the first fusion neural network block, its input terminal receives all feature maps in Con ₁ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all feature maps that are output is recorded as RH ₁ ;

对于第1个反卷积层，其输入端接收RH₁中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₁；For the first deconvolution layer, its input receives all feature maps in RH ₁ , and its output outputs 256 widths of and the height is The feature map of , record the set of all feature maps that are output as FJ ₁ ;

对于第2个级联层，其输入端接收FJ₁中的所有特征图、CP₄中的所有特征图和DP₄中的所有特征图，对FJ₁中的所有特征图、CP₄中的所有特征图和DP₄中的所有特征图进行叠加，其输出端输出768幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₂；For the second cascade layer, its input receives all feature maps in FJ ₁ , all feature maps in CP ₄ , and all feature maps in DP ₄ , for all feature maps in FJ ₁ , all feature maps in CP ₄ The feature map and all feature maps in DP ₄ are superimposed, and the output terminal outputs 768 widths of and the height is The feature map of , record the set of all feature maps that are output as Con ₂ ;

对于第2个融合神经网络块，其输入端接收Con₂中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₂；For the second fusion neural network block, its input terminal receives all feature maps in Con ₂ , and its output terminal outputs 256 widths of and the height is The feature map of , record the set of all feature maps that are output as RH ₂ ;

对于第2个反卷积层，其输入端接收RH₂中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₂；For the second deconvolution layer, its input receives all the feature maps in RH ₂ , and its output outputs 256 widths of and the height is The feature map of , record the set of all feature maps that are output as FJ ₂ ;

对于第3个级联层，其输入端接收FJ₂中的所有特征图、CP₃中的所有特征图和DP₃中的所有特征图，对FJ₂中的所有特征图、CP₃中的所有特征图和DP₃中的所有特征图进行叠加，其输出端输出512幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₃；For the third cascade layer, its input receives all feature maps in FJ ₂ , all feature maps in CP ₃ , and all feature maps in DP ₃ , for all feature maps in FJ ₂ , all feature maps in CP ₃ The feature map and all feature maps in DP ₃ are superimposed, and the output terminal outputs 512 widths of and the height is The feature map of , record the set of all feature maps that are output as Con ₃ ;

对于第3个融合神经网络块，其输入端接收Con₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₃；For the third fusion neural network block, its input terminal receives all the feature maps in Con ₃ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all feature maps of the output is recorded as RH ₃ ;

对于第3个反卷积层，其输入端接收RH₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₃；For the third deconvolution layer, its input receives all feature maps in RH ₃ , and its output outputs 128 widths of and the height is The feature map of , record the set of all feature maps that are output as FJ ₃ ;

对于第4个级联层，其输入端接收FJ₃中的所有特征图、CP₂中的所有特征图和DP₂中的所有特征图，对FJ₃中的所有特征图、CP₂中的所有特征图和DP₂中的所有特征图进行叠加，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₄；For the fourth cascade layer, its input receives all feature maps in FJ ₃ , all feature maps in CP ₂ , and all feature maps in DP ₂ , for all feature maps in FJ ₃ , all feature maps in CP ₂ The feature map and all feature maps in DP ₂ are superimposed, and the output terminal outputs 256 widths of and the height is The feature map of , record the set of all feature maps that are output as Con ₄ ;

对于第4个融合神经网络块，其输入端接收Con₄中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₄；For the fourth fusion neural network block, its input terminal receives all feature maps in Con ₄ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all feature maps of the output is recorded as RH ₄ ;

对于第4个反卷积层，其输入端接收RH₄中的所有特征图，其输出端输出64幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为FJ₄；For the fourth deconvolution layer, its input terminal receives all feature maps in RH ₄ , and its output terminal outputs 64 feature maps with width W and height H, and the set of all output feature maps is recorded as FJ ₄ ;

对于第5个级联层，其输入端接收FJ₄中的所有特征图、CP₁中的所有特征图和DP₁中的所有特征图，对FJ₄中的所有特征图、CP₁中的所有特征图和DP₁中的所有特征图进行叠加，其输出端输出128幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为Con₅；For the fifth cascaded layer, its input receives all feature maps in FJ ₄ , all feature maps in CP ₁ , and all feature maps in DP ₁ , for all feature maps in FJ ₄ , all feature maps in CP ₁ The feature map is superimposed with all feature maps in DP ₁ , and its output terminal outputs 128 feature maps with a width of W and a height of H, and the set of all feature maps that are output is denoted as Con ₅ ;

对于第5个融合神经网络块，其输入端接收Con₅中的所有特征图，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为RH₅；For the fifth fused neural network block, its input terminal receives all feature maps in Con ₅ , and its output terminal outputs 32 feature maps with a width of W and a height of H, and the set of all output feature maps is recorded as RH ₅ ;

对于第1个子输出层，其输入端接收RH₁中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₁，Out₁中的其中一幅特征图为显著性检测预测图；For the first sub-output layer, its input terminal receives all feature maps in RH ₁ , and its output terminal outputs 2 widths of and the height is feature map, record the set of all output feature maps as Out ₁ , and one of the feature maps in Out ₁ is the saliency detection prediction map;

对于第2个子输出层，其输入端接收RH₂中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₂，Out₂中的其中一幅特征图为显著性检测预测图；For the second sub-output layer, its input receives all the feature maps in RH ₂ , and its output outputs 2 widths of and the height is The feature map of , the set of all the output feature maps is recorded as Out ₂ , and one of the feature maps in Out ₂ is the saliency detection prediction map;

对于第3个子输出层，其输入端接收RH₃中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₃，Out₃中的其中一幅特征图为显著性检测预测图；For the third sub-output layer, its input receives all the feature maps in RH ₃ , and its output outputs 2 widths of and the height is feature map, record the set of all output feature maps as Out ₃ , and one of the feature maps in Out ₃ is a saliency detection prediction map;

对于第4个子输出层，其输入端接收RH₄中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₄，Out₄中的其中一幅特征图为显著性检测预测图；For the fourth sub-output layer, its input receives all feature maps in RH ₄ , and its output outputs 2 widths of and the height is The feature map of , the set of all the output feature maps is recorded as Out ₄ , and one of the feature maps in Out ₄ is the saliency detection prediction map;

对于第5个子输出层，其输入端接收RH₅中的所有特征图，其输出端输出2幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为Out₅，Out₅中的其中一幅特征图为显著性检测预测图；For the fifth sub-output layer, its input terminal receives all feature maps in RH ₅ , and its output terminal outputs two feature maps with a width of W and a height of H, and the set of all output feature maps is recorded as Out ₅ , One of the feature maps in Out ₅ is a saliency detection prediction map;

步骤1_3：将训练集中的每幅原始的彩色真实物体图像作为训练用RGB彩色图像，将训练集中的每幅原始的彩色真实物体图像对应的深度图像作为训练用深度图像，输入到卷积神经网络中进行训练，得到训练集中的每幅原始的彩色真实物体图像对应的5幅显著性检测预测图，将{I^q(i,j)}对应的5幅显著性检测预测图构成的集合记为 Step 1_3: Use each original color real object image in the training set as the RGB color image for training, use the depth image corresponding to each original color real object image in the training set as the depth image for training, and input it to the convolutional neural network training in , to obtain 5 saliency detection prediction maps corresponding to each original color real object image in the training set, and record the set of 5 saliency detection prediction maps corresponding to {I ^q (i,j)} as

步骤1_4：对训练集中的每幅原始的彩色真实物体图像对应的真实显著性检测标签图像进行5种不同尺寸大小的缩放处理，得到宽度为且高度为的图像、宽度为且高度为的图像、宽度为且高度为的图像、宽度为且高度为的图像、宽度为W且高度为H的图像，将{I^q(i,j)}对应的真实显著性检测图像经缩放处理后得到的5幅图像构成的集合记为 Step 1_4: Scale the real saliency detection label image corresponding to each original color real object image in the training set to 5 different sizes, and obtain a width of and the height is image with a width of and the height is image with a width of and the height is image with a width of and the height is image, the image whose width is W and the height is H, and the set of 5 images obtained by scaling the real saliency detection image corresponding to {I ^q (i, j)} is denoted as

步骤1_5：计算训练集中的每幅原始的彩色真实物体图像对应的5幅显著性检测预测图构成的集合与该原始的彩色真实物体图像对应的真实显著性检测图像经缩放处理后得到的5幅图像构成的集合之间的损失函数值，将与之间的损失函数值记为采用分类交叉熵获得；Step 1_5: Calculate the set of 5 saliency detection prediction maps corresponding to each original color real object image in the training set and the 5 real saliency detection images corresponding to the original color real object image after scaling processing The loss function value between the set of images will be and The loss function value between is denoted as Obtained using categorical cross-entropy;

步骤1_6：重复执行步骤1_3至步骤1_5共V次，得到卷积神经网络训练模型，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为卷积神经网络训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1；Step 1_6: Repeat step 1_3 to step 1_5 for a total of V times to obtain the convolutional neural network training model, and obtain a total of Q×V loss function values; then find the loss function with the smallest value from the Q×V loss function values value; then use the weight vector and bias item corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, correspondingly recorded as W ^best and b ^best ; where , V>1;

所述的测试阶段过程的具体步骤为：The specific steps of the described testing phase process are:

步骤2_1：令表示待显著性检测的彩色真实物体图像，将对应的深度图像记为其中，1≤i'≤W'，1≤j'≤H'，W'表示和的宽度，H'表示和的高度，表示中坐标位置为(i',j')的像素点的像素值，表示中坐标位置为(i',j')的像素点的像素值；Step 2_1: Order Represents a color real object image to be saliency detected, the The corresponding depth image is recorded as Among them, 1≤i'≤W', 1≤j'≤H', W' means and The width, H' means and the height of, express The pixel value of the pixel point whose coordinate position is (i', j'), express The pixel value of the pixel point whose middle coordinate position is (i',j');

步骤2_2：将的R通道分量、G通道分量和B通道分量以及输入到卷积神经网络训练模型中，并利用W^best和b^best进行预测，得到对应的5幅不同尺寸大小的预测显著性检测图像，将尺寸大小与的尺寸大小一致的预测显著性检测图像作为对应的最终预测显著性检测图像，并记为其中，表示中坐标位置为(i',j')的像素点的像素值。Step 2_2: Put The R channel component, G channel component and B channel component of the Input to the convolutional neural network training model, and use W ^best and b ^best to predict, get Corresponding to the predicted saliency detection images of 5 different sizes, the size and The size of the consistent predicted saliency detection image as The corresponding final predicted saliency detection image is denoted as in, express The pixel value of the pixel whose coordinate position is (i', j').

所述的步骤1_2中，第1个RGB图神经网络块和第1个深度图神经网络块的结构相同，其由依次设置的第一卷积层、第一批标准化层、第一激活层、第一残差块、第二卷积层、第二批标准化层、第二激活层组成，第一卷积层的输入端为其所在的神经网络块的输入端，第一批标准化层的输入端接收第一卷积层的输出端输出的所有特征图，第一激活层的输入端接收第一批标准化层的输出端输出的所有特征图，第一残差块的输入端接收第一激活层的输出端输出的所有特征图，第二卷积层的输入端接收第一残差块的输出端输出的所有特征图，第二批标准化层的输入端接收第二卷积层的输出端输出的所有特征图，第二激活层的输入端接收第二批标准化层的输出端输出的所有特征图，第二激活层的输出端为其所在的神经网络块的输出端；其中，第一卷积层和第二卷积层的卷积核大小均为3×3、卷积核个数均为32、补零参数均为1，第一激活层和第二激活层的激活方式均为“Relu”，第一批标准化层、第二批标准化层、第一激活层、第二激活层和第一残差块各自的输出端输出32幅特征图；In the described step 1_2, the structure of the first RGB graph neural network block and the first depth graph neural network block is the same, which consists of the first convolutional layer, the first batch of normalization layers, the first activation layer, The first residual block, the second convolutional layer, the second batch of normalization layers, and the second activation layer are composed. The input of the first convolutional layer is the input of the neural network block where it is located, and the input of the first batch of normalization layers is The input end of the first convolutional layer receives all the feature maps output by the output end of the first activation layer, the input end of the first activation layer receives all the feature maps output by the output end of the first normalization layer, and the input end of the first residual block receives the first activation All the feature maps output by the output of the second convolutional layer, the input of the second convolutional layer receives all the feature maps output by the output of the first residual block, and the input of the second normalization layer receives the output of the second convolutional layer All the feature maps of the output, the input of the second activation layer receives all the feature maps output by the output of the second batch of normalization layers, and the output of the second activation layer is the output of the neural network block where it is located; wherein, the first The convolution kernel size of the convolution layer and the second convolution layer are both 3×3, the number of convolution kernels is 32, and the zero padding parameters are 1. The activation methods of the first activation layer and the second activation layer are both "Relu", the first batch of normalization layers, the second batch of normalization layers, the first activation layer, the second activation layer and the first residual block respectively output 32 feature maps;

第2个RGB图神经网络块和第2个深度图神经网络块的结构相同，其由依次设置的第三卷积层、第三批标准化层、第三激活层、第二残差块、第四卷积层、第四批标准化层、第四激活层组成，第三卷积层的输入端为其所在的神经网络块的输入端，第三批标准化层的输入端接收第三卷积层的输出端输出的所有特征图，第三激活层的输入端接收第三批标准化层的输出端输出的所有特征图，第二残差块的输入端接收第三激活层的输出端输出的所有特征图，第四卷积层的输入端接收第二残差块的输出端输出的所有特征图，第四批标准化层的输入端接收第四卷积层的输出端输出的所有特征图，第四激活层的输入端接收第四批标准化层的输出端输出的所有特征图，第四激活层的输出端为其所在的神经网络块的输出端；其中，第三卷积层和第四卷积层的卷积核大小均为3×3、卷积核个数均为64、补零参数均为1，第三激活层和第四激活层的激活方式均为“Relu”，第三批标准化层、第四批标准化层、第三激活层、第四激活层和第二残差块各自的输出端输出64幅特征图；The structure of the second RGB image neural network block is the same as that of the second depth image neural network block, which consists of the third convolutional layer, the third batch normalization layer, the third activation layer, the second residual block, and the second Composed of four convolutional layers, the fourth batch of normalization layers, and the fourth activation layer, the input of the third convolutional layer is the input of the neural network block where it is located, and the input of the third batch of normalization layers receives the third convolutional layer All the feature maps output by the output of the third activation layer, the input of the third activation layer receives all the feature maps output by the output of the third normalization layer, and the input of the second residual block receives all the output of the output of the third activation layer The feature map, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the second residual block, the input end of the fourth batch of normalization layer receives all the feature maps output by the output end of the fourth convolutional layer, the first The input of the four activation layers receives all the feature maps output by the output of the fourth batch of normalization layers, and the output of the fourth activation layer is the output of the neural network block where it is located; where the third convolutional layer and the fourth volume The convolution kernel size of the product layer is 3×3, the number of convolution kernels is 64, and the zero padding parameters are all 1. The activation mode of the third activation layer and the fourth activation layer are both "Relu", and the third batch The output terminals of the normalization layer, the fourth batch of normalization layers, the third activation layer, the fourth activation layer and the second residual block output 64 feature maps;

第3个RGB图神经网络块和第3个深度图神经网络块的结构相同，其由依次设置的第五卷积层、第五批标准化层、第五激活层、第三残差块、第六卷积层、第六批标准化层、第六激活层组成，第五卷积层的输入端为其所在的神经网络块的输入端，第五批标准化层的输入端接收第五卷积层的输出端输出的所有特征图，第五激活层的输入端接收第五批标准化层的输出端输出的所有特征图，第三残差块的输入端接收第五激活层的输出端输出的所有特征图，第六卷积层的输入端接收第三残差块的输出端输出的所有特征图，第六批标准化层的输入端接收第六卷积层的输出端输出的所有特征图，第六激活层的输入端接收第六批标准化层的输出端输出的所有特征图，第六激活层的输出端为其所在的神经网络块的输出端；其中，第五卷积层和第六卷积层的卷积核大小均为3×3、卷积核个数均为128、补零参数均为1，第五激活层和第六激活层的激活方式均为“Relu”，第五批标准化层、第六批标准化层、第五激活层、第六激活层和第三残差块各自的输出端输出128幅特征图；The third RGB graph neural network block has the same structure as the third depth graph neural network block, which consists of the fifth convolutional layer, the fifth batch normalization layer, the fifth activation layer, the third residual block, the It consists of six convolutional layers, the sixth batch of normalization layers, and the sixth activation layer. The input of the fifth convolutional layer is the input of the neural network block where it is located, and the input of the fifth batch of normalization layers receives the fifth convolutional layer. All the feature maps output by the output of the fifth activation layer, the input of the fifth activation layer receives all the feature maps output by the output of the fifth normalization layer, and the input of the third residual block receives all the output of the output of the fifth activation layer The feature map, the input of the sixth convolutional layer receives all the feature maps output by the output of the third residual block, the input of the sixth batch of normalization layer receives all the feature maps output by the output of the sixth convolutional layer, the first The input of the six activation layers receives all the feature maps output by the output of the sixth batch of normalization layers, and the output of the sixth activation layer is the output of the neural network block where it is located; among them, the fifth convolutional layer and the sixth volume The size of the convolution kernel of the product layer is 3×3, the number of convolution kernels is 128, and the zero-padding parameter is 1. The activation mode of the fifth activation layer and the sixth activation layer are both "Relu", and the fifth batch The output terminals of the normalization layer, the sixth batch of normalization layers, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature maps;

第4个RGB图神经网络块和第4个深度图神经网络块的结构相同，其由依次设置的第七卷积层、第七批标准化层、第七激活层、第四残差块、第八卷积层、第八批标准化层、第八激活层组成，第七卷积层的输入端为其所在的神经网络块的输入端，第七批标准化层的输入端接收第七卷积层的输出端输出的所有特征图，第七激活层的输入端接收第七批标准化层的输出端输出的所有特征图，第四残差块的输入端接收第七激活层的输出端输出的所有特征图，第八卷积层的输入端接收第四残差块的输出端输出的所有特征图，第八批标准化层的输入端接收第八卷积层的输出端输出的所有特征图，第八激活层的输入端接收第八批标准化层的输出端输出的所有特征图，第八激活层的输出端为其所在的神经网络块的输出端；其中，第七卷积层和第八卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第七激活层和第八激活层的激活方式均为“Relu”，第七批标准化层、第八批标准化层、第七激活层、第八激活层和第四残差块各自的输出端输出256幅特征图；The structure of the 4th RGB graph neural network block is the same as that of the 4th depth graph neural network block, which consists of the seventh convolutional layer, the seventh batch normalization layer, the seventh activation layer, the fourth residual block, the Eight convolutional layers, the eighth batch of normalization layers, and the eighth activation layer, the input of the seventh convolutional layer is the input of the neural network block where it is located, and the input of the seventh batch of normalization layers receives the seventh convolutional layer All the feature maps output by the output of the seventh activation layer, the input of the seventh activation layer receives all the feature maps output by the output of the seventh normalization layer, and the input of the fourth residual block receives all the output of the output of the seventh activation layer The feature map, the input end of the eighth convolutional layer receives all the feature maps output by the output end of the fourth residual block, the input end of the eighth normalization layer receives all the feature maps output by the output end of the eighth convolutional layer, the first The input of the eighth activation layer receives all the feature maps output by the output of the eighth normalization layer, and the output of the eighth activation layer is the output of the neural network block where it is located; among them, the seventh convolutional layer and the eighth volume The size of the convolution kernel of the product layer is 3×3, the number of convolution kernels is 256, and the zero-padding parameter is 1. The activation mode of the seventh activation layer and the eighth activation layer are both "Relu", and the seventh batch The output terminals of the normalization layer, the eighth batch of normalization layers, the seventh activation layer, the eighth activation layer and the fourth residual block output 256 feature maps;

第5个RGB图神经网络块和第5个深度图神经网络块的结构相同，其由依次设置的第九卷积层、第九批标准化层、第九激活层、第五残差块、第十卷积层、第十批标准化层、第十激活层组成，第九卷积层的输入端为其所在的神经网络块的输入端，第九批标准化层的输入端接收第九卷积层的输出端输出的所有特征图，第九激活层的输入端接收第九批标准化层的输出端输出的所有特征图，第五残差块的输入端接收第九激活层的输出端输出的所有特征图，第十卷积层的输入端接收第五残差块的输出端输出的所有特征图，第十批标准化层的输入端接收第十卷积层的输出端输出的所有特征图，第十激活层的输入端接收第十批标准化层的输出端输出的所有特征图，第十激活层的输出端为其所在的神经网络块的输出端；其中，第九卷积层和第十卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第九激活层和第十激活层的激活方式均为“Relu”，第九批标准化层、第十批标准化层、第九激活层、第十激活层和第五残差块各自的输出端输出256幅特征图。The fifth RGB graph neural network block has the same structure as the fifth depth graph neural network block, which consists of the ninth convolutional layer, the ninth batch normalization layer, the ninth activation layer, the fifth residual block, the The tenth convolutional layer, the tenth batch of normalization layer, and the tenth activation layer are composed. The input end of the ninth convolutional layer is the input end of the neural network block where it is located, and the input end of the ninth batch of normalization layer receives the ninth convolutional layer All the feature maps output by the output of the ninth activation layer, the input of the ninth activation layer receives all the feature maps output by the output of the ninth normalization layer, and the input of the fifth residual block receives all the output of the output of the ninth activation layer The feature map, the input end of the tenth convolutional layer receives all the feature maps output by the output end of the fifth residual block, the input end of the tenth batch normalization layer receives all the feature maps output by the output end of the tenth convolutional layer, the first The input end of the tenth activation layer receives all the feature maps output by the output end of the tenth batch of normalization layers, and the output end of the tenth activation layer is the output end of the neural network block where it is located; among them, the ninth convolutional layer and the tenth volume The convolution kernel size of the product layer is 3×3, the number of convolution kernels is 256, and the zero padding parameters are all 1. The activation methods of the ninth activation layer and the tenth activation layer are both "Relu". The respective outputs of the normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block output 256 feature maps.

所述的步骤1_2中，4个RGB图最大池化层和4个深度图最大池化层均为最大池化层，4个RGB图最大池化层和4个深度图最大池化层的池化尺寸均为2、步长均为2。In the step 1_2, the maximum pooling layers of the 4 RGB images and the maximum pooling layers of the 4 depth images are the maximum pooling layers, and the pooling of the maximum pooling layers of the 4 RGB images and the maximum pooling layers of the 4 depth images All sizes are 2, and the step size is 2.

所述的步骤1_2中，5个融合神经网络块的结构相同，其由依次设置的第十一卷积层、第十一批标准化层、第十一激活层、第六残差块、第十二卷积层、第十二批标准化层、第十二激活层组成，第十一卷积层的输入端为其所在的融合神经网络块的输入端，第十一批标准化层的输入端接收第十一卷积层的输出端输出的所有特征图，第十一激活层的输入端接收第十一批标准化层的输出端输出的所有特征图，第六残差块的输入端接收第十一激活层的输出端输出的所有特征图，第十二卷积层的输入端接收第六残差块的输出端输出的所有特征图，第十二批标准化层的输入端接收第十二卷积层的输出端输出的所有特征图，第十二激活层的输入端接收第十二批标准化层的输出端输出的所有特征图，第十二激活层的输出端为其所在的神经网络块的输出端；其中，第1个和第2个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第1个和第2个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第1个和第2个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出256幅特征图，第3个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为128、补零参数均为1，第3个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第3个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出128幅特征图，第4个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为64、补零参数均为1，第4个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第4个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出64幅特征图，第5个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为32、补零参数均为1，第5个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第5个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出32幅特征图。In the step 1_2, the structure of the 5 fused neural network blocks is the same, which consists of the eleventh convolutional layer, the eleventh batch of normalization layers, the eleventh activation layer, the sixth residual block, the tenth The second convolutional layer, the twelfth batch of normalization layers, and the twelfth activation layer are composed. The input end of the eleventh convolutional layer is the input end of the fusion neural network block where it is located, and the input end of the eleventh batch of normalization layers receives All the feature maps output by the output of the eleventh convolutional layer, the input of the eleventh activation layer receive all the feature maps output by the output of the eleventh normalization layer, and the input of the sixth residual block receives the tenth All the feature maps output by the output of the first activation layer, the input of the twelfth convolutional layer receives all the feature maps output by the output of the sixth residual block, and the input of the twelfth batch normalization layer receives the twelfth volume All the feature maps output by the output of the product layer, the input of the twelfth activation layer receives all the feature maps output by the output of the twelfth batch of normalization layers, and the output of the twelfth activation layer is the neural network block where it is located The output terminal of ; wherein, the size of the convolution kernel of the eleventh convolution layer and the twelfth convolution layer in the first and second fusion neural network blocks are both 3×3, and the number of convolution kernels is 256. The zero padding parameters are all 1, the activation methods of the eleventh activation layer and the twelfth activation layer in the first and second fusion neural network blocks are both "Relu", the first and second fusion The eleventh batch of normalization layers, the twelfth batch of normalization layers, the eleventh activation layer, the twelfth activation layer, and the sixth residual block in the neural network block each output 256 feature maps, and the third fusion The size of the convolution kernel of the eleventh convolutional layer and the twelfth convolutional layer in the neural network block is 3×3, the number of convolution kernels is 128, and the zero-padding parameters are all 1. The third fusion neural The activation modes of the eleventh activation layer and the twelfth activation layer in the network block are both "Relu", and the eleventh batch normalization layer, the twelfth batch normalization layer, the eleventh batch normalization layer in the third fusion neural network block The respective outputs of the activation layer, the twelfth activation layer, and the sixth residual block output 128 feature maps, and the convolution kernels of the eleventh convolutional layer and the twelfth convolutional layer in the fourth fusion neural network block The size is 3×3, the number of convolution kernels is 64, and the zero-padding parameters are all 1. The activation methods of the eleventh activation layer and the twelfth activation layer in the fourth fusion neural network block are both "Relu ", the output terminals of the eleventh batch of normalization layers, the twelfth batch of normalization layers, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the fourth fusion neural network block output 64 features In the figure, the size of the convolution kernel of the eleventh convolution layer and the twelfth convolution layer in the fifth fusion neural network block is 3×3, the number of convolution kernels is 32, and the zero-padding parameters are all 1 , the activation modes of the eleventh activation layer and the twelfth activation layer in the fifth fusion neural network block are both "Relu", the eleventh batch of normalization layers, the twelfth batch of activation layers in the fifth fusion neural network block The respective outputs of the normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block output 32 feature maps.

所述的步骤1_2中，第1个和第2个反卷积层的卷积核大小均为2×2、卷积核个数均为256、步长均为2、补零参数均为0，第3个反卷积层的卷积核大小为2×2、卷积核个数为128、步长为2、补零参数为0，第4个反卷积层的卷积核大小为2×2、卷积核个数为64、步长为2、补零参数为0。In the above step 1_2, the convolution kernel sizes of the first and second deconvolution layers are both 2×2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameters are all 0 , the convolution kernel size of the third deconvolution layer is 2×2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0. The convolution kernel size of the fourth deconvolution layer is 2×2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.

所述的步骤1_2中，5个子输出层的结构相同，其由第十三卷积层组成；其中，第十三卷积层的卷积核大小为1×1、卷积核个数为2、补零参数为0。In the step 1_2, the structure of the five sub-output layers is the same, which is composed of the thirteenth convolutional layer; wherein, the size of the convolutional kernel of the thirteenth convolutional layer is 1×1, and the number of convolutional kernels is 2 , The zero padding parameter is 0.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

1)本发明方法构建的卷积神经网络，实现了端到端的显著性物体检测，易于训练，方便快捷；使用训练集中的彩色真实物体图像和对应的深度图像输入到卷积神经网络中进行训练，得到卷积神经网络训练模型；再将待显著性检测的彩色真实物体图像和对应的深度图像输入到卷积神经网络训练模型中，预测得到彩色真实物体图像对应的预测显著性检测图像，由于本发明方法在构建卷积神经网络时结合了残差块和反卷积层的特点，因此能够在加深卷积神经网络训练模型的同时，并提升了卷积神经网络训练模型的预测准确率。1) The convolutional neural network constructed by the method of the present invention realizes end-to-end salient object detection, is easy to train, and is convenient and quick; use the color real object images and corresponding depth images in the training set to input them into the convolutional neural network for training , to obtain the convolutional neural network training model; then input the color real object image and the corresponding depth image to be saliency detection into the convolutional neural network training model, and predict the predicted saliency detection image corresponding to the color real object image, because The method of the present invention combines the characteristics of the residual block and the deconvolution layer when constructing the convolutional neural network, so while deepening the convolutional neural network training model, the prediction accuracy of the convolutional neural network training model is improved.

2)本发明方法在利用深度信息的时候采用后融合的方式，将在编码层对应的深度信息和彩色图信息与对应译码层进行级联(concatenation)，避免了前融合在编码阶段加入噪声信息，同时在卷积神经网络训练模型训练的时候能够充分地学习到彩色图信息和深度信息的互补信息，进而在训练集与测试集上都能得到较好效果。2) The method of the present invention adopts a post-fusion mode when utilizing depth information, and concatenates (concatenation) the depth information and color map information corresponding to the encoding layer with the corresponding decoding layer, avoiding the addition of noise in the encoding stage of the former fusion At the same time, the complementary information of color image information and depth information can be fully learned during the training of the convolutional neural network training model, and better results can be obtained in both the training set and the test set.

3)本发明采用了多尺度监督(multi-scale Supervision)，即通过反卷积层使得物体的空间细节信息能够在上采样的过程中得到优化，并在不同的尺寸输出预测图，并用相对应尺寸的标签图进行监督，能够指导卷积神经网络训练模型逐步地构建显著性检测预测图，从而使得在训练集和测试集上得到了更好的效果。3) The present invention adopts multi-scale supervision (multi-scale Supervision), that is, through the deconvolution layer, the spatial detail information of the object can be optimized in the process of upsampling, and the prediction map is output in different sizes, and the corresponding Supervised by the label map of the same size, it can guide the convolutional neural network training model to gradually build a saliency detection prediction map, so that better results are obtained on the training set and test set.

附图说明Description of drawings

图1为本发明方法构建的卷积神经网络的组成结构示意图；Fig. 1 is the composition structure schematic diagram of the convolutional neural network that the inventive method builds;

图2a为利用本发明方法对真实物体图像数据库NLPR测试集中的每幅彩色真实物体图像进行预测，反映本发明方法的显著性检测效果的类准确率召回率曲线；Fig. 2 a is to utilize the method of the present invention to predict each color real object image in the real object image database NLPR test set, reflecting the class accuracy-recall rate curve of the significance detection effect of the inventive method;

图2b为利用本发明方法对真实物体图像数据库NLPR测试集中的每幅彩色真实物体图像进行预测，反映本发明方法的显著性检测效果的平均绝对误差；Fig. 2b is to use the method of the present invention to predict each color real object image in the real object image database NLPR test set, reflecting the average absolute error of the significance detection effect of the method of the present invention;

图2c为利用本发明方法对真实物体图像数据库NLPR测试集中的每幅彩色真实物体图像进行预测，反映本发明方法的显著性检测效果的F度量值；Fig. 2c is to use the method of the present invention to predict each color real object image in the real object image database NLPR test set, reflecting the F measure value of the significance detection effect of the method of the present invention;

图3a为同一场景的第1幅原始的彩色真实物体图像；Figure 3a is the first original color real object image of the same scene;

图3b为图3a对应的深度图像；Figure 3b is a depth image corresponding to Figure 3a;

图3c为利用本发明方法对图3a进行预测得到的预测显著性检测图像；Fig. 3c is the predicted saliency detection image obtained by predicting Fig. 3a by using the method of the present invention;

图4a为同一场景的第2幅原始的彩色真实物体图像；Figure 4a is the second original color real object image of the same scene;

图4b为图4a对应的深度图像；Figure 4b is the depth image corresponding to Figure 4a;

图4c为利用本发明方法对图4a进行预测得到的预测显著性检测图像；Fig. 4c is the prediction saliency detection image obtained by predicting Fig. 4a by using the method of the present invention;

图5a为同一场景的第3幅原始的彩色真实物体图像；Figure 5a is the third original color real object image of the same scene;

图5b为图5a对应的深度图像；Figure 5b is the depth image corresponding to Figure 5a;

图5c为利用本发明方法对图5a进行预测得到的预测显著性检测图像；Fig. 5c is a prediction saliency detection image obtained by predicting Fig. 5a using the method of the present invention;

图6a为同一场景的第4幅原始的彩色真实物体图像；Figure 6a is the fourth original color real object image of the same scene;

图6b为图6a对应的深度图像；Figure 6b is the depth image corresponding to Figure 6a;

图6c为利用本发明方法对图6a进行预测得到的预测显著性检测图像。Fig. 6c is a predicted saliency detection image obtained by predicting Fig. 6a using the method of the present invention.

具体实施方式Detailed ways

以下结合附图实施例对本发明作进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

本发明提出的一种基于残差网络和深度信息融合的显著性检测方法，其包括训练阶段和测试阶段两个过程。A saliency detection method based on residual network and deep information fusion proposed by the present invention includes two processes of a training phase and a testing phase.

步骤1_1：选取Q幅原始的彩色真实物体图像及每幅原始的彩色真实物体图像对应的深度图像和真实显著性检测标签图像，并构成训练集，将训练集中的第q幅原始的彩色真实物体图像及其对应的深度图像和真实显著性检测标签图像对应记为{I^q(i,j)}、{D^q(i,j)}、其中，Q为正整数，Q≥200，如取Q＝367，q为正整数，q的初始值为1，1≤q≤Q，1≤i≤W，1≤j≤H，W表示{I^q(i,j)}、{D^q(i,j)}、的宽度，H表示{I^q(i,j)}、{D^q(i,j)}、的高度，W和H均能够被2整除，如取W＝512、H＝512，{I^q(i,j)}为RGB彩色图像，I^q(i,j)表示{I^q(i,j)}中坐标位置为(i,j)的像素点的像素值，{D^q(i,j)}为单通道的深度图像，D^q(i,j)表示{D^q(i,j)}中坐标位置为(i,j)的像素点的像素值，表示中坐标位置为(i,j)的像素点的像素值；在此，原始的彩色真实物体图像直接选用数据库NLPR训练集中的800幅图像。Step 1_1: Select Q original color real object images and the corresponding depth images and true saliency detection label images of each original color real object image, and form a training set, and the qth original color real object in the training set The image and its corresponding depth image and the real saliency detection label image are correspondingly denoted as {I ^q (i,j)}, {D ^q (i,j)}, Among them, Q is a positive integer, Q≥200, such as Q=367, q is a positive integer, the initial value of q is 1, 1≤q≤Q, 1≤i≤W, 1≤j≤H, W means { I ^q (i,j)}, {D ^q (i,j)}, The width of , H represents {I ^q (i,j)}, {D ^q (i,j)}, height, both W and H can be divisible by 2, such as taking W=512, H=512, {I ^q (i,j)} is an RGB color image, and I ^q (i,j) means {I ^q (i, The pixel value of the pixel whose coordinate position is (i,j) in j)}, {D ^q (i,j)} is a single-channel depth image, D ^q (i,j) means {D ^q (i,j) )} in the pixel value of the pixel whose coordinate position is (i, j), express The middle coordinate position is the pixel value of the pixel point (i, j); here, the original color real object image directly selects 800 images in the database NLPR training set.

步骤1_2：构建卷积神经网络：如图1所示，该卷积神经网络包含输入层、隐层、输出层，输入层包括RGB图输入层和深度图输入层，隐层包括5个RGB图神经网络块、4个RGB图最大池化层(Maxpooling，Pool)、5个深度图神经网络块、4个深度图最大池化层、5个级联层、5个融合神经网络块、4个反卷积层，输出层包括5个子输出层；其中，5个RGB图神经网络块和4个RGB图最大池化层构成RGB图的编码结构，5个深度图神经网络块和4个深度图最大池化层构成深度图的编码结构，RGB图的编码结构和深度图的编码结构构成卷积神经网络的编码层，5个级联层、5个融合神经网络块和4个反卷积层构成卷积神经网络的译码层。Step 1_2: Construct a convolutional neural network: As shown in Figure 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer. The input layer includes an RGB image input layer and a depth image input layer, and the hidden layer includes 5 RGB images. Neural network block, 4 RGB image maximum pooling layers (Maxpooling, Pool), 5 depth image neural network blocks, 4 depth image maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks, 4 Deconvolution layer, the output layer includes 5 sub-output layers; among them, 5 RGB image neural network blocks and 4 RGB image maximum pooling layers constitute the encoding structure of RGB images, 5 depth image neural network blocks and 4 depth images The maximum pooling layer constitutes the encoding structure of the depth map, the encoding structure of the RGB image and the encoding structure of the depth map constitute the encoding layer of the convolutional neural network, 5 cascade layers, 5 fused neural network blocks and 4 deconvolution layers Constitutes the decoding layer of the convolutional neural network.

对于RGB图输入层，其输入端接收一幅训练用RGB彩色图像的R通道分量、G通道分量和B通道分量，其输出端输出训练用RGB彩色图像的R通道分量、G通道分量和B通道分量给隐层；其中，要求训练用RGB彩色图像的宽度为W且高度为H。For the RGB image input layer, its input terminal receives the R channel component, G channel component and B channel component of a RGB color image for training, and its output terminal outputs the R channel component, G channel component and B channel component of the RGB color image for training The component is given to the hidden layer; among them, the width of the RGB color image required for training is W and the height is H.

对于深度图输入层，其输入端接收RGB图输入层的输入端接收的训练用RGB彩色图像对应的训练用深度图像，其输出端输出训练用深度图像给隐层；其中，训练用深度图像的宽度为W且高度为H。For the depth image input layer, its input terminal receives the training depth image corresponding to the training RGB color image received by the input end of the RGB image input layer, and its output terminal outputs the training depth image to the hidden layer; wherein, the training depth image The width is W and the height is H.

对于第1个RGB图神经网络块，其输入端接收RGB图输入层的输出端输出的训练用RGB彩色图像的R通道分量、G通道分量和B通道分量，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为CP₁。For the first RGB image neural network block, its input terminal receives the R channel component, G channel component and B channel component of the training RGB color image output by the output terminal of the RGB image input layer, and its output terminal outputs 32 widths of W And a feature map with a height of H, the set of all output feature maps is denoted as CP ₁ .

对于第1个RGB图最大池化层，其输入端接收CP₁中的所有特征图，其输出端输出32幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₁。For the first RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₁ , and its output terminal outputs 32 widths of and the height is The feature map of , denote the set of all output feature maps as ZC ₁ .

对于第2个RGB图神经网络块，其输入端接收ZC₁中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₂。For the second RGB image neural network block, its input terminal receives all feature maps in ZC ₁ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all output feature maps is denoted as CP ₂ .

对于第2个RGB图最大池化层，其输入端接收CP₂中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₂。For the second RGB image maximum pooling layer, its input receives all feature maps in CP ₂ , and its output outputs 64 widths of and the height is The feature map of , denote the set of all output feature maps as ZC ₂ .

对于第3个RGB图神经网络块，其输入端接收ZC₂中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₃。For the third RGB image neural network block, its input terminal receives all feature maps in ZC ₂ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all output feature maps is denoted as CP ₃ .

对于第3个RGB图最大池化层，其输入端接收CP₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₃。For the third RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₃ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all output feature maps is denoted as ZC ₃ .

对于第4个RGB图神经网络块，其输入端接收ZC₃中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₄。For the fourth RGB image neural network block, its input terminal receives all feature maps in ZC ₃ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is denoted as CP ₄ .

对于第4个RGB图最大池化层，其输入端接收CP₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为ZC₄。For the fourth RGB image maximum pooling layer, its input terminal receives all feature maps in CP ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is denoted as ZC ₄ .

对于第5个RGB图神经网络块，其输入端接收ZC₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为CP₅。For the fifth RGB image neural network block, its input terminal receives all feature maps in ZC ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is denoted as CP ₅ .

对于第1个深度图神经网络块，其输入端接收深度图输入层的输出端输出的训练用深度图像，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为DP₁。For the first depth map neural network block, its input terminal receives the training depth image output from the output terminal of the depth map input layer, and its output terminal outputs 32 feature maps with a width of W and a height of H, and all the output features A collection of graphs is denoted as DP ₁ .

对于第1个深度图最大池化层，其输入端接收DP₁中的所有特征图，其输出端输出32幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₁。For the first depth map maximum pooling layer, its input terminal receives all feature maps in DP ₁ , and its output terminal outputs 32 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₁ .

对于第2个深度图神经网络块，其输入端接收DC₁中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₂。For the second depth map neural network block, its input terminal receives all feature maps in DC ₁ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all output feature maps is recorded as DP ₂ .

对于第2个深度图最大池化层，其输入端接收DP₂中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₂。For the second depth map maximum pooling layer, its input receives all feature maps in DP ₂ , and its output outputs 64 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₂ .

对于第3个深度图神经网络块，其输入端接收DC₂中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₃。For the third depth map neural network block, its input terminal receives all feature maps in DC ₂ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all output feature maps is recorded as DP ₃ .

对于第3个深度图最大池化层，其输入端接收DP₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₃。For the third depth map maximum pooling layer, its input terminal receives all feature maps in DP ₃ , and its output terminal outputs 128 widths. and the height is The feature map of , the set of all output feature maps is recorded as DC ₃ .

对于第4个深度图神经网络块，其输入端接收DC₃中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₄。For the fourth depth map neural network block, its input terminal receives all feature maps in DC ₃ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as DP ₄ .

对于第4个深度图最大池化层，其输入端接收DP₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DC₄。For the fourth depth map maximum pooling layer, its input terminal receives all feature maps in DP ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as DC ₄ .

对于第5个深度图神经网络块，其输入端接收DC₄中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为DP₅。For the fifth depth map neural network block, its input terminal receives all feature maps in DC ₄ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as DP ₅ .

对于第1个级联(concatenation)层，其输入端接收CP₅中的所有特征图和DP₅中的所有特征图，对CP₅中的所有特征图和DP₅中的所有特征图进行叠加，其输出端输出512幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₁。For the first concatenation layer, its input receives all feature maps in CP ₅ and all feature maps in DP ₅ , and superimposes all feature maps in CP ₅ and all feature maps in DP ₅ , Its output terminal outputs 512 widths as and the height is The feature map of , the set of all output feature maps is denoted as Con ₁ .

对于第1个融合神经网络块，其输入端接收Con₁中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₁。For the first fusion neural network block, its input terminal receives all feature maps in Con ₁ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as RH ₁ .

对于第1个反卷积层，其输入端接收RH₁中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₁。For the first deconvolution layer, its input receives all feature maps in RH ₁ , and its output outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as FJ ₁ .

对于第2个级联层，其输入端接收FJ₁中的所有特征图、CP₄中的所有特征图和DP₄中的所有特征图，对FJ₁中的所有特征图、CP₄中的所有特征图和DP₄中的所有特征图进行叠加，其输出端输出768幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₂。For the second cascade layer, its input receives all feature maps in FJ ₁ , all feature maps in CP ₄ , and all feature maps in DP ₄ , for all feature maps in FJ ₁ , all feature maps in CP ₄ The feature map and all feature maps in DP ₄ are superimposed, and the output terminal outputs 768 widths of and the height is The feature map of , the set of all output feature maps is denoted as Con ₂ .

对于第2个融合神经网络块，其输入端接收Con₂中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₂。For the second fusion neural network block, its input terminal receives all feature maps in Con ₂ , and its output terminal outputs 256 widths of and the height is The feature map of , the set of all output feature maps is recorded as RH ₂ .

对于第2个反卷积层，其输入端接收RH₂中的所有特征图，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₂。For the second deconvolution layer, its input receives all the feature maps in RH ₂ , and its output outputs 256 widths of and the height is The feature map of , and the set of all output feature maps is denoted as FJ ₂ .

对于第3个级联层，其输入端接收FJ₂中的所有特征图、CP₃中的所有特征图和DP₃中的所有特征图，对FJ₂中的所有特征图、CP₃中的所有特征图和DP₃中的所有特征图进行叠加，其输出端输出512幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₃。For the third cascade layer, its input receives all feature maps in FJ ₂ , all feature maps in CP ₃ , and all feature maps in DP ₃ , for all feature maps in FJ ₂ , all feature maps in CP ₃ The feature map and all feature maps in DP ₃ are superimposed, and the output terminal outputs 512 widths of and the height is The feature map of , the set of all output feature maps is denoted as Con ₃ .

对于第3个融合神经网络块，其输入端接收Con₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₃。For the third fusion neural network block, its input terminal receives all the feature maps in Con ₃ , and its output terminal outputs 128 widths of and the height is The feature map of , the set of all output feature maps is recorded as RH ₃ .

对于第3个反卷积层，其输入端接收RH₃中的所有特征图，其输出端输出128幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为FJ₃。For the third deconvolution layer, its input receives all feature maps in RH ₃ , and its output outputs 128 widths of and the height is The feature map of , the set of all output feature maps is recorded as FJ ₃ .

对于第4个级联层，其输入端接收FJ₃中的所有特征图、CP₂中的所有特征图和DP₂中的所有特征图，对FJ₃中的所有特征图、CP₂中的所有特征图和DP₂中的所有特征图进行叠加，其输出端输出256幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Con₄。For the fourth cascade layer, its input receives all feature maps in FJ ₃ , all feature maps in CP ₂ , and all feature maps in DP ₂ , for all feature maps in FJ ₃ , all feature maps in CP ₂ The feature map and all feature maps in DP ₂ are superimposed, and the output terminal outputs 256 widths of and the height is The feature map of , and the set of all output feature maps is denoted as Con ₄ .

对于第4个融合神经网络块，其输入端接收Con₄中的所有特征图，其输出端输出64幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为RH₄。For the fourth fusion neural network block, its input terminal receives all feature maps in Con ₄ , and its output terminal outputs 64 widths of and the height is The feature map of , the set of all output feature maps is recorded as RH ₄ .

对于第4个反卷积层，其输入端接收RH₄中的所有特征图，其输出端输出64幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为FJ₄。For the fourth deconvolution layer, its input terminal receives all feature maps in RH ₄ , and its output terminal outputs 64 feature maps with width W and height H, and the set of all output feature maps is recorded as FJ ₄ .

对于第5个级联层，其输入端接收FJ₄中的所有特征图、CP₁中的所有特征图和DP₁中的所有特征图，对FJ₄中的所有特征图、CP₁中的所有特征图和DP₁中的所有特征图进行叠加，其输出端输出128幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为Con₅。For the fifth cascaded layer, its input receives all feature maps in FJ ₄ , all feature maps in CP ₁ , and all feature maps in DP ₁ , for all feature maps in FJ ₄ , all feature maps in CP ₁ The feature map is superimposed with all feature maps in DP ₁ , and its output terminal outputs 128 feature maps with a width of W and a height of H, and the set of all output feature maps is denoted as Con ₅ .

对于第5个融合神经网络块，其输入端接收Con₅中的所有特征图，其输出端输出32幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为RH₅。For the fifth fused neural network block, its input terminal receives all feature maps in Con ₅ , and its output terminal outputs 32 feature maps with a width of W and a height of H, and the set of all output feature maps is recorded as RH ₅ .

对于第1个子输出层，其输入端接收RH₁中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₁，Out₁中的其中一幅特征图(第2幅特征图)为显著性检测预测图。For the first sub-output layer, its input terminal receives all feature maps in RH ₁ , and its output terminal outputs 2 widths of and the height is The feature map of , the set of all the output feature maps is recorded as Out ₁ , and one of the feature maps (the second feature map) in Out ₁ is the saliency detection prediction map.

对于第2个子输出层，其输入端接收RH₂中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₂，Out₂中的其中一幅特征图(第2幅特征图)为显著性检测预测图。For the second sub-output layer, its input receives all the feature maps in RH ₂ , and its output outputs 2 widths of and the height is The feature map of , the set of all output feature maps is recorded as Out ₂ , and one of the feature maps in Out ₂ (the second feature map) is the saliency detection prediction map.

对于第3个子输出层，其输入端接收RH₃中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₃，Out₃中的其中一幅特征图(第2幅特征图)为显著性检测预测图。For the third sub-output layer, its input receives all the feature maps in RH ₃ , and its output outputs 2 widths of and the height is The feature map of the output feature map is recorded as Out ₃ , and one of the feature maps (the second feature map) in Out ₃ is the saliency detection prediction map.

对于第4个子输出层，其输入端接收RH₄中的所有特征图，其输出端输出2幅宽度为且高度为的特征图，将输出的所有特征图构成的集合记为Out₄，Out₄中的其中一幅特征图(第2幅特征图)为显著性检测预测图。For the fourth sub-output layer, its input receives all feature maps in RH ₄ , and its output outputs 2 widths of and the height is The feature map of , the set of all the output feature maps is recorded as Out ₄ , and one of the feature maps (the second feature map) in Out ₄ is the saliency detection prediction map.

对于第5个子输出层，其输入端接收RH₅中的所有特征图，其输出端输出2幅宽度为W且高度为H的特征图，将输出的所有特征图构成的集合记为Out₅，Out₅中的其中一幅特征图(第2幅特征图)为显著性检测预测图。For the fifth sub-output layer, its input terminal receives all feature maps in RH ₅ , and its output terminal outputs two feature maps with a width of W and a height of H, and the set of all output feature maps is recorded as Out ₅ , One of the feature maps (the second feature map) in Out ₅ is the saliency detection prediction map.

步骤1_5：计算训练集中的每幅原始的彩色真实物体图像对应的5幅显著性检测预测图构成的集合与该原始的彩色真实物体图像对应的真实显著性检测图像经缩放处理后得到的5幅图像构成的集合之间的损失函数值，将与之间的损失函数值记为采用分类交叉熵(categorical crossentropy)获得。Step 1_5: Calculate the set of 5 saliency detection prediction maps corresponding to each original color real object image in the training set and the 5 real saliency detection images corresponding to the original color real object image after scaling processing The loss function value between the set of images will be and The loss function value between is denoted as Obtained using categorical crossentropy.

步骤1_6：重复执行步骤1_3至步骤1_5共V次，得到卷积神经网络训练模型，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为卷积神经网络训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1，在本实施例中取V＝300。Step 1_6: Repeat step 1_3 to step 1_5 for a total of V times to obtain the convolutional neural network training model, and obtain a total of Q×V loss function values; then find the loss function with the smallest value from the Q×V loss function values value; then use the weight vector and bias item corresponding to the loss function value with the smallest value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, correspondingly recorded as W ^best and b ^best ; where , V>1, V=300 in this embodiment.

步骤2_1：令表示待显著性检测的彩色真实物体图像，将对应的深度图像记为其中，1≤i'≤W'，1≤j'≤H'，W'表示和的宽度，H'表示和的高度，表示中坐标位置为(i',j')的像素点的像素值，表示中坐标位置为(i',j')的像素点的像素值。Step 2_1: Order Represents a color real object image to be saliency detected, the The corresponding depth image is recorded as Among them, 1≤i'≤W', 1≤j'≤H', W' means and The width, H' means and the height of, express The pixel value of the pixel point whose coordinate position is (i', j'), express The pixel value of the pixel whose coordinate position is (i', j').

在此具体实施例中，步骤1_2中，第1个RGB图神经网络块和第1个深度图神经网络块的结构相同，其由依次设置的第一卷积层(Convolution，Conv)、第一批标准化层(BatchNormalize，BN)、第一激活层(Activation，Act)、第一残差块(Residual Block，RB)、第二卷积层、第二批标准化层、第二激活层组成，第一卷积层的输入端为其所在的神经网络块的输入端，第一批标准化层的输入端接收第一卷积层的输出端输出的所有特征图，第一激活层的输入端接收第一批标准化层的输出端输出的所有特征图，第一残差块的输入端接收第一激活层的输出端输出的所有特征图，第二卷积层的输入端接收第一残差块的输出端输出的所有特征图，第二批标准化层的输入端接收第二卷积层的输出端输出的所有特征图，第二激活层的输入端接收第二批标准化层的输出端输出的所有特征图，第二激活层的输出端为其所在的神经网络块的输出端；其中，第一卷积层和第二卷积层的卷积核大小(kernel_size)均为3×3、卷积核个数(filters)均为32、补零参数(padding)均为1，第一激活层和第二激活层的激活方式均为“Relu”，第一批标准化层、第二批标准化层、第一激活层、第二激活层和第一残差块各自的输出端输出32幅特征图。In this specific embodiment, in step 1-2, the structure of the first RGB graph neural network block and the first depth graph neural network block are the same, which consists of the first convolutional layer (Convolution, Conv), the first The batch normalization layer (BatchNormalize, BN), the first activation layer (Activation, Act), the first residual block (Residual Block, RB), the second convolution layer, the second batch normalization layer, the second activation layer, the first The input end of a convolutional layer is the input end of the neural network block where it is located, the input end of the first batch of normalization layers receives all the feature maps output by the output end of the first convolutional layer, and the input end of the first activation layer receives the first All the feature maps output by the output of a batch of normalization layers, the input of the first residual block receives all the feature maps output by the output of the first activation layer, and the input of the second convolutional layer receives the output of the first residual block All the feature maps output by the output end, the input end of the second batch of normalization layer receives all the feature maps output by the output end of the second convolutional layer, and the input end of the second activation layer receives all the output end output end of the second batch normalization layer Feature map, the output end of the second activation layer is the output end of the neural network block where it is located; wherein, the convolution kernel size (kernel_size) of the first convolution layer and the second convolution layer are both 3×3, convolution The number of cores (filters) is 32, the padding parameters are 1, the activation mode of the first activation layer and the second activation layer are both "Relu", the first batch of normalization layers, the second batch of normalization layers, The respective output terminals of the first activation layer, the second activation layer and the first residual block output 32 feature maps.

在此具体实施例中，第2个RGB图神经网络块和第2个深度图神经网络块的结构相同，其由依次设置的第三卷积层、第三批标准化层、第三激活层、第二残差块、第四卷积层、第四批标准化层、第四激活层组成，第三卷积层的输入端为其所在的神经网络块的输入端，第三批标准化层的输入端接收第三卷积层的输出端输出的所有特征图，第三激活层的输入端接收第三批标准化层的输出端输出的所有特征图，第二残差块的输入端接收第三激活层的输出端输出的所有特征图，第四卷积层的输入端接收第二残差块的输出端输出的所有特征图，第四批标准化层的输入端接收第四卷积层的输出端输出的所有特征图，第四激活层的输入端接收第四批标准化层的输出端输出的所有特征图，第四激活层的输出端为其所在的神经网络块的输出端；其中，第三卷积层和第四卷积层的卷积核大小均为3×3、卷积核个数均为64、补零参数均为1，第三激活层和第四激活层的激活方式均为“Relu”，第三批标准化层、第四批标准化层、第三激活层、第四激活层和第二残差块各自的输出端输出64幅特征图。In this specific embodiment, the structure of the 2nd RGB graph neural network block and the 2nd depth graph neural network block are the same, and it consists of the third convolutional layer, the third batch normalization layer, the third activation layer, The second residual block, the fourth convolutional layer, the fourth batch of normalization layers, and the fourth activation layer are composed. The input of the third convolutional layer is the input of the neural network block where it is located, and the input of the third batch of normalization layers is The input end of the third convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third activation layer receives all the feature maps output by the output end of the third normalization layer, and the input end of the second residual block receives the third activation All feature maps output by the output of the layer, the input of the fourth convolutional layer receives all the feature maps output by the output of the second residual block, and the input of the fourth batch normalization layer receives the output of the fourth convolutional layer All the feature maps of the output, the input of the fourth activation layer receives all the feature maps output by the output of the fourth batch of normalization layers, and the output of the fourth activation layer is the output of the neural network block where it is located; wherein, the third The convolution kernel size of the convolution layer and the fourth convolution layer are both 3×3, the number of convolution kernels is 64, and the zero padding parameters are 1. The activation methods of the third activation layer and the fourth activation layer are both "Relu", the third batch of normalization layer, the fourth batch of normalization layer, the third activation layer, the fourth activation layer and the second residual block respectively output 64 feature maps.

在此具体实施例中，第3个RGB图神经网络块和第3个深度图神经网络块的结构相同，其由依次设置的第五卷积层、第五批标准化层、第五激活层、第三残差块、第六卷积层、第六批标准化层、第六激活层组成，第五卷积层的输入端为其所在的神经网络块的输入端，第五批标准化层的输入端接收第五卷积层的输出端输出的所有特征图，第五激活层的输入端接收第五批标准化层的输出端输出的所有特征图，第三残差块的输入端接收第五激活层的输出端输出的所有特征图，第六卷积层的输入端接收第三残差块的输出端输出的所有特征图，第六批标准化层的输入端接收第六卷积层的输出端输出的所有特征图，第六激活层的输入端接收第六批标准化层的输出端输出的所有特征图，第六激活层的输出端为其所在的神经网络块的输出端；其中，第五卷积层和第六卷积层的卷积核大小均为3×3、卷积核个数均为128、补零参数均为1，第五激活层和第六激活层的激活方式均为“Relu”，第五批标准化层、第六批标准化层、第五激活层、第六激活层和第三残差块各自的输出端输出128幅特征图。In this specific embodiment, the structure of the 3rd RGB graph neural network block and the 3rd depth graph neural network block are the same, and it consists of the fifth convolution layer, the fifth batch normalization layer, the fifth activation layer, The third residual block, the sixth convolutional layer, the sixth batch of normalization layers, and the sixth activation layer, the input of the fifth convolutional layer is the input of the neural network block where it is located, and the input of the fifth batch of normalization layers The input end of the fifth convolutional layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fifth activation layer receives all the feature maps output by the output end of the fifth batch of normalization layers, and the input end of the third residual block receives the fifth activation All feature maps output by the output of the layer, the input of the sixth convolutional layer receives all the feature maps output by the output of the third residual block, and the input of the sixth batch normalization layer receives the output of the sixth convolutional layer All the feature maps of the output, the input of the sixth activation layer receives all the feature maps output by the output of the sixth batch of normalization layers, and the output of the sixth activation layer is the output of the neural network block where it is located; wherein, the fifth The convolution kernel size of the convolution layer and the sixth convolution layer are both 3×3, the number of convolution kernels is 128, and the zero padding parameters are 1. The activation methods of the fifth activation layer and the sixth activation layer are both "Relu", the fifth batch of normalization layer, the sixth batch of normalization layer, the fifth activation layer, the sixth activation layer and the third residual block respectively output 128 feature maps.

在此具体实施例中，第4个RGB图神经网络块和第4个深度图神经网络块的结构相同，其由依次设置的第七卷积层、第七批标准化层、第七激活层、第四残差块、第八卷积层、第八批标准化层、第八激活层组成，第七卷积层的输入端为其所在的神经网络块的输入端，第七批标准化层的输入端接收第七卷积层的输出端输出的所有特征图，第七激活层的输入端接收第七批标准化层的输出端输出的所有特征图，第四残差块的输入端接收第七激活层的输出端输出的所有特征图，第八卷积层的输入端接收第四残差块的输出端输出的所有特征图，第八批标准化层的输入端接收第八卷积层的输出端输出的所有特征图，第八激活层的输入端接收第八批标准化层的输出端输出的所有特征图，第八激活层的输出端为其所在的神经网络块的输出端；其中，第七卷积层和第八卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第七激活层和第八激活层的激活方式均为“Relu”，第七批标准化层、第八批标准化层、第七激活层、第八激活层和第四残差块各自的输出端输出256幅特征图。In this specific embodiment, the structure of the 4th RGB graph neural network block and the 4th depth graph neural network block is the same, and it consists of the seventh convolutional layer, the seventh batch normalization layer, the seventh activation layer, The fourth residual block, the eighth convolutional layer, the eighth normalization layer, and the eighth activation layer, the input of the seventh convolutional layer is the input of the neural network block where it is located, and the input of the seventh normalization layer The terminal receives all the feature maps output by the output of the seventh convolutional layer, the input of the seventh activation layer receives all the feature maps output by the output of the seventh normalization layer, and the input of the fourth residual block receives the seventh activation All feature maps output by the output of the layer, the input of the eighth convolutional layer receives all the feature maps output by the output of the fourth residual block, and the input of the eighth normalization layer receives the output of the eighth convolutional layer All the feature maps of the output, the input of the eighth activation layer receives all the feature maps output by the output of the eighth batch of normalization layers, and the output of the eighth activation layer is the output of the neural network block where it is located; wherein, the seventh The convolution kernel size of the convolution layer and the eighth convolution layer are both 3×3, the number of convolution kernels is 256, and the zero padding parameters are 1. The activation methods of the seventh activation layer and the eighth activation layer are both "Relu", the seventh batch of normalization layer, the eighth batch of normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block respectively output 256 feature maps.

在此具体实施例中，第5个RGB图神经网络块和第5个深度图神经网络块的结构相同，其由依次设置的第九卷积层、第九批标准化层、第九激活层、第五残差块、第十卷积层、第十批标准化层、第十激活层组成，第九卷积层的输入端为其所在的神经网络块的输入端，第九批标准化层的输入端接收第九卷积层的输出端输出的所有特征图，第九激活层的输入端接收第九批标准化层的输出端输出的所有特征图，第五残差块的输入端接收第九激活层的输出端输出的所有特征图，第十卷积层的输入端接收第五残差块的输出端输出的所有特征图，第十批标准化层的输入端接收第十卷积层的输出端输出的所有特征图，第十激活层的输入端接收第十批标准化层的输出端输出的所有特征图，第十激活层的输出端为其所在的神经网络块的输出端；其中，第九卷积层和第十卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第九激活层和第十激活层的激活方式均为“Relu”，第九批标准化层、第十批标准化层、第九激活层、第十激活层和第五残差块各自的输出端输出256幅特征图。In this specific embodiment, the structure of the 5th RGB graph neural network block and the 5th depth graph neural network block is the same, and it consists of the ninth convolutional layer, the ninth batch normalization layer, the ninth activation layer, The fifth residual block, the tenth convolutional layer, the tenth batch normalization layer, and the tenth activation layer are composed. The input end of the ninth convolutional layer is the input end of the neural network block where it is located, and the input end of the ninth batch normalization layer The terminal receives all the feature maps output by the output terminal of the ninth convolutional layer, the input terminal of the ninth activation layer receives all the feature maps output by the output terminal of the ninth batch normalization layer, and the input terminal of the fifth residual block receives the ninth activation All the feature maps output by the output of the layer, the input of the tenth convolutional layer receives all the feature maps output by the output of the fifth residual block, and the input of the tenth normalization layer receives the output of the tenth convolutional layer All feature maps output, the input end of the tenth activation layer receives all feature maps output by the output end of the tenth batch of normalization layers, and the output end of the tenth activation layer is the output end of the neural network block where it is located; wherein, the ninth The convolution kernel size of the convolutional layer and the tenth convolutional layer are both 3×3, the number of convolution kernels is 256, and the zero padding parameters are all 1. The activation methods of the ninth activation layer and the tenth activation layer are both "Relu", the ninth batch of normalization layer, the tenth batch of normalization layer, the ninth activation layer, the tenth activation layer and the output of the fifth residual block output 256 feature maps.

在此具体实施例中，步骤1_2中，4个RGB图最大池化层和4个深度图最大池化层均为最大池化层，4个RGB图最大池化层和4个深度图最大池化层的池化尺寸(pool_size)均为2、步长(stride)均为2。In this specific embodiment, in step 1-2, the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are all maximum pooling layers, and the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers The pooling size (pool_size) of the layer is 2, and the stride is 2.

在此具体实施例中，步骤1_2中，5个融合神经网络块的结构相同，其由依次设置的第十一卷积层、第十一批标准化层、第十一激活层、第六残差块、第十二卷积层、第十二批标准化层、第十二激活层组成，第十一卷积层的输入端为其所在的融合神经网络块的输入端，第十一批标准化层的输入端接收第十一卷积层的输出端输出的所有特征图，第十一激活层的输入端接收第十一批标准化层的输出端输出的所有特征图，第六残差块的输入端接收第十一激活层的输出端输出的所有特征图，第十二卷积层的输入端接收第六残差块的输出端输出的所有特征图，第十二批标准化层的输入端接收第十二卷积层的输出端输出的所有特征图，第十二激活层的输入端接收第十二批标准化层的输出端输出的所有特征图，第十二激活层的输出端为其所在的神经网络块的输出端；其中，第1个和第2个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为256、补零参数均为1，第1个和第2个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第1个和第2个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出256幅特征图，第3个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为128、补零参数均为1，第3个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第3个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出128幅特征图，第4个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为64、补零参数均为1，第4个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第4个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出64幅特征图，第5个融合神经网络块中的第十一卷积层和第十二卷积层的卷积核大小均为3×3、卷积核个数均为32、补零参数均为1，第5个融合神经网络块中的第十一激活层和第十二激活层的激活方式均为“Relu”，第5个融合神经网络块中的第十一批标准化层、第十二批标准化层、第十一激活层、第十二激活层和第六残差块各自的输出端输出32幅特征图。In this specific embodiment, in steps 1-2, the structure of the 5 fused neural network blocks is the same, which consists of the eleventh convolutional layer, the eleventh batch of normalization layers, the eleventh activation layer, and the sixth residual block, the twelfth convolutional layer, the twelfth batch of normalization layers, and the twelfth activation layer. The input end of the eleventh convolutional layer is the input end of the fusion neural network block where it is located. The input of the eleventh convolutional layer receives all the feature maps output by the output of the eleventh activation layer, the input of the eleventh activation layer receives all the feature maps output by the output of the eleventh normalization layer, and the input of the sixth residual block The terminal receives all the feature maps output by the output terminal of the eleventh activation layer, the input terminal of the twelfth convolutional layer receives all the feature maps output by the output terminal of the sixth residual block, and the input terminal of the twelfth batch normalization layer receives All the feature maps output by the output of the twelfth convolutional layer, the input of the twelfth activation layer receive all the feature maps output by the output of the twelfth batch normalization layer, and the output of the twelfth activation layer is where The output end of the neural network block; wherein, the convolution kernel size of the eleventh convolution layer and the twelfth convolution layer in the first and second fusion neural network blocks are both 3×3, convolution kernel The number is 256, and the zero-padding parameters are all 1. The activation methods of the eleventh activation layer and the twelfth activation layer in the first and second fusion neural network blocks are both "Relu". The output terminals of the eleventh batch of normalization layers, the twelfth batch of normalization layers, the eleventh activation layer, the twelfth activation layer, and the sixth residual block in the second fusion neural network block output 256 feature maps, The size of the convolution kernels of the eleventh convolutional layer and the twelfth convolutional layer in the third fusion neural network block are both 3×3, the number of convolution kernels is 128, and the zero-padding parameters are all 1. The activation mode of the eleventh activation layer and the twelfth activation layer in the 3 fused neural network blocks is "Relu", and the 11th batch of normalization layers and the 12th batch of normalization layers in the 3rd fused neural network block , the eleventh activation layer, the twelfth activation layer and the sixth residual block respectively output 128 feature maps, the eleventh convolutional layer and the twelfth convolutional layer in the fourth fusion neural network block The convolution kernel size is 3×3, the number of convolution kernels is 64, and the zero-padding parameters are all 1. The activation method of the eleventh activation layer and the twelfth activation layer in the fourth fusion neural network block Both are "Relu", the respective outputs of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the fourth fused neural network block Output 64 feature maps, the size of the convolution kernel of the eleventh convolution layer and the twelfth convolution layer in the fifth fused neural network block are both 3×3, the number of convolution kernels is 32, and zero padding The parameters are all 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the fifth fusion neural network block are both "Relu", the eleventh batch of standardized layers in the fifth fusion neural network block, The output terminals of the twelfth batch of normalization layers, the eleventh activation layer, the twelfth activation layer and the sixth residual block output 32 feature maps.

在此具体实施例中，步骤1_2中，第1个和第2个反卷积层的卷积核大小均为2×2、卷积核个数均为256、步长均为2、补零参数均为0，第3个反卷积层的卷积核大小为2×2、卷积核个数为128、步长为2、补零参数为0，第4个反卷积层的卷积核大小为2×2、卷积核个数为64、步长为2、补零参数为0。In this specific embodiment, in step 1_2, the size of the convolution kernel of the first and second deconvolution layers is 2×2, the number of convolution kernels is 256, the step size is 2, and zero padding The parameters are all 0, the convolution kernel size of the third deconvolution layer is 2×2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the volume of the fourth deconvolution layer The size of the product kernel is 2×2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.

在此具体实施例中，步骤1_2中，5个子输出层的结构相同，其由第十三卷积层组成；其中，第十三卷积层的卷积核大小为1×1、卷积核个数为2、补零参数为0。In this specific embodiment, in step 1_2, the structure of the 5 sub-output layers is the same, which is composed of the thirteenth convolutional layer; wherein, the convolution kernel size of the thirteenth convolutional layer is 1×1, and the convolution kernel The number is 2, and the zero padding parameter is 0.

为了进一步验证本发明方法的可行性和有效性，进行实验。In order to further verify the feasibility and effectiveness of the method of the present invention, experiments were carried out.

使用基于python的深度学习库Pytorch0.4.1构建本发明方法提出的卷积神经网络的架构。采用真实物体图像数据库NLPR测试集，来分析利用本发明方法预测得到的彩色真实物体图像(取200幅真实物体图像)的显著性检测效果如何。这里，利用评估显著性检测方法的3个常用客观参量作为评价指标，即类准确率召回率曲线(Precision RecallCurve)、平均绝对误差(Mean Absolute Error，MAE)、F度量值(F-Measure)来评价预测显著性检测图像的检测性能。Use python-based deep learning library Pytorch0.4.1 to build the architecture of the convolutional neural network proposed by the method of the present invention. The real object image database NLPR test set is used to analyze the saliency detection effect of the color real object images (taking 200 real object images) predicted by the method of the present invention. Here, three commonly used objective parameters for evaluating the significance detection method are used as evaluation indicators, namely the class precision recall rate curve (Precision Recall Curve), mean absolute error (Mean Absolute Error, MAE), and F-measure value (F-Measure). Evaluate detection performance on predicted saliency detected images.

利用本发明方法对真实物体图像数据库NLPR测试集中的每幅彩色真实物体图像进行预测，得到每幅彩色真实物体图像对应的预测显著性检测图像。反映本发明方法的显著性检测效果的类准确率召回率曲线(PR Curve)如图2a所示，反映本发明方法的显著性检测效果的平均绝对误差(MAE)如图2b所示，值为0.058，反映本发明方法的显著性检测效果的F度量值(F-Measure)如图2c所示，值为0.796。从图2a至图2c中可以看出，按本发明方法得到的彩色真实物体图像的显著性检测结果是好的，表明利用本发明方法来获取彩色真实物体图像对应的预测显著性检测图像是可行且有效的。The method of the invention is used to predict each color real object image in the NLPR test set of the real object image database, and obtain the predicted significance detection image corresponding to each color real object image. The class precision-recall rate curve (PR Curve) reflecting the significance detection effect of the method of the present invention is shown in Figure 2a, and the mean absolute error (MAE) reflecting the significance detection effect of the method of the present invention is shown in Figure 2b, and the value is 0.058, the F-measure value (F-Measure) reflecting the significance detection effect of the method of the present invention is shown in Figure 2c, and the value is 0.796. As can be seen from Fig. 2a to Fig. 2c, the saliency detection result of the color real object image obtained by the method of the present invention is good, indicating that it is feasible to use the method of the present invention to obtain the corresponding prediction saliency detection image of the color real object image and effective.

图3a给出了同一场景的第1幅原始的彩色真实物体图像，图3b给出了图3a对应的深度图像，图3c给出了利用本发明方法对图3a进行预测得到的预测显著性检测图像；图4a给出了同一场景的第2幅原始的彩色真实物体图像，图4b给出了图4a对应的深度图像，图4c给出了利用本发明方法对图4a进行预测得到的预测显著性检测图像；图5a给出了同一场景的第3幅原始的彩色真实物体图像，图5b给出了图5a对应的深度图像，图5c给出了利用本发明方法对图5a进行预测得到的预测显著性检测图像；图6a给出了同一场景的第4幅原始的彩色真实物体图像，图6b给出了图6a对应的深度图像，图6c给出了利用本发明方法对图6a进行预测得到的预测显著性检测图像。对比图3a和图3c，对比图4a和4c，对比图5a和图5c，对比图6a和图6c，可以看出利用本发明方法得到的预测显著性检测图像的检测精度较高。Figure 3a shows the first original color real object image of the same scene, Figure 3b shows the depth image corresponding to Figure 3a, and Figure 3c shows the predicted saliency detection obtained by predicting Figure 3a using the method of the present invention image; Fig. 4a provides the 2nd original color real object image of the same scene, Fig. 4b provides the depth image corresponding to Fig. 4a, Fig. 4c provides the prediction significant that utilizes the inventive method to Fig. 4a to predict Figure 5a shows the 3rd original color real object image of the same scene, Figure 5b shows the depth image corresponding to Figure 5a, and Figure 5c shows the result obtained by using the method of the present invention to predict Figure 5a Predict the saliency detection image; Fig. 6a provides the 4th original color real object image of the same scene, Fig. 6b provides the depth image corresponding to Fig. 6a, and Fig. 6c provides the method of the present invention to predict Fig. 6a The resulting predicted saliency detected image. Comparing Figure 3a and Figure 3c, comparing Figure 4a and 4c, comparing Figure 5a and Figure 5c, comparing Figure 6a and Figure 6c, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is relatively high.

Claims

1. A significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original color real object images, depth images corresponding to each original color real object image and real significance detection label images, forming a training set, and obtaining the Q-th original color real object image in the training set and the corresponding original color real object imageThe depth image and the real significance detection label image of (1) are correspondingly marked as { I^q(i,j)}、{D^q(i,j)}、Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦^q(i,j)}、{D^q(i,j)}、H represents { I }^q(i,j)}、{D^q(i,j)}、W and H can be divided by 2, { I^q(I, j) } RGB color image, I^q(I, j) represents { I^q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D^q(i, j) } is a single-channel depth image, D^q(i, j) represents { D^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;

for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;

for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;

for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP₁；

For the 1 st RGB map max pooling layer, its input receives CP₁The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC₁；

For the 2 nd RGB map neural network block, its input receives ZC₁The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP₂；

For the 2 nd RGB map max pooling layer, its input receives CP₂The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC₂；

For the 3 rd RGB map neural network block, its input receives ZC₂The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP₃；

For the 3 rd RGB map max pooling layer, its input receives CP₃The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC₃；

For the 4 th RGB map neural network block, its input receives ZC₃All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP₄；

For the 4 th RGB map max pooling layer, its input receives CP₄All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC₄；

For the 5 th RGB map neural network block, its input receives ZC₄All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP₅；

For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP₁；

For the 1 st depth map max pooling layer, its input receives DP₁The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC₁；

For the 2 nd depth map neural network block, its input receives DC₁The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP₂；

For the 2 nd depth map max pooling layer, its input receives DP₂The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC₂；

For the 3 rd depth map neural network block, its input receives DC₂The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP₃；

For the 3 rd depth map max pooling layer, its input receives DP₃All the feature maps in (1), the output thereofThe output end outputs 128 pieces of widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC₃；

For the 4 th depth map neural network block, its input receives DC₃All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP₄；

For the 4 th depth map max pooling layer, its input receives DP₄All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC₄；

For the 5 th depth map neural network block, its input receives DC₄All the characteristic maps in (1) have 256 output widths ofAnd has a height ofA feature map of, all features to be outputThe set of graph constructs is denoted DP₅；

For the 1 st cascaded layer, its input receives CP₅All feature maps and DP in₅All feature maps in (1), for CP₅All feature maps and DP in₅All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con₁；

For the 1 st converged neural network block, its input receives Con₁All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH₁；

For the 1 st deconvolution layer, its input terminal receives RH₁All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ₁；

For the 2 nd cascaded layer, its input receives FJ₁All feature maps, CP₄All feature maps and DP in₄All feature maps in (1), for FJ₁All feature maps, CP₄All feature maps and DP in₄All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con₂；

For the 2 nd converged neural network block, its input receives Con₂All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH₂；

For the 2 nd deconvolution layer, its input terminal receives RH₂All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ₂；

For the 3 rd cascaded layer, its input receives FJ₂All feature maps, CP₃All feature maps and DP in₃All feature maps in (1), for FJ₂All feature maps, CP₃All feature maps and DP in₃All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con₃；

For the 3 rd converged neural network block, its input receives Con₃The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH₃；

For the 3 rd deconvolution layer, its input terminal receives RH₃The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ₃；

For the 4 th cascaded layer, its input receives FJ₃All feature maps, CP₂All feature maps and DP in₂All feature maps in (1), for FJ₃All feature maps, CP₂All feature maps and DP in₂All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are outputAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con₄；

For the 4 th converged neural network block, its input receives Con₄The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH₄；

For the 4 th deconvolution layer, its input terminal receives RH₄The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ₄；

For the 5 th cascaded layer, its input receives FJ₄All feature maps, CP₁All feature maps and DP in₁All feature maps in (1), for FJ₄All feature maps, CP₁All feature maps and DP in₁The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con₅；

For the 5 th converged neural network block, its input receives Con₅The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH₅；

For the 1 st sub-output layer, its input receives RH₁The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out₁，Out₁One of the feature maps is a significance detection prediction map;

for the 2 nd sub-output layer, its input receives RH₂The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out₂，Out₂One of the feature maps is a significance detection prediction map;

for the 3 rd sub-output layer, its input receives RH₃The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out₃，Out₃One of the feature maps is a significance detection prediction map;

for the 4 th sub-output layer, its input receives RH₄The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out₄，Out₄One of the feature maps is a significance detection prediction map;

for the 5 th sub-output layer, its input receives RH₅2 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out₅，Out₅One of the feature maps is a significance detection prediction map;

step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in the^q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as

Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width W and height H will be { I }^q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as

Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the setsAndthe value of the loss function in between is recorded asObtaining by adopting a classified cross entropy;

step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: order toRepresenting a color real object image to be saliency detected, willThe corresponding depth image is notedWherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsAndwidth of (A), H' representsAndthe height of (a) of (b),to representThe pixel value of the pixel point with the middle coordinate position (i ', j'),to representImage of pixel point with middle coordinate position (i', jThe prime value;

step 2_ 2: will be provided withR channel component, G channel component and B channel component of andinputting into a convolutional neural network training model and using W^bestAnd b^bestMaking a prediction to obtainCorresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes withAs the predicted saliency detection image of uniform sizeCorresponding final predicted saliency detection images and notationWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').

2. The method according to claim 1, wherein in step 1_2, the 1 st RGB graph neural network block and the 1 st depth graph neural network block have the same structure, and are composed of a first convolution layer, a first normalization layer, a first active layer, a first residual block, a second convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first normalization layer receives all feature maps output from an output end of the first convolution layer, an input end of the first active layer receives all feature maps output from an output end of the first normalization layer, an input end of the first residual block receives all feature maps output from an output end of the first active layer, and an input end of the second convolution layer receives all feature maps output from an output end of the first residual block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;

the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;

the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;

the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;

the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.

3. The method according to claim 1 or 2, wherein in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are maximum pooling layers, and the pooling sizes of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and the step size is 2.

4. The method according to claim 3, wherein in step 1_2, the 5 fused neural network blocks have the same structure and are composed of an eleventh convolutional layer, an eleventh normalization layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, a twelfth normalization layer, and a twelfth activation layer, which are sequentially arranged, wherein an input end of the eleventh convolutional layer is an input end of the fused neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, the input end of the twelfth normalization layer receives all the characteristic diagrams output by the output end of the twelfth convolution layer, the input end of the twelfth activation layer receives all the characteristic diagrams output by the output end of the twelfth normalization layer, and the output end of the twelfth activation layer is the output end of the neural network block where the twelfth activation layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.

5. The saliency detection method based on residual error network and depth information fusion according to claim 4 is characterized in that in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 x 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 x 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 x 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.

6. The method according to claim 5, wherein in step 1_2, the 5 sub-output layers have the same structure and are composed of a thirteenth convolutional layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.