CN110263813B

CN110263813B - Significance detection method based on residual error network and depth information fusion

Info

Publication number: CN110263813B
Application number: CN201910444775.0A
Authority: CN
Inventors: 周武杰; 吴君委; 雷景生; 何成; 钱亚冠; 王海江; 张伟
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Huahao Technology Xi'an Co ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2020-12-01
Anticipated expiration: 2039-05-27
Also published as: CN110263813A

Abstract

The invention discloses a significance detection method based on residual error network and depth information fusion, which is characterized in that a convolutional neural network is constructed in a training stage, an input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, a hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and an output layer comprises 5 sub-output layers; inputting the color real object image and the depth image in the training set into a convolutional neural network for training to obtain a saliency detection prediction map; obtaining a convolutional neural network training model by calculating a loss function value between a saliency detection prediction image and a real saliency detection label image; in the testing stage, predicting a color real object image to be subjected to significance detection by using a convolutional neural network training model to obtain a predicted significance detection image; the advantage is that the significance detects the high accuracy.

Description

Significance detection method based on residual error network and depth information fusion

Technical Field

The invention relates to a visual saliency detection technology, in particular to a saliency detection method based on residual error network and depth information fusion.

Background

The visual saliency can help people to quickly filter out unimportant information, so that people can focus more on meaningful areas, and the scene in front of the eyes can be better understood. With the rapid development of the computer vision field, people hope that a computer can also have the same capability as a human being, namely, when a complex scene is understood and analyzed, the computer can process useful information more pertinently, so that the complexity of an algorithm can be reduced more, and the interference of noise waves can be eliminated. In the conventional method, researchers model a saliency object detection algorithm according to various observed prior knowledge to generate a saliency map. These a priori knowledge include contrast, center a priori, edge a priori, semantic a priori, etc. However, in complex scenes, conventional practice tends to be inaccurate because these observations tend to be limited to low-level features (e.g., color and contrast, etc.), and therefore do not accurately reflect the common points of significance inherent in the object.

In recent years, convolutional neural networks have been widely used in various fields of computer vision, and many difficult vision problems have been made a great progress. Different from the traditional method, the deep convolutional neural network can be modeled from a large number of training samples and automatically learns more essential characteristics end-to-end (end-to-end), so that the defects of traditional manual modeling and feature design are effectively avoided. Recently, effective application of 3D sensors enriches databases, and people can obtain not only color pictures but also depth information of color pictures. Depth information is an important ring in the human visual system in real 3D scenes, which is an important piece of information that has been completely ignored in the conventional practice, so that the most important task at present is how to build a model to effectively utilize the depth information.

A significance detection method of deep learning is adopted in an RGB-D database, pixel-level end-to-end significance detection is directly carried out, and prediction can be carried out on a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. At present, the depth learning significance detection model based on the RGB-D database mainly uses an encoding-decoding architecture, and there are three methods how to utilize depth information: the first method is to directly superimpose the depth information and the color image information into a four-dimensional input information or add or superimpose the color image information and the depth information in the encoding process, and the method is called pre-fusion; the second method is to add or superimpose the color image information and the depth information corresponding to each other in the encoding process into the corresponding decoding process in a layer skipping (skip connection) manner, which is called post-fusion; the third method is to use color image information and depth information to predict the significance and to fuse the final results. In the first method, since the color image information and the depth information have a large difference in distribution, noise is added to a certain extent by directly adding the depth information in the encoding process. The third method uses the depth information and the color map information to perform saliency prediction, but if the prediction results of the depth information and the color map information are not accurate, the final fusion result is relatively inaccurate. The second method not only avoids the noise brought by directly utilizing the depth information in the coding stage, but also can fully learn the complementary relation between the color image information and the depth information in the continuous optimization of the network model. Compared with the previous post-Fusion scheme, such as an RGB-D salience Detection by Multi-stream Late Fusion Network model (based on a Multi-stream post-Fusion RGB-D significance Detection Network model), which is hereinafter referred to as an MLF for short, the MLF performs feature extraction and down-sampling operations on color image information and depth information respectively, performs Fusion by multiplying corresponding position elements in the highest dimension, and outputs a significance prediction map with a small size on the result of the Fusion. The MLF only has a down-sampling operation, so that spatial detail information of an object is blurred in the continuous down-sampling operation, and the MLF performs significance prediction output on the minimum size, and loses much information of a significant object after being amplified to the original size.

Disclosure of Invention

The invention aims to solve the technical problem of a significance detection method based on residual error network and depth information fusion, which improves the significance detection accuracy rate by efficiently utilizing depth information and color image information.

The technical scheme adopted by the invention for solving the technical problems is as follows: a significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }^q(i,j)}、{D^q(i,j)}、

Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦^q(i,j)}、{D^q(i,j)}、

H represents { I }^q(i,j)}、{D^q(i,j)}、

W and H can be divided by 2, { I^q(I, j) } RGB color image, I^q(I, j) represents { I^q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D^q(i, j) } is a single-channel depth image, D^q(i, j) represents { D^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j);

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;

for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;

for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;

for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP₁；

For the 1 st RGB map max pooling layer, its input receives CP₁The output end of all the characteristic maps outputs 32 characteristic maps with the width of

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₁；

For the 2 nd RGB map neural network block, its input receives ZC₁The output end of all the characteristic graphs in (1) outputs 64 width

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₂；

For the 2 nd RGB map max pooling layer, its input receives CP₂The output end of all the characteristic graphs in (1) outputs 64 width

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₂；

For the 3 rd RGB map neural network block, its input receives ZC₂The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₃；

For the 3 rd RGB map max pooling layer, its input receives CP₃The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₃；

For the 4 th RGB map neural network block, its input receives ZC₃All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₄；

For the 4 th RGB map max pooling layer, its input receives CP₄All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₄；

For the 5 th RGB map neural network block, its input receives ZC₄All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₅；

For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP₁；

For the1 st depth map max pooling layer with DP received at its input₁The output end of all the characteristic maps outputs 32 characteristic maps with the width of

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₁；

For the 2 nd depth map neural network block, its input receives DC₁The output end of all the characteristic graphs in (1) outputs 64 width

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₂；

For the 2 nd depth map max pooling layer, its input receives DP₂The output end of all the characteristic graphs in (1) outputs 64 width

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₂；

For the 3 rd depth map neural network block, its input receives DC₂The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₃；

For the 3 rd depth map max pooling layer, its input receives DP₃The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₃；

For the 4 th depth map neural network block, its input receives DC₃All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₄；

For the 4 th depth map max pooling layer, its input receives DP₄All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₄；

For the 5 th depth map neural network block, its input receives DC₄All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) represents a set of all feature maps outputtedDP₅；

For the 1 st cascaded layer, its input receives CP₅All feature maps and DP in₅All feature maps in (1), for CP₅All feature maps and DP in₅All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₁；

For the 1 st converged neural network block, its input receives Con₁All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₁；

For the 1 st deconvolution layer, its input terminal receives RH₁All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₁；

For the 2 nd cascaded layer, its input receives FJ₁All feature maps, CP₄All feature maps and DP in₄All feature maps in (1), for FJ₁All feature maps, CP₄All feature maps and DP in₄All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₂；

For the 2 nd converged neural network block, its input receives Con₂All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₂；

For the 2 nd deconvolution layer, its input terminal receives RH₂All the characteristic maps in (1) have 256 output widths of

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₂；

For the 3 rd cascaded layer, its input receives FJ₂All feature maps, CP₃All feature maps and DP in₃All feature maps in (1), for FJ₂All feature maps, CP₃All feature maps and DP in₃All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device

And has a height of

The feature map of (1), a set composed of all the feature maps of the outputIs totally expressed as Con₃；

For the 3 rd converged neural network block, its input receives Con₃The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₃；

For the 3 rd deconvolution layer, its input terminal receives RH₃The output end of all the characteristic maps outputs 128 width

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₃；

For the 4 th cascaded layer, its input receives FJ₃All feature maps, CP₂All feature maps and DP in₂All feature maps in (1), for FJ₃All feature maps, CP₂All feature maps and DP in₂All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₄；

For the 4 th converged neural network block, its input receives Con₄The output end of all the characteristic graphs in (1) outputs 64 width

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₄；

For the 4 th deconvolution layer, its input terminal receives RH₄The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ₄；

For the 5 th cascaded layer, its input receives FJ₄All feature maps, CP₁All feature maps and DP in₁All feature maps in (1), for FJ₄All feature maps, CP₁All feature maps and DP in₁The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con₅；

For the 5 th converged neural network block, its input receives Con₅The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH₅；

For the 1 st sub-output layer, its input receives RH₁The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₁，Out₁One of the feature maps is a significance detection prediction map;

for the 2 nd sub-output layer, its input receives RH₂The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₂，Out₂One of the feature maps is a significance detection prediction map;

for the 3 rd sub-output layer, its input receives RH₃The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₃，Out₃One of the feature maps is a significance detection prediction map;

for the 4 th sub-output layer, its input receives RH₄The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₄，Out₄One of the feature maps is a significance detection prediction map;

for the 5 th sub-output layer, its input receives RH₅2 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out₅，Out₅One of the feature maps is a significance detection prediction map;

step 1_ 3: each original color real object image in the training set is used as an RGB color image for training, a depth image corresponding to each original color real object image in the training set is used as a depth image for training, and the depth image is input into a convolution neural networkPerforming training to obtain 5 saliency detection prediction maps corresponding to each original color real object image in a training set, and performing { I }^q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as

Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width W and height H will be { I }^q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as

Step 1_ 5: calculation trainingLoss function values between a set formed by 5 saliency detection prediction maps corresponding to each original color real object image in the training set and a set formed by 5 images obtained by scaling the real saliency detection images corresponding to the original color real object images are obtained

And

the value of the loss function in between is recorded as

Obtaining by adopting a classified cross entropy;

step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: order to

Representing a color real object image to be saliency detected, will

The corresponding depth image is noted

Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

And

width of (A), H' represents

And

the height of (a) of (b),

to represent

The pixel value of the pixel point with the middle coordinate position (i ', j'),

to represent

The pixel value of the pixel point with the middle coordinate position (i ', j');

step 2_ 2: will be provided with

R channel component, G channel component and B channel component of and

inputting into a convolutional neural network training model and using W^bestAnd b^bestMaking a prediction to obtain

Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with

As the predicted saliency detection image of uniform size

Corresponding final predicted significanceDetect the image and note as

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

In step 1_2, the 1 st RGB map neural network block and the 1 st depth map neural network block have the same structure, and are composed of a first convolution layer, a first batch of normalization layers, a first activation layer, a first residual block, a second convolution layer, a second batch of normalization layers, and a second activation layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first batch of normalization layers receives all feature maps output by an output end of the first convolution layer, an input end of the first activation layer receives all feature maps output by an output end of the first batch of normalization layer, an input end of the first residual block receives all feature maps output by an output end of the first activation layer, an input end of the second convolution layer receives all feature maps output by an output end of the first residual block, and an input end of the second batch of the standard layer receives all feature maps output by an output end of the second convolution layer, the input end of the second activation layer receives all characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;

the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;

the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;

the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;

the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.

In the step 1_2, the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are maximum pooling layers, the pooling sizes of the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are both 2, and the step sizes are both 2.

In step 1_2, the 5 fused neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh block of normalization layers, an eleventh active layer, a sixth residual block, a twelfth convolutional layer, a twelfth block of normalization layers, and a twelfth active layer, which are sequentially arranged, wherein the input end of the eleventh convolutional layer is the input end of the fused neural network block where the eleventh convolutional layer is located, the input end of the eleventh convolutional layer receives all feature maps output by the output end of the eleventh convolutional layer, the input end of the eleventh active layer receives all feature maps output by the output end of the eleventh block of normalization layers, the input end of the sixth residual block receives all feature maps output by the output end of the eleventh active layer, the input end of the twelfth convolutional layer receives all feature maps output by the output end of the sixth residual block, and the input end of the twelfth block of normalization layers receives all feature maps output by the output end of the twelfth convolutional layers, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.

In step 1_2, the sizes of convolution kernels of the 1 st deconvolution layer and the 2 nd deconvolution layer are both 2 × 2, the numbers of convolution kernels are both 256, the step lengths are both 2, and the zero padding parameter is 0, the sizes of convolution kernels of the 3 rd deconvolution layer are 2 × 2, the numbers of convolution kernels are 128, the step lengths are 2, and the zero padding parameter is 0, and the sizes of convolution kernels of the 4 th deconvolution layer are 2 × 2, the numbers of convolution kernels are 64, the step lengths are 2, and the zero padding parameter is 0.

In the step 1_2, the 5 sub-output layers have the same structure and consist of a thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.

Compared with the prior art, the invention has the advantages that:

1) the convolutional neural network constructed by the method realizes the detection of the salient object from end to end, is easy to train, and is convenient and quick; inputting the color real object images and the corresponding depth images in the training set into a convolutional neural network for training to obtain a convolutional neural network training model; and inputting the color real object image to be subjected to significance detection and the corresponding depth image into the convolutional neural network training model, and predicting to obtain a predicted significance detection image corresponding to the color real object image.

2) The method adopts a post-fusion mode when the depth information is utilized, and cascades the depth information and the color image information corresponding to the coding layer with the corresponding coding layer (registration), thereby avoiding the addition of noise information in the coding stage by pre-fusion, and simultaneously being capable of fully learning complementary information of the color image information and the depth information when a convolutional neural network training model is trained, and further obtaining better effect on a training set and a testing set.

3) The invention adopts multi-scale Supervision (multi-scale Supervision), namely, spatial detail information of an object can be optimized in the process of up-sampling through a deconvolution layer, prediction graphs are output at different sizes and supervised by label graphs with corresponding sizes, and a convolutional neural network training model can be guided to gradually construct significance detection prediction graphs, so that better effects are obtained on a training set and a testing set.

Drawings

FIG. 1 is a schematic diagram of the structure of a convolutional neural network constructed by the method of the present invention;

FIG. 2a is a class accuracy recall curve for predicting each color real object image in a real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;

FIG. 2b is a graph showing the mean absolute error of the saliency detection effect of the present invention as predicted for each color real object image in the real object image database NLPR test set by the present invention;

FIG. 2c is a F metric value for predicting each color real object image in the real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;

FIG. 3a is the 1 st original color real object image of the same scene;

FIG. 3b is a depth image corresponding to FIG. 3 a;

FIG. 3c is a predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention;

FIG. 4a is the 2 nd original color real object image of the same scene;

FIG. 4b is a depth image corresponding to FIG. 4 a;

FIG. 4c is a predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention;

FIG. 5a is the 3 rd original color real object image of the same scene;

FIG. 5b is a depth image corresponding to FIG. 5 a;

FIG. 5c is a predicted saliency detected image from the prediction of FIG. 5a using the method of the present invention;

FIG. 6a is the 4 th original color real object image of the same scene;

FIG. 6b is a depth image corresponding to FIG. 6 a;

fig. 6c is a predicted saliency detection image obtained by predicting fig. 6a by the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The significance detection method based on the residual error network and the depth information fusion comprises a training stage and a testing stage.

The specific steps of the training phase process are as follows:

Wherein Q is a positive integer, Q is not less than 200, if Q is 367, Q is a positive integer, the initial value of Q is 1, 1 is not less than Q is not less than Q, 1 is not less than I is not less than W, 1 is not less than j is not less than H, W represents { I ≦ H^q(i,j)}、{D^q(i,j)}、

H represents { I }^q(i,j)}、{D^q(i,j)}、

W and H can be divided by 2, for example, W512, H512, { I^q(I, j) } RGB color image, I^q(I, j) represents { I^q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D^q(i, j) } is a single-channel depth image, D^q(i, j) represents { D^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); in this case, the original color real object image is directly selected from 800 images in the training set of the database NLPR.

Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes 5 RGB map neural network blocks, 4 RGB map max pooling layers (Pool), 5 depth map neural network blocks, 4 depth map max pooling layers, 5 cascade layers, 5 fusion neural network blocks, and 4 deconvolution layers, and the output layer includes 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of the convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers.

For the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; among them, the width of the RGB color image for training is required to be W and the height is required to be H.

For the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; the training depth image has a width W and a height H.

For the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP₁。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₁。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₂。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₂。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₃。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₃。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₄。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as ZC₄。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as CP₅。

For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP₁。

For the 1 st depth map max pooling layer, its input receives DP₁The output end of all the characteristic maps outputs 32 characteristic maps with the width of

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₁。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₂。

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₂。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₃。

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₃。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₄。

And has a height of

The feature map of (1) is a set of all output feature maps, and is denoted as DC₄。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₅。

For the 1 st cascaded (concatenation) layer, its input receives the CP₅All feature maps and DP in₅All feature maps in (1), for CP₅All feature maps and DP in₅All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₁。

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₁。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₁。

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₂。

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₂。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₂。

For the 3 rd cascaded layer, its input receives FJ₂All feature maps, CP₃All feature maps and DP in₃All feature maps in (1), for FJ₂All feature maps, CP₃All feature maps and DP in₃All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph is outputOut 512 pieces of width of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₃。

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₃。

And has a height of

The feature map of (1), a set of all feature maps outputted is denoted as FJ₃。

And has a height of

A characteristic diagram ofThe set of all the output feature maps is denoted as Con₄。

And has a height of

The feature map of (1) is a set of all feature maps outputted and is denoted as RH₄。

For the 4 th deconvolution layer, its input terminal receives RH₄The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ₄。

For the 5 th cascaded layer, its input receives FJ₄All feature maps, CP₁All feature maps and DP in₁All feature maps in (1), for FJ₄All feature maps, CP₁All feature maps and DP in₁The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con₅。

For the 5 th converged neural network block, its input receives Con₅The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH₅。

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₁，Out₁One of the feature maps (feature map 2) in (b) is a significance detection prediction map.

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₂，Out₂One of the feature maps (feature map 2) in (b) is a significance detection prediction map.

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₃，Out₃One of the feature maps (feature map 2) in (b) is a significance detection prediction map.

And has a height of

The feature map of (1), the set of all feature maps of output is denoted as Out₄，Out₄One of the feature maps (feature map 2) in (b) is a significance detection prediction map.

For the 5 th sub-output layer, its input receives RH₅The output end of all the characteristic diagrams outputs 2 characteristic diagrams with width W and height H, and the characteristic diagrams are processedThe set of all feature graph components of the output is denoted as Out₅，Out₅One of the feature maps (feature map 2) in (b) is a significance detection prediction map.

Step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in the^q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the sets

And

the value of the loss function in between is recorded as

Obtained using categorical cross entropy (categorical cross entropy).

Step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Where V > 1, in this example V is 300.

The test stage process comprises the following specific steps:

step 2_ 1: order to

Representing a color real object image to be saliency detected, will

The corresponding depth image is noted

And

width of (A), H' represents

And

the height of (a) of (b),

to represent

to represent

Step 2_ 2: will be provided with

R channel component, G channel component and B channel component of and

As the predicted saliency detection image of uniform size

Corresponding final predicted saliency detection images and notation

Wherein the content of the first and second substances,

to represent

In this embodiment, in step 1_2, the 1 st RGB map neural network Block and the 1 st depth map neural network Block have the same structure, and are composed of a first Convolution layer (convention, Conv), a first normalization layer (Batch normalization, BN), a first active layer (Activation, Act), a first Residual Block (Residual Block, RB), a second Convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, an input end of the first Convolution layer is an input end of the neural network Block, an input end of the first normalization layer receives all the feature maps output by an output end of the first Convolution layer, an input end of the first active layer receives all the feature maps output by an output end of the first normalization layer, an input end of the first Residual Block receives all the feature maps output by an output end of the first active layer, an input end of the second Convolution layer receives all the feature maps output by an output end of the first Residual Block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the convolution kernel sizes (kernel _ size) of the first convolution layer and the second convolution layer are 3 x 3, the convolution kernel numbers (filters) are 32, the zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual error block output 32 characteristic diagrams respectively.

In this embodiment, the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third active layer, a second residual block, a fourth convolution layer, a fourth normalization layer, and a fourth active layer, which are sequentially arranged, where an input end of the third convolution layer is an input end of the neural network block where the third convolution layer is located, an input end of the third normalization layer receives all feature maps output by an output end of the third convolution layer, an input end of the third active layer receives all feature maps output by an output end of the third normalization layer, an input end of the second residual block receives all feature maps output by an output end of the third active layer, an input end of the fourth convolution layer receives all feature maps output by an output end of the second residual block, and an input end of the fourth normalization layer receives all feature maps output by an output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block.

In this specific embodiment, the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure, and are composed of a fifth convolution layer, a fifth normalization layer, a fifth active layer, a third residual block, a sixth convolution layer, a sixth normalization layer, and a sixth active layer, which are sequentially arranged, where an input end of the fifth convolution layer is an input end of the neural network block where the fifth convolution layer is located, an input end of the fifth normalization layer receives all feature maps output by an output end of the fifth convolution layer, an input end of the fifth active layer receives all feature maps output by an output end of the fifth normalization layer, an input end of the third residual block receives all feature maps output by an output end of the fifth active layer, an input end of the sixth convolution layer receives all feature maps output by an output end of the third residual block, and an input end of the sixth normalization layer receives all feature maps output by an output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and 128 feature graphs are output from output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block respectively.

In this specific embodiment, the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh active layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged, an input end of the seventh convolution layer is an input end of the neural network block where the seventh convolution layer is located, an input end of the seventh normalization layer receives all feature maps output by an output end of the seventh convolution layer, an input end of the seventh active layer receives all feature maps output by an output end of the seventh normalization layer, an input end of the fourth residual block receives all feature maps output by an output end of the seventh active layer, an input end of the eighth convolution layer receives all feature maps output by an output end of the fourth residual block, an input end of the eighth normalization layer receives all feature maps output by an output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block.

In this embodiment, the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure, and are composed of a ninth convolutional layer, a ninth block of normalization layers, a ninth active layer, a fifth residual block, a tenth convolutional layer, a tenth block of normalization layers, and a tenth active layer, which are sequentially arranged, an input end of the ninth convolutional layer is an input end of the neural network block where the ninth convolutional layer is located, an input end of the ninth block of normalization layers receives all feature maps output by an output end of the ninth convolutional layer, an input end of the ninth active layer receives all feature maps output by an output end of the ninth block of normalization layers, an input end of the fifth residual block receives all feature maps output by an output end of the ninth active layer, an input end of the tenth block of normalization layers receives all feature maps output by an output end of the tenth block of normalization layers, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.

In this embodiment, in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both maximum pooling layers, and the pooling sizes (pool _ size) and the step sizes (stride) of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and 2, respectively.

In this embodiment, in step 1_2, the 5 merged neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, and a twelfth activation layer, which are sequentially arranged, an input end of the eleventh convolutional layer is an input end of the merged neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the twelfth convolutional layer, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.

In this embodiment, in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 × 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 × 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 × 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.

In this embodiment, in step 1_2, the 5 sub-output layers have the same structure and are composed of the thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) constructing a convolutional neural network architecture proposed by the method by using a python-based deep learning library Pytrich0.4.1. The method analyzes how the significance detection effect of the color real object image (taking 200 real object images) obtained by prediction by the method is achieved by adopting a real object image database NLPR test set. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy Recall Curve (Precision Recall Curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).

The method of the invention is used for predicting each color real object image in the real object image database NLPR test set to obtain a prediction significance detection image corresponding to each color real object image. A class accuracy recall Curve (PR cure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2a, an average absolute error (MAE) reflecting the significance detection effect of the method of the present invention is shown in fig. 2b, and has a value of 0.058, and an F metric (F-Measure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2c, and has a value of 0.796. As can be seen from fig. 2a to 2c, the saliency detection result of the color real object image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted saliency detection image corresponding to the color real object image by using the method of the present invention.

FIG. 3a shows the 1 st original color real object image of the same scene, FIG. 3b shows the depth image corresponding to FIG. 3a, and FIG. 3c shows the predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention; FIG. 4a shows the 2 nd original color real object image of the same scene, FIG. 4b shows the depth image corresponding to FIG. 4a, and FIG. 4c shows the predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention; FIG. 5a shows the 3 rd original color real object image of the same scene, FIG. 5b shows the depth image corresponding to FIG. 5a, and FIG. 5c shows the predicted saliency detection image obtained by predicting FIG. 5a using the method of the present invention; fig. 6a shows the 4 th original color real object image of the same scene, fig. 6b shows the depth image corresponding to fig. 6a, and fig. 6c shows the predicted saliency detection image obtained by predicting fig. 6a by using the method of the present invention. Comparing fig. 3a and 3c, comparing fig. 4a and 4c, comparing fig. 5a and 5c, and comparing fig. 6a and 6c, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.

Claims

1. A significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

H represents { I }^q(i,j)}、{D^q(i,j)}、

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j);

And has a height of

For the 2 nd RGB map neural network block, its input receives ZC₁The Chinese herbal medicineHas a characteristic diagram, the output end of the characteristic diagram outputs 64 widths

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

For the 1 st depth map max pooling layer, its input receives DP₁All characteristic maps in (1), output at output terminal32 width of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is designated as DP₅；

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

And has a height of

The feature map of (1) is a set of all feature maps outputted, and is denoted as Con₃；

And has a height of

And has a height of

And has a height of

A feature map of, a table to be outputThe set of characteristic diagrams is denoted Con₄；

And has a height of

And has a height of

The feature map of (1) represents a set of all feature maps outputtedOut₁，Out₁One of the feature maps is a significance detection prediction map;

And has a height of

And has a height of

And has a height of

for the 5 th sub-output layer, its input receives RH₅The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of WThe feature map with height H, the set of all feature maps output is recorded as Out₅，Out₅One of the feature maps is a significance detection prediction map;

Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 4 different sizes to obtain the real significance detection label image with the width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

Will { I }^q(i, j) } the set formed by the 4 images obtained by scaling the corresponding real significance detection label images and the real significance detection label images is recorded as

Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 4 images obtained by scaling real saliency detection label images corresponding to the original color real object images and the set formed by the real saliency detection label images, and calculating the loss function values of the sets

And

the value of the loss function in between is recorded as

Obtaining by adopting a classified cross entropy;

the test stage process comprises the following specific steps:

step 2_ 1: order to

Representing a color real object image to be saliency detected, will

The corresponding depth image is noted

And

width of (A), H' represents

And

the height of (a) of (b),

to represent

to represent

step 2_ 2: will be provided with

R channel component, G channel component and B channel component ofAmount and

As the predicted saliency detection image of uniform size

Corresponding final predicted saliency detection images and notation

Wherein the content of the first and second substances,

to represent

2. The method according to claim 1, wherein in step 1_2, the 1 st RGB graph neural network block and the 1 st depth graph neural network block have the same structure, and are composed of a first convolution layer, a first normalization layer, a first active layer, a first residual block, a second convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first normalization layer receives all feature maps output from an output end of the first convolution layer, an input end of the first active layer receives all feature maps output from an output end of the first normalization layer, an input end of the first residual block receives all feature maps output from an output end of the first active layer, and an input end of the second convolution layer receives all feature maps output from an output end of the first residual block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;

3. The method according to claim 1 or 2, wherein in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are maximum pooling layers, and the pooling sizes of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and the step size is 2.

4. The method according to claim 3, wherein in step 1_2, the 5 fused neural network blocks have the same structure and are composed of an eleventh convolutional layer, an eleventh normalization layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, a twelfth normalization layer, and a twelfth activation layer, which are sequentially arranged, wherein an input end of the eleventh convolutional layer is an input end of the fused neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, the input end of the twelfth normalization layer receives all the characteristic diagrams output by the output end of the twelfth convolution layer, the input end of the twelfth activation layer receives all the characteristic diagrams output by the output end of the twelfth normalization layer, and the output end of the twelfth activation layer is the output end of the neural network block where the twelfth activation layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.

5. The saliency detection method based on residual error network and depth information fusion according to claim 4 is characterized in that in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 x 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 x 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 x 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.

6. The method according to claim 5, wherein in step 1_2, the 5 sub-output layers have the same structure and are composed of a thirteenth convolutional layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.