Disclosure of Invention
The invention aims to solve the technical problem of a significance detection method based on residual error network and depth information fusion, which improves the significance detection accuracy rate by efficiently utilizing depth information and color image information.
The technical scheme adopted by the invention for solving the technical problems is as follows: a significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }
q(i,j)}、{D
q(i,j)}、
Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦
q(i,j)}、{D
q(i,j)}、
H represents { I }
q(i,j)}、{D
q(i,j)}、
W and H can be divided by 2, { I
q(I, j) } RGB color image, I
q(I, j) represents { I
q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D
q(i, j) } is a single-channel depth image, D
q(i, j) represents { D
qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
to represent
The middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;
for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1;
For the 1 st RGB map max pooling layer, its input receives CP
1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
1;
For the 2 nd RGB map neural network block, its input receives ZC
1The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
2;
For the 2 nd RGB map max pooling layer, its input receives CP
2The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
2;
For the 3 rd RGB map neural network block, its input receives ZC
2The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
3;
For the 3 rd RGB map max pooling layer, its input receives CP
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
3;
For the 4 th RGB map neural network block, its input receives ZC
3All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
4;
For the 4 th RGB map max pooling layer, its input receives CP
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
4;
For the 5 th RGB map neural network block, its input receives ZC
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
5;
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1;
For the1 st depth map max pooling layer with DP received at its input
1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
1;
For the 2 nd depth map neural network block, its input receives DC
1The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
2;
For the 2 nd depth map max pooling layer, its input receives DP
2The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
2;
For the 3 rd depth map neural network block, its input receives DC
2The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
3;
For the 3 rd depth map max pooling layer, its input receives DP
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
3;
For the 4 th depth map neural network block, its input receives DC
3All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
4;
For the 4 th depth map max pooling layer, its input receives DP
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
4;
For the 5 th depth map neural network block, its input receives DC
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) represents a set of all feature maps outputtedDP
5;
For the 1 st cascaded layer, its input receives CP
5All feature maps and DP in
5All feature maps in (1), for CP
5All feature maps and DP in
5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
1;
For the 1 st converged neural network block, its input receives Con
1All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
1;
For the 1 st deconvolution layer, its input terminal receives RH
1All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
1;
For the 2 nd cascaded layer, its input receives FJ
1All feature maps, CP
4All feature maps and DP in
4All feature maps in (1), for FJ
1All feature maps, CP
4All feature maps and DP in
4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
2;
For the 2 nd converged neural network block, its input receives Con
2All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
2;
For the 2 nd deconvolution layer, its input terminal receives RH
2All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
2;
For the 3 rd cascaded layer, its input receives FJ
2All feature maps, CP
3All feature maps and DP in
3All feature maps in (1), for FJ
2All feature maps, CP
3All feature maps and DP in
3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
And has a height of
The feature map of (1), a set composed of all the feature maps of the outputIs totally expressed as Con
3;
For the 3 rd converged neural network block, its input receives Con
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
3;
For the 3 rd deconvolution layer, its input terminal receives RH
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
3;
For the 4 th cascaded layer, its input receives FJ
3All feature maps, CP
2All feature maps and DP in
2All feature maps in (1), for FJ
3All feature maps, CP
2All feature maps and DP in
2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
4;
For the 4 th converged neural network block, its input receives Con
4The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
4;
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4;
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5;
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5;
For the 1 st sub-output layer, its input receives RH
1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
1,Out
1One of the feature maps is a significance detection prediction map;
for the 2 nd sub-output layer, its input receives RH
2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
2,Out
2One of the feature maps is a significance detection prediction map;
for the 3 rd sub-output layer, its input receives RH
3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
3,Out
3One of the feature maps is a significance detection prediction map;
for the 4 th sub-output layer, its input receives RH
4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
4,Out
4One of the feature maps is a significance detection prediction map;
for the 5 th sub-output layer, its input receives RH52 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out5,Out5One of the feature maps is a significance detection prediction map;
step 1_ 3: each original color real object image in the training set is used as an RGB color image for training, a depth image corresponding to each original color real object image in the training set is used as a depth image for training, and the depth image is input into a convolution neural networkPerforming training to obtain 5 saliency detection prediction maps corresponding to each original color real object image in a training set, and performing { I }
q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width of
And has a height of
An image of width of
And has a height of
An image of width of
And has a height of
An image of width of
And has a height of
An image of width W and height H will be { I }
q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Step 1_ 5: calculation trainingLoss function values between a set formed by 5 saliency detection prediction maps corresponding to each original color real object image in the training set and a set formed by 5 images obtained by scaling the real saliency detection images corresponding to the original color real object images are obtained
And
the value of the loss function in between is recorded as
Obtaining by adopting a classified cross entropy;
step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order to
Representing a color real object image to be saliency detected, will
The corresponding depth image is noted
Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
And
width of (A), H' represents
And
the height of (a) of (b),
to represent
The pixel value of the pixel point with the middle coordinate position (i ', j'),
to represent
The pixel value of the pixel point with the middle coordinate position (i ', j');
step 2_ 2: will be provided with
R channel component, G channel component and B channel component of and
inputting into a convolutional neural network training model and using W
bestAnd b
bestMaking a prediction to obtain
Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with
As the predicted saliency detection image of uniform size
Corresponding final predicted significanceDetect the image and note as
Wherein the content of the first and second substances,
to represent
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
In step 1_2, the 1 st RGB map neural network block and the 1 st depth map neural network block have the same structure, and are composed of a first convolution layer, a first batch of normalization layers, a first activation layer, a first residual block, a second convolution layer, a second batch of normalization layers, and a second activation layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first batch of normalization layers receives all feature maps output by an output end of the first convolution layer, an input end of the first activation layer receives all feature maps output by an output end of the first batch of normalization layer, an input end of the first residual block receives all feature maps output by an output end of the first activation layer, an input end of the second convolution layer receives all feature maps output by an output end of the first residual block, and an input end of the second batch of the standard layer receives all feature maps output by an output end of the second convolution layer, the input end of the second activation layer receives all characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;
the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;
the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;
the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;
the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In the step 1_2, the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are maximum pooling layers, the pooling sizes of the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are both 2, and the step sizes are both 2.
In step 1_2, the 5 fused neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh block of normalization layers, an eleventh active layer, a sixth residual block, a twelfth convolutional layer, a twelfth block of normalization layers, and a twelfth active layer, which are sequentially arranged, wherein the input end of the eleventh convolutional layer is the input end of the fused neural network block where the eleventh convolutional layer is located, the input end of the eleventh convolutional layer receives all feature maps output by the output end of the eleventh convolutional layer, the input end of the eleventh active layer receives all feature maps output by the output end of the eleventh block of normalization layers, the input end of the sixth residual block receives all feature maps output by the output end of the eleventh active layer, the input end of the twelfth convolutional layer receives all feature maps output by the output end of the sixth residual block, and the input end of the twelfth block of normalization layers receives all feature maps output by the output end of the twelfth convolutional layers, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In step 1_2, the sizes of convolution kernels of the 1 st deconvolution layer and the 2 nd deconvolution layer are both 2 × 2, the numbers of convolution kernels are both 256, the step lengths are both 2, and the zero padding parameter is 0, the sizes of convolution kernels of the 3 rd deconvolution layer are 2 × 2, the numbers of convolution kernels are 128, the step lengths are 2, and the zero padding parameter is 0, and the sizes of convolution kernels of the 4 th deconvolution layer are 2 × 2, the numbers of convolution kernels are 64, the step lengths are 2, and the zero padding parameter is 0.
In the step 1_2, the 5 sub-output layers have the same structure and consist of a thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
Compared with the prior art, the invention has the advantages that:
1) the convolutional neural network constructed by the method realizes the detection of the salient object from end to end, is easy to train, and is convenient and quick; inputting the color real object images and the corresponding depth images in the training set into a convolutional neural network for training to obtain a convolutional neural network training model; and inputting the color real object image to be subjected to significance detection and the corresponding depth image into the convolutional neural network training model, and predicting to obtain a predicted significance detection image corresponding to the color real object image.
2) The method adopts a post-fusion mode when the depth information is utilized, and cascades the depth information and the color image information corresponding to the coding layer with the corresponding coding layer (registration), thereby avoiding the addition of noise information in the coding stage by pre-fusion, and simultaneously being capable of fully learning complementary information of the color image information and the depth information when a convolutional neural network training model is trained, and further obtaining better effect on a training set and a testing set.
3) The invention adopts multi-scale Supervision (multi-scale Supervision), namely, spatial detail information of an object can be optimized in the process of up-sampling through a deconvolution layer, prediction graphs are output at different sizes and supervised by label graphs with corresponding sizes, and a convolutional neural network training model can be guided to gradually construct significance detection prediction graphs, so that better effects are obtained on a training set and a testing set.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The significance detection method based on the residual error network and the depth information fusion comprises a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }
q(i,j)}、{D
q(i,j)}、
Wherein Q is a positive integer, Q is not less than 200, if Q is 367, Q is a positive integer, the initial value of Q is 1, 1 is not less than Q is not less than Q, 1 is not less than I is not less than W, 1 is not less than j is not less than H, W represents { I ≦ H
q(i,j)}、{D
q(i,j)}、
H represents { I }
q(i,j)}、{D
q(i,j)}、
W and H can be divided by 2, for example, W512, H512, { I
q(I, j) } RGB color image, I
q(I, j) represents { I
q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D
q(i, j) } is a single-channel depth image, D
q(i, j) represents { D
qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
to represent
The middle coordinate position is the pixel value of the pixel point of (i, j); in this case, the original color real object image is directly selected from 800 images in the training set of the database NLPR.
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes 5 RGB map neural network blocks, 4 RGB map max pooling layers (Pool), 5 depth map neural network blocks, 4 depth map max pooling layers, 5 cascade layers, 5 fusion neural network blocks, and 4 deconvolution layers, and the output layer includes 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of the convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers.
For the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; among them, the width of the RGB color image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; the training depth image has a width W and a height H.
For the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1。
For the 1 st RGB map max pooling layer, its input receives CP
1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
1。
For the 2 nd RGB map neural network block, its input receives ZC
1The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
2。
For the 2 nd RGB map max pooling layer, its input receives CP
2The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
2。
For the 3 rd RGB map neural network block, its input receives ZC
2The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
3。
For the 3 rd RGB map max pooling layer, its input receives CP
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
3。
For the 4 th RGB map neural network block, its input receives ZC
3All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
4。
For the 4 th RGB map max pooling layer, its input receives CP
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as ZC
4。
For the 5 th RGB map neural network block, its input receives ZC
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP
5。
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1。
For the 1 st depth map max pooling layer, its input receives DP
1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
1。
For the 2 nd depth map neural network block, its input receives DC
1The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
2。
For the 2 nd depth map max pooling layer, its input receives DP
2The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
2。
For the 3 rd depth map neural network block, its input receives DC
2The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
3。
For the 3 rd depth map max pooling layer, its input receives DP
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
3。
For the 4 th depth map neural network block, its input receives DC
3All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
4。
For the 4 th depth map max pooling layer, its input receives DP
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all output feature maps, and is denoted as DC
4。
For the 5 th depth map neural network block, its input receives DC
4All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is designated as DP
5。
For the 1 st cascaded (concatenation) layer, its input receives the CP
5All feature maps and DP in
5All feature maps in (1), for CP
5All feature maps and DP in
5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
1。
For the 1 st converged neural network block, its input receives Con
1All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
1。
For the 1 st deconvolution layer, its input terminal receives RH
1All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
1。
For the 2 nd cascaded layer, its input receives FJ
1All feature maps, CP
4All feature maps and DP in
4All feature maps in (1), for FJ
1All feature maps, CP
4All feature maps and DP in
4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
2。
For the 2 nd converged neural network block, its input receives Con
2All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
2。
For the 2 nd deconvolution layer, its input terminal receives RH
2All the characteristic maps in (1) have 256 output widths of
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
2。
For the 3 rd cascaded layer, its input receives FJ
2All feature maps, CP
3All feature maps and DP in
3All feature maps in (1), for FJ
2All feature maps, CP
3All feature maps and DP in
3All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph is outputOut 512 pieces of width of
And has a height of
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con
3。
For the 3 rd converged neural network block, its input receives Con
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
3。
For the 3 rd deconvolution layer, its input terminal receives RH
3The output end of all the characteristic maps outputs 128 width
And has a height of
The feature map of (1), a set of all feature maps outputted is denoted as FJ
3。
For the 4 th cascaded layer, its input receives FJ
3All feature maps, CP
2All feature maps and DP in
2All feature maps in (1), for FJ
3All feature maps, CP
2All feature maps and DP in
2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output
And has a height of
A characteristic diagram ofThe set of all the output feature maps is denoted as Con
4。
For the 4 th converged neural network block, its input receives Con
4The output end of all the characteristic graphs in (1) outputs 64 width
And has a height of
The feature map of (1) is a set of all feature maps outputted and is denoted as RH
4。
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4。
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5。
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5。
For the 1 st sub-output layer, its input receives RH
1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
1,Out
1One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 2 nd sub-output layer, its input receives RH
2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
2,Out
2One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 3 rd sub-output layer, its input receives RH
3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
3,Out
3One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 4 th sub-output layer, its input receives RH
4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
And has a height of
The feature map of (1), the set of all feature maps of output is denoted as Out
4,Out
4One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 5 th sub-output layer, its input receives RH5The output end of all the characteristic diagrams outputs 2 characteristic diagrams with width W and height H, and the characteristic diagrams are processedThe set of all feature graph components of the output is denoted as Out5,Out5One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
Step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in the
q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width of
And has a height of
An image of width of
And has a height of
An image of width of
And has a height of
An image of width of
And has a height of
An image of width W and height H will be { I }
q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the sets
And
the value of the loss function in between is recorded as
Obtained using categorical cross entropy (categorical cross entropy).
Step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The test stage process comprises the following specific steps:
step 2_ 1: order to
Representing a color real object image to be saliency detected, will
The corresponding depth image is noted
Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
And
width of (A), H' represents
And
the height of (a) of (b),
to represent
The pixel value of the pixel point with the middle coordinate position (i ', j'),
to represent
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
Step 2_ 2: will be provided with
R channel component, G channel component and B channel component of and
inputting into a convolutional neural network training model and using W
bestAnd b
bestMaking a prediction to obtain
Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with
As the predicted saliency detection image of uniform size
Corresponding final predicted saliency detection images and notation
Wherein the content of the first and second substances,
to represent
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
In this embodiment, in step 1_2, the 1 st RGB map neural network Block and the 1 st depth map neural network Block have the same structure, and are composed of a first Convolution layer (convention, Conv), a first normalization layer (Batch normalization, BN), a first active layer (Activation, Act), a first Residual Block (Residual Block, RB), a second Convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, an input end of the first Convolution layer is an input end of the neural network Block, an input end of the first normalization layer receives all the feature maps output by an output end of the first Convolution layer, an input end of the first active layer receives all the feature maps output by an output end of the first normalization layer, an input end of the first Residual Block receives all the feature maps output by an output end of the first active layer, an input end of the second Convolution layer receives all the feature maps output by an output end of the first Residual Block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the convolution kernel sizes (kernel _ size) of the first convolution layer and the second convolution layer are 3 x 3, the convolution kernel numbers (filters) are 32, the zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual error block output 32 characteristic diagrams respectively.
In this embodiment, the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third active layer, a second residual block, a fourth convolution layer, a fourth normalization layer, and a fourth active layer, which are sequentially arranged, where an input end of the third convolution layer is an input end of the neural network block where the third convolution layer is located, an input end of the third normalization layer receives all feature maps output by an output end of the third convolution layer, an input end of the third active layer receives all feature maps output by an output end of the third normalization layer, an input end of the second residual block receives all feature maps output by an output end of the third active layer, an input end of the fourth convolution layer receives all feature maps output by an output end of the second residual block, and an input end of the fourth normalization layer receives all feature maps output by an output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block.
In this specific embodiment, the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure, and are composed of a fifth convolution layer, a fifth normalization layer, a fifth active layer, a third residual block, a sixth convolution layer, a sixth normalization layer, and a sixth active layer, which are sequentially arranged, where an input end of the fifth convolution layer is an input end of the neural network block where the fifth convolution layer is located, an input end of the fifth normalization layer receives all feature maps output by an output end of the fifth convolution layer, an input end of the fifth active layer receives all feature maps output by an output end of the fifth normalization layer, an input end of the third residual block receives all feature maps output by an output end of the fifth active layer, an input end of the sixth convolution layer receives all feature maps output by an output end of the third residual block, and an input end of the sixth normalization layer receives all feature maps output by an output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and 128 feature graphs are output from output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block respectively.
In this specific embodiment, the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh active layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged, an input end of the seventh convolution layer is an input end of the neural network block where the seventh convolution layer is located, an input end of the seventh normalization layer receives all feature maps output by an output end of the seventh convolution layer, an input end of the seventh active layer receives all feature maps output by an output end of the seventh normalization layer, an input end of the fourth residual block receives all feature maps output by an output end of the seventh active layer, an input end of the eighth convolution layer receives all feature maps output by an output end of the fourth residual block, an input end of the eighth normalization layer receives all feature maps output by an output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block.
In this embodiment, the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure, and are composed of a ninth convolutional layer, a ninth block of normalization layers, a ninth active layer, a fifth residual block, a tenth convolutional layer, a tenth block of normalization layers, and a tenth active layer, which are sequentially arranged, an input end of the ninth convolutional layer is an input end of the neural network block where the ninth convolutional layer is located, an input end of the ninth block of normalization layers receives all feature maps output by an output end of the ninth convolutional layer, an input end of the ninth active layer receives all feature maps output by an output end of the ninth block of normalization layers, an input end of the fifth residual block receives all feature maps output by an output end of the ninth active layer, an input end of the tenth block of normalization layers receives all feature maps output by an output end of the tenth block of normalization layers, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In this embodiment, in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both maximum pooling layers, and the pooling sizes (pool _ size) and the step sizes (stride) of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and 2, respectively.
In this embodiment, in step 1_2, the 5 merged neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, and a twelfth activation layer, which are sequentially arranged, an input end of the eleventh convolutional layer is an input end of the merged neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the twelfth convolutional layer, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In this embodiment, in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 × 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 × 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 × 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.
In this embodiment, in step 1_2, the 5 sub-output layers have the same structure and are composed of the thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) constructing a convolutional neural network architecture proposed by the method by using a python-based deep learning library Pytrich0.4.1. The method analyzes how the significance detection effect of the color real object image (taking 200 real object images) obtained by prediction by the method is achieved by adopting a real object image database NLPR test set. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy Recall Curve (Precision Recall Curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).
The method of the invention is used for predicting each color real object image in the real object image database NLPR test set to obtain a prediction significance detection image corresponding to each color real object image. A class accuracy recall Curve (PR cure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2a, an average absolute error (MAE) reflecting the significance detection effect of the method of the present invention is shown in fig. 2b, and has a value of 0.058, and an F metric (F-Measure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2c, and has a value of 0.796. As can be seen from fig. 2a to 2c, the saliency detection result of the color real object image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted saliency detection image corresponding to the color real object image by using the method of the present invention.
FIG. 3a shows the 1 st original color real object image of the same scene, FIG. 3b shows the depth image corresponding to FIG. 3a, and FIG. 3c shows the predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention; FIG. 4a shows the 2 nd original color real object image of the same scene, FIG. 4b shows the depth image corresponding to FIG. 4a, and FIG. 4c shows the predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention; FIG. 5a shows the 3 rd original color real object image of the same scene, FIG. 5b shows the depth image corresponding to FIG. 5a, and FIG. 5c shows the predicted saliency detection image obtained by predicting FIG. 5a using the method of the present invention; fig. 6a shows the 4 th original color real object image of the same scene, fig. 6b shows the depth image corresponding to fig. 6a, and fig. 6c shows the predicted saliency detection image obtained by predicting fig. 6a by using the method of the present invention. Comparing fig. 3a and 3c, comparing fig. 4a and 4c, comparing fig. 5a and 5c, and comparing fig. 6a and 6c, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.