CN110263813B - Significance detection method based on residual error network and depth information fusion - Google Patents

Significance detection method based on residual error network and depth information fusion Download PDF

Info

Publication number
CN110263813B
CN110263813B CN201910444775.0A CN201910444775A CN110263813B CN 110263813 B CN110263813 B CN 110263813B CN 201910444775 A CN201910444775 A CN 201910444775A CN 110263813 B CN110263813 B CN 110263813B
Authority
CN
China
Prior art keywords
layer
output
feature maps
neural network
receives
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910444775.0A
Other languages
Chinese (zh)
Other versions
CN110263813A (en
Inventor
周武杰
吴君委
雷景生
何成
钱亚冠
王海江
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huahao Technology Xi'an Co ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910444775.0A priority Critical patent/CN110263813B/en
Publication of CN110263813A publication Critical patent/CN110263813A/en
Application granted granted Critical
Publication of CN110263813B publication Critical patent/CN110263813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a significance detection method based on residual error network and depth information fusion, which is characterized in that a convolutional neural network is constructed in a training stage, an input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, a hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and an output layer comprises 5 sub-output layers; inputting the color real object image and the depth image in the training set into a convolutional neural network for training to obtain a saliency detection prediction map; obtaining a convolutional neural network training model by calculating a loss function value between a saliency detection prediction image and a real saliency detection label image; in the testing stage, predicting a color real object image to be subjected to significance detection by using a convolutional neural network training model to obtain a predicted significance detection image; the advantage is that the significance detects the high accuracy.

Description

Significance detection method based on residual error network and depth information fusion
Technical Field
The invention relates to a visual saliency detection technology, in particular to a saliency detection method based on residual error network and depth information fusion.
Background
The visual saliency can help people to quickly filter out unimportant information, so that people can focus more on meaningful areas, and the scene in front of the eyes can be better understood. With the rapid development of the computer vision field, people hope that a computer can also have the same capability as a human being, namely, when a complex scene is understood and analyzed, the computer can process useful information more pertinently, so that the complexity of an algorithm can be reduced more, and the interference of noise waves can be eliminated. In the conventional method, researchers model a saliency object detection algorithm according to various observed prior knowledge to generate a saliency map. These a priori knowledge include contrast, center a priori, edge a priori, semantic a priori, etc. However, in complex scenes, conventional practice tends to be inaccurate because these observations tend to be limited to low-level features (e.g., color and contrast, etc.), and therefore do not accurately reflect the common points of significance inherent in the object.
In recent years, convolutional neural networks have been widely used in various fields of computer vision, and many difficult vision problems have been made a great progress. Different from the traditional method, the deep convolutional neural network can be modeled from a large number of training samples and automatically learns more essential characteristics end-to-end (end-to-end), so that the defects of traditional manual modeling and feature design are effectively avoided. Recently, effective application of 3D sensors enriches databases, and people can obtain not only color pictures but also depth information of color pictures. Depth information is an important ring in the human visual system in real 3D scenes, which is an important piece of information that has been completely ignored in the conventional practice, so that the most important task at present is how to build a model to effectively utilize the depth information.
A significance detection method of deep learning is adopted in an RGB-D database, pixel-level end-to-end significance detection is directly carried out, and prediction can be carried out on a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. At present, the depth learning significance detection model based on the RGB-D database mainly uses an encoding-decoding architecture, and there are three methods how to utilize depth information: the first method is to directly superimpose the depth information and the color image information into a four-dimensional input information or add or superimpose the color image information and the depth information in the encoding process, and the method is called pre-fusion; the second method is to add or superimpose the color image information and the depth information corresponding to each other in the encoding process into the corresponding decoding process in a layer skipping (skip connection) manner, which is called post-fusion; the third method is to use color image information and depth information to predict the significance and to fuse the final results. In the first method, since the color image information and the depth information have a large difference in distribution, noise is added to a certain extent by directly adding the depth information in the encoding process. The third method uses the depth information and the color map information to perform saliency prediction, but if the prediction results of the depth information and the color map information are not accurate, the final fusion result is relatively inaccurate. The second method not only avoids the noise brought by directly utilizing the depth information in the coding stage, but also can fully learn the complementary relation between the color image information and the depth information in the continuous optimization of the network model. Compared with the previous post-Fusion scheme, such as an RGB-D salience Detection by Multi-stream Late Fusion Network model (based on a Multi-stream post-Fusion RGB-D significance Detection Network model), which is hereinafter referred to as an MLF for short, the MLF performs feature extraction and down-sampling operations on color image information and depth information respectively, performs Fusion by multiplying corresponding position elements in the highest dimension, and outputs a significance prediction map with a small size on the result of the Fusion. The MLF only has a down-sampling operation, so that spatial detail information of an object is blurred in the continuous down-sampling operation, and the MLF performs significance prediction output on the minimum size, and loses much information of a significant object after being amplified to the original size.
Disclosure of Invention
The invention aims to solve the technical problem of a significance detection method based on residual error network and depth information fusion, which improves the significance detection accuracy rate by efficiently utilizing depth information and color image information.
The technical scheme adopted by the invention for solving the technical problems is as follows: a significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000031
Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦q(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000032
H represents { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000033
W and H can be divided by 2, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002073263270000034
to represent
Figure BDA0002073263270000035
The middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;
for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
Figure BDA0002073263270000041
And has a height of
Figure BDA0002073263270000042
The feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000043
And has a height of
Figure BDA0002073263270000044
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000045
And has a height of
Figure BDA0002073263270000046
The feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000051
And has a height of
Figure BDA0002073263270000052
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000053
And has a height of
Figure BDA0002073263270000054
The feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000055
And has a height of
Figure BDA0002073263270000056
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000057
And has a height of
Figure BDA0002073263270000058
The feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000059
And has a height of
Figure BDA00020732632700000510
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the1 st depth map max pooling layer with DP received at its input1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
Figure BDA00020732632700000511
And has a height of
Figure BDA00020732632700000512
The feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA00020732632700000513
And has a height of
Figure BDA00020732632700000514
The feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000061
And has a height of
Figure BDA0002073263270000062
The feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000063
And has a height of
Figure BDA0002073263270000064
The feature map of (1) is a set of all feature maps outputted, and is designated as DP3
For the 3 rd depth map max pooling layer, its input receives DP3The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000065
And has a height of
Figure BDA0002073263270000066
The feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000067
And has a height of
Figure BDA0002073263270000068
The feature map of (1) is a set of all feature maps outputted, and is designated as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000069
And has a height of
Figure BDA00020732632700000610
The feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths of
Figure BDA00020732632700000611
And has a height of
Figure BDA00020732632700000612
The feature map of (1) represents a set of all feature maps outputtedDP5
For the 1 st cascaded layer, its input receives CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
Figure BDA00020732632700000613
And has a height of
Figure BDA00020732632700000614
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths of
Figure BDA00020732632700000615
And has a height of
Figure BDA00020732632700000616
The feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000071
And has a height of
Figure BDA0002073263270000072
The feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layer, its input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths
Figure BDA0002073263270000073
And has a height of
Figure BDA0002073263270000074
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000075
And has a height of
Figure BDA0002073263270000076
The feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000077
And has a height of
Figure BDA0002073263270000078
The feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All feature maps and DP in3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
Figure BDA0002073263270000079
And has a height of
Figure BDA00020732632700000710
The feature map of (1), a set composed of all the feature maps of the outputIs totally expressed as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 width
Figure BDA00020732632700000711
And has a height of
Figure BDA00020732632700000712
The feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For the 3 rd deconvolution layer, its input terminal receives RH3The output end of all the characteristic maps outputs 128 width
Figure BDA00020732632700000713
And has a height of
Figure BDA00020732632700000714
The feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output
Figure BDA0002073263270000081
And has a height of
Figure BDA0002073263270000082
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000083
And has a height of
Figure BDA0002073263270000084
The feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000085
And has a height of
Figure BDA0002073263270000086
The feature map of (1), the set of all feature maps of output is denoted as Out1,Out1One of the feature maps is a significance detection prediction map;
for the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000087
And has a height of
Figure BDA0002073263270000088
The feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps is a significance detection prediction map;
for the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000091
And has a height of
Figure BDA0002073263270000092
The feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps is a significance detection prediction map;
for the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000093
And has a height of
Figure BDA0002073263270000094
The feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps is a significance detection prediction map;
for the 5 th sub-output layer, its input receives RH52 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out5,Out5One of the feature maps is a significance detection prediction map;
step 1_ 3: each original color real object image in the training set is used as an RGB color image for training, a depth image corresponding to each original color real object image in the training set is used as a depth image for training, and the depth image is input into a convolution neural networkPerforming training to obtain 5 saliency detection prediction maps corresponding to each original color real object image in a training set, and performing { I }q(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Figure BDA0002073263270000095
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width of
Figure BDA0002073263270000096
And has a height of
Figure BDA0002073263270000097
An image of width of
Figure BDA0002073263270000098
And has a height of
Figure BDA0002073263270000099
An image of width of
Figure BDA00020732632700000910
And has a height of
Figure BDA00020732632700000911
An image of width of
Figure BDA00020732632700000912
And has a height of
Figure BDA00020732632700000913
An image of width W and height H will be { I }q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Figure BDA00020732632700000914
Step 1_ 5: calculation trainingLoss function values between a set formed by 5 saliency detection prediction maps corresponding to each original color real object image in the training set and a set formed by 5 images obtained by scaling the real saliency detection images corresponding to the original color real object images are obtained
Figure BDA0002073263270000101
And
Figure BDA0002073263270000102
the value of the loss function in between is recorded as
Figure BDA0002073263270000103
Obtaining by adopting a classified cross entropy;
step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order to
Figure BDA0002073263270000104
Representing a color real object image to be saliency detected, will
Figure BDA0002073263270000105
The corresponding depth image is noted
Figure BDA0002073263270000106
Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0002073263270000107
And
Figure BDA0002073263270000108
width of (A), H' represents
Figure BDA0002073263270000109
And
Figure BDA00020732632700001010
the height of (a) of (b),
Figure BDA00020732632700001011
to represent
Figure BDA00020732632700001012
The pixel value of the pixel point with the middle coordinate position (i ', j'),
Figure BDA00020732632700001013
to represent
Figure BDA00020732632700001014
The pixel value of the pixel point with the middle coordinate position (i ', j');
step 2_ 2: will be provided with
Figure BDA00020732632700001015
R channel component, G channel component and B channel component of and
Figure BDA00020732632700001016
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure BDA00020732632700001017
Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with
Figure BDA00020732632700001018
As the predicted saliency detection image of uniform size
Figure BDA00020732632700001019
Corresponding final predicted significanceDetect the image and note as
Figure BDA00020732632700001020
Wherein the content of the first and second substances,
Figure BDA00020732632700001021
to represent
Figure BDA00020732632700001022
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
In step 1_2, the 1 st RGB map neural network block and the 1 st depth map neural network block have the same structure, and are composed of a first convolution layer, a first batch of normalization layers, a first activation layer, a first residual block, a second convolution layer, a second batch of normalization layers, and a second activation layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first batch of normalization layers receives all feature maps output by an output end of the first convolution layer, an input end of the first activation layer receives all feature maps output by an output end of the first batch of normalization layer, an input end of the first residual block receives all feature maps output by an output end of the first activation layer, an input end of the second convolution layer receives all feature maps output by an output end of the first residual block, and an input end of the second batch of the standard layer receives all feature maps output by an output end of the second convolution layer, the input end of the second activation layer receives all characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;
the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;
the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;
the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;
the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In the step 1_2, the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are maximum pooling layers, the pooling sizes of the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are both 2, and the step sizes are both 2.
In step 1_2, the 5 fused neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh block of normalization layers, an eleventh active layer, a sixth residual block, a twelfth convolutional layer, a twelfth block of normalization layers, and a twelfth active layer, which are sequentially arranged, wherein the input end of the eleventh convolutional layer is the input end of the fused neural network block where the eleventh convolutional layer is located, the input end of the eleventh convolutional layer receives all feature maps output by the output end of the eleventh convolutional layer, the input end of the eleventh active layer receives all feature maps output by the output end of the eleventh block of normalization layers, the input end of the sixth residual block receives all feature maps output by the output end of the eleventh active layer, the input end of the twelfth convolutional layer receives all feature maps output by the output end of the sixth residual block, and the input end of the twelfth block of normalization layers receives all feature maps output by the output end of the twelfth convolutional layers, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In step 1_2, the sizes of convolution kernels of the 1 st deconvolution layer and the 2 nd deconvolution layer are both 2 × 2, the numbers of convolution kernels are both 256, the step lengths are both 2, and the zero padding parameter is 0, the sizes of convolution kernels of the 3 rd deconvolution layer are 2 × 2, the numbers of convolution kernels are 128, the step lengths are 2, and the zero padding parameter is 0, and the sizes of convolution kernels of the 4 th deconvolution layer are 2 × 2, the numbers of convolution kernels are 64, the step lengths are 2, and the zero padding parameter is 0.
In the step 1_2, the 5 sub-output layers have the same structure and consist of a thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
Compared with the prior art, the invention has the advantages that:
1) the convolutional neural network constructed by the method realizes the detection of the salient object from end to end, is easy to train, and is convenient and quick; inputting the color real object images and the corresponding depth images in the training set into a convolutional neural network for training to obtain a convolutional neural network training model; and inputting the color real object image to be subjected to significance detection and the corresponding depth image into the convolutional neural network training model, and predicting to obtain a predicted significance detection image corresponding to the color real object image.
2) The method adopts a post-fusion mode when the depth information is utilized, and cascades the depth information and the color image information corresponding to the coding layer with the corresponding coding layer (registration), thereby avoiding the addition of noise information in the coding stage by pre-fusion, and simultaneously being capable of fully learning complementary information of the color image information and the depth information when a convolutional neural network training model is trained, and further obtaining better effect on a training set and a testing set.
3) The invention adopts multi-scale Supervision (multi-scale Supervision), namely, spatial detail information of an object can be optimized in the process of up-sampling through a deconvolution layer, prediction graphs are output at different sizes and supervised by label graphs with corresponding sizes, and a convolutional neural network training model can be guided to gradually construct significance detection prediction graphs, so that better effects are obtained on a training set and a testing set.
Drawings
FIG. 1 is a schematic diagram of the structure of a convolutional neural network constructed by the method of the present invention;
FIG. 2a is a class accuracy recall curve for predicting each color real object image in a real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;
FIG. 2b is a graph showing the mean absolute error of the saliency detection effect of the present invention as predicted for each color real object image in the real object image database NLPR test set by the present invention;
FIG. 2c is a F metric value for predicting each color real object image in the real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;
FIG. 3a is the 1 st original color real object image of the same scene;
FIG. 3b is a depth image corresponding to FIG. 3 a;
FIG. 3c is a predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention;
FIG. 4a is the 2 nd original color real object image of the same scene;
FIG. 4b is a depth image corresponding to FIG. 4 a;
FIG. 4c is a predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention;
FIG. 5a is the 3 rd original color real object image of the same scene;
FIG. 5b is a depth image corresponding to FIG. 5 a;
FIG. 5c is a predicted saliency detected image from the prediction of FIG. 5a using the method of the present invention;
FIG. 6a is the 4 th original color real object image of the same scene;
FIG. 6b is a depth image corresponding to FIG. 6 a;
fig. 6c is a predicted saliency detection image obtained by predicting fig. 6a by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The significance detection method based on the residual error network and the depth information fusion comprises a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000171
Wherein Q is a positive integer, Q is not less than 200, if Q is 367, Q is a positive integer, the initial value of Q is 1, 1 is not less than Q is not less than Q, 1 is not less than I is not less than W, 1 is not less than j is not less than H, W represents { I ≦ Hq(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000172
H represents { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002073263270000173
W and H can be divided by 2, for example, W512, H512, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002073263270000174
to represent
Figure BDA0002073263270000175
The middle coordinate position is the pixel value of the pixel point of (i, j); in this case, the original color real object image is directly selected from 800 images in the training set of the database NLPR.
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes 5 RGB map neural network blocks, 4 RGB map max pooling layers (Pool), 5 depth map neural network blocks, 4 depth map max pooling layers, 5 cascade layers, 5 fusion neural network blocks, and 4 deconvolution layers, and the output layer includes 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of the convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers.
For the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; among them, the width of the RGB color image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; the training depth image has a width W and a height H.
For the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
Figure BDA0002073263270000181
And has a height of
Figure BDA0002073263270000182
The feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000183
And has a height of
Figure BDA0002073263270000184
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000191
And has a height of
Figure BDA0002073263270000192
The feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000193
And has a height of
Figure BDA0002073263270000194
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000195
And has a height of
Figure BDA0002073263270000196
The feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000197
And has a height of
Figure BDA0002073263270000198
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000199
And has a height of
Figure BDA00020732632700001910
The feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths of
Figure BDA00020732632700001911
And has a height of
Figure BDA00020732632700001912
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the 1 st depth map max pooling layer, its input receives DP1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
Figure BDA00020732632700001913
And has a height of
Figure BDA00020732632700001914
The feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000201
And has a height of
Figure BDA0002073263270000202
The feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000203
And has a height of
Figure BDA0002073263270000204
The feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000205
And has a height of
Figure BDA0002073263270000206
The feature map of (1) is a set of all feature maps outputted, and is designated as DP3
For the 3 rd depth map max pooling layer, its input receives DP3The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000207
And has a height of
Figure BDA0002073263270000208
The feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000209
And has a height of
Figure BDA00020732632700002010
The feature map of (1) is a set of all feature maps outputted, and is designated as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths of
Figure BDA00020732632700002011
And has a height of
Figure BDA00020732632700002012
The feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths of
Figure BDA00020732632700002013
And has a height of
Figure BDA00020732632700002014
The feature map of (1) is a set of all feature maps outputted, and is designated as DP5
For the 1 st cascaded (concatenation) layer, its input receives the CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
Figure BDA00020732632700002015
And has a height of
Figure BDA00020732632700002016
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000211
And has a height of
Figure BDA0002073263270000212
The feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000213
And has a height of
Figure BDA0002073263270000214
The feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layer, its input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths
Figure BDA0002073263270000215
And has a height of
Figure BDA0002073263270000216
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000217
And has a height of
Figure BDA0002073263270000218
The feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths of
Figure BDA0002073263270000219
And has a height of
Figure BDA00020732632700002110
The feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All feature maps and DP in3All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph is outputOut 512 pieces of width of
Figure BDA00020732632700002111
And has a height of
Figure BDA00020732632700002112
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 width
Figure BDA00020732632700002113
And has a height of
Figure BDA00020732632700002114
The feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For the 3 rd deconvolution layer, its input terminal receives RH3The output end of all the characteristic maps outputs 128 width
Figure BDA0002073263270000221
And has a height of
Figure BDA0002073263270000222
The feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output
Figure BDA0002073263270000223
And has a height of
Figure BDA0002073263270000224
A characteristic diagram ofThe set of all the output feature maps is denoted as Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 width
Figure BDA0002073263270000225
And has a height of
Figure BDA0002073263270000226
The feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000227
And has a height of
Figure BDA0002073263270000228
The feature map of (1), the set of all feature maps of output is denoted as Out1,Out1One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000231
And has a height of
Figure BDA0002073263270000232
The feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000233
And has a height of
Figure BDA0002073263270000234
The feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure BDA0002073263270000235
And has a height of
Figure BDA0002073263270000236
The feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 5 th sub-output layer, its input receives RH5The output end of all the characteristic diagrams outputs 2 characteristic diagrams with width W and height H, and the characteristic diagrams are processedThe set of all feature graph components of the output is denoted as Out5,Out5One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
Step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in theq(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Figure BDA0002073263270000237
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width of
Figure BDA0002073263270000238
And has a height of
Figure BDA0002073263270000239
An image of width of
Figure BDA00020732632700002310
And has a height of
Figure BDA00020732632700002311
An image of width of
Figure BDA00020732632700002312
And has a height of
Figure BDA00020732632700002313
An image of width of
Figure BDA00020732632700002314
And has a height of
Figure BDA00020732632700002315
An image of width W and height H will be { I }q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Figure BDA0002073263270000241
Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the sets
Figure BDA0002073263270000242
And
Figure BDA0002073263270000243
the value of the loss function in between is recorded as
Figure BDA0002073263270000244
Obtained using categorical cross entropy (categorical cross entropy).
Step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The test stage process comprises the following specific steps:
step 2_ 1: order to
Figure BDA0002073263270000245
Representing a color real object image to be saliency detected, will
Figure BDA0002073263270000246
The corresponding depth image is noted
Figure BDA0002073263270000247
Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0002073263270000248
And
Figure BDA0002073263270000249
width of (A), H' represents
Figure BDA00020732632700002410
And
Figure BDA00020732632700002411
the height of (a) of (b),
Figure BDA00020732632700002412
to represent
Figure BDA00020732632700002413
The pixel value of the pixel point with the middle coordinate position (i ', j'),
Figure BDA00020732632700002414
to represent
Figure BDA00020732632700002415
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
Step 2_ 2: will be provided with
Figure BDA00020732632700002416
R channel component, G channel component and B channel component of and
Figure BDA00020732632700002417
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure BDA00020732632700002418
Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with
Figure BDA00020732632700002419
As the predicted saliency detection image of uniform size
Figure BDA00020732632700002420
Corresponding final predicted saliency detection images and notation
Figure BDA00020732632700002421
Wherein the content of the first and second substances,
Figure BDA00020732632700002422
to represent
Figure BDA00020732632700002423
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
In this embodiment, in step 1_2, the 1 st RGB map neural network Block and the 1 st depth map neural network Block have the same structure, and are composed of a first Convolution layer (convention, Conv), a first normalization layer (Batch normalization, BN), a first active layer (Activation, Act), a first Residual Block (Residual Block, RB), a second Convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, an input end of the first Convolution layer is an input end of the neural network Block, an input end of the first normalization layer receives all the feature maps output by an output end of the first Convolution layer, an input end of the first active layer receives all the feature maps output by an output end of the first normalization layer, an input end of the first Residual Block receives all the feature maps output by an output end of the first active layer, an input end of the second Convolution layer receives all the feature maps output by an output end of the first Residual Block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the convolution kernel sizes (kernel _ size) of the first convolution layer and the second convolution layer are 3 x 3, the convolution kernel numbers (filters) are 32, the zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual error block output 32 characteristic diagrams respectively.
In this embodiment, the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third active layer, a second residual block, a fourth convolution layer, a fourth normalization layer, and a fourth active layer, which are sequentially arranged, where an input end of the third convolution layer is an input end of the neural network block where the third convolution layer is located, an input end of the third normalization layer receives all feature maps output by an output end of the third convolution layer, an input end of the third active layer receives all feature maps output by an output end of the third normalization layer, an input end of the second residual block receives all feature maps output by an output end of the third active layer, an input end of the fourth convolution layer receives all feature maps output by an output end of the second residual block, and an input end of the fourth normalization layer receives all feature maps output by an output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block.
In this specific embodiment, the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure, and are composed of a fifth convolution layer, a fifth normalization layer, a fifth active layer, a third residual block, a sixth convolution layer, a sixth normalization layer, and a sixth active layer, which are sequentially arranged, where an input end of the fifth convolution layer is an input end of the neural network block where the fifth convolution layer is located, an input end of the fifth normalization layer receives all feature maps output by an output end of the fifth convolution layer, an input end of the fifth active layer receives all feature maps output by an output end of the fifth normalization layer, an input end of the third residual block receives all feature maps output by an output end of the fifth active layer, an input end of the sixth convolution layer receives all feature maps output by an output end of the third residual block, and an input end of the sixth normalization layer receives all feature maps output by an output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and 128 feature graphs are output from output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block respectively.
In this specific embodiment, the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh active layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged, an input end of the seventh convolution layer is an input end of the neural network block where the seventh convolution layer is located, an input end of the seventh normalization layer receives all feature maps output by an output end of the seventh convolution layer, an input end of the seventh active layer receives all feature maps output by an output end of the seventh normalization layer, an input end of the fourth residual block receives all feature maps output by an output end of the seventh active layer, an input end of the eighth convolution layer receives all feature maps output by an output end of the fourth residual block, an input end of the eighth normalization layer receives all feature maps output by an output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block.
In this embodiment, the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure, and are composed of a ninth convolutional layer, a ninth block of normalization layers, a ninth active layer, a fifth residual block, a tenth convolutional layer, a tenth block of normalization layers, and a tenth active layer, which are sequentially arranged, an input end of the ninth convolutional layer is an input end of the neural network block where the ninth convolutional layer is located, an input end of the ninth block of normalization layers receives all feature maps output by an output end of the ninth convolutional layer, an input end of the ninth active layer receives all feature maps output by an output end of the ninth block of normalization layers, an input end of the fifth residual block receives all feature maps output by an output end of the ninth active layer, an input end of the tenth block of normalization layers receives all feature maps output by an output end of the tenth block of normalization layers, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In this embodiment, in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both maximum pooling layers, and the pooling sizes (pool _ size) and the step sizes (stride) of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and 2, respectively.
In this embodiment, in step 1_2, the 5 merged neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, and a twelfth activation layer, which are sequentially arranged, an input end of the eleventh convolutional layer is an input end of the merged neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the twelfth convolutional layer, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In this embodiment, in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 × 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 × 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 × 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.
In this embodiment, in step 1_2, the 5 sub-output layers have the same structure and are composed of the thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) constructing a convolutional neural network architecture proposed by the method by using a python-based deep learning library Pytrich0.4.1. The method analyzes how the significance detection effect of the color real object image (taking 200 real object images) obtained by prediction by the method is achieved by adopting a real object image database NLPR test set. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy Recall Curve (Precision Recall Curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).
The method of the invention is used for predicting each color real object image in the real object image database NLPR test set to obtain a prediction significance detection image corresponding to each color real object image. A class accuracy recall Curve (PR cure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2a, an average absolute error (MAE) reflecting the significance detection effect of the method of the present invention is shown in fig. 2b, and has a value of 0.058, and an F metric (F-Measure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2c, and has a value of 0.796. As can be seen from fig. 2a to 2c, the saliency detection result of the color real object image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted saliency detection image corresponding to the color real object image by using the method of the present invention.
FIG. 3a shows the 1 st original color real object image of the same scene, FIG. 3b shows the depth image corresponding to FIG. 3a, and FIG. 3c shows the predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention; FIG. 4a shows the 2 nd original color real object image of the same scene, FIG. 4b shows the depth image corresponding to FIG. 4a, and FIG. 4c shows the predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention; FIG. 5a shows the 3 rd original color real object image of the same scene, FIG. 5b shows the depth image corresponding to FIG. 5a, and FIG. 5c shows the predicted saliency detection image obtained by predicting FIG. 5a using the method of the present invention; fig. 6a shows the 4 th original color real object image of the same scene, fig. 6b shows the depth image corresponding to fig. 6a, and fig. 6c shows the predicted saliency detection image obtained by predicting fig. 6a by using the method of the present invention. Comparing fig. 3a and 3c, comparing fig. 4a and 4c, comparing fig. 5a and 5c, and comparing fig. 6a and 6c, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.

Claims (6)

1. A significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、
Figure FDA0002698601000000011
Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦q(i,j)}、{Dq(i,j)}、
Figure FDA0002698601000000012
H represents { I }q(i,j)}、{Dq(i,j)}、
Figure FDA0002698601000000013
W and H can be divided by 2, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure FDA0002698601000000014
to represent
Figure FDA0002698601000000015
The middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;
for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width of
Figure FDA0002698601000000021
And has a height of
Figure FDA0002698601000000022
The feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The Chinese herbal medicineHas a characteristic diagram, the output end of the characteristic diagram outputs 64 widths
Figure FDA0002698601000000023
And has a height of
Figure FDA0002698601000000024
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure FDA0002698601000000025
And has a height of
Figure FDA0002698601000000026
The feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 width
Figure FDA0002698601000000027
And has a height of
Figure FDA0002698601000000028
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 width
Figure FDA0002698601000000029
And has a height of
Figure FDA00026986010000000210
The feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000031
And has a height of
Figure FDA0002698601000000032
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000033
And has a height of
Figure FDA0002698601000000034
The feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000035
And has a height of
Figure FDA0002698601000000036
The feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the 1 st depth map max pooling layer, its input receives DP1All characteristic maps in (1), output at output terminal32 width of
Figure FDA0002698601000000037
And has a height of
Figure FDA0002698601000000038
The feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 width
Figure FDA0002698601000000039
And has a height of
Figure FDA00026986010000000310
The feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The output end of all the characteristic graphs in (1) outputs 64 width
Figure FDA00026986010000000311
And has a height of
Figure FDA00026986010000000312
The feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 width
Figure FDA00026986010000000313
And has a height of
Figure FDA00026986010000000314
The feature map of (1) is a set of all feature maps outputted, and is designated as DP3
For the 3 rd depth map max pooling layer, its input receives DP3The output end of all the characteristic maps outputs 128 width
Figure FDA0002698601000000041
And has a height of
Figure FDA0002698601000000042
The feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000043
And has a height of
Figure FDA0002698601000000044
The feature map of (1) is a set of all feature maps outputted, and is designated as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000045
And has a height of
Figure FDA0002698601000000046
The feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000047
And has a height of
Figure FDA0002698601000000048
The feature map of (1) is a set of all feature maps outputted, and is designated as DP5
For the 1 st cascaded layer, its input receives CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
Figure FDA0002698601000000049
And has a height of
Figure FDA00026986010000000410
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths of
Figure FDA00026986010000000411
And has a height of
Figure FDA00026986010000000412
The feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths of
Figure FDA00026986010000000413
And has a height of
Figure FDA00026986010000000414
The feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layer, its input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widths
Figure FDA00026986010000000415
And has a height of
Figure FDA00026986010000000416
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000051
And has a height of
Figure FDA0002698601000000052
The feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths of
Figure FDA0002698601000000053
And has a height of
Figure FDA0002698601000000054
The feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All feature maps and DP in3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the device
Figure FDA0002698601000000055
And has a height of
Figure FDA0002698601000000056
The feature map of (1) is a set of all feature maps outputted, and is denoted as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 width
Figure FDA0002698601000000057
And has a height of
Figure FDA0002698601000000058
The feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For the 3 rd deconvolution layer, its input terminal receives RH3The output end of all the characteristic maps outputs 128 width
Figure FDA0002698601000000059
And has a height of
Figure FDA00026986010000000510
The feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are output
Figure FDA00026986010000000511
And has a height of
Figure FDA00026986010000000512
A feature map of, a table to be outputThe set of characteristic diagrams is denoted Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 width
Figure FDA0002698601000000061
And has a height of
Figure FDA0002698601000000062
The feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure FDA0002698601000000063
And has a height of
Figure FDA0002698601000000064
The feature map of (1) represents a set of all feature maps outputtedOut1,Out1One of the feature maps is a significance detection prediction map;
for the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure FDA0002698601000000065
And has a height of
Figure FDA0002698601000000066
The feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps is a significance detection prediction map;
for the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure FDA0002698601000000067
And has a height of
Figure FDA0002698601000000068
The feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps is a significance detection prediction map;
for the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of
Figure FDA0002698601000000071
And has a height of
Figure FDA0002698601000000072
The feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps is a significance detection prediction map;
for the 5 th sub-output layer, its input receives RH5The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width of WThe feature map with height H, the set of all feature maps output is recorded as Out5,Out5One of the feature maps is a significance detection prediction map;
step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in theq(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Figure FDA0002698601000000073
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 4 different sizes to obtain the real significance detection label image with the width of
Figure FDA0002698601000000074
And has a height of
Figure FDA0002698601000000075
An image of width of
Figure FDA0002698601000000076
And has a height of
Figure FDA0002698601000000077
An image of width of
Figure FDA0002698601000000078
And has a height of
Figure FDA0002698601000000079
An image of width of
Figure FDA00026986010000000710
And has a height of
Figure FDA00026986010000000711
Will { I }q(i, j) } the set formed by the 4 images obtained by scaling the corresponding real significance detection label images and the real significance detection label images is recorded as
Figure FDA00026986010000000712
Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 4 images obtained by scaling real saliency detection label images corresponding to the original color real object images and the set formed by the real saliency detection label images, and calculating the loss function values of the sets
Figure FDA00026986010000000713
And
Figure FDA00026986010000000714
the value of the loss function in between is recorded as
Figure FDA00026986010000000715
Obtaining by adopting a classified cross entropy;
step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order to
Figure FDA0002698601000000081
Representing a color real object image to be saliency detected, will
Figure FDA0002698601000000082
The corresponding depth image is noted
Figure FDA0002698601000000083
Wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure FDA0002698601000000084
And
Figure FDA0002698601000000085
width of (A), H' represents
Figure FDA0002698601000000086
And
Figure FDA0002698601000000087
the height of (a) of (b),
Figure FDA0002698601000000088
to represent
Figure FDA0002698601000000089
The pixel value of the pixel point with the middle coordinate position (i ', j'),
Figure FDA00026986010000000810
to represent
Figure FDA00026986010000000811
The pixel value of the pixel point with the middle coordinate position (i ', j');
step 2_ 2: will be provided with
Figure FDA00026986010000000812
R channel component, G channel component and B channel component ofAmount and
Figure FDA00026986010000000813
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure FDA00026986010000000814
Corresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes with
Figure FDA00026986010000000815
As the predicted saliency detection image of uniform size
Figure FDA00026986010000000816
Corresponding final predicted saliency detection images and notation
Figure FDA00026986010000000817
Wherein the content of the first and second substances,
Figure FDA00026986010000000818
to represent
Figure FDA00026986010000000819
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
2. The method according to claim 1, wherein in step 1_2, the 1 st RGB graph neural network block and the 1 st depth graph neural network block have the same structure, and are composed of a first convolution layer, a first normalization layer, a first active layer, a first residual block, a second convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first normalization layer receives all feature maps output from an output end of the first convolution layer, an input end of the first active layer receives all feature maps output from an output end of the first normalization layer, an input end of the first residual block receives all feature maps output from an output end of the first active layer, and an input end of the second convolution layer receives all feature maps output from an output end of the first residual block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;
the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;
the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;
the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;
the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
3. The method according to claim 1 or 2, wherein in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are maximum pooling layers, and the pooling sizes of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and the step size is 2.
4. The method according to claim 3, wherein in step 1_2, the 5 fused neural network blocks have the same structure and are composed of an eleventh convolutional layer, an eleventh normalization layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, a twelfth normalization layer, and a twelfth activation layer, which are sequentially arranged, wherein an input end of the eleventh convolutional layer is an input end of the fused neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, the input end of the twelfth normalization layer receives all the characteristic diagrams output by the output end of the twelfth convolution layer, the input end of the twelfth activation layer receives all the characteristic diagrams output by the output end of the twelfth normalization layer, and the output end of the twelfth activation layer is the output end of the neural network block where the twelfth activation layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
5. The saliency detection method based on residual error network and depth information fusion according to claim 4 is characterized in that in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 x 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 x 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 x 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.
6. The method according to claim 5, wherein in step 1_2, the 5 sub-output layers have the same structure and are composed of a thirteenth convolutional layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
CN201910444775.0A 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion Active CN110263813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910444775.0A CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910444775.0A CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Publications (2)

Publication Number Publication Date
CN110263813A CN110263813A (en) 2019-09-20
CN110263813B true CN110263813B (en) 2020-12-01

Family

ID=67915440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910444775.0A Active CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Country Status (1)

Country Link
CN (1) CN110263813B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751157B (en) * 2019-10-18 2022-06-24 厦门美图之家科技有限公司 Image significance segmentation and image significance model training method and device
CN110782458B (en) * 2019-10-23 2022-05-31 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110929736B (en) * 2019-11-12 2023-05-26 浙江科技学院 Multi-feature cascading RGB-D significance target detection method
CN111160410B (en) * 2019-12-11 2023-08-08 北京京东乾石科技有限公司 Object detection method and device
CN111209919B (en) * 2020-01-06 2023-06-09 上海海事大学 Marine ship significance detection method and system
CN111242238B (en) * 2020-01-21 2023-12-26 北京交通大学 RGB-D image saliency target acquisition method
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Convolutional neural network edge-assisted enhanced binocular saliency image detection method
CN111351450B (en) * 2020-03-20 2021-09-28 南京理工大学 Single-frame stripe image three-dimensional measurement method based on deep learning
CN112749712B (en) * 2021-01-22 2022-04-12 四川大学 RGBD significance object detection method based on 3D convolutional neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961220A (en) * 2018-06-14 2018-12-07 上海大学 A kind of image collaboration conspicuousness detection method based on multilayer convolution Fusion Features
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10699151B2 (en) * 2016-06-03 2020-06-30 Miovision Technologies Incorporated System and method for performing saliency detection using deep active contours
CN109409380B (en) * 2018-08-27 2021-01-12 浙江科技学院 Stereo image visual saliency extraction method based on double learning networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961220A (en) * 2018-06-14 2018-12-07 上海大学 A kind of image collaboration conspicuousness detection method based on multilayer convolution Fusion Features
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RGB-D Salient Object Detection by a CNN With Multiple Layers Fusion;Huang, Rui 等;《IEEE SIGNAL PROCESSING LETTERS》;20190430;全文 *
Saliency Detection for Stereoscopic 3D Images in the Quaternion Frequency Domain;Wujie Zhou 等;《3DR EXPRESS》;20181231;全文 *
利用卷积神经网络的显著性区域预测方法;李荣 等;《重庆邮电大学学报( 自然科学版)》;20190228;全文 *

Also Published As

Publication number Publication date
CN110263813A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110263813B (en) Significance detection method based on residual error network and depth information fusion
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
Yu et al. Underwater-GAN: Underwater image restoration via conditional generative adversarial network
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
CN108664981B (en) Salient image extraction method and device
CN110246148B (en) Multi-modal significance detection method for depth information fusion and attention learning
CN110490082B (en) Road scene semantic segmentation method capable of effectively fusing neural network features
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
US8692830B2 (en) Automatic avatar creation
CN110992238B (en) Digital image tampering blind detection method based on dual-channel network
CN108389224B (en) Image processing method and device, electronic equipment and storage medium
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN110728682A (en) Semantic segmentation method based on residual pyramid pooling neural network
CN112070753A (en) Multi-scale information enhanced binocular convolutional neural network saliency image detection method
CN110782458B (en) Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110570402B (en) Binocular salient object detection method based on boundary perception neural network
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
CN110009700B (en) Convolutional neural network visual depth estimation method based on RGB (red, green and blue) graph and gradient graph
CN111445432A (en) Image significance detection method based on information fusion convolutional neural network
CN113139904B (en) Image blind super-resolution method and system
CN111310767A (en) Significance detection method based on boundary enhancement
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
CN111294614B (en) Method and apparatus for digital image, audio or video data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230118

Address after: Room 2202, 22 / F, Wantong building, No. 3002, Sungang East Road, Sungang street, Luohu District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen dragon totem technology achievement transformation Co.,Ltd.

Address before: 310023 No. 318 stay Road, Xihu District, Zhejiang, Hangzhou

Patentee before: ZHEJIANG University OF SCIENCE AND TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230627

Address after: 710000 Room 1306, Building 7, Taihua Jinmao International, Keji Second Road, Hi tech Zone, Xi'an City, Shaanxi Province

Patentee after: Huahao Technology (Xi'an) Co.,Ltd.

Address before: Room 2202, 22 / F, Wantong building, No. 3002, Sungang East Road, Sungang street, Luohu District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen dragon totem technology achievement transformation Co.,Ltd.