CN110263813A - A kind of conspicuousness detection method merged based on residual error network and depth information - Google Patents

A kind of conspicuousness detection method merged based on residual error network and depth information Download PDF

Info

Publication number
CN110263813A
CN110263813A CN201910444775.0A CN201910444775A CN110263813A CN 110263813 A CN110263813 A CN 110263813A CN 201910444775 A CN201910444775 A CN 201910444775A CN 110263813 A CN110263813 A CN 110263813A
Authority
CN
China
Prior art keywords
layer
output
feature maps
neural network
receives
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910444775.0A
Other languages
Chinese (zh)
Other versions
CN110263813B (en
Inventor
周武杰
吴君委
雷景生
何成
钱亚冠
王海江
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huahao Technology Xi'an Co ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910444775.0A priority Critical patent/CN110263813B/en
Publication of CN110263813A publication Critical patent/CN110263813A/en
Application granted granted Critical
Publication of CN110263813B publication Critical patent/CN110263813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a kind of conspicuousness detection methods merged based on residual error network and depth information, it constructs convolutional neural networks in the training stage, input layer includes RGB figure input layer and depth map input layer, hidden layer includes 5 RGB figure neural network blocks, the maximum pond layer of 4 RGB figures, 5 depth map neural network blocks, 4 depth map maximum pond layers, 5 cascading layers, 5 fused neural network blocks, 4 warp laminations, and output layer includes 5 sub- output layers;By in training set colored real-world object image and depth image be input in convolutional neural networks and be trained, obtain conspicuousness detection prognostic chart;By calculating the loss function value between conspicuousness detection prognostic chart and true conspicuousness detection label image, convolutional neural networks training pattern is obtained;It is predicted in test phase using the colored real-world object image that convolutional neural networks training pattern treats conspicuousness detection, obtains prediction conspicuousness detection image;Advantage is that conspicuousness Detection accuracy is high.

Description

Significance detection method based on residual error network and depth information fusion
Technical Field
The invention relates to a visual saliency detection technology, in particular to a saliency detection method based on residual error network and depth information fusion.
Background
The visual saliency can help people to quickly filter out unimportant information, so that people can focus more on meaningful areas, and the scene in front of the eyes can be better understood. With the rapid development of the computer vision field, people hope that a computer can also have the same capability as a human being, namely, when a complex scene is understood and analyzed, the computer can process useful information more pertinently, so that the complexity of an algorithm can be reduced more, and the interference of noise waves can be eliminated. In the conventional method, researchers model a saliency object detection algorithm according to various observed prior knowledge to generate a saliency map. These a priori knowledge include contrast, center a priori, edge a priori, semantic a priori, etc. However, in complex scenes, conventional practice tends to be inaccurate because these observations tend to be limited to low-level features (e.g., color and contrast, etc.), and therefore do not accurately reflect the common points of significance inherent in the object.
In recent years, convolutional neural networks have been widely used in various fields of computer vision, and many difficult vision problems have been made a great progress. Different from the traditional method, the deep convolutional neural network can be modeled from a large number of training samples and automatically learns more essential characteristics end-to-end (end-to-end), so that the defects of traditional manual modeling and feature design are effectively avoided. Recently, effective application of 3D sensors enriches databases, and people can obtain not only color pictures but also depth information of color pictures. Depth information is an important ring in the human visual system in real 3D scenes, which is an important piece of information that has been completely ignored in the conventional practice, so that the most important task at present is how to build a model to effectively utilize the depth information.
A significance detection method of deep learning is adopted in an RGB-D database, pixel-level end-to-end significance detection is directly carried out, and prediction can be carried out on a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. At present, the depth learning significance detection model based on the RGB-D database mainly uses an encoding-decoding architecture, and there are three methods how to utilize depth information: the first method is to directly superimpose the depth information and the color image information into a four-dimensional input information or add or superimpose the color image information and the depth information in the encoding process, and the method is called pre-fusion; the second method is to add or superimpose the color image information and the depth information corresponding to each other in the encoding process into the corresponding decoding process in a layer skipping (skip connection) manner, which is called post-fusion; the third method is to use color image information and depth information to predict the significance and to fuse the final results. In the first method, since the color image information and the depth information have a large difference in distribution, noise is added to a certain extent by directly adding the depth information in the encoding process. The third method uses the depth information and the color map information to perform saliency prediction, but if the prediction results of the depth information and the color map information are not accurate, the final fusion result is relatively inaccurate. The second method not only avoids the noise brought by directly utilizing the depth information in the coding stage, but also can fully learn the complementary relation between the color image information and the depth information in the continuous optimization of the network model. Compared with the previous post-Fusion scheme, such as an RGB-D salience Detection by Multi-stream-linear late Fusion Network model (based on a Multi-stream post-Fusion RGB-D significance Detection Network model), which is hereinafter referred to as MLF for short, the MLF performs feature extraction and down-sampling operations on color image information and depth information respectively, performs Fusion by multiplying corresponding position elements in the highest dimension, and outputs a significance prediction map with a small size on the Fusion result. The MLF only has a down-sampling operation, so that spatial detail information of an object is blurred in the continuous down-sampling operation, and the MLF performs significance prediction output on the minimum size, and loses much information of a significant object after being amplified to the original size.
Disclosure of Invention
The invention aims to solve the technical problem of a significance detection method based on residual error network and depth information fusion, which improves the significance detection accuracy rate by efficiently utilizing depth information and color image information.
The technical scheme adopted by the invention for solving the technical problems is as follows: a significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, and depth images and real salience corresponding to each original color real object imageAnd (3) detecting the label images, forming a training set, and correspondingly marking the q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦q(i,j)}、{Dq(i,j)}、H represents { I }q(i,j)}、{Dq(i,j)}、W and H can be divided by 2, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;
for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the 1 st depth map max pooling layer, its input receives DP1All the characteristics ofThe output end of the graph outputs 32 pieces of widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all the feature maps to be outputIs recorded as DP3
For the 3 rd depth map max pooling layer, its input receives DP3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP5
For the 1 st cascaded layer, its input receives CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layerWhose input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All the characteristics ofGraph and DP3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For the 3 rd deconvolution layer, its input terminal receives RH3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are outputAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out1,Out1One of the feature maps is a significance detection prediction map;
for the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps is a significance detection prediction map;
for the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps is a significance detection prediction map;
for the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps is a significance detection prediction map;
for the 5 th sub-output layer, its input receives RH52 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out5,Out5One of the feature maps is a significance detection prediction map;
step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in theq(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width W and height H will be { I }q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the setsAndthe value of the loss function in between is recorded asObtaining by adopting a classified cross entropy;
step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, corresponding the weight vector and the bias item corresponding to the loss function value with the minimum value to serve as convolutional neural network trainingThe optimal weight vector and the optimal bias term of the model are correspondingly marked as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a color real object image to be saliency detected, willThe corresponding depth image is notedWherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsAndwidth of (A), H' representsAndthe height of (a) of (b),to representThe pixel value of the pixel point with the middle coordinate position (i ', j'),to representThe pixel value of the pixel point with the middle coordinate position (i ', j');
step 2_ 2: will be provided withR channel component, G channel component and B channel component of andinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainCorresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes withAs the predicted saliency detection image of uniform sizeCorresponding final predicted saliency detection images and notationWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').
In step 1_2, the 1 st RGB map neural network block and the 1 st depth map neural network block have the same structure, and are composed of a first convolution layer, a first batch of normalization layers, a first activation layer, a first residual block, a second convolution layer, a second batch of normalization layers, and a second activation layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first batch of normalization layers receives all feature maps output by an output end of the first convolution layer, an input end of the first activation layer receives all feature maps output by an output end of the first batch of normalization layer, an input end of the first residual block receives all feature maps output by an output end of the first activation layer, an input end of the second convolution layer receives all feature maps output by an output end of the first residual block, and an input end of the second batch of the standard layer receives all feature maps output by an output end of the second convolution layer, the input end of the second activation layer receives all characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;
the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;
the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;
the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;
the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In the step 1_2, the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are maximum pooling layers, the pooling sizes of the 4 RGB image maximum pooling layers and the 4 depth image maximum pooling layers are both 2, and the step sizes are both 2.
In step 1_2, the 5 fused neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh block of normalization layers, an eleventh active layer, a sixth residual block, a twelfth convolutional layer, a twelfth block of normalization layers, and a twelfth active layer, which are sequentially arranged, wherein the input end of the eleventh convolutional layer is the input end of the fused neural network block where the eleventh convolutional layer is located, the input end of the eleventh convolutional layer receives all feature maps output by the output end of the eleventh convolutional layer, the input end of the eleventh active layer receives all feature maps output by the output end of the eleventh block of normalization layers, the input end of the sixth residual block receives all feature maps output by the output end of the eleventh active layer, the input end of the twelfth convolutional layer receives all feature maps output by the output end of the sixth residual block, and the input end of the twelfth block of normalization layers receives all feature maps output by the output end of the twelfth convolutional layers, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In step 1_2, the sizes of convolution kernels of the 1 st deconvolution layer and the 2 nd deconvolution layer are both 2 × 2, the numbers of convolution kernels are both 256, the step lengths are both 2, and the zero padding parameter is 0, the sizes of convolution kernels of the 3 rd deconvolution layer are 2 × 2, the numbers of convolution kernels are 128, the step lengths are 2, and the zero padding parameter is 0, and the sizes of convolution kernels of the 4 th deconvolution layer are 2 × 2, the numbers of convolution kernels are 64, the step lengths are 2, and the zero padding parameter is 0.
In the step 1_2, the 5 sub-output layers have the same structure and consist of a thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
Compared with the prior art, the invention has the advantages that:
1) the convolutional neural network constructed by the method realizes the detection of the salient object from end to end, is easy to train, and is convenient and quick; inputting the color real object images and the corresponding depth images in the training set into a convolutional neural network for training to obtain a convolutional neural network training model; and inputting the color real object image to be subjected to significance detection and the corresponding depth image into the convolutional neural network training model, and predicting to obtain a predicted significance detection image corresponding to the color real object image.
2) The method adopts a post-fusion mode when the depth information is utilized, and cascades the depth information and the color image information corresponding to the coding layer with the corresponding coding layer (registration), thereby avoiding the addition of noise information in the coding stage by pre-fusion, and simultaneously being capable of fully learning complementary information of the color image information and the depth information when a convolutional neural network training model is trained, and further obtaining better effect on a training set and a testing set.
3) The invention adopts multi-scale Supervision (multi-scale Supervision), namely, spatial detail information of an object can be optimized in the process of up-sampling through a deconvolution layer, prediction graphs are output at different sizes and supervised by label graphs with corresponding sizes, and a convolutional neural network training model can be guided to gradually construct significance detection prediction graphs, so that better effects are obtained on a training set and a testing set.
Drawings
FIG. 1 is a schematic diagram of the structure of a convolutional neural network constructed by the method of the present invention;
FIG. 2a is a class accuracy recall curve for predicting each color real object image in a real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;
FIG. 2b is a graph showing the mean absolute error of the saliency detection effect of the present invention as predicted for each color real object image in the real object image database NLPR test set by the present invention;
FIG. 2c is a F metric value for predicting each color real object image in the real object image database NLPR test set by using the method of the present invention to reflect the significance detection effect of the method of the present invention;
FIG. 3a is the 1 st original color real object image of the same scene;
FIG. 3b is a depth image corresponding to FIG. 3 a;
FIG. 3c is a predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention;
FIG. 4a is the 2 nd original color real object image of the same scene;
FIG. 4b is a depth image corresponding to FIG. 4 a;
FIG. 4c is a predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention;
FIG. 5a is the 3 rd original color real object image of the same scene;
FIG. 5b is a depth image corresponding to FIG. 5 a;
FIG. 5c is a predicted saliency detected image from the prediction of FIG. 5a using the method of the present invention;
FIG. 6a is the 4 th original color real object image of the same scene;
FIG. 6b is a depth image corresponding to FIG. 6 a;
fig. 6c is a predicted saliency detection image obtained by predicting fig. 6a by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The significance detection method based on the residual error network and the depth information fusion comprises a training stage and a testing stage.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、Wherein Q is a positive integer, Q is not less than 200, if Q is 367, Q is a positive integer, the initial value of Q is 1, 1 is not less than Q is not less than Q, 1 is not less than I is not less than W, 1 is not less than j is not less than H, W represents { I ≦ Hq(i,j)}、{Dq(i,j)}、H represents { I }q(i,j)}、{Dq(i,j)}、W and H can be divided by 2, for example, W512, H512, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j); in this case, the original color real object image is directly selected from 800 images in the training set of the database NLPR.
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes 5 RGB map neural network blocks, 4 RGB map max pooling layers (Pool), 5 depth map neural network blocks, 4 depth map max pooling layers, 5 cascade layers, 5 fusion neural network blocks, and 4 deconvolution layers, and the output layer includes 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of the convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers.
For the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; among them, the width of the RGB color image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; the training depth image has a width W and a height H.
For the 1 st RGB map neural network block, the input end thereof receives R of the RGB color image for training output from the output end of the RGB map input layerThe output end of the channel component, the G channel component and the B channel component outputs 32 characteristic graphs with width W and height H, and the set formed by all the output characteristic graphs is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the 1 st depth map max pooling layer, its input receives DP1The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The Chinese herbal medicineHas a characteristic diagram, the output end of the characteristic diagram outputs 64 widthsAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP3
For the 3 rd depth map max pooling layer, its input receives DP3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is formed by all the feature maps outputIs denoted as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP5
For the 1 st cascaded (concatenation) layer, its input receives the CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layer, its input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All feature maps and DP in3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For theA 3 rd deconvolution layer whose input terminal receives RH3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are outputAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out1,Out1One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
For the 5 th sub-output layer, its input receives RH52 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out5,Out5One of the feature maps (feature map 2) in (b) is a significance detection prediction map.
Step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in theq(i, j) } corresponding 5 significance testsThe set of prediction graph constructs is denoted as
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width W and height H will be { I }q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Step 1_ 5: calculating each original color real object image in training setLoss function values between a set formed by 5 saliency detection prediction images corresponding to the images and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images are obtainedAndthe value of the loss function in between is recorded asObtained using categorical cross entropy (categorical cross entropy).
Step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a color real object image to be saliency detected, willThe corresponding depth image is notedWherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsAndwidth of (A), H' representsAndthe height of (a) of (b),to representThe pixel value of the pixel point with the middle coordinate position (i ', j'),to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').
Step 2_ 2: will be provided withR channel component, G channel component and B channel component of andinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainCorresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes withAs the predicted saliency detection image of uniform sizeCorresponding final predicted saliency detection images and notationWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').
In this embodiment, in step 1_2, the 1 st RGB map neural network Block and the 1 st depth map neural network Block have the same structure, and are composed of a first Convolution layer (convention, Conv), a first normalization layer (batchnormaize, BN), a first active layer (Activation, Act), a first Residual Block (Residual Block, RB), a second Convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, an input end of the first Convolution layer is an input end of the neural network Block, an input end of the first Convolution layer receives all the feature maps output by an output end of the first Convolution layer, an input end of the first active layer receives all the feature maps output by an output end of the first normalization layer, an input end of the first Residual Block receives all the feature maps output by an output end of the first active layer, an input end of the second Convolution layer receives all the feature maps output by an output end of the first Residual Block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the convolution kernel sizes (kernel _ size) of the first convolution layer and the second convolution layer are 3 x 3, the convolution kernel numbers (filters) are 32, the zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual error block output 32 characteristic diagrams respectively.
In this embodiment, the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third active layer, a second residual block, a fourth convolution layer, a fourth normalization layer, and a fourth active layer, which are sequentially arranged, where an input end of the third convolution layer is an input end of the neural network block where the third convolution layer is located, an input end of the third normalization layer receives all feature maps output by an output end of the third convolution layer, an input end of the third active layer receives all feature maps output by an output end of the third normalization layer, an input end of the second residual block receives all feature maps output by an output end of the third active layer, an input end of the fourth convolution layer receives all feature maps output by an output end of the second residual block, and an input end of the fourth normalization layer receives all feature maps output by an output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block.
In this specific embodiment, the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure, and are composed of a fifth convolution layer, a fifth normalization layer, a fifth active layer, a third residual block, a sixth convolution layer, a sixth normalization layer, and a sixth active layer, which are sequentially arranged, where an input end of the fifth convolution layer is an input end of the neural network block where the fifth convolution layer is located, an input end of the fifth normalization layer receives all feature maps output by an output end of the fifth convolution layer, an input end of the fifth active layer receives all feature maps output by an output end of the fifth normalization layer, an input end of the third residual block receives all feature maps output by an output end of the fifth active layer, an input end of the sixth convolution layer receives all feature maps output by an output end of the third residual block, and an input end of the sixth normalization layer receives all feature maps output by an output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and 128 feature graphs are output from output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block respectively.
In this specific embodiment, the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh active layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer, and an eighth active layer, which are sequentially arranged, an input end of the seventh convolution layer is an input end of the neural network block where the seventh convolution layer is located, an input end of the seventh normalization layer receives all feature maps output by an output end of the seventh convolution layer, an input end of the seventh active layer receives all feature maps output by an output end of the seventh normalization layer, an input end of the fourth residual block receives all feature maps output by an output end of the seventh active layer, an input end of the eighth convolution layer receives all feature maps output by an output end of the fourth residual block, an input end of the eighth normalization layer receives all feature maps output by an output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block.
In this embodiment, the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure, and are composed of a ninth convolutional layer, a ninth block of normalization layers, a ninth active layer, a fifth residual block, a tenth convolutional layer, a tenth block of normalization layers, and a tenth active layer, which are sequentially arranged, an input end of the ninth convolutional layer is an input end of the neural network block where the ninth convolutional layer is located, an input end of the ninth block of normalization layers receives all feature maps output by an output end of the ninth convolutional layer, an input end of the ninth active layer receives all feature maps output by an output end of the ninth block of normalization layers, an input end of the fifth residual block receives all feature maps output by an output end of the ninth active layer, an input end of the tenth block of normalization layers receives all feature maps output by an output end of the tenth block of normalization layers, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
In this embodiment, in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both maximum pooling layers, and the pooling sizes (pool _ size) and the step sizes (stride) of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and 2, respectively.
In this embodiment, in step 1_2, the 5 merged neural network blocks have the same structure, and are composed of an eleventh convolutional layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, and a twelfth activation layer, which are sequentially arranged, an input end of the eleventh convolutional layer is an input end of the merged neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the twelfth convolutional layer, the input end of the twelfth active layer receives all the characteristic diagrams output by the output end of the twelfth batch of standardized layers, and the output end of the twelfth active layer is the output end of the neural network block where the twelfth active layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
In this embodiment, in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 × 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 × 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 × 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.
In this embodiment, in step 1_2, the 5 sub-output layers have the same structure and are composed of the thirteenth convolution layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) constructing a convolutional neural network architecture proposed by the method by using a python-based deep learning library Pytrich0.4.1. The method analyzes how the significance detection effect of the color real object image (taking 200 real object images) obtained by prediction by the method is achieved by adopting a real object image database NLPR test set. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy recall curve (Precision recalling curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).
The method of the invention is used for predicting each color real object image in the real object image database NLPR test set to obtain a prediction significance detection image corresponding to each color real object image. A class accuracy recall Curve (PR cure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2a, an average absolute error (MAE) reflecting the significance detection effect of the method of the present invention is shown in fig. 2b, and has a value of 0.058, and an F metric (F-Measure) reflecting the significance detection effect of the method of the present invention is shown in fig. 2c, and has a value of 0.796. As can be seen from fig. 2a to 2c, the saliency detection result of the color real object image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted saliency detection image corresponding to the color real object image by using the method of the present invention.
FIG. 3a shows the 1 st original color real object image of the same scene, FIG. 3b shows the depth image corresponding to FIG. 3a, and FIG. 3c shows the predicted saliency detection image obtained by predicting FIG. 3a using the method of the present invention; FIG. 4a shows the 2 nd original color real object image of the same scene, FIG. 4b shows the depth image corresponding to FIG. 4a, and FIG. 4c shows the predicted saliency detection image obtained by predicting FIG. 4a using the method of the present invention; FIG. 5a shows the 3 rd original color real object image of the same scene, FIG. 5b shows the depth image corresponding to FIG. 5a, and FIG. 5c shows the predicted saliency detection image obtained by predicting FIG. 5a using the method of the present invention; fig. 6a shows the 4 th original color real object image of the same scene, fig. 6b shows the depth image corresponding to fig. 6a, and fig. 6c shows the predicted saliency detection image obtained by predicting fig. 6a by using the method of the present invention. Comparing fig. 3a and 3c, comparing fig. 4a and 4c, comparing fig. 5a and 5c, and comparing fig. 6a and 6c, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.

Claims (6)

1. A significance detection method based on residual error network and depth information fusion is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, depth images corresponding to each original color real object image and real significance detection label images, forming a training set, and obtaining the Q-th original color real object image in the training set and the corresponding original color real object imageThe depth image and the real significance detection label image of (1) are correspondingly marked as { Iq(i,j)}、{Dq(i,j)}、Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦q(i,j)}、{Dq(i,j)}、H represents { I }q(i,j)}、{Dq(i,j)}、W and H can be divided by 2, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 5 depth graph neural network blocks, 4 depth graph maximum pooling layers, 5 cascade layers, 5 fusion neural network blocks and 4 deconvolution layers, and the output layer comprises 5 sub-output layers; the coding structure of the depth map is formed by the 5 RGB map neural network blocks and the 4 RGB map maximum pooling layers, the coding structure of the depth map is formed by the 5 depth map neural network blocks and the 4 depth map maximum pooling layers, the coding structure of the RGB map and the coding structure of the depth map form a coding layer of a convolutional neural network, and the coding layer of the convolutional neural network is formed by the 5 cascade layers, the 5 fusion neural network blocks and the 4 deconvolution layers;
for the RGB image input layer, the input end of the RGB image input layer receives an R channel component, a G channel component and a B channel component of an RGB color image for training, and the output end of the RGB image input layer outputs the R channel component, the G channel component and the B channel component of the RGB color image for training to the hidden layer; wherein, the width of the RGB color image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the depth image for training corresponding to the RGB color image for training received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the depth image for training to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the 1 st RGB map neural network block, the input end receives the R channel component, the G channel component and the B channel component of the RGB color image for training output by the output end of the RGB map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as CP1
For the 1 st RGB map max pooling layer, its input receives CP1The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC1
For the 2 nd RGB map neural network block, its input receives ZC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP2
For the 2 nd RGB map max pooling layer, its input receives CP2The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC2
For the 3 rd RGB map neural network block, its input receives ZC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP3
For the 3 rd RGB map max pooling layer, its input receives CP3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC3
For the 4 th RGB map neural network block, its input receives ZC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP4
For the 4 th RGB map max pooling layer, its input receives CP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as ZC4
For the 5 th RGB map neural network block, its input receives ZC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as CP5
For the 1 st depth map neural network block, the input end receives the training depth image output by the output end of the depth map input layer, the output end outputs 32 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as DP1
For the 1 st depth map max pooling layer, its input receives DP1The output end of all the characteristic maps outputs 32 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC1
For the 2 nd depth map neural network block, its input receives DC1The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP2
For the 2 nd depth map max pooling layer, its input receives DP2The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC2
For the 3 rd depth map neural network block, its input receives DC2The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP3
For the 3 rd depth map max pooling layer, its input receives DP3All the feature maps in (1), the output thereofThe output end outputs 128 pieces of widthAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC3
For the 4 th depth map neural network block, its input receives DC3All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is designated as DP4
For the 4 th depth map max pooling layer, its input receives DP4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all output feature maps, and is denoted as DC4
For the 5 th depth map neural network block, its input receives DC4All the characteristic maps in (1) have 256 output widths ofAnd has a height ofA feature map of, all features to be outputThe set of graph constructs is denoted DP5
For the 1 st cascaded layer, its input receives CP5All feature maps and DP in5All feature maps in (1), for CP5All feature maps and DP in5All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con1
For the 1 st converged neural network block, its input receives Con1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH1
For the 1 st deconvolution layer, its input terminal receives RH1All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ1
For the 2 nd cascaded layer, its input receives FJ1All feature maps, CP4All feature maps and DP in4All feature maps in (1), for FJ1All feature maps, CP4All feature maps and DP in4All the characteristic graphs in (1) are superposed, and the output end of the characteristic graph has 768 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con2
For the 2 nd converged neural network block, its input receives Con2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH2
For the 2 nd deconvolution layer, its input terminal receives RH2All the characteristic maps in (1) have 256 output widths ofAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ2
For the 3 rd cascaded layer, its input receives FJ2All feature maps, CP3All feature maps and DP in3All feature maps in (1), for FJ2All feature maps, CP3All feature maps and DP in3All the feature maps in (1) are superposed, and 512 widths are output at the output end of the deviceAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con3
For the 3 rd converged neural network block, its input receives Con3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH3
For the 3 rd deconvolution layer, its input terminal receives RH3The output end of all the characteristic maps outputs 128 widthAnd has a height ofThe feature map of (1), a set of all feature maps outputted is denoted as FJ3
For the 4 th cascaded layer, its input receives FJ3All feature maps, CP2All feature maps and DP in2All feature maps in (1), for FJ3All feature maps, CP2All feature maps and DP in2All the characteristic maps in the table are superposed, and 256 widths of the output end of the table are outputAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as Con4
For the 4 th converged neural network block, its input receives Con4The output end of all the characteristic graphs in (1) outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted and is denoted as RH4
For the 4 th deconvolution layer, its input terminal receives RH4The output end of all the feature maps outputs 64 feature maps with width W and height H, and the set of all the output feature maps is denoted as FJ4
For the 5 th cascaded layer, its input receives FJ4All feature maps, CP1All feature maps and DP in1All feature maps in (1), for FJ4All feature maps, CP1All feature maps and DP in1The output end outputs 128 feature maps with width W and height H, and the set of all output feature maps is recorded as Con5
For the 5 th converged neural network block, its input receives Con5The output end of all the feature maps outputs 32 feature maps with width W and height H, and the set of all the output feature maps is denoted as RH5
For the 1 st sub-output layer, its input receives RH1The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out1,Out1One of the feature maps is a significance detection prediction map;
for the 2 nd sub-output layer, its input receives RH2The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out2,Out2One of the feature maps is a significance detection prediction map;
for the 3 rd sub-output layer, its input receives RH3The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out3,Out3One of the feature maps is a significance detection prediction map;
for the 4 th sub-output layer, its input receives RH4The output end of all the characteristic graphs in (1) outputs 2 characteristic graphs with the width ofAnd has a height ofThe feature map of (1), the set of all feature maps of output is denoted as Out4,Out4One of the feature maps is a significance detection prediction map;
for the 5 th sub-output layer, its input receives RH52 feature maps with width W and height H are output from the output end of all feature maps in (1), and the set formed by all output feature maps is recorded as Out5,Out5One of the feature maps is a significance detection prediction map;
step 1_ 3: taking each original color real object image in the training set as an RGB color image for training, taking a depth image corresponding to each original color real object image in the training set as a depth image for training, inputting the depth image into a convolutional neural network for training to obtain 5 saliency detection prediction images corresponding to each original color real object image in the training set, and taking { I } as a prediction image for the saliency detection, and calculating the saliency of each original color real object image in the training set according to the prediction image for the saliency detection, the saliency of each original color real object image in theq(i, j) } the set formed by the 5 saliency detection prediction maps is marked as
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 5 different sizes to obtain the width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width ofAnd has a height ofAn image of width W and height H will be { I }q(i, j) } the set formed by 5 images obtained by zooming the corresponding real significance detection image is recorded as
Step 1_ 5: calculating loss function values between a set formed by 5 saliency detection prediction images corresponding to each original color real object image in a training set and a set formed by 5 images obtained by scaling real saliency detection images corresponding to the original color real object images, and calculating loss function values between the setsAndthe value of the loss function in between is recorded asObtaining by adopting a classified cross entropy;
step 1_ 6: repeatedly executing the step 1_3 to the step 1_5 for V times to obtain a convolutional neural network training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a color real object image to be saliency detected, willThe corresponding depth image is notedWherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsAndwidth of (A), H' representsAndthe height of (a) of (b),to representThe pixel value of the pixel point with the middle coordinate position (i ', j'),to representImage of pixel point with middle coordinate position (i', jThe prime value;
step 2_ 2: will be provided withR channel component, G channel component and B channel component of andinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainCorresponding 5 prediction significance detection images with different sizes are obtained by comparing the sizes withAs the predicted saliency detection image of uniform sizeCorresponding final predicted saliency detection images and notationWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').
2. The method according to claim 1, wherein in step 1_2, the 1 st RGB graph neural network block and the 1 st depth graph neural network block have the same structure, and are composed of a first convolution layer, a first normalization layer, a first active layer, a first residual block, a second convolution layer, a second normalization layer, and a second active layer, which are sequentially arranged, wherein an input end of the first convolution layer is an input end of the neural network block where the first convolution layer is located, an input end of the first normalization layer receives all feature maps output from an output end of the first convolution layer, an input end of the first active layer receives all feature maps output from an output end of the first normalization layer, an input end of the first residual block receives all feature maps output from an output end of the first active layer, and an input end of the second convolution layer receives all feature maps output from an output end of the first residual block, the input end of the second batch of normalization layers receives all the characteristic graphs output by the output end of the second convolution layer, the input end of the second activation layer receives all the characteristic graphs output by the output end of the second batch of normalization layers, and the output end of the second activation layer is the output end of the neural network block where the second activation layer is located; the sizes of convolution kernels of the first convolution layer and the second convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 32, zero padding parameters are both 1, the activation modes of the first activation layer and the second activation layer are both 'Relu', and output ends of the first normalization layer, the second normalization layer, the first activation layer, the second activation layer and the first residual block respectively output 32 characteristic graphs;
the 2 nd RGB map neural network block and the 2 nd depth map neural network block have the same structure, and are composed of a third convolution layer, a third normalization layer, a third activation layer, a second residual block, a fourth convolution layer, a fourth normalization layer and a fourth activation layer which are sequentially arranged, wherein the input end of the third convolution layer is the input end of the neural network block where the third convolution layer is located, the input end of the third normalization layer receives all feature maps output by the output end of the third convolution layer, the input end of the third activation layer receives all feature maps output by the output end of the third normalization layer, the input end of the second residual block receives all feature maps output by the output end of the third activation layer, the input end of the fourth convolution layer receives all feature maps output by the output end of the second residual block, and the input end of the fourth normalization layer receives all feature maps output by the output end of the fourth convolution layer, the input end of the fourth activation layer receives all characteristic graphs output by the output end of the fourth batch of normalization layers, and the output end of the fourth activation layer is the output end of the neural network block where the fourth activation layer is located; the sizes of convolution kernels of the third convolution layer and the fourth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 64, zero padding parameters are both 1, the activation modes of the third activation layer and the fourth activation layer are both 'Relu', and 64 feature graphs are output by respective output ends of the third normalization layer, the fourth normalization layer, the third activation layer, the fourth activation layer and the second residual block;
the 3 rd RGB map neural network block and the 3 rd depth map neural network block have the same structure and are composed of a fifth convolution layer, a fifth normalization layer, a fifth activation layer, a third residual block, a sixth convolution layer, a sixth normalization layer and a sixth activation layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the fifth normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the fifth normalization layer, the input end of the third residual block receives all feature maps output by the output end of the fifth activation layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the third residual block, and the input end of the sixth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the sixth active layer receives all the characteristic graphs output by the output end of the sixth batch of normalization layers, and the output end of the sixth active layer is the output end of the neural network block where the sixth active layer is located; the sizes of convolution kernels of the fifth convolution layer and the sixth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 128, zero padding parameters are 1, the activation modes of the fifth activation layer and the sixth activation layer are both 'Relu', and the output ends of the fifth normalization layer, the sixth normalization layer, the fifth activation layer, the sixth activation layer and the third residual block output 128 feature graphs;
the 4 th RGB map neural network block and the 4 th depth map neural network block have the same structure, and are composed of a seventh convolution layer, a seventh normalization layer, a seventh activation layer, a fourth residual block, an eighth convolution layer, an eighth normalization layer and an eighth activation layer which are sequentially arranged, wherein the input end of the seventh convolution layer is the input end of the neural network block where the seventh convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the seventh convolution layer, the input end of the seventh activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the fourth residual block receives all feature maps output by the output end of the seventh activation layer, the input end of the eighth convolution layer receives all feature maps output by the output end of the fourth residual block, and the input end of the eighth normalization layer receives all feature maps output by the output end of the eighth convolution layer, the input end of the eighth active layer receives all characteristic graphs output by the output end of the eighth normalization layer, and the output end of the eighth active layer is the output end of the neural network block where the eighth active layer is located; the sizes of convolution kernels of the seventh convolution layer and the eighth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the activation modes of the seventh activation layer and the eighth activation layer are both 'Relu', and 256 characteristic graphs are output by respective output ends of the seventh normalization layer, the eighth normalization layer, the seventh activation layer, the eighth activation layer and the fourth residual block;
the 5 th RGB map neural network block and the 5 th depth map neural network block have the same structure and are composed of a ninth convolution layer, a ninth normalization layer, a ninth active layer, a fifth residual block, a tenth convolution layer, a tenth normalization layer and a tenth active layer which are sequentially arranged, wherein the input end of the ninth convolution layer is the input end of the neural network block where the ninth convolution layer is located, the input end of the ninth normalization layer receives all feature maps output by the output end of the ninth convolution layer, the input end of the ninth active layer receives all feature maps output by the output end of the ninth normalization layer, the input end of the fifth residual block receives all feature maps output by the output end of the ninth active layer, the input end of the tenth convolution layer receives all feature maps output by the output end of the fifth residual block, and the input end of the tenth normalization layer receives all feature maps output by the output end of the tenth convolution layer, the input end of the tenth active layer receives all characteristic graphs output by the output end of the tenth normalization layer, and the output end of the tenth active layer is the output end of the neural network block where the tenth active layer is located; the sizes of convolution kernels of the ninth convolution layer and the tenth convolution layer are both 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are both 1, the activation modes of the ninth activation layer and the tenth activation layer are both 'Relu', and 256 feature maps are output from output ends of the ninth normalization layer, the tenth normalization layer, the ninth activation layer, the tenth activation layer and the fifth residual block respectively.
3. The method according to claim 1 or 2, wherein in step 1_2, the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are maximum pooling layers, and the pooling sizes of the 4 RGB map maximum pooling layers and the 4 depth map maximum pooling layers are both 2 and the step size is 2.
4. The method according to claim 3, wherein in step 1_2, the 5 fused neural network blocks have the same structure and are composed of an eleventh convolutional layer, an eleventh normalization layer, an eleventh activation layer, a sixth residual block, a twelfth convolutional layer, a twelfth normalization layer, and a twelfth activation layer, which are sequentially arranged, wherein an input end of the eleventh convolutional layer is an input end of the fused neural network block where the eleventh convolutional layer is located, an input end of the eleventh convolutional layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the eleventh activation layer receives all feature maps output by an output end of the eleventh convolutional layer, an input end of the sixth residual block receives all feature maps output by an output end of the eleventh activation layer, and an input end of the twelfth convolutional layer receives all feature maps output by an output end of the sixth residual block, the input end of the twelfth normalization layer receives all the characteristic diagrams output by the output end of the twelfth convolution layer, the input end of the twelfth activation layer receives all the characteristic diagrams output by the output end of the twelfth normalization layer, and the output end of the twelfth activation layer is the output end of the neural network block where the twelfth activation layer is located; wherein, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 1 st and the 2 nd fusion neural network blocks are both 3 × 3, the numbers of the convolution kernels are both 256, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 1 st and the 2 nd fusion neural network blocks are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 1 st and the 2 nd fusion neural network blocks output 256 characteristic diagrams, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 3 rd fusion neural network block are both 3 × 3, the numbers of the convolution kernels are both 128, zero-padding parameters are both 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 3 rd fusion neural network block are both "Relu", the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 3 rd fused neural network block output 128 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 4 th fused neural network block are both 3 x 3, the number of convolution kernels is 64, zero padding parameters are 1, the activation modes of the eleventh activation layer and the twelfth activation layer in the 4 th fused neural network block are both 'Relu', the output ends of the eleventh normalization layer, the twelfth normalization layer, the eleventh activation layer, the twelfth activation layer and the sixth residual block in the 4 th fused neural network block output 64 feature maps, the sizes of convolution kernels of the eleventh convolution layer and the twelfth convolution layer in the 5 th fused neural network block are both 3 x 3, the number of convolution kernels is 32, the number of convolution kernels is 3, Zero padding parameters are all 1, the activation modes of an eleventh activation layer and a twelfth activation layer in the 5 th fusion neural network block are all 'Relu', and output ends of an eleventh batch of normalization layer, a twelfth batch of normalization layer, an eleventh activation layer, a twelfth activation layer and a sixth residual block in the 5 th fusion neural network block output 32 feature graphs.
5. The saliency detection method based on residual error network and depth information fusion according to claim 4 is characterized in that in step 1_2, the convolution kernel sizes of the 1 st and 2 nd deconvolution layers are both 2 x 2, the number of convolution kernels is 256, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 3 rd deconvolution layer is 2 x 2, the number of convolution kernels is 128, the step size is 2, and the zero padding parameter is 0, the convolution kernel size of the 4 th deconvolution layer is 2 x 2, the number of convolution kernels is 64, the step size is 2, and the zero padding parameter is 0.
6. The method according to claim 5, wherein in step 1_2, the 5 sub-output layers have the same structure and are composed of a thirteenth convolutional layer; wherein, the convolution kernel size of the thirteenth convolution layer is 1 × 1, the number of convolution kernels is 2, and the zero padding parameter is 0.
CN201910444775.0A 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion Active CN110263813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910444775.0A CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910444775.0A CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Publications (2)

Publication Number Publication Date
CN110263813A true CN110263813A (en) 2019-09-20
CN110263813B CN110263813B (en) 2020-12-01

Family

ID=67915440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910444775.0A Active CN110263813B (en) 2019-05-27 2019-05-27 Significance detection method based on residual error network and depth information fusion

Country Status (1)

Country Link
CN (1) CN110263813B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751157A (en) * 2019-10-18 2020-02-04 厦门美图之家科技有限公司 Image saliency segmentation and image saliency model training method and device
CN110782458A (en) * 2019-10-23 2020-02-11 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method
CN111160410A (en) * 2019-12-11 2020-05-15 北京京东乾石科技有限公司 Object detection method and device
CN111209919A (en) * 2020-01-06 2020-05-29 上海海事大学 Marine ship significance detection method and system
CN111242238A (en) * 2020-01-21 2020-06-05 北京交通大学 Method for acquiring RGB-D image saliency target
CN111351450A (en) * 2020-03-20 2020-06-30 南京理工大学 Single-frame stripe image three-dimensional measurement method based on deep learning
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Convolutional neural network edge-assisted enhanced binocular saliency image detection method
CN111783862A (en) * 2020-06-22 2020-10-16 浙江科技学院 Three-dimensional significant object detection technology of multi-attention-directed neural network
CN112749712A (en) * 2021-01-22 2021-05-04 四川大学 RGBD significance object detection method based on 3D convolutional neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170351941A1 (en) * 2016-06-03 2017-12-07 Miovision Technologies Incorporated System and Method for Performing Saliency Detection Using Deep Active Contours
CN108961220A (en) * 2018-06-14 2018-12-07 上海大学 A kind of image collaboration conspicuousness detection method based on multilayer convolution Fusion Features
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109409380A (en) * 2018-08-27 2019-03-01 浙江科技学院 A kind of significant extracting method of stereo-picture vision based on double learning networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170351941A1 (en) * 2016-06-03 2017-12-07 Miovision Technologies Incorporated System and Method for Performing Saliency Detection Using Deep Active Contours
CN108961220A (en) * 2018-06-14 2018-12-07 上海大学 A kind of image collaboration conspicuousness detection method based on multilayer convolution Fusion Features
CN109409380A (en) * 2018-08-27 2019-03-01 浙江科技学院 A kind of significant extracting method of stereo-picture vision based on double learning networks
CN109409435A (en) * 2018-11-01 2019-03-01 上海大学 A kind of depth perception conspicuousness detection method based on convolutional neural networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUANG, RUI 等: "RGB-D Salient Object Detection by a CNN With Multiple Layers Fusion", 《IEEE SIGNAL PROCESSING LETTERS》 *
WUJIE ZHOU 等: "Saliency Detection for Stereoscopic 3D Images in the Quaternion Frequency Domain", 《3DR EXPRESS》 *
李荣 等: "利用卷积神经网络的显著性区域预测方法", 《重庆邮电大学学报( 自然科学版)》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751157A (en) * 2019-10-18 2020-02-04 厦门美图之家科技有限公司 Image saliency segmentation and image saliency model training method and device
CN110751157B (en) * 2019-10-18 2022-06-24 厦门美图之家科技有限公司 Image significance segmentation and image significance model training method and device
CN110782458B (en) * 2019-10-23 2022-05-31 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110782458A (en) * 2019-10-23 2020-02-11 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method
CN110929736B (en) * 2019-11-12 2023-05-26 浙江科技学院 Multi-feature cascading RGB-D significance target detection method
CN111160410A (en) * 2019-12-11 2020-05-15 北京京东乾石科技有限公司 Object detection method and device
CN111160410B (en) * 2019-12-11 2023-08-08 北京京东乾石科技有限公司 Object detection method and device
CN111209919A (en) * 2020-01-06 2020-05-29 上海海事大学 Marine ship significance detection method and system
CN111209919B (en) * 2020-01-06 2023-06-09 上海海事大学 Marine ship significance detection method and system
CN111242238A (en) * 2020-01-21 2020-06-05 北京交通大学 Method for acquiring RGB-D image saliency target
CN111242238B (en) * 2020-01-21 2023-12-26 北京交通大学 RGB-D image saliency target acquisition method
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Convolutional neural network edge-assisted enhanced binocular saliency image detection method
CN111351450B (en) * 2020-03-20 2021-09-28 南京理工大学 Single-frame stripe image three-dimensional measurement method based on deep learning
CN111351450A (en) * 2020-03-20 2020-06-30 南京理工大学 Single-frame stripe image three-dimensional measurement method based on deep learning
CN111783862A (en) * 2020-06-22 2020-10-16 浙江科技学院 Three-dimensional significant object detection technology of multi-attention-directed neural network
CN112749712A (en) * 2021-01-22 2021-05-04 四川大学 RGBD significance object detection method based on 3D convolutional neural network
CN112749712B (en) * 2021-01-22 2022-04-12 四川大学 RGBD significance object detection method based on 3D convolutional neural network

Also Published As

Publication number Publication date
CN110263813B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN110263813B (en) Significance detection method based on residual error network and depth information fusion
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
Yu et al. Underwater-GAN: Underwater image restoration via conditional generative adversarial network
CN110728192B (en) High-resolution remote sensing image classification method based on novel characteristic pyramid depth network
CN108664981B (en) Salient image extraction method and device
CN110246148B (en) Multi-modal significance detection method for depth information fusion and attention learning
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN110490082B (en) Road scene semantic segmentation method capable of effectively fusing neural network features
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN110728682A (en) Semantic segmentation method based on residual pyramid pooling neural network
CN110570402B (en) Binocular salient object detection method based on boundary perception neural network
CN110210492B (en) Stereo image visual saliency detection method based on deep learning
CN112070753A (en) Multi-scale information enhanced binocular convolutional neural network saliency image detection method
CN110782458B (en) Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110009700B (en) Convolutional neural network visual depth estimation method based on RGB (red, green and blue) graph and gradient graph
CN110826411B (en) Vehicle target rapid identification method based on unmanned aerial vehicle image
CN111310767A (en) Significance detection method based on boundary enhancement
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
CN110458178A (en) The multi-modal RGB-D conspicuousness object detection method spliced more
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN114092774B (en) RGB-T image significance detection system and detection method based on information flow fusion
Guan et al. Srdgan: learning the noise prior for super resolution with dual generative adversarial networks
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230118

Address after: Room 2202, 22 / F, Wantong building, No. 3002, Sungang East Road, Sungang street, Luohu District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen dragon totem technology achievement transformation Co.,Ltd.

Address before: 310023 No. 318 stay Road, Xihu District, Zhejiang, Hangzhou

Patentee before: ZHEJIANG University OF SCIENCE AND TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230627

Address after: 710000 Room 1306, Building 7, Taihua Jinmao International, Keji Second Road, Hi tech Zone, Xi'an City, Shaanxi Province

Patentee after: Huahao Technology (Xi'an) Co.,Ltd.

Address before: Room 2202, 22 / F, Wantong building, No. 3002, Sungang East Road, Sungang street, Luohu District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen dragon totem technology achievement transformation Co.,Ltd.

TR01 Transfer of patent right