CN110175986B - Stereo image visual saliency detection method based on convolutional neural network - Google Patents

Stereo image visual saliency detection method based on convolutional neural network Download PDF

Info

Publication number
CN110175986B
CN110175986B CN201910327556.4A CN201910327556A CN110175986B CN 110175986 B CN110175986 B CN 110175986B CN 201910327556 A CN201910327556 A CN 201910327556A CN 110175986 B CN110175986 B CN 110175986B
Authority
CN
China
Prior art keywords
layer
output
neural network
input
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910327556.4A
Other languages
Chinese (zh)
Other versions
CN110175986A (en
Inventor
周武杰
吕营
雷景生
张伟
何成
王海江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910327556.4A priority Critical patent/CN110175986B/en
Publication of CN110175986A publication Critical patent/CN110175986A/en
Application granted granted Critical
Publication of CN110175986B publication Critical patent/CN110175986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a stereo image visual saliency detection method based on a convolutional neural network, which constructs the convolutional neural network and comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, and the coding frame consists of an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module; inputting the left viewpoint image and the depth image of each stereo image in the training set into a convolutional neural network for training to obtain a saliency image of each stereo image in the training set; calculating a loss function value between the saliency image of each stereo image in the training set and the real eye gazing image, and repeatedly executing for multiple times to obtain a convolutional neural network training model; inputting a left viewpoint image and a depth image of a to-be-tested stereo image into a convolutional neural network training model, and predicting to obtain a significance prediction image; the advantage is that it has higher visual saliency detection accuracy.

Description

Stereo image visual saliency detection method based on convolutional neural network
Technical Field
The invention relates to a visual saliency detection technology, in particular to a stereo image visual saliency detection method based on a convolutional neural network.
Background
The visual saliency is a popular research topic in many fields such as neuroscience, robotics, and computer vision in recent years. Studies on visual saliency detection can be divided into two broad categories: eyeball gaze prediction and salient object detection. The former is to predict several points of regard of a person when viewing a natural scene, and the latter is to accurately extract an object of interest. In general, visual saliency detection algorithms can be divided into two categories, top-down and bottom-up. The top-down approach is task driven, requiring supervised learning. Whereas bottom-up methods typically use low-level cues such as color features, distance features, and heuristic saliency features. One of the most common heuristic saliency features is contrast, e.g. pixel-based or block-based contrast. Previous research on detecting visual saliency has focused on two-dimensional images. However, it was found that, first, three-dimensional data instead of two-dimensional data is more suitable for practical use; secondly, as visual scenes become more complex, it is not sufficient to extract salient objects using only two-dimensional data. In recent years, with the progress of three-dimensional data acquisition technologies such as Time-of-Flight sensors and Microsoft Kinect, the adoption of a structural finite element method is promoted, and the recognition capability between different objects with similar appearances is improved. The depth data is easy to capture, is independent of light, and can provide geometric clues to improve the prediction of visual saliency. Due to the complementarity of RGB data and depth data, a number of methods have been proposed that combine RGB images with depth images in pairs for visual saliency detection. Previous work has focused primarily on using domain-specific a priori knowledge to construct low-level saliency features, such as humans tend to focus more on closer objects, however this observation is difficult to generalize to all scenarios. In most previous work, the multi-modal fusion problem was solved by directly serializing the RGB-D channels, or processing each modality independently and then combining the decisions of the two modalities. While these strategies have improved greatly, they have difficulty adequately exploring cross-modal complementarity. In recent years, with the success of Convolutional Neural Networks (CNNs) in learning RGB data discriminatory features, more and more work has been undertaken to explore more powerful RGB-D representations of efficient multimodal combinations using CNNs. Most of these works are based on a two-stream architecture, where RGB data and depth data are learned in a separate bottom-up stream and jointly inferred in early or late stages, with features. As the most popular solution, the dual stream architecture achieves a significant improvement over the work based on manual RGB-D features, however, there are the most critical issues: how to effectively utilize multi-modal complementary information in a bottom-up process. Therefore, further research on the RGB-D image visual saliency detection technology is necessary to improve the accuracy of visual saliency detection.
Disclosure of Invention
The invention aims to provide a stereo image visual saliency detection method based on a convolutional neural network, which has higher visual saliency detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as
Figure BDA0002036696820000031
{Dn(x,y)}、
Figure BDA0002036696820000032
Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure BDA0002036696820000033
to represent
Figure BDA0002036696820000034
The pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure BDA0002036696820000035
to represent
Figure BDA0002036696820000036
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths of
Figure BDA0002036696820000041
And has a height of
Figure BDA0002036696820000042
The feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000043
And has a height of
Figure BDA0002036696820000044
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widths
Figure BDA0002036696820000045
And has a height of
Figure BDA0002036696820000046
The feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000047
And has a height of
Figure BDA0002036696820000048
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling block
Figure BDA0002036696820000049
And has a height of
Figure BDA00020366968200000410
The feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width of
Figure BDA00020366968200000411
And has a height of
Figure BDA00020366968200000412
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling block
Figure BDA00020366968200000413
And has a height of
Figure BDA00020366968200000414
The feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4All feature maps in (1), 6 thThe output end of the neural network block outputs 128 pieces of width
Figure BDA00020366968200000415
And has a height of
Figure BDA00020366968200000416
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widths
Figure BDA0002036696820000051
And has a height of
Figure BDA0002036696820000052
The feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000053
And has a height of
Figure BDA0002036696820000054
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling block
Figure BDA0002036696820000055
And has a height of
Figure BDA0002036696820000056
The feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width of
Figure BDA0002036696820000057
And has a height of
Figure BDA0002036696820000058
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 width
Figure BDA0002036696820000059
And has a height of
Figure BDA00020366968200000510
Characteristic diagram of (1), will outputThe set formed by all the characteristic graphs is marked as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000061
And has a height of
Figure BDA0002036696820000062
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operation
Figure BDA0002036696820000063
And has a height of
Figure BDA0002036696820000064
The feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width of
Figure BDA0002036696820000065
And has a height of
Figure BDA0002036696820000066
The feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000067
And has a height of
Figure BDA0002036696820000068
The feature map of (1), a set of all the feature maps to be outputIs denoted by P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operation
Figure BDA0002036696820000069
And has a height of
Figure BDA00020366968200000610
The feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width maps
Figure BDA00020366968200000611
And has a height of
Figure BDA00020366968200000612
The feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width of
Figure BDA00020366968200000613
And has a height of
Figure BDA00020366968200000614
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operation
Figure BDA00020366968200000615
And has a height of
Figure BDA00020366968200000616
A characteristic diagram ofThe set of all the output feature maps is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 width
Figure BDA00020366968200000617
And has a height of
Figure BDA00020366968200000618
The feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width of
Figure BDA00020366968200000619
And has a height of
Figure BDA00020366968200000620
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 width
Figure BDA0002036696820000071
And has a height of
Figure BDA0002036696820000072
The feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000073
And has a height of
Figure BDA0002036696820000074
The feature map of (1) represents a set of all feature maps outputtedP16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic maps
Figure BDA0002036696820000075
And has a height of
Figure BDA0002036696820000076
The feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000077
And has a height of
Figure BDA0002036696820000078
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width of
Figure BDA0002036696820000079
And has a height of
Figure BDA00020366968200000710
The feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width of
Figure BDA00020366968200000711
And has a height of
Figure BDA00020366968200000712
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18All feature maps in (1), the 4 th upsampling layerThe output terminal of (1) outputs 64 feature maps with width W and height H, and the set of all the feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the input of the first convolutional layer receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted as
Figure BDA0002036696820000081
Wherein the content of the first and second substances,
Figure BDA0002036696820000082
to represent
Figure BDA0002036696820000083
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing image
Figure BDA0002036696820000084
And
Figure BDA0002036696820000085
the value of the loss function in between is recorded as
Figure BDA0002036696820000086
Obtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order to
Figure BDA0002036696820000087
Representing a three-dimensional image of width W 'and height H' to be tested, will
Figure BDA0002036696820000088
Is correspondingly recorded as
Figure BDA0002036696820000089
And
Figure BDA00020366968200000810
wherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',
Figure BDA00020366968200000811
to represent
Figure BDA00020366968200000812
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure BDA00020366968200000813
to represent
Figure BDA00020366968200000814
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure BDA00020366968200000815
to represent
Figure BDA00020366968200000816
The pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided with
Figure BDA00020366968200000817
And
Figure BDA00020366968200000818
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure BDA00020366968200000819
Is recorded as a saliency predicted image
Figure BDA00020366968200000820
Wherein the content of the first and second substances,
Figure BDA0002036696820000091
to represent
Figure BDA0002036696820000092
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
In step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first cavity convolution layer, a second active layer, a first residual block, a second cavity convolution layer and a third cavity convolution layer which are sequentially arranged, wherein the input end of the first cavity convolution layer is the input end of the neural network block where the first cavity convolution layer is located, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first cavity convolution layer, the input end of the second active layer receives all feature maps output by the output end of the second cavity convolution layer, the input end of the first residual block receives all feature maps output by the output end of the second active layer, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first residual block, and the input end of the third cavity convolution layer receives all feature maps output by the output end of the second cavity convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are formed by the second residual block, the input end of the second residual block is the input end of the downsampling block where the second residual block is located, and the output end of the second residual block is the output end of the downsampling block where the second residual block is located.
The first residual block and the second residual block have the same structure, and comprise 3 convolution layers, 3 batch normalization layers and 3 activation layers, wherein the input end of the 1 st convolution layer is the input end of the residual block where the 1 st convolution layer is located, the input end of the 1 st batch normalization layer receives all characteristic diagrams output by the output end of the 1 st convolution layer, the input end of the 1 st activation layer receives all characteristic diagrams output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolution layer receives all characteristic diagrams output by the output end of the 1 st activation layer, the input end of the 2 nd batch normalization layer receives all characteristic diagrams output by the output end of the 2 nd convolution layer, the input end of the 2 nd activation layer receives all characteristic diagrams output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolution layer receives all characteristic diagrams output by the output end of the 2 nd activation layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In step 1_2, the sizes of the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2, and the steps are all 2.
In step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are 2.
Compared with the prior art, the invention has the advantages that:
1) the method respectively trains a module (namely an RGB feature extraction module and a depth feature extraction module) for RGB images and depth images through a coding frame provided in a constructed convolutional neural network to learn RGB and depth features of different levels, and provides a module specially fusing the RGB and depth features, namely a feature fusion module, which fuses the two features from low level to high level, thereby being beneficial to fully utilizing cross-modal information to form new discrimination features and improving the accuracy of stereo vision significance prediction.
2) The down-sampling blocks in the RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method utilize the residual block with the stride of 2 to replace the maximum pooling layer used in the prior work, so that the model is favorable for adaptively selecting feature information, and important information is prevented from being lost due to the maximum pooling operation.
3) The RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method introduce the residual blocks with the cavity convolutional layers in the front and the back, enlarge the acceptance domain of the convolutional kernel, and are beneficial to the constructed convolutional neural network to pay more attention to global information and learn more abundant contents.
Drawings
FIG. 1 is a schematic diagram of the composition of a convolutional neural network constructed by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a stereo image visual saliency detection method based on a convolutional neural network.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then all the selected original stereo images and the respective left viewpoint images and depths of all the original stereo imagesThe images and the real eye gazing images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as
Figure BDA0002036696820000141
{Dn(x,y)}、
Figure BDA0002036696820000142
Wherein N is a positive integer, N is more than or equal to 300, if N is 600, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, I isn(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure BDA0002036696820000143
to represent
Figure BDA0002036696820000144
The pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure BDA0002036696820000145
to represent
Figure BDA0002036696820000146
The middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes a coding frame and a decoding frame, the coding frame includes three parts, namely, an RGB feature extraction module, a depth feature extraction module, and a feature fusion module, the RGB feature extraction module includes 1 st to 4 th neural network blocks, and 1 st to 3 rd downsampling blocks, the depth feature extraction module includes 5 th to 8 th neural network blocks, and 4 th to 6 th downsampling blocks, the feature fusion module includes 9 th to 15 th neural network blocks, and 1 st to 4 th maximum pooling layers, and the decoding frame includes 16 th to 19 th neural network blocks, and 1 st to 4 th upsampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the padding is 1, and the activation mode of the first activation layer is 'Sigmoid'.
For the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; here, the width of the left viewpoint image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; the training depth image has a width W and a height H.
For the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths of
Figure BDA0002036696820000151
And has a height of
Figure BDA0002036696820000152
The feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000153
And has a height of
Figure BDA0002036696820000154
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widths
Figure BDA0002036696820000155
And has a height of
Figure BDA0002036696820000156
The feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000157
And has a height of
Figure BDA0002036696820000158
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling block
Figure BDA0002036696820000159
And has a height of
Figure BDA00020366968200001510
The feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width of
Figure BDA00020366968200001511
And has a height of
Figure BDA00020366968200001512
The feature map of (1) is a set of all feature maps to be outputtedIs P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling block
Figure BDA00020366968200001513
And has a height of
Figure BDA00020366968200001514
The feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width of
Figure BDA00020366968200001515
And has a height of
Figure BDA00020366968200001516
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widths
Figure BDA0002036696820000161
And has a height of
Figure BDA0002036696820000162
The feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000163
And has a height of
Figure BDA0002036696820000164
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling block
Figure BDA0002036696820000165
And has a height of
Figure BDA0002036696820000166
The feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width of
Figure BDA0002036696820000167
And has a height of
Figure BDA0002036696820000168
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 width
Figure BDA0002036696820000169
And has a height of
Figure BDA00020366968200001610
The feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000171
And has a height of
Figure BDA0002036696820000172
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operation
Figure BDA0002036696820000173
And has a height of
Figure BDA0002036696820000174
The feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width of
Figure BDA0002036696820000175
And has a height of
Figure BDA0002036696820000176
The feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000177
And has a height of
Figure BDA0002036696820000178
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operation
Figure BDA0002036696820000179
And has a height of
Figure BDA00020366968200001710
The feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width maps
Figure BDA00020366968200001711
And has a height of
Figure BDA00020366968200001712
The feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3All characteristic diagrams in (1)The output end of the 14 th neural network block outputs 512 pieces of width
Figure BDA00020366968200001713
And has a height of
Figure BDA00020366968200001714
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operation
Figure BDA00020366968200001715
And has a height of
Figure BDA00020366968200001716
The feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 width
Figure BDA00020366968200001717
And has a height of
Figure BDA00020366968200001718
The feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width of
Figure BDA00020366968200001719
And has a height of
Figure BDA0002036696820000181
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 width
Figure BDA0002036696820000182
And has a height of
Figure BDA0002036696820000183
The feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width of
Figure BDA0002036696820000184
And has a height of
Figure BDA0002036696820000185
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic maps
Figure BDA0002036696820000186
And has a height of
Figure BDA0002036696820000187
The feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width of
Figure BDA0002036696820000188
And has a height of
Figure BDA0002036696820000189
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width of
Figure BDA00020366968200001810
And has a height of
Figure BDA00020366968200001811
The feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width of
Figure BDA00020366968200001812
And has a height of
Figure BDA00020366968200001813
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the input of the first convolutional layer receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H.
Step 1_ 3: the left viewpoint image of each original stereo image in the training set is used as a left viewpoint image for training, and the depth image of each original stereo image in the training set is used as a depth image for training and input into the convolutionTraining in the neural network to obtain the significance image of each original stereo image in the training set, and calculating the significance value of the { I }n(x, y) } significant image is noted as
Figure BDA0002036696820000191
Wherein the content of the first and second substances,
Figure BDA0002036696820000192
to represent
Figure BDA0002036696820000193
The middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing image
Figure BDA0002036696820000194
And
Figure BDA0002036696820000195
the value of the loss function in between is recorded as
Figure BDA0002036696820000196
Obtained by using a mean square error loss function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein, V is more than 1, and if V is 50.
The test stage process comprises the following specific steps:
step 2_ 1: order to
Figure BDA0002036696820000197
Representing a three-dimensional image of width W 'and height H' to be tested, will
Figure BDA0002036696820000198
Is correspondingly recorded as
Figure BDA0002036696820000199
And
Figure BDA00020366968200001910
wherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',
Figure BDA00020366968200001911
to represent
Figure BDA00020366968200001912
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure BDA00020366968200001913
to represent
Figure BDA00020366968200001914
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure BDA00020366968200001915
to represent
Figure BDA00020366968200001916
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
Step 2_ 2: will be provided with
Figure BDA00020366968200001917
And
Figure BDA00020366968200001918
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure BDA0002036696820000201
Is recorded as a saliency predicted image
Figure BDA0002036696820000202
Wherein the content of the first and second substances,
Figure BDA0002036696820000203
to represent
Figure BDA0002036696820000204
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
In this embodiment, in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second active layer, a first residual block, a second hole convolution layer and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second active layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second active layer, an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, an input end of the third normalization layer receives all feature maps output by an output end of the second hole convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU'.
The 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3.
The 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU".
The 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In this embodiment, in step 1_2, the structure of the 1 st to 6 th downsampling blocks is the same, and they are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
In this specific embodiment, the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, where the input end of the 1 st convolutional layer is the input end of the residual block where it is located, the input end of the 1 st batch normalization layer receives all the feature maps output by the output end of the 1 st convolutional layer, the input end of the 1 st active layer receives all the feature maps output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolutional layer receives all the feature maps output by the output end of the 1 st active layer, the input end of the 2 nd batch normalization layer receives all the feature maps output by the output end of the 2 nd convolutional layer, the input end of the 2 nd active layer receives all the feature maps output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In this embodiment, in step 1_2, the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2 in size and all 2 in steps.
In this embodiment, in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are all 2.
To verify the feasibility and effectiveness of the method of the invention, experiments were performed.
Here, the accuracy and stability of the method of the present invention was analyzed using a three-dimensional human eye tracking database (NCTU-3DFixation) provided by Taiwan university of transportation. Here, 4 common objective parameters for evaluating the visual Saliency extraction method are used as evaluation indexes, namely, a Linear Correlation Coefficient (CC), a Kullback-Leibler Divergence Coefficient (KLD), an AUC parameter (AUC), and a Normalized scan path Saliency (NSS).
The method is used for obtaining the significance prediction image of each three-dimensional image in the three-dimensional human eye tracking database provided by Taiwan traffic university, and comparing the significance prediction image with a subjective visual significance map of each three-dimensional image in the three-dimensional human eye tracking database, namely a real human eye gazing image (existing in the three-dimensional human eye tracking database), wherein the higher the CC, AUC and NSS values are, the lower the KLD value is, the better the consistency between the significance prediction image obtained by the method and the subjective visual significance map is. The CC, KLD, AUC and NSS related indices reflecting the significant extraction performance of the method of the invention are listed in Table 1.
TABLE 1 accuracy and stability of the saliency predicted images and subjective visual saliency maps obtained by the method of the invention
Performance index CC KLD AUC(Borji) NSS
Performance index value 0.7583 0.4868 0.8789 2.0692
As can be seen from the data listed in Table 1, the accuracy and stability of the saliency predicted image obtained by the method of the invention and the subjective visual saliency map are good, which indicates that the objective detection result is more consistent with the result of subjective perception of human eyes, and is enough to illustrate the feasibility and effectiveness of the method of the invention.

Claims (6)

1. A stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as
Figure FDA0002036696810000011
{Dn(x,y)}、
Figure FDA0002036696810000012
Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure FDA0002036696810000013
to represent
Figure FDA0002036696810000014
The pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),
Figure FDA0002036696810000015
to represent
Figure FDA0002036696810000016
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths of
Figure FDA0002036696810000021
And has a height of
Figure FDA0002036696810000022
The feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a 2 nd (a)The input of the neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width of
Figure FDA0002036696810000023
And has a height of
Figure FDA0002036696810000024
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widths
Figure FDA0002036696810000025
And has a height of
Figure FDA0002036696810000026
The feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width of
Figure FDA0002036696810000027
And has a height of
Figure FDA0002036696810000028
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling block
Figure FDA0002036696810000029
And has a height of
Figure FDA00020366968100000210
The feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a 4 th neural network blockInput terminal of (1) receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width of
Figure FDA00020366968100000211
And has a height of
Figure FDA00020366968100000212
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling block
Figure FDA0002036696810000031
And has a height of
Figure FDA0002036696810000032
The feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width of
Figure FDA0002036696810000033
And has a height of
Figure FDA0002036696810000034
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widths
Figure FDA0002036696810000035
And has a height of
Figure FDA0002036696810000036
The feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width of
Figure FDA0002036696810000037
And has a height of
Figure FDA0002036696810000038
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling block
Figure FDA0002036696810000039
And has a height of
Figure FDA00020366968100000310
The feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width of
Figure FDA00020366968100000311
And has a height of
Figure FDA00020366968100000312
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the output end of the 9 th neural network block outputs 3 images with width W and height WH, a set of all output feature maps is expressed as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 width
Figure FDA0002036696810000041
And has a height of
Figure FDA0002036696810000042
The feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width of
Figure FDA0002036696810000043
And has a height of
Figure FDA0002036696810000044
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operation
Figure FDA0002036696810000045
And has a height of
Figure FDA0002036696810000046
The feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width of
Figure FDA0002036696810000047
And has a height of
Figure FDA0002036696810000048
The feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width of
Figure FDA0002036696810000049
And has a height of
Figure FDA00020366968100000410
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operation
Figure FDA00020366968100000411
And has a height of
Figure FDA00020366968100000412
The feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width maps
Figure FDA00020366968100000413
And has a height of
Figure FDA00020366968100000414
The feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width of
Figure FDA00020366968100000415
And has a height of
Figure FDA00020366968100000416
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operation
Figure FDA0002036696810000051
And has a height of
Figure FDA0002036696810000052
The feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 width
Figure FDA0002036696810000053
And has a height of
Figure FDA0002036696810000054
The feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width of
Figure FDA0002036696810000055
And has a height of
Figure FDA0002036696810000056
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 width
Figure FDA0002036696810000057
And has a height of
Figure FDA0002036696810000058
The feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width of
Figure FDA0002036696810000059
And has a height of
Figure FDA00020366968100000510
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic maps
Figure FDA00020366968100000511
And has a height of
Figure FDA00020366968100000512
The feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width of
Figure FDA00020366968100000513
And has a height of
Figure FDA00020366968100000514
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width of
Figure FDA00020366968100000515
And has a height of
Figure FDA00020366968100000516
The feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width of
Figure FDA00020366968100000517
And has a height of
Figure FDA00020366968100000518
The feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 characteristic maps with width W and height H, and outputsIs denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the input of the first convolutional layer receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted as
Figure FDA0002036696810000061
Wherein the content of the first and second substances,
Figure FDA0002036696810000062
to represent
Figure FDA0002036696810000063
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing image
Figure FDA0002036696810000064
And
Figure FDA0002036696810000065
the value of the loss function in between is recorded as
Figure FDA0002036696810000066
Obtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order to
Figure FDA0002036696810000067
Representing a three-dimensional image of width W 'and height H' to be tested, will
Figure FDA0002036696810000071
Is correspondingly recorded as
Figure FDA0002036696810000072
And
Figure FDA0002036696810000073
wherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',
Figure FDA0002036696810000074
to represent
Figure FDA0002036696810000075
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure FDA0002036696810000076
to represent
Figure FDA0002036696810000077
The pixel value of the pixel point with the middle coordinate position (x ', y'),
Figure FDA0002036696810000078
to represent
Figure FDA0002036696810000079
The pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided with
Figure FDA00020366968100000710
And
Figure FDA00020366968100000711
inputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtain
Figure FDA00020366968100000712
Is recorded as a saliency predicted image
Figure FDA00020366968100000713
Wherein the content of the first and second substances,
Figure FDA00020366968100000714
to represent
Figure FDA00020366968100000715
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
2. The method according to claim 1, wherein in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second activation layer, a first residual block, a second hole convolution layer, and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second activation layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second activation layer, and an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, the input end of the third batch of normalization layers receives all characteristic graphs output by the output end of the second cavity convolution layer, and the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
3. The method for detecting visual saliency of stereoscopic images based on convolutional neural network as claimed in claim 2, wherein in step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
4. The method according to claim 3, wherein the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, an input of a 1 st convolutional layer is an input of the residual block, an input of a 1 st batch normalization layer receives all feature maps output by an output of the 1 st convolutional layer, an input of a 1 st active layer receives all feature maps output by an output of the 1 st batch normalization layer, an input of a 2 nd convolutional layer receives all feature maps output by an output of the 1 st active layer, an input of a 2 nd batch normalization layer receives all feature maps output by an output of the 2 nd convolutional layer, an input of a 2 nd active layer receives all feature maps output by an output of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolutional layer, all the feature maps received by the input end of the 1 st convolutional layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and all the feature maps output by the output end of the 3 rd active layer after passing through the 3 rd active layer are used as all the feature maps output by the output end of the residual block; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
5. The method for detecting the visual saliency of stereoscopic images based on convolutional neural network as claimed in any one of claims 1 to 4, wherein in step 1_2, the sizes of the pooling windows of the 1 st to 4 th maximum pooling layers are all 2 x 2 and the steps are all 2.
6. The method for detecting visual saliency of stereoscopic images based on a convolutional neural network as claimed in claim 5, wherein in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are all bilinear interpolation, and the scaling factor is all 2.
CN201910327556.4A 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network Active CN110175986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910327556.4A CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910327556.4A CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN110175986A CN110175986A (en) 2019-08-27
CN110175986B true CN110175986B (en) 2021-01-08

Family

ID=67689881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910327556.4A Active CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN110175986B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555434B (en) * 2019-09-03 2022-03-29 浙江科技学院 Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN110782458B (en) * 2019-10-23 2022-05-31 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
JP2023503827A (en) * 2019-11-14 2023-02-01 ズークス インコーポレイテッド Depth data model training with upsampling, loss and loss balance
US11157774B2 (en) 2019-11-14 2021-10-26 Zoox, Inc. Depth data model training with upsampling, losses, and loss balancing
CN111369506B (en) * 2020-02-26 2022-08-02 四川大学 Lens turbidity grading method based on eye B-ultrasonic image
CN111582316B (en) * 2020-04-10 2022-06-28 天津大学 RGB-D significance target detection method
CN111612832B (en) * 2020-04-29 2023-04-18 杭州电子科技大学 Method for improving depth estimation accuracy by utilizing multitask complementation
CN112528900B (en) * 2020-12-17 2022-09-16 南开大学 Image salient object detection method and system based on extreme down-sampling
CN112528899B (en) * 2020-12-17 2022-04-12 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113592795B (en) * 2021-07-19 2024-04-12 深圳大学 Visual saliency detection method for stereoscopic image, thumbnail generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462771A (en) * 2016-08-05 2017-02-22 深圳大学 3D image significance detection method
CN106778687A (en) * 2017-01-16 2017-05-31 大连理工大学 Method for viewing points detecting based on local evaluation and global optimization
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10699151B2 (en) * 2016-06-03 2020-06-30 Miovision Technologies Incorporated System and method for performing saliency detection using deep active contours

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462771A (en) * 2016-08-05 2017-02-22 深圳大学 3D image significance detection method
CN106778687A (en) * 2017-01-16 2017-05-31 大连理工大学 Method for viewing points detecting based on local evaluation and global optimization
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RGB-D Saliency Detection by Multi-stream Late Fusion Network;Chen, Hao 等;《COMPUTER VISION SYSTEMS》;20171231;全文 *
Saliency detection for stereoscopic 3D images in the quaternion frequency domain;Xingyu Cai 等;《3D Research》;20180630;全文 *
利用卷积神经网络的显著性区域预测方法;李荣 等;《重庆邮电大学学报( 自然科学版)》;20190228;全文 *

Also Published As

Publication number Publication date
CN110175986A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN109615582B (en) Face image super-resolution reconstruction method for generating countermeasure network based on attribute description
CN108520535B (en) Object classification method based on depth recovery information
Li et al. Underwater image enhancement via medium transmission-guided multi-color space embedding
CN108520503B (en) Face defect image restoration method based on self-encoder and generation countermeasure network
CN110032926B (en) Video classification method and device based on deep learning
CN107977932B (en) Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN111563418A (en) Asymmetric multi-mode fusion significance detection method based on attention mechanism
CN110689599B (en) 3D visual saliency prediction method based on non-local enhancement generation countermeasure network
CN110210492B (en) Stereo image visual saliency detection method based on deep learning
US20110292051A1 (en) Automatic Avatar Creation
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN108491848B (en) Image saliency detection method and device based on depth information
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
CN111882516B (en) Image quality evaluation method based on visual saliency and deep neural network
CN110570402B (en) Binocular salient object detection method based on boundary perception neural network
Luo et al. Bi-GANs-ST for perceptual image super-resolution
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
Sharma et al. A novel 3d-unet deep learning framework based on high-dimensional bilateral grid for edge consistent single image depth estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant