CN110175986A - A kind of stereo-picture vision significance detection method based on convolutional neural networks - Google Patents

A kind of stereo-picture vision significance detection method based on convolutional neural networks Download PDF

Info

Publication number
CN110175986A
CN110175986A CN201910327556.4A CN201910327556A CN110175986A CN 110175986 A CN110175986 A CN 110175986A CN 201910327556 A CN201910327556 A CN 201910327556A CN 110175986 A CN110175986 A CN 110175986A
Authority
CN
China
Prior art keywords
layer
output
neural network
input
feature maps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910327556.4A
Other languages
Chinese (zh)
Other versions
CN110175986B (en
Inventor
周武杰
吕营
雷景生
张伟
何成
王海江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910327556.4A priority Critical patent/CN110175986B/en
Publication of CN110175986A publication Critical patent/CN110175986A/en
Application granted granted Critical
Publication of CN110175986B publication Critical patent/CN110175986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of stereo-picture vision significance detection method based on convolutional neural networks, it constructs convolutional neural networks, include input layer, hidden layer, output layer, input layer includes RGB figure input layer and depth map input layer, hidden layer includes coding framework and decoding frame, and coding framework is made of RGB feature extraction module, depth characteristic extraction module and Fusion Features module;The left view point image of every width stereo-picture in training set and depth image are input in convolutional neural networks and are trained, the Saliency maps picture of every width stereo-picture in training set is obtained;The loss function value between the Saliency maps picture and true human eye gazing at images of every width stereo-picture in training set is calculated, obtains convolutional neural networks training pattern after repeating repeatedly;The left view point image and depth image of stereo-picture to be tested are input in convolutional neural networks training pattern, and prediction obtains conspicuousness forecast image;Advantage is its vision significance detection accuracy with higher.

Description

Stereo image visual saliency detection method based on convolutional neural network
Technical Field
The invention relates to a visual saliency detection technology, in particular to a stereo image visual saliency detection method based on a convolutional neural network.
Background
The visual saliency is a popular research topic in many fields such as neuroscience, robotics, and computer vision in recent years. Studies on visual saliency detection can be divided into two broad categories: eyeball gaze prediction and salient object detection. The former is to predict several points of regard of a person when viewing a natural scene, and the latter is to accurately extract an object of interest. In general, visual saliency detection algorithms can be divided into two categories, top-down and bottom-up. The top-down approach is task driven, requiring supervised learning. Whereas bottom-up methods typically use low-level cues such as color features, distance features, and heuristic saliency features. One of the most common heuristic saliency features is contrast, e.g. pixel-based or block-based contrast. Previous research on detecting visual saliency has focused on two-dimensional images. However, it was found that, first, three-dimensional data instead of two-dimensional data is more suitable for practical use; secondly, as visual scenes become more complex, it is not sufficient to extract salient objects using only two-dimensional data. In recent years, with the progress of three-dimensional data acquisition technologies such as Time-of-Flight sensors and Microsoft Kinect, the adoption of a structural finite element method is promoted, and the recognition capability between different objects with similar appearances is improved. The depth data is easy to capture, is independent of light, and can provide geometric clues to improve the prediction of visual saliency. Due to the complementarity of RGB data and depth data, a number of methods have been proposed that combine RGB images with depth images in pairs for visual saliency detection. Previous work has focused primarily on using domain-specific a priori knowledge to construct low-level saliency features, such as humans tend to focus more on closer objects, however this observation is difficult to generalize to all scenarios. In most previous work, the multi-modal fusion problem was solved by directly serializing the RGB-D channels, or processing each modality independently and then combining the decisions of the two modalities. While these strategies have improved greatly, they have difficulty adequately exploring cross-modal complementarity. In recent years, with the success of Convolutional Neural Networks (CNNs) in learning RGB data discriminatory features, more and more work has been undertaken to explore more powerful RGB-D representations of efficient multimodal combinations using CNNs. Most of these works are based on a two-stream architecture, where RGB data and depth data are learned in a separate bottom-up stream and jointly inferred in early or late stages, with features. As the most popular solution, the dual stream architecture achieves a significant improvement over the work based on manual RGB-D features, however, there are the most critical issues: how to effectively utilize multi-modal complementary information in a bottom-up process. Therefore, further research on the RGB-D image visual saliency detection technology is necessary to improve the accuracy of visual saliency detection.
Disclosure of Invention
The invention aims to provide a stereo image visual saliency detection method based on a convolutional neural network, which has higher visual saliency detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the output end of the 1 st neural network block64 feature maps with width W and height H are output, and the set of all the output feature maps is denoted as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a 3 rd lowerThe input of the sampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd is high inIs composed ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1All feature maps in (1), output of 12 th neural network blockEnd output 128 pieces wideAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All feature maps in (1) are subjected to Element-wise Summation operationThen, after Element-wise Summation operation, 256 output signals with widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5All feature maps in (1), output of the 4 th max pooling layerThe output end outputs 512 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a 19 th neural networkInput of the block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the input of the first convolutional layer receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
In step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first cavity convolution layer, a second active layer, a first residual block, a second cavity convolution layer and a third cavity convolution layer which are sequentially arranged, wherein the input end of the first cavity convolution layer is the input end of the neural network block where the first cavity convolution layer is located, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first cavity convolution layer, the input end of the second active layer receives all feature maps output by the output end of the second cavity convolution layer, the input end of the first residual block receives all feature maps output by the output end of the second active layer, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first residual block, and the input end of the third cavity convolution layer receives all feature maps output by the output end of the second cavity convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are formed by the second residual block, the input end of the second residual block is the input end of the downsampling block where the second residual block is located, and the output end of the second residual block is the output end of the downsampling block where the second residual block is located.
The first residual block and the second residual block have the same structure, and comprise 3 convolution layers, 3 batch normalization layers and 3 activation layers, wherein the input end of the 1 st convolution layer is the input end of the residual block where the 1 st convolution layer is located, the input end of the 1 st batch normalization layer receives all characteristic diagrams output by the output end of the 1 st convolution layer, the input end of the 1 st activation layer receives all characteristic diagrams output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolution layer receives all characteristic diagrams output by the output end of the 1 st activation layer, the input end of the 2 nd batch normalization layer receives all characteristic diagrams output by the output end of the 2 nd convolution layer, the input end of the 2 nd activation layer receives all characteristic diagrams output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolution layer receives all characteristic diagrams output by the output end of the 2 nd activation layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In step 1_2, the sizes of the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2, and the steps are all 2.
In step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are 2.
Compared with the prior art, the invention has the advantages that:
1) the method respectively trains a module (namely an RGB feature extraction module and a depth feature extraction module) for RGB images and depth images through a coding frame provided in a constructed convolutional neural network to learn RGB and depth features of different levels, and provides a module specially fusing the RGB and depth features, namely a feature fusion module, which fuses the two features from low level to high level, thereby being beneficial to fully utilizing cross-modal information to form new discrimination features and improving the accuracy of stereo vision significance prediction.
2) The down-sampling blocks in the RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method utilize the residual block with the stride of 2 to replace the maximum pooling layer used in the prior work, so that the model is favorable for adaptively selecting feature information, and important information is prevented from being lost due to the maximum pooling operation.
3) The RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method introduce the residual blocks with the cavity convolutional layers in the front and the back, enlarge the acceptance domain of the convolutional kernel, and are beneficial to the constructed convolutional neural network to pay more attention to global information and learn more abundant contents.
Drawings
FIG. 1 is a schematic diagram of the composition of a convolutional neural network constructed by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a stereo image visual saliency detection method based on a convolutional neural network.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, if N is 600, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, I isn(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes a coding frame and a decoding frame, the coding frame includes three parts, namely, an RGB feature extraction module, a depth feature extraction module, and a feature fusion module, the RGB feature extraction module includes 1 st to 4 th neural network blocks, and 1 st to 3 rd downsampling blocks, the depth feature extraction module includes 5 th to 8 th neural network blocks, and 4 th to 6 th downsampling blocks, the feature fusion module includes 9 th to 15 th neural network blocks, and 1 st to 4 th maximum pooling layers, and the decoding frame includes 16 th to 19 th neural network blocks, and 1 st to 4 th upsampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the padding is 1, and the activation mode of the first activation layer is 'Sigmoid'.
For the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; here, the width of the left viewpoint image for training is required to be W and the height is required to be H.
For the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; the training depth image has a width W and a height H.
For the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6All characteristic maps in (5)The output end of each downsampling block outputs 128 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the 9 th nerveThe output end of the network block outputs 3 characteristic graphs with width W and height H, and the set formed by all the output characteristic graphs is marked as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofA characteristic diagram ofThe set of all the output feature maps is denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the first rollThe input end of the lamination receives P19The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H.
Step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtained by using a mean square error loss function.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein, V is more than 1, and if V is 50.
The test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
Step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
In this embodiment, in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second active layer, a first residual block, a second hole convolution layer and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second active layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second active layer, an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, an input end of the third normalization layer receives all feature maps output by an output end of the second hole convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU'.
The 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3.
The 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU".
The 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
In this embodiment, in step 1_2, the structure of the 1 st to 6 th downsampling blocks is the same, and they are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
In this specific embodiment, the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, where the input end of the 1 st convolutional layer is the input end of the residual block where it is located, the input end of the 1 st batch normalization layer receives all the feature maps output by the output end of the 1 st convolutional layer, the input end of the 1 st active layer receives all the feature maps output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolutional layer receives all the feature maps output by the output end of the 1 st active layer, the input end of the 2 nd batch normalization layer receives all the feature maps output by the output end of the 2 nd convolutional layer, the input end of the 2 nd active layer receives all the feature maps output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
In this embodiment, in step 1_2, the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2 in size and all 2 in steps.
In this embodiment, in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are all 2.
To verify the feasibility and effectiveness of the method of the invention, experiments were performed.
Here, the accuracy and stability of the method of the present invention was analyzed using a three-dimensional human eye tracking database (NCTU-3DFixation) provided by Taiwan university of transportation. Here, 4 common objective parameters for evaluating the visual Saliency extraction method are used as evaluation indexes, namely, a Linear Correlation Coefficient (CC), a Kullback-Leibler Divergence Coefficient (KLD), an AUC parameter (AUC), and a normalized scan path Saliency (NSS).
The method is used for obtaining the significance prediction image of each three-dimensional image in the three-dimensional human eye tracking database provided by Taiwan traffic university, and comparing the significance prediction image with a subjective visual significance map of each three-dimensional image in the three-dimensional human eye tracking database, namely a real human eye gazing image (existing in the three-dimensional human eye tracking database), wherein the higher the CC, AUC and NSS values are, the lower the KLD value is, the better the consistency between the significance prediction image obtained by the method and the subjective visual significance map is. The CC, KLD, AUC and NSS related indices reflecting the significant extraction performance of the method of the invention are listed in Table 1.
TABLE 1 accuracy and stability of the saliency predicted images and subjective visual saliency maps obtained by the method of the invention
Performance index CC KLD AUC(Borji) NSS
Performance index value 0.7583 0.4868 0.8789 2.0692
As can be seen from the data listed in Table 1, the accuracy and stability of the saliency predicted image obtained by the method of the invention and the subjective visual saliency map are good, which indicates that the objective detection result is more consistent with the result of subjective perception of human eyes, and is enough to illustrate the feasibility and effectiveness of the method of the invention.

Claims (6)

1. A stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }n(x, y) }, will { InThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dn(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, In(x, y) represents { InThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)n(x, y) represents { DnThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';
for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;
for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;
for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P1(ii) a The input of the 1 st downsampling block receives P1Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X1(ii) a The input of the 2 nd neural network block receives X1The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P2(ii) a The input of the 2 nd downsampling block receives P2Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X2(ii) a The input of the 3 rd neural network block receives X2The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P3(ii) a The input of the 3 rd downsampling block receives P3Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X3(ii) a The input of the 4 th neural network block receives X3The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P4
For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P5(ii) a The input of the 4 th downsampling block receives P5Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X4(ii) a The input of the 6 th neural network block receives X4The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P6(ii) a The input of the 5 th downsampling block receives P6Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X5(ii) a The input of the 7 th neural network block receives X5All feature maps in (1), output of the 7 th neural network blockOutput 256 widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P7(ii) a The input of the 6 th downsampling block receives P7Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X6(ii) a The input of the 8 th neural network block receives X6The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P8
For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P9(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P10(ii) a To P9All feature maps and P in (1)10All of (1)Element-wise Summation operation is carried out on the feature maps, 3 feature maps with width W and height H are output after the Element-wise Summation operation, and a set formed by all the output feature maps is recorded as E1(ii) a The input of the 11 th neural network block receives E1The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P11(ii) a To P1All characteristic maps, P in5All feature maps and P in (1)11After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E2(ii) a Input of the 1 st max pooling layer receives E2The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z1(ii) a Input of 12 th neural network block receives Z1The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P12(ii) a To P2All characteristic maps, P in6All feature maps and P in (1)12All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E3(ii) a Input of 2 nd largest pooling layer receives E3The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z2(ii) a Input of the 13 th neural network block receives Z2The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P13(ii) a To P3All characteristic maps, P in7All feature maps and P in (1)13All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E4(ii) a Input of the 3 rd largest pooling layer receives E4The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z3(ii) a The input of the 14 th neural network block receives Z3The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P14(ii) a To P4All characteristic maps, P in8All feature maps and P in (1)14All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E5(ii) a Input terminal of 4 th max pooling layer receives E5The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z4(ii) a Input of the 15 th neural network block receives Z4All feature maps in (1), the 15 th neural networkThe output end of the block outputs 1024 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P15
For the decoding framework, the input of the 1 st upsampling layer receives P15The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S1(ii) a The input of the 16 th neural network block receives S1The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P16(ii) a The input of the 2 nd up-sampling layer receives P16The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S2(ii) a The input of the 17 th neural network block receives S2The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P17(ii) a The input of the 3 rd up-sampling layer receives P17The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S3(ii) a The input of the 18 th neural network block receives S3The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P18(ii) a The input of the 4 th up-sampling layer receives P18The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S4(ii) a The input of the 19 th neural network block receives S4The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P19
For the output layer, the input of the first convolutional layer receives P19All of (1)The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;
step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingn(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtaining by using a mean square error loss function;
step 1_ 5: repeatedly executing steps 1_3 andstep 1_4, obtaining a convolutional neural network training model for V times, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as WbestAnd bbest(ii) a Wherein V is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y');
step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using WbestAnd bbestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
2. The method according to claim 1, wherein in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second activation layer, a first residual block, a second hole convolution layer, and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second activation layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second activation layer, and an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, the input end of the third batch of normalization layers receives all characteristic graphs output by the output end of the second cavity convolution layer, and the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';
the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;
the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";
the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.
3. The method for detecting visual saliency of stereoscopic images based on convolutional neural network as claimed in claim 2, wherein in step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.
4. The method according to claim 3, wherein the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, an input of a 1 st convolutional layer is an input of the residual block, an input of a 1 st batch normalization layer receives all feature maps output by an output of the 1 st convolutional layer, an input of a 1 st active layer receives all feature maps output by an output of the 1 st batch normalization layer, an input of a 2 nd convolutional layer receives all feature maps output by an output of the 1 st active layer, an input of a 2 nd batch normalization layer receives all feature maps output by an output of the 2 nd convolutional layer, an input of a 2 nd active layer receives all feature maps output by an output of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolutional layer, all the feature maps received by the input end of the 1 st convolutional layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and all the feature maps output by the output end of the 3 rd active layer after passing through the 3 rd active layer are used as all the feature maps output by the output end of the residual block; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".
5. The method for detecting the visual saliency of stereoscopic images based on convolutional neural network as claimed in any one of claims 1 to 4, wherein in step 1_2, the sizes of the pooling windows of the 1 st to 4 th maximum pooling layers are all 2 x 2 and the steps are all 2.
6. The method for detecting visual saliency of stereoscopic images based on a convolutional neural network as claimed in claim 5, wherein in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are all bilinear interpolation, and the scaling factor is all 2.
CN201910327556.4A 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network Active CN110175986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910327556.4A CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910327556.4A CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN110175986A true CN110175986A (en) 2019-08-27
CN110175986B CN110175986B (en) 2021-01-08

Family

ID=67689881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910327556.4A Active CN110175986B (en) 2019-04-23 2019-04-23 Stereo image visual saliency detection method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN110175986B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN110782458A (en) * 2019-10-23 2020-02-11 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN111369506A (en) * 2020-02-26 2020-07-03 四川大学 Lens turbidity grading method based on eye B-ultrasonic image
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN111612832A (en) * 2020-04-29 2020-09-01 杭州电子科技大学 Method for improving depth estimation accuracy by utilizing multitask complementation
CN112528900A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on extreme down-sampling
CN112528899A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on implicit depth information recovery
WO2021096806A1 (en) * 2019-11-14 2021-05-20 Zoox, Inc Depth data model training with upsampling, losses, and loss balancing
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
US11157774B2 (en) * 2019-11-14 2021-10-26 Zoox, Inc. Depth data model training with upsampling, losses, and loss balancing
CN113592795A (en) * 2021-07-19 2021-11-02 深圳大学 Visual saliency detection method of stereoscopic image, thumbnail generation method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462771A (en) * 2016-08-05 2017-02-22 深圳大学 3D image significance detection method
CN106778687A (en) * 2017-01-16 2017-05-31 大连理工大学 Method for viewing points detecting based on local evaluation and global optimization
US20170351941A1 (en) * 2016-06-03 2017-12-07 Miovision Technologies Incorporated System and Method for Performing Saliency Detection Using Deep Active Contours
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170351941A1 (en) * 2016-06-03 2017-12-07 Miovision Technologies Incorporated System and Method for Performing Saliency Detection Using Deep Active Contours
CN106462771A (en) * 2016-08-05 2017-02-22 深圳大学 3D image significance detection method
CN106778687A (en) * 2017-01-16 2017-05-31 大连理工大学 Method for viewing points detecting based on local evaluation and global optimization
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
CN109598268A (en) * 2018-11-23 2019-04-09 安徽大学 A kind of RGB-D well-marked target detection method based on single flow depth degree network
CN109635822A (en) * 2018-12-07 2019-04-16 浙江科技学院 The significant extracting method of stereo-picture vision based on deep learning coding and decoding network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHEN, HAO 等: "RGB-D Saliency Detection by Multi-stream Late Fusion Network", 《COMPUTER VISION SYSTEMS》 *
XINGYU CAI 等: "Saliency detection for stereoscopic 3D images in the quaternion frequency domain", 《3D RESEARCH》 *
李荣 等: "利用卷积神经网络的显著性区域预测方法", 《重庆邮电大学学报( 自然科学版)》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN110555434B (en) * 2019-09-03 2022-03-29 浙江科技学院 Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN110782458A (en) * 2019-10-23 2020-02-11 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
CN110782458B (en) * 2019-10-23 2022-05-31 浙江科技学院 Object image 3D semantic prediction segmentation method of asymmetric coding network
US11681046B2 (en) 2019-11-14 2023-06-20 Zoox, Inc. Depth data model training with upsampling, losses and loss balancing
WO2021096806A1 (en) * 2019-11-14 2021-05-20 Zoox, Inc Depth data model training with upsampling, losses, and loss balancing
US11157774B2 (en) * 2019-11-14 2021-10-26 Zoox, Inc. Depth data model training with upsampling, losses, and loss balancing
CN111369506A (en) * 2020-02-26 2020-07-03 四川大学 Lens turbidity grading method based on eye B-ultrasonic image
CN111582316A (en) * 2020-04-10 2020-08-25 天津大学 RGB-D significance target detection method
CN111582316B (en) * 2020-04-10 2022-06-28 天津大学 RGB-D significance target detection method
CN111612832A (en) * 2020-04-29 2020-09-01 杭州电子科技大学 Method for improving depth estimation accuracy by utilizing multitask complementation
CN111612832B (en) * 2020-04-29 2023-04-18 杭州电子科技大学 Method for improving depth estimation accuracy by utilizing multitask complementation
CN112528899B (en) * 2020-12-17 2022-04-12 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN112528900B (en) * 2020-12-17 2022-09-16 南开大学 Image salient object detection method and system based on extreme down-sampling
CN112528899A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN112528900A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on extreme down-sampling
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113592795A (en) * 2021-07-19 2021-11-02 深圳大学 Visual saliency detection method of stereoscopic image, thumbnail generation method and device
CN113592795B (en) * 2021-07-19 2024-04-12 深圳大学 Visual saliency detection method for stereoscopic image, thumbnail generation method and device

Also Published As

Publication number Publication date
CN110175986B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN108520535B (en) Object classification method based on depth recovery information
CN109615582B (en) Face image super-resolution reconstruction method for generating countermeasure network based on attribute description
CN108520503B (en) Face defect image restoration method based on self-encoder and generation countermeasure network
CN107977932B (en) Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network
CN110032926B (en) Video classification method and device based on deep learning
CN107154023B (en) Based on the face super-resolution reconstruction method for generating confrontation network and sub-pix convolution
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN111563418A (en) Asymmetric multi-mode fusion significance detection method based on attention mechanism
CN110689599B (en) 3D visual saliency prediction method based on non-local enhancement generation countermeasure network
CN110210492B (en) Stereo image visual saliency detection method based on deep learning
CN112734915A (en) Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
Wei et al. Bidirectional hybrid LSTM based recurrent neural network for multi-view stereo
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN110458178A (en) The multi-modal RGB-D conspicuousness object detection method spliced more
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
CN111882516B (en) Image quality evaluation method based on visual saliency and deep neural network
Luo et al. Bi-GANs-ST for perceptual image super-resolution
CN107909565A (en) Stereo-picture Comfort Evaluation method based on convolutional neural networks
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant