CN110175986A

CN110175986A - A kind of stereo-picture vision significance detection method based on convolutional neural networks

Info

Publication number: CN110175986A
Application number: CN201910327556.4A
Authority: CN
Inventors: 周武杰; 吕营; 雷景生; 张伟; 何成; 王海江
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-08-27
Anticipated expiration: 2039-04-23
Also published as: CN110175986B

Abstract

The invention discloses a kind of stereo-picture vision significance detection method based on convolutional neural networks, it constructs convolutional neural networks, include input layer, hidden layer, output layer, input layer includes RGB figure input layer and depth map input layer, hidden layer includes coding framework and decoding frame, and coding framework is made of RGB feature extraction module, depth characteristic extraction module and Fusion Features module；The left view point image of every width stereo-picture in training set and depth image are input in convolutional neural networks and are trained, the Saliency maps picture of every width stereo-picture in training set is obtained；The loss function value between the Saliency maps picture and true human eye gazing at images of every width stereo-picture in training set is calculated, obtains convolutional neural networks training pattern after repeating repeatedly；The left view point image and depth image of stereo-picture to be tested are input in convolutional neural networks training pattern, and prediction obtains conspicuousness forecast image；Advantage is its vision significance detection accuracy with higher.

Description

Stereo image visual saliency detection method based on convolutional neural network

Technical Field

The invention relates to a visual saliency detection technology, in particular to a stereo image visual saliency detection method based on a convolutional neural network.

Background

The visual saliency is a popular research topic in many fields such as neuroscience, robotics, and computer vision in recent years. Studies on visual saliency detection can be divided into two broad categories: eyeball gaze prediction and salient object detection. The former is to predict several points of regard of a person when viewing a natural scene, and the latter is to accurately extract an object of interest. In general, visual saliency detection algorithms can be divided into two categories, top-down and bottom-up. The top-down approach is task driven, requiring supervised learning. Whereas bottom-up methods typically use low-level cues such as color features, distance features, and heuristic saliency features. One of the most common heuristic saliency features is contrast, e.g. pixel-based or block-based contrast. Previous research on detecting visual saliency has focused on two-dimensional images. However, it was found that, first, three-dimensional data instead of two-dimensional data is more suitable for practical use; secondly, as visual scenes become more complex, it is not sufficient to extract salient objects using only two-dimensional data. In recent years, with the progress of three-dimensional data acquisition technologies such as Time-of-Flight sensors and Microsoft Kinect, the adoption of a structural finite element method is promoted, and the recognition capability between different objects with similar appearances is improved. The depth data is easy to capture, is independent of light, and can provide geometric clues to improve the prediction of visual saliency. Due to the complementarity of RGB data and depth data, a number of methods have been proposed that combine RGB images with depth images in pairs for visual saliency detection. Previous work has focused primarily on using domain-specific a priori knowledge to construct low-level saliency features, such as humans tend to focus more on closer objects, however this observation is difficult to generalize to all scenarios. In most previous work, the multi-modal fusion problem was solved by directly serializing the RGB-D channels, or processing each modality independently and then combining the decisions of the two modalities. While these strategies have improved greatly, they have difficulty adequately exploring cross-modal complementarity. In recent years, with the success of Convolutional Neural Networks (CNNs) in learning RGB data discriminatory features, more and more work has been undertaken to explore more powerful RGB-D representations of efficient multimodal combinations using CNNs. Most of these works are based on a two-stream architecture, where RGB data and depth data are learned in a separate bottom-up stream and jointly inferred in early or late stages, with features. As the most popular solution, the dual stream architecture achieves a significant improvement over the work based on manual RGB-D features, however, there are the most critical issues: how to effectively utilize multi-modal complementary information in a bottom-up process. Therefore, further research on the RGB-D image visual saliency detection technology is necessary to improve the accuracy of visual saliency detection.

Disclosure of Invention

The invention aims to provide a stereo image visual saliency detection method based on a convolutional neural network, which has higher visual saliency detection accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }ⁿ(x, y) }, will { IⁿThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dⁿ(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, Iⁿ(x, y) represents { IⁿThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)ⁿ(x, y) represents { DⁿThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a coding frame and a decoding frame, the coding frame comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module and a feature fusion module, the RGB feature extraction module comprises 1 to 4 neural network blocks and 1 to 3 down-sampling blocks, the depth feature extraction module comprises 5 to 8 neural network blocks and 4 to 6 down-sampling blocks, the feature fusion module comprises 9 to 15 neural network blocks and 1 to 4 maximum pooling layers, and the decoding frame comprises 16 to 19 neural network blocks and 1 to 4 up-sampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the filling is 1, and the activation mode of the first activation layer is 'Sigmoid';

for the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; wherein, the width of the left viewpoint image for training is required to be W and the height is required to be H;

for the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; wherein the width of the depth image for training is W and the height of the depth image for training is H;

for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the output end of the 1 st neural network block64 feature maps with width W and height H are output, and the set of all the output feature maps is denoted as P₁(ii) a The input of the 1 st downsampling block receives P₁Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₁(ii) a The input of the 2 nd neural network block receives X₁The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₂(ii) a The input of the 2 nd downsampling block receives P₂Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₂(ii) a The input of the 3 rd neural network block receives X₂The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₃(ii) a 3 rd lowerThe input of the sampling block receives P₃Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₃(ii) a The input of the 4 th neural network block receives X₃The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₄；

For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P₅(ii) a The input of the 4 th downsampling block receives P₅Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₄(ii) a The input of the 6 th neural network block receives X₄The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd is high inIs composed ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₆(ii) a The input of the 5 th downsampling block receives P₆Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₅(ii) a The input of the 7 th neural network block receives X₅The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₇(ii) a The input of the 6 th downsampling block receives P₇Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₆(ii) a The input of the 8 th neural network block receives X₆The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₈；

For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P₉(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P₁₀(ii) a To P₉All feature maps and P in (1)₁₀After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E₁(ii) a The input of the 11 th neural network block receives E₁The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₁(ii) a To P₁All characteristic maps, P in₅All feature maps and P in (1)₁₁After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E₂(ii) a Input of the 1 st max pooling layer receives E₂The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₁(ii) a Input of 12 th neural network block receives Z₁All feature maps in (1), output of 12 th neural network blockEnd output 128 pieces wideAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₂(ii) a To P₂All characteristic maps, P in₆All feature maps and P in (1)₁₂All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₃(ii) a Input of 2 nd largest pooling layer receives E₃The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₂(ii) a Input of the 13 th neural network block receives Z₂The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₃(ii) a To P₃All characteristic maps, P in₇All feature maps and P in (1)₁₃All feature maps in (1) are subjected to Element-wise Summation operationThen, after Element-wise Summation operation, 256 output signals with widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₄(ii) a Input of the 3 rd largest pooling layer receives E₄The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₃(ii) a The input of the 14 th neural network block receives Z₃The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₄(ii) a To P₄All characteristic maps, P in₈All feature maps and P in (1)₁₄All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₅(ii) a Input terminal of 4 th max pooling layer receives E₅All feature maps in (1), output of the 4 th max pooling layerThe output end outputs 512 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₄(ii) a Input of the 15 th neural network block receives Z₄The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₅；

For the decoding framework, the input of the 1 st upsampling layer receives P₁₅The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₁(ii) a The input of the 16 th neural network block receives S₁The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₆(ii) a The input of the 2 nd up-sampling layer receives P₁₆The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₂(ii) a The input of the 17 th neural network block receives S₂The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₇(ii) a The input of the 3 rd up-sampling layer receives P₁₇The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₃(ii) a The input of the 18 th neural network block receives S₃The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₈(ii) a The input of the 4 th up-sampling layer receives P₁₈The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S₄(ii) a 19 th neural networkInput of the block receives S₄The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₉；

For the output layer, the input of the first convolutional layer receives P₁₉The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;

step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingⁿ(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);

step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtaining by using a mean square error loss function;

step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y');

step 2_ 2: will be provided withAndinputting into a convolutional neural network training model and using W^bestAnd b^bestMaking a prediction to obtainIs recorded as a saliency predicted imageWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').

In step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first cavity convolution layer, a second active layer, a first residual block, a second cavity convolution layer and a third cavity convolution layer which are sequentially arranged, wherein the input end of the first cavity convolution layer is the input end of the neural network block where the first cavity convolution layer is located, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first cavity convolution layer, the input end of the second active layer receives all feature maps output by the output end of the second cavity convolution layer, the input end of the first residual block receives all feature maps output by the output end of the second active layer, the input end of the second cavity convolution layer receives all feature maps output by the output end of the first residual block, and the input end of the third cavity convolution layer receives all feature maps output by the output end of the second cavity convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';

the 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3;

the 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU";

the 13 th to 19 th neural network blocks have the same structure, and are composed of a fifth convolution layer, a seventh normalization layer, a fourth activation layer, a sixth convolution layer, an eighth normalization layer, a fifth activation layer, a seventh convolution layer and a ninth normalization layer which are arranged in sequence, wherein the input end of the fifth convolution layer is the input end of the neural network block where the fifth convolution layer is located, the input end of the seventh normalization layer receives all feature maps output by the output end of the fifth convolution layer, the input end of the fourth activation layer receives all feature maps output by the output end of the seventh normalization layer, the input end of the sixth convolution layer receives all feature maps output by the output end of the fourth activation layer, the input end of the eighth normalization layer receives all feature maps output by the output end of the sixth convolution layer, the input end of the fifth activation layer receives all feature maps output by the output end of the eighth normalization layer, the input end of the seventh convolutional layer receives all the characteristic graphs output by the output end of the fifth activation layer, the input end of the ninth normalization layer receives all the characteristic graphs output by the output end of the seventh convolutional layer, and the output end of the ninth normalization layer is the output end of the neural network block where the ninth normalization layer is located; wherein, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 13 th neural network block is 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 14 th neural network block is 512, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 15 th neural network block is 1024, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 16 th neural network block is 512, 512 and 256, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in the 17 th neural network block is 256, 256 and 128, the number of convolution kernels of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernel in the 18 th neural network block is 128, 128 and 64, the number of the fifth convolution layer, the sixth convolution layer and the seventh convolution kernels in the 19 th neural network block is 64, convolution kernel sizes of a fifth convolution layer, a sixth convolution layer and a seventh convolution layer in each of the 13 th to 19 th neural network blocks are all 3 × 3, steps are all 1, padding is all 1, and activation modes of a fourth activation layer and a fifth activation layer in each of the 13 th to 19 th neural network blocks are all 'ReLU'.

In step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are formed by the second residual block, the input end of the second residual block is the input end of the downsampling block where the second residual block is located, and the output end of the second residual block is the output end of the downsampling block where the second residual block is located.

The first residual block and the second residual block have the same structure, and comprise 3 convolution layers, 3 batch normalization layers and 3 activation layers, wherein the input end of the 1 st convolution layer is the input end of the residual block where the 1 st convolution layer is located, the input end of the 1 st batch normalization layer receives all characteristic diagrams output by the output end of the 1 st convolution layer, the input end of the 1 st activation layer receives all characteristic diagrams output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolution layer receives all characteristic diagrams output by the output end of the 1 st activation layer, the input end of the 2 nd batch normalization layer receives all characteristic diagrams output by the output end of the 2 nd convolution layer, the input end of the 2 nd activation layer receives all characteristic diagrams output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolution layer receives all characteristic diagrams output by the output end of the 2 nd activation layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".

In step 1_2, the sizes of the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2, and the steps are all 2.

In step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are 2.

Compared with the prior art, the invention has the advantages that:

1) the method respectively trains a module (namely an RGB feature extraction module and a depth feature extraction module) for RGB images and depth images through a coding frame provided in a constructed convolutional neural network to learn RGB and depth features of different levels, and provides a module specially fusing the RGB and depth features, namely a feature fusion module, which fuses the two features from low level to high level, thereby being beneficial to fully utilizing cross-modal information to form new discrimination features and improving the accuracy of stereo vision significance prediction.

2) The down-sampling blocks in the RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method utilize the residual block with the stride of 2 to replace the maximum pooling layer used in the prior work, so that the model is favorable for adaptively selecting feature information, and important information is prevented from being lost due to the maximum pooling operation.

3) The RGB feature extraction module and the depth feature extraction module in the convolutional neural network constructed by the method introduce the residual blocks with the cavity convolutional layers in the front and the back, enlarge the acceptance domain of the convolutional kernel, and are beneficial to the constructed convolutional neural network to pay more attention to global information and learn more abundant contents.

Drawings

FIG. 1 is a schematic diagram of the composition of a convolutional neural network constructed by the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides a stereo image visual saliency detection method based on a convolutional neural network.

The specific steps of the training phase process are as follows:

step 1_ 1: selecting N original three-dimensional images with the width W and the height H; then, all the selected original stereo images and the respective left viewpoint images, depth images and real eye gazing images of all the original stereo images form a training set, and the nth original stereo image in the training set is marked as { I }ⁿ(x, y) }, will { IⁿThe left viewpoint image, the depth image and the real human eye gazing image of (x, y) } are correspondingly recorded as{Dⁿ(x,y)}、Wherein N is a positive integer, N is more than or equal to 300, if N is 600, W and H can be evenly divided by 2, N is a positive integer, the initial value of N is 1, N is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, I isⁿ(x, y) represents { IⁿThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe pixel value D of the pixel point with the middle coordinate position (x, y)ⁿ(x, y) represents { DⁿThe coordinate position in (x, y) is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y).

Step 1_ 2: constructing a convolutional neural network: as shown in fig. 1, the convolutional neural network includes an input layer, a hidden layer, and an output layer, where the input layer includes an RGB map input layer and a depth map input layer, the hidden layer includes a coding frame and a decoding frame, the coding frame includes three parts, namely, an RGB feature extraction module, a depth feature extraction module, and a feature fusion module, the RGB feature extraction module includes 1 st to 4 th neural network blocks, and 1 st to 3 rd downsampling blocks, the depth feature extraction module includes 5 th to 8 th neural network blocks, and 4 th to 6 th downsampling blocks, the feature fusion module includes 9 th to 15 th neural network blocks, and 1 st to 4 th maximum pooling layers, and the decoding frame includes 16 th to 19 th neural network blocks, and 1 st to 4 th upsampling layers; the output layer consists of a first convolution layer, a first batch of normalization layers and a first activation layer, the convolution kernel size of the first convolution layer is 3 multiplied by 3, the step size is 1, the number of the convolution kernels is 1, the padding is 1, and the activation mode of the first activation layer is 'Sigmoid'.

For the RGB image input layer, the input end of the RGB image input layer receives a left viewpoint image for training, and the output end of the RGB image input layer outputs the left viewpoint image for training to the hidden layer; here, the width of the left viewpoint image for training is required to be W and the height is required to be H.

For the depth map input layer, the input end of the depth map input layer receives the training depth image corresponding to the training left viewpoint image received by the input end of the RGB map input layer, and the output end of the depth map input layer outputs the training depth image to the hidden layer; the training depth image has a width W and a height H.

For the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P₁(ii) a The input of the 1 st downsampling block receives P₁Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₁(ii) a The input of the 2 nd neural network block receives X₁The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₂(ii) a The input of the 2 nd downsampling block receives P₂Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₂(ii) a The input of the 3 rd neural network block receives X₂The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₃(ii) a The input of the 3 rd downsampling block receives P₃Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₃(ii) a The input of the 4 th neural network block receives X₃The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₄。

For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P₅(ii) a The input of the 4 th downsampling block receives P₅Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₄(ii) a The input of the 6 th neural network block receives X₄The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₆(ii) a The input of the 5 th downsampling block receives P₆All characteristic maps in (5)The output end of each downsampling block outputs 128 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₅(ii) a The input of the 7 th neural network block receives X₅The output end of the 7 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₇(ii) a The input of the 6 th downsampling block receives P₇Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₆(ii) a The input of the 8 th neural network block receives X₆The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₈。

For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, and the 9 th nerveThe output end of the network block outputs 3 characteristic graphs with width W and height H, and the set formed by all the output characteristic graphs is marked as P₉(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P₁₀(ii) a To P₉All feature maps and P in (1)₁₀After Element-wise Summation operation, 3 feature maps with width W and height H are output, and the set of all output feature maps is recorded as E₁(ii) a The input of the 11 th neural network block receives E₁The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₁(ii) a To P₁All characteristic maps, P in₅All feature maps and P in (1)₁₁After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E₂(ii) a Input of the 1 st max pooling layer receives E₂The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₁(ii) a Input of 12 th neural network block receives Z₁The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₂(ii) a To P₂All characteristic maps, P in₆All feature maps and P in (1)₁₂All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₃(ii) a Input of 2 nd largest pooling layer receives E₃The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₂(ii) a Input of the 13 th neural network block receives Z₂The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₃(ii) a To P₃All characteristic maps, P in₇All feature maps and P in (1)₁₃All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₄(ii) a Input of the 3 rd largest pooling layer receives E₄The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₃(ii) a The input of the 14 th neural network block receives Z₃The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₄(ii) a To P₄All characteristic maps, P in₈All feature maps and P in (1)₁₄All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₅(ii) a Input terminal of 4 th max pooling layer receives E₅The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₄(ii) a Input of the 15 th neural network block receives Z₄The output end of the 15 th neural network block outputs 1024 pieces of characteristic graphs with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₅。

For the decoding framework, the input of the 1 st upsampling layer receives P₁₅The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₁(ii) a The input of the 16 th neural network block receives S₁The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₆(ii) a The input of the 2 nd up-sampling layer receives P₁₆The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofA characteristic diagram ofThe set of all the output feature maps is denoted as S₂(ii) a The input of the 17 th neural network block receives S₂The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₇(ii) a The input of the 3 rd up-sampling layer receives P₁₇The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₃(ii) a The input of the 18 th neural network block receives S₃The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₈(ii) a The input of the 4 th up-sampling layer receives P₁₈The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S₄(ii) a The input of the 19 th neural network block receives S₄The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₉。

For the output layer, the first rollThe input end of the lamination receives P₁₉The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H.

Step 1_ 3: taking the left viewpoint image of each original stereo image in the training set as a training left viewpoint image, taking the depth image of each original stereo image in the training set as a training depth image, inputting the training depth image into a convolutional neural network for training to obtain a saliency image of each original stereo image in the training set, and taking the { I } as a left viewpoint image for trainingⁿ(x, y) } significant image is noted asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y).

Step 1_ 4: calculating the loss function value between the significance image of each original stereo image in the training set and the real eye gazing imageAndthe value of the loss function in between is recorded asObtained by using a mean square error loss function.

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network training model, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein, V is more than 1, and if V is 50.

The test stage process comprises the following specific steps:

step 2_ 1: order toRepresenting a three-dimensional image of width W 'and height H' to be tested, willIs correspondingly recorded asAndwherein x 'is more than or equal to 1 and less than or equal to W', y 'is more than or equal to 1 and less than or equal to H',to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representThe pixel value of the pixel point with the middle coordinate position (x ', y'),to representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').

In this embodiment, in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second active layer, a first residual block, a second hole convolution layer and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second active layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second active layer, an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, an input end of the third normalization layer receives all feature maps output by an output end of the second hole convolution layer, the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU'.

The 9 th and 10 th neural network blocks have the same structure and are composed of a second convolution layer and a fourth batch of normalization layers which are sequentially arranged, wherein the input end of the second convolution layer is the input end of the neural network block where the second convolution layer is located, the input end of the fourth batch of normalization layers receives all characteristic diagrams output by the output end of the second convolution layer, and the output end of the fourth batch of normalization layers is the output end of the neural network block where the fourth batch of normalization layers is located; the number of convolution kernels of the second convolution layer in each of the 9 th neural network block and the 10 th neural network block is 3, the sizes of the convolution kernels are 7 multiplied by 7, the steps are 1, and the padding is 3.

The 11 th and 12 th neural network blocks have the same structure and are composed of a third convolution layer, a fifth normalization layer and a third activation layer which are arranged in sequence, the input end of the third convolutional layer is the input end of the neural network block where the third convolutional layer is located, the input end of the fifth convolutional layer receives all the feature maps output by the output end of the third convolutional layer, the input end of the third active layer receives all the feature maps output by the output end of the fifth convolutional layer, the input end of the fourth convolutional layer receives all the feature maps output by the output end of the third active layer, the input end of the sixth convolutional layer receives all the feature maps output by the output end of the fourth convolutional layer, and the output end of the sixth convolutional layer is the output end of the neural network block where the sixth convolutional layer is located; the number of convolution kernels of a third convolution layer and a fourth convolution layer in an 11 th neural network block is 64, the number of convolution kernels of a third convolution layer and a fourth convolution layer in a 12 th neural network block is 128, the sizes of convolution kernels of the third convolution layer and the fourth convolution layer in the 11 th neural network block and the 12 th neural network block are both 3 x 3, the steps are both 1, and the padding is both 1; the activation mode of the third activation layer in each of the 11 th and 12 th neural network blocks is "ReLU".

In this embodiment, in step 1_2, the structure of the 1 st to 6 th downsampling blocks is the same, and they are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.

In this specific embodiment, the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, where the input end of the 1 st convolutional layer is the input end of the residual block where it is located, the input end of the 1 st batch normalization layer receives all the feature maps output by the output end of the 1 st convolutional layer, the input end of the 1 st active layer receives all the feature maps output by the output end of the 1 st batch normalization layer, the input end of the 2 nd convolutional layer receives all the feature maps output by the output end of the 1 st active layer, the input end of the 2 nd batch normalization layer receives all the feature maps output by the output end of the 2 nd convolutional layer, the input end of the 2 nd active layer receives all the feature maps output by the output end of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolution layer, all the feature maps received by the input end of the 1 st convolution layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and after passing through the 3 rd activation layer, all the feature maps output by the output end of the 3 rd activation layer are used as all the feature maps output by the output end of the residual block where the feature maps are located; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".

In this embodiment, in step 1_2, the pooling windows of the 1 st to 4 th largest pooling layers are all 2 × 2 in size and all 2 in steps.

In this embodiment, in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are bilinear interpolation, and the scaling factors are all 2.

To verify the feasibility and effectiveness of the method of the invention, experiments were performed.

Here, the accuracy and stability of the method of the present invention was analyzed using a three-dimensional human eye tracking database (NCTU-3DFixation) provided by Taiwan university of transportation. Here, 4 common objective parameters for evaluating the visual Saliency extraction method are used as evaluation indexes, namely, a Linear Correlation Coefficient (CC), a Kullback-Leibler Divergence Coefficient (KLD), an AUC parameter (AUC), and a normalized scan path Saliency (NSS).

The method is used for obtaining the significance prediction image of each three-dimensional image in the three-dimensional human eye tracking database provided by Taiwan traffic university, and comparing the significance prediction image with a subjective visual significance map of each three-dimensional image in the three-dimensional human eye tracking database, namely a real human eye gazing image (existing in the three-dimensional human eye tracking database), wherein the higher the CC, AUC and NSS values are, the lower the KLD value is, the better the consistency between the significance prediction image obtained by the method and the subjective visual significance map is. The CC, KLD, AUC and NSS related indices reflecting the significant extraction performance of the method of the invention are listed in Table 1.

TABLE 1 accuracy and stability of the saliency predicted images and subjective visual saliency maps obtained by the method of the invention

Performance index	CC	KLD	AUC(Borji)	NSS
					Performance index value	0.7583	0.4868	0.8789	2.0692

As can be seen from the data listed in Table 1, the accuracy and stability of the saliency predicted image obtained by the method of the invention and the subjective visual saliency map are good, which indicates that the objective detection result is more consistent with the result of subjective perception of human eyes, and is enough to illustrate the feasibility and effectiveness of the method of the invention.

Claims

1. A stereo image visual saliency detection method based on a convolutional neural network is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

for the RGB feature extraction module, the input end of the 1 st neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature images with width W and height H, and the set formed by all the output feature images is recorded as P₁(ii) a The input of the 1 st downsampling block receives P₁Of the 1 st downsampling block, 64 output widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₁(ii) a The input of the 2 nd neural network block receives X₁The output end of the 2 nd neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₂(ii) a The input of the 2 nd downsampling block receives P₂Of the 2 nd downsampling block, the output of the 2 nd downsampling block has 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₂(ii) a The input of the 3 rd neural network block receives X₂The output end of the 3 rd neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₃(ii) a The input of the 3 rd downsampling block receives P₃Of 256 widths at the output of the 3 rd downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₃(ii) a The input of the 4 th neural network block receives X₃The output end of the 4 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₄；

For the depth feature extraction module, the input end of the 5 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 5 th neural network block outputs 64 feature maps with width W and height H, and the set formed by all the output feature maps is recorded as P₅(ii) a The input of the 4 th downsampling block receives P₅Of 64 width at the output of the 4 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₄(ii) a The input of the 6 th neural network block receives X₄The output end of the 6 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₆(ii) a The input of the 5 th downsampling block receives P₆Of the output of the 5 th downsampling block, of 128 widthsAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₅(ii) a The input of the 7 th neural network block receives X₅All feature maps in (1), output of the 7 th neural network blockOutput 256 widths ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₇(ii) a The input of the 6 th downsampling block receives P₇Of 256 widths at the output of the 6 th downsampling blockAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as X₆(ii) a The input of the 8 th neural network block receives X₆The output end of the 8 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₈；

For the feature fusion module, the input end of the 9 th neural network block receives the left viewpoint image for training output by the output end of the RGB image input layer, the output end of the 9 th neural network block outputs 3 feature images with width W and height H, and the set formed by all the output feature images is recorded as P₉(ii) a The input end of the 10 th neural network block receives the training depth image output by the output end of the depth map input layer, the output end of the 10 th neural network block outputs 3 feature maps with the width W and the height H, and the set formed by all the output feature maps is recorded as P₁₀(ii) a To P₉All feature maps and P in (1)₁₀All of (1)Element-wise Summation operation is carried out on the feature maps, 3 feature maps with width W and height H are output after the Element-wise Summation operation, and a set formed by all the output feature maps is recorded as E₁(ii) a The input of the 11 th neural network block receives E₁The output end of the 11 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₁(ii) a To P₁All characteristic maps, P in₅All feature maps and P in (1)₁₁After the Element-wise Summation operation, 64 feature maps with width W and height H are output, and the set of all the output feature maps is recorded as E₂(ii) a Input of the 1 st max pooling layer receives E₂The output end of the 1 st maximum pooling layer outputs 64 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₁(ii) a Input of 12 th neural network block receives Z₁The output end of the 12 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₂(ii) a To P₂All characteristic maps, P in₆All feature maps and P in (1)₁₂All the feature maps in the table are subjected to Element-wise Summation operation, and 128 pieces of output width are obtained after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₃(ii) a Input of 2 nd largest pooling layer receives E₃The output end of the 2 nd maximum pooling layer outputs 128 pieces of feature maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₂(ii) a Input of the 13 th neural network block receives Z₂The output end of the 13 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₃(ii) a To P₃All characteristic maps, P in₇All feature maps and P in (1)₁₃All the feature maps in the table are subjected to Element-wise Summation operation, and 256 pieces of feature maps with the width of 256 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₄(ii) a Input of the 3 rd largest pooling layer receives E₄The output end of the 3 rd maximum pooling layer outputs 256 width mapsAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₃(ii) a The input of the 14 th neural network block receives Z₃The output end of the 14 th neural network block outputs 512 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₄(ii) a To P₄All characteristic maps, P in₈All feature maps and P in (1)₁₄All the feature maps in the table are subjected to Element-wise Summation operation, and 512 output images with the width of 512 are output after the Element-wise Summation operationAnd has a height ofThe feature map of (1) is a set of all feature maps outputted, and is denoted as E₅(ii) a Input terminal of 4 th max pooling layer receives E₅The output end of the 4 th maximum pooling layer outputs 512 widthAnd has a height ofThe feature map of (1) is a set of all feature maps output as Z₄(ii) a Input of the 15 th neural network block receives Z₄All feature maps in (1), the 15 th neural networkThe output end of the block outputs 1024 pieces of widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₅；

For the decoding framework, the input of the 1 st upsampling layer receives P₁₅The output end of the 1 st up-sampling layer outputs 1024 widthAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₁(ii) a The input of the 16 th neural network block receives S₁The output end of the 16 th neural network block outputs 256 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₆(ii) a The input of the 2 nd up-sampling layer receives P₁₆The output end of the 2 nd up-sampling layer outputs 256 width characteristic mapsAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₂(ii) a The input of the 17 th neural network block receives S₂The output end of the 17 th neural network block outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₇(ii) a The input of the 3 rd up-sampling layer receives P₁₇The output end of the 3 rd up-sampling layer outputs 128 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, denoted as S₃(ii) a The input of the 18 th neural network block receives S₃The output end of the 18 th neural network block outputs 64 characteristic maps with the width ofAnd has a height ofThe feature map of (1) is a set of all feature maps of (1) output, and is denoted as P₁₈(ii) a The input of the 4 th up-sampling layer receives P₁₈The 4 th up-sampling layer outputs 64 feature maps with width W and height H, and the set of all output feature maps is denoted as S₄(ii) a The input of the 19 th neural network block receives S₄The output end of the 19 th neural network block outputs 64 feature maps with the width W and the height H, and the set of all the output feature maps is marked as P₁₉；

For the output layer, the input of the first convolutional layer receives P₁₉All of (1)The output end of the first convolution layer outputs a characteristic diagram with width W and height H; the input end of the first batch of normalization layers receives the characteristic diagram output by the output end of the first convolution layer; the input end of the first active layer receives the characteristic diagram output by the output end of the first batch of normalization layers; the output end of the first activation layer outputs a saliency image of a three-dimensional image corresponding to a left viewpoint image for training; wherein the width of the saliency image is W and the height is H;

step 1_ 5: repeatedly executing steps 1_3 andstep 1_4, obtaining a convolutional neural network training model for V times, and obtaining N multiplied by V loss function values; then finding out the loss function value with the minimum value from the N multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

2. The method according to claim 1, wherein in step 1_2, the 1 st to 8 th neural network blocks have the same structure and are composed of a first hole convolution layer, a second normalization layer, a second activation layer, a first residual block, a second hole convolution layer, and a third normalization layer, which are sequentially arranged, wherein an input end of the first hole convolution layer is an input end of the neural network block where the first hole convolution layer is located, an input end of the second normalization layer receives all feature maps output by an output end of the first hole convolution layer, an input end of the second activation layer receives all feature maps output by an output end of the second normalization layer, an input end of the first residual block receives all feature maps output by an output end of the second activation layer, and an input end of the second hole convolution layer receives all feature maps output by an output end of the first residual block, the input end of the third batch of normalization layers receives all characteristic graphs output by the output end of the second cavity convolution layer, and the output end of the third batch of normalization layers is the output end of the neural network block where the third batch of normalization layers is located; wherein, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the first hole convolution layer and the second hole convolution layer in each of the 1 st to 8 th neural network blocks are both 3 × 3 and steps are both 1, the holes are all 2, the fillings are all 2, and the activation modes of the second activation layers in the 1 st to 8 th neural network blocks are all 'ReLU';

3. The method for detecting visual saliency of stereoscopic images based on convolutional neural network as claimed in claim 2, wherein in step 1_2, the 1 st to 6 th downsampling blocks have the same structure and are composed of the second residual block, the input end of the second residual block is the input end of the downsampling block where it is located, and the output end of the second residual block is the output end of the downsampling block where it is located.

4. The method according to claim 3, wherein the first residual block and the second residual block have the same structure, and include 3 convolutional layers, 3 batch normalization layers, and 3 active layers, an input of a 1 st convolutional layer is an input of the residual block, an input of a 1 st batch normalization layer receives all feature maps output by an output of the 1 st convolutional layer, an input of a 1 st active layer receives all feature maps output by an output of the 1 st batch normalization layer, an input of a 2 nd convolutional layer receives all feature maps output by an output of the 1 st active layer, an input of a 2 nd batch normalization layer receives all feature maps output by an output of the 2 nd convolutional layer, an input of a 2 nd active layer receives all feature maps output by an output of the 2 nd batch normalization layer, the input end of the 3 rd convolutional layer receives all the feature maps output by the output end of the 2 nd active layer, the input end of the 3 rd batch of normalization layers receives all the feature maps output by the output end of the 3 rd convolutional layer, all the feature maps received by the input end of the 1 st convolutional layer are added with all the feature maps output by the output end of the 3 rd batch of normalization layers, and all the feature maps output by the output end of the 3 rd active layer after passing through the 3 rd active layer are used as all the feature maps output by the output end of the residual block; wherein the number of convolution kernels of each convolution layer in the first residual block in each of the 1 st and 5 th neural network blocks is 64, the number of convolution kernels of each convolution layer in the first residual block in each of the 2 nd and 6 th neural network blocks is 128, the number of convolution kernels of each convolution layer in the first residual block in each of the 3 rd and 7 th neural network blocks is 256, the number of convolution kernels of each convolution layer in the first residual block in each of the 4 th and 8 th neural network blocks is 512, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 1 × 1 and step length is 1, the sizes of convolution kernels of the 2 nd convolution layer in the first residual block in each of the 1 st to 8 th neural network blocks are both 3 × 3, the sizes of convolution kernels are both 1 and step length are 1, and the padding is both 1, the number of convolution kernels of each convolution layer in the second residual block in each of the 1 st and 4 th downsampling blocks is 64, the number of convolution kernels of each convolution layer in the second residual block in each of the 2 nd and 5 th downsampling blocks is 128, the number of convolution kernels of each convolution layer in the second residual block in each of the 3 rd and 6 th downsampling blocks is 256, the sizes of convolution kernels of the 1 st convolution layer and the 3 rd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 1 × 1 and 1 step, the sizes of convolution kernels of the 2 nd convolution layer in the second residual block in each of the 1 st to 6 th downsampling blocks are both 3 × 3, the steps are both 2 and 1 filling, and the activation modes of the 3 activation layers are both "ReLU".

5. The method for detecting the visual saliency of stereoscopic images based on convolutional neural network as claimed in any one of claims 1 to 4, wherein in step 1_2, the sizes of the pooling windows of the 1 st to 4 th maximum pooling layers are all 2 x 2 and the steps are all 2.

6. The method for detecting visual saliency of stereoscopic images based on a convolutional neural network as claimed in claim 5, wherein in step 1_2, the sampling modes of the 1 st to 4 th upsampling layers are all bilinear interpolation, and the scaling factor is all 2.