CN111445432A

CN111445432A - Image significance detection method based on information fusion convolutional neural network

Info

Publication number: CN111445432A
Application number: CN201910971962.4A
Authority: CN
Inventors: 周武杰; 吴君委; 雷景生; 何成; 王海江
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-07-24

Abstract

The invention discloses an image significance detection method based on an information fusion convolutional neural network. Inputting an original object image and a corresponding depth image into a convolutional neural network for training to obtain a saliency detection prediction image and a saliency object boundary prediction image; then, calculating a loss function value between a set formed by a saliency detection prediction image and a set formed by a real saliency detection image and a loss function value between a saliency edge detection image and a set formed by a real saliency object saliency edge image to obtain an optimal weight vector and a bias term of a convolutional neural network classification training model; inputting a scene image to be subjected to significance detection and a corresponding depth image into a convolutional neural network classification training model to obtain a prediction significance detection image; the method has the advantage of improving the significance detection efficiency and accuracy of the object image.

Description

Image significance detection method based on information fusion convolutional neural network

Technical Field

The invention relates to a significance detection method based on deep learning, in particular to an image significance detection method based on an information fusion convolutional neural network.

Background

The visual saliency can help people to quickly filter out unimportant information, so that the attention of people is focused on a meaningful area, and the scene in front of the eyes can be better understood. With the development of the field of computer vision, it is desirable that a computer also have the same ability as a human being, that is, when a complex scene is understood and analyzed, the computer can process useful information more specifically, so that the complexity of an algorithm can be reduced more greatly, and interference of noise can be eliminated. In the conventional method, researchers model a saliency object detection algorithm according to various observed a priori knowledge to generate a saliency map. These a priori knowledge include contrast, center a priori, edge a priori, semantic a priori, etc. However, in complex scenarios, conventional approaches tend to be inaccurate. This is because these observations are often limited to low-level features (e.g., color and contrast, etc.) and do not accurately reflect the common points of significance inherent to the salient objects.

In recent years, deep convolutional neural networks have been widely used in various fields of computer vision, and many difficult vision problems have been made a great progress. Different from the traditional method, the deep convolutional neural network can be modeled from a large number of training samples and automatically learn more essential characteristics end to end (end-to-end), so that the defects of traditional manual modeling and feature design are effectively avoided. Recently, the effective application of 3D sensors enriches databases, and we can obtain not only color pictures, but also depth information of the pictures. Depth information is an important ring in the human visual system in real 3D scenes, which is an important piece of information that has been completely ignored in the conventional methods, so that the most important task at present is how to build a model to effectively utilize the depth information.

A significance detection method of deep learning is adopted in an RGB-D database, pixel-level end-to-end significance detection is directly carried out, and prediction can be carried out on a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. At present, the depth learning significance detection based on the RGB-D database mainly uses a coding-decoding architecture, and there are three methods how to use depth information: the first method is to directly superimpose the depth information and the color image information into a four-dimensional input information or add or superimpose the color image information and the depth information in the encoding process, and the method is called pre-fusion; the second method is to add or superimpose the color image information and the depth information corresponding to each other in the encoding process into the corresponding decoding process in a layer skipping (skip connection) manner, which is called post-fusion; the third method is to classify and use color image information and depth information to carry out significance prediction and fuse the final results. In the first method, since the color image information and the depth information have a large difference in distribution, noise is added to a certain extent by directly adding the depth information in the encoding process. The third method uses the depth information and the color map information to perform saliency prediction, but if the prediction results of the depth information and the color map information are not accurate, the final fusion result is relatively inaccurate. The second method not only avoids the noise brought by directly utilizing the depth information in the coding stage, but also can fully learn the complementary relation between the color image information and the depth information in the continuous optimization of the network model. Compared with the prior post-Fusion scheme, such as CNNs-Based RGB-D safety Detection View Cross-views-View Transfer and Multiview Fusion (RGB-D salient object Detection Based on Cross-View conversion and multi-View Fusion of a convolutional neural network), which is hereinafter referred to as CBSD, the CBSD respectively performs feature extraction and down-sampling operations on color image information and depth information, performs Fusion on the minimum scale, and outputs a salient prediction map with a small size on the basis of the Fusion. The CBSD only has down-sampling operation, so that the space detail information of the object becomes fuzzy in the continuous down-sampling operation, and the information of different modes is fused by a direct addition method, and the final result is influenced to a certain extent due to different data information distribution.

Disclosure of Invention

The invention aims to provide a significance detection method based on a convolutional neural network, which has high detection efficiency and high detection accuracy.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

the method comprises the steps of 1, selecting Q RGB images containing real objects, and a depth map, a saliency detection label map and a saliency boundary label map which are known and correspond to each RGB image to form a training set, then utilizing a convolution operation of 3 × 3 to extract the boundary of the saliency detection label image to obtain the saliency boundary label map, wherein the saliency detection label map is an image obtained after the real objects are extracted, and the saliency boundary label map is an image obtained after the outline of the real objects is extracted.

Step 2: and constructing an information fusion convolutional neural network, wherein the information fusion convolutional neural network comprises an input layer, a hidden layer and an output layer which are sequentially connected.

And step 3: inputting each RGB image in the training set and the corresponding depth map thereof into an information fusion convolutional neural network from an input layer for training, and outputting from an output layer to obtain four saliency detection prediction maps and four saliency boundary prediction maps; taking the four significance detection prediction images as significance prediction image sets, and taking the four significance boundary prediction images as boundary prediction image sets; carrying out scaling treatment on the saliency detection label graphs corresponding to each RGB image in different sizes to obtain four images with different widths and heights as a saliency label graph set, and carrying out scaling treatment on the saliency boundary label graphs corresponding to the same RGB image in different sizes to obtain four images with different widths and heights as a boundary label graph set; calculating a first loss function value between the significance prediction atlas and the significance label atlas, wherein the first loss function value is obtained by using classification cross entropy (canonical cross entropy); obtaining a second loss function value by adopting Dice loss; and calculating a second loss function value between the boundary prediction atlas and the boundary label atlas, and adding the first loss function value and the second loss function value to obtain a total loss function value.

And 4, repeatedly executing the step 3 for V times to obtain Q × V total loss function values, and taking the weight vector and the bias item corresponding to the minimum total loss function value as the optimal weight vector and the optimal bias item of the information fusion convolutional neural network, so as to obtain the trained information fusion convolutional neural network.

And 5: collecting RGB images to be subjected to significance detection, inputting the RGB images to the trained information fusion convolutional neural network, outputting to obtain a final significance detection prediction image, and taking the fourth significance detection prediction image as a final predicted significance detection prediction image.

The input layer of the information fusion convolutional neural network comprises an RGB image input layer and a depth image input layer, the hidden layer comprises a color image processing part and a depth image processing part, the RGB input layer receives an RGB image and inputs the RGB image to the color image processing part for processing and then outputs the RGB image to obtain four significance sub-output layers, and the depth image input layer receives a depth image and inputs the depth image to the depth image processing part for processing and then outputs the depth image to obtain four boundary sub-output layers.

The color image processing part comprises a first RGB image neural network block, a first RGB image maximum pooling layer, a second RGB image neural network block, a second RGB image maximum pooling layer, a third RGB image neural network block, a third RGB image maximum pooling layer, a fourth RGB image neural network block, a fourth RGB image maximum pooling layer, a fifth RGB image neural network block, a first significance detection module, a first multi-mode information fusion module, a first RGB up-sampling block, a second multi-mode information fusion module, a second RGB up-sampling block, a third multi-mode information fusion module and a third RGB up-sampling block which are connected in sequence, the RGB map received by the RGB input layer is input into the color map processing part through a first RGB map neural network block and is output by a first significance detection module, a first RGB up-sampling block, a second RGB up-sampling block and a third RGB up-sampling block.

The outputs of the fourth RGB map neural network block and the fifth RGB map neural network block are connected to the input of the first context information fusion block, and the output of the first context information fusion block is connected to the input of the first multi-mode information fusion module; the outputs of the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the second context information fusion block, and the output of the second context information fusion block is connected to the input of the second multi-mode information fusion module; the outputs of the second RGB map neural network block, the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the third context information fusion block, and the output of the third context information fusion block is connected to the input of the third multi-mode information fusion module; the output of the first depth map upsampling block is further connected to the input of the first multimodal information fusion module, the output of the second depth map upsampling block is further connected to the input of the second multimodal information fusion module, and the output of the third depth map upsampling block is further connected to the input of the third multimodal information fusion module.

The depth map processing part comprises a first depth map neural network block, a first depth map maximum pooling layer, a second depth map neural network block, a second depth map maximum pooling layer, a third depth map neural network block, a third depth map maximum pooling layer, a fourth depth map neural network block, a first depth map upsampling layer, a second depth map upsampling block, a second depth map upsampling layer, a third depth map upsampling block, a third depth map upsampling layer and a fourth depth map upsampling block which are connected in sequence; the depth map received by the depth map input layer is input into the depth map processing part through a first depth map neural network block and output by a first depth map upsampling block, a second depth map upsampling block, a third depth map upsampling block and a fourth depth map upsampling block.

The output of the third depth map neural network block is connected to the input of the second depth map upsampling block, and the output of the first depth map upsampling layer and the output of the third depth map neural network block are fused and then input into the second depth map upsampling block; the output of the second depth map neural network block is connected to the input of the third depth map upsampling block, and the output of the second depth map upsampling layer and the output of the second depth map neural network block are fused and then input into the third depth map upsampling block; the output of the first depth map neural network block is connected to the input of a fourth depth map upsampling block, and the output of the third depth map upsampling layer and the output of the first depth map neural network block are fused and then input into the fourth depth map upsampling block; the fusion mode is specifically as follows: and adding the pixel values of the pixel points at the corresponding positions in the output characteristic diagram.

The output layers comprise four significance sub-output layers and four boundary sub-output layers, the outputs of the first significance detection module, the first RGB up-sampling block, the second RGB up-sampling block and the third RGB up-sampling block are respectively connected with the first significance sub-output layer, the second significance sub-output layer, the third significance sub-output layer and the fourth significance sub-output layer, and the outputs of the first significance sub-output layer, the second significance sub-output layer and the third significance sub-output layer are also respectively connected with the inputs of the first multi-mode information fusion module, the second multi-mode information fusion module and the third multi-mode information fusion module; the outputs of the first depth map upsampling block, the second depth map upsampling block, the third depth map upsampling block and the fourth depth map upsampling block are respectively connected with the first boundary sub-output layer, the second boundary sub-output layer, the third boundary sub-output layer and the fourth boundary sub-output layer.

The structure of each depth map neural network block is the same, each depth map neural network block is mainly formed by sequentially connecting a plurality of convolution blocks, and each convolution block is mainly composed of a convolution layer, a batch normalization layer and an activation layer which are sequentially connected. The number of convolution blocks of the first depth map neural network block, the second depth map neural network block, the third depth map neural network block and the fourth depth map neural network block is respectively 2, 3 and 3. The number of convolution blocks of the sampling blocks on the first two, three and four depth maps is 3. The convolution block numbers of the first color map neural network block, the second color map neural network block, the third color map neural network block, the fourth color map neural network block and the fifth color map neural network block are respectively 2, 3 and 3. The first, second and third RGB upsampling blocks have the same structure and are composed of three convolution blocks and an upsampling layer, wherein the convolution blocks are sequentially connected, one end of each convolution block is used as the input of the RGB upsampling block, one end of each convolution block is connected with the upsampling layer, and the output of the upsampling layer is used as the output of the RGB upsampling block.

The context information fusion blocks have the same structure, and specifically include: the context information fusion block comprises a plurality of convolution layers, a convolution block I and a convolution block II, wherein the number of the convolution layers is the same as the input number of the context information fusion block, the convolution layers correspond to the input number of the context information fusion block one by one, one end of each convolution layer is connected with one input, the other end of each convolution layer is connected with the convolution block I and the convolution block II in sequence, and the output of the convolution block II is used as the output of the 1 st context information fusion block.

The first multi-mode information fusion module comprises an overlapping layer, a multiplying layer, a first convolution layer, a second convolution layer and an addition layer, wherein the output of the overlapping layer is respectively connected to the input of the multiplying layer and the input of the first convolution layer, the output of the multiplying layer is connected to the input of the addition layer through the second convolution layer, the output of the first convolution layer is connected to the input of the addition layer, and the output of the addition layer is used as the output of the multi-mode information fusion module; the superposition means adding the channel numbers of the output characteristic graphs, the multiplication means multiplying the pixel values of the pixel points at the corresponding positions in the output characteristic graphs, and the addition means adding the pixel values of the pixel points at the corresponding positions in the output characteristic graphs.

For the first multimodal information fusion module: the output of the first context information fusion block and the output of the first significance detection module are jointly input into the superposition layer for superposition (registration) and then serve as the output of the superposition layer, the output of the superposition layer is also input into the first convolution layer, the output of the superposition layer and the output of the first significance sub-output layer are jointly input into the multiplication layer for multiplication and then serve as the output of the multiplication layer, the output of the multiplication layer is further input into the second convolution layer, and the output of the first convolution layer, the second convolution layer and the sampling block on the first depth map which are jointly input into the addition layer for addition is the output of the first multi-mode information fusion module.

For the second and third multi-modal information fusion modules: the output of the context information fusion block and the output of the RGB up-sampling block are jointly input into an overlapping layer for overlapping (registration) and then used as the output of the overlapping layer, the output of the overlapping layer is also input into a first convolution layer, the output of the overlapping layer and the output of the saliency sub-output layer are jointly input into a multiplication layer for multiplication and then used as the output of the multiplication layer, the output of the multiplication layer is input into a second convolution layer, and the input of the first convolution layer, the second convolution layer and the input of the depth map up-sampling block are jointly input into an addition layer for addition, and then the output is the output of the second or third multi-mode information fusion module.

The sampling mode of the sampling layer on each depth map is a bilinear difference method, and the structure of the first significance detection module adopts a network structure of a pyramid pooling module (pyramid pooling module).

Compared with the prior art, the invention has the advantages that:

1) the method comprises the steps of constructing a convolutional neural network, inputting color images and depth images in a training set into the convolutional neural network for training to obtain a convolutional neural network training model; the method combines the convolutional layer with holes and the bilinear difference layer (namely the upper sampling layer) to construct the sampling neural network blocks on the 1 st to the 3 rd RGB images and the sampling neural network blocks on the 1 st to the 4 th depth images when constructing the convolutional neural network, so that the object space information is optimized in the operation process of up-sampling one step by one step, the convolutional layer with holes can obtain larger receptive field, and the final detection effect can be improved.

2) The method of the invention innovatively uses the depth information to obtain the boundary of the salient object when the depth information is utilized, and creates a fusion mode in the fusion process of different modal information (namely color image information and depth image information), and takes the salient prediction image with a smaller scale as input to gradually guide the prediction of the salient prediction image with a larger scale. The final detection effect is greatly improved by extracting the boundary and gradually predicting the salient object.

3) The invention adopts a plurality of supervision modes, namely, the salient output layer is supervised by utilizing the salient detection label graph, and the salient boundary output layer is supervised by utilizing the salient boundary label graph, so that the boundary of an object is clearer, and a better result is obtained.

Drawings

FIG. 1 is a block diagram of an overall implementation of the inventive method;

FIG. 2 is a schematic diagram of a multimodal information fusion module;

FIG. 3 is a diagram of a context information fusion block; the difference point of the three different context information fusion modules of the invention is that the number of the input is respectively 2, 3 and 4, the rest structures are consistent, and the legend shows two inputs.

FIG. 4a is the 1 st original image of a real object;

4a-d are depth maps of the 1 st original image of a real object;

FIG. 4b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 4a using the method of the present invention;

FIG. 5a is the 2 nd original image of a real object;

FIGS. 5a-d are depth maps of the 2 nd original real object image;

FIG. 5b is a predicted saliency detection image obtained by predicting the original object image shown in FIG. 5a using the method of the present invention;

FIG. 6a is the 3 rd original image of a real object;

6a-d are depth maps of the 3 rd original real object image;

FIG. 6b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 6a using the method of the present invention;

FIG. 7a is the 4 th original image of a real object;

FIGS. 7a-d are depth maps of the 4 th original real object image;

FIG. 7b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 7a using the method of the present invention;

FIG. 8-a is a graph of the recall rate of accuracy reflecting the significance detection effect of the method of the present invention using the method of the present invention to predict each color real object image in a real object image database NJU2000 test set;

FIG. 8-b is an average absolute error showing the significance detection effect of the method of the present invention for predicting each color real object image in the real object image database NJU2000 test set using the method of the present invention;

fig. 8-c is a F metric value that reflects the significance detection effect of the method of the present invention, using the method of the present invention to predict each color real object image in the real object image database NJU2000 test set.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides a significance detection method based on a convolutional neural network, the overall implementation block diagram of which is shown in fig. 1, and the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }^q(i,j)}、{D^q(i,j)}、

Then, using convolution of 3 × 3 to extract boundary of each real saliency detection label image in the training set to obtain saliency boundary map of each real saliency detection label image in the training set, and obtaining the saliency boundary map of each real saliency detection label image in the training set

The saliency boundary map of (1) is denoted as

Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦^q(i,j)}、{D^q(i,j)}、

H represents { I }^q(i,j)}、{D^q(i,j)}、

W and H can be divided by 2, { I^q(I, j) } RGB color image, I^q(I, j) represents { I^q(i, j) } pixel value of pixel point whose coordinate position is (i, j) { D^q(i, j) } is a single-channel depth image, D^q(i, j) represents { D^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j),

to represent

The pixel value of the pixel point with the middle coordinate of (i, j);

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 4 depth graph neural network blocks, 3 RGB graph maximum pooling layers, 3 RGB upsampling blocks, 1 global-based significance detection module, 3 multi-mode-based information fusion modules, 3 context information-based fusion modules, 4 depth graph upsampling blocks, 3 depth graph upsampling layers and the output layer comprises 4 significance output layers and 4 significance boundary output layers.

For the depth input layer, the input end of the input layer receives an original input depth image and superposes the original input depth image into three-channel depth information, and the output end of the input layer outputs the three-channel depth information of the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H.

For the 1 st depth map neural network block, the first convolution layer, the first batch of normalization layers, the first activation layer, the second convolution layer, the second batch of normalization layers and the second activation layer are sequentially arranged; the input end of the 1 st depth map neural network block receives three-channel components of an original input image output by the output end of the depth map input layer, the output end of the 1 st depth map neural network block outputs 64 feature maps, and a set formed by the 64 feature maps is recorded as DP₁Wherein, the sizes (kernel _ size) of the convolution kernels of the first convolution layer are all 3 × 3, the number (filters) of the convolution kernels is 64, zero padding (padding) is 1, the output of the first normalization layer is 64 feature maps, the activation mode of the first activation layer is 'Relu', the sizes (kernel _ size) of the convolution kernels of the second convolution layer is 3 × 3, the number (padding) of the convolution kernels is 64, zero padding is 1, the output of the second normalization layer is 64 feature maps, the activation mode of the second activation layer is 'Relu', and DP is processed by the following steps of₁As an input, the input is a first maximum pooling layer (Pool), the pooling size (Pool _ size) of the first maximum pooling layer is 2, the step size (stride) is 2, and the output is represented as DP_1p，DP_1pEach feature map of (1) has a width of

Has a height of

For the 2 nd depth map neural network block, it is composed of the third convolution layerA third normalization layer, a third active layer, a fourth convolution layer, a fourth normalization layer and a fourth active layer; DP is received at the input of the 2 nd depth map neural network block_1pThe output end of the 2 nd deep neural network block outputs 128 feature maps, and a set formed by the 128 feature maps is recorded as DP₂The convolution kernel size of the third convolution layer is 3 × 3, the convolution kernel number is 128, the zero padding parameter is 1, the output of the third batch of normalization layers is 128 characteristic graphs, the activation mode of the third activation layer is 'Relu', the convolution kernel size of the fourth convolution layer is 3 × 3, the convolution kernel number is 128, the zero padding parameter is 1, the output of the fourth batch of normalization layers is 128 characteristic graphs, the activation mode of the fourth activation layer is 'Relu', and DP is calculated₂As an input, the pooling size input to the second largest pooling layer is 2, the step size is 2, and the output is DP_2p，DP_2pEach feature map of (1) has a width of

Has a height of

For the 3 rd depth map neural network block, the fifth convolution layer, the fifth normalization layer, the fifth activation layer, the sixth convolution layer, the sixth normalization layer, the sixth activation layer, the seventh convolution layer, the seventh normalization layer and the seventh activation layer are sequentially arranged; DP is received at the input of the 3 rd depth map neural network block_2p256 feature maps are output from the output end of the 3 rd depth map neural network block, and a set formed by the 256 feature maps is recorded as DP₃The convolution kernel size of the fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the fifth batch of standard layers is 256 characteristic graphs, the activation mode of the fifth activation layer is 'Relu', the convolution kernel size of the sixth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixth batch of standard layers is 256 characteristic graphs, the activation mode of the sixth activation layer is 'Relu', the convolution kernel size of the seventh convolution layer is 3 × 3, and the convolution kernel number is 1256 and zero padding parameter of 1, the output of the seventh batch of standard layers is 256 characteristic graphs, the activation mode of the seventh activation layer is 'Relu', and DP is calculated₃As an input, the data is input into a third maximum pooling layer, the pooling size of the third maximum pooling layer is 2, the step size is 2, and the output is represented as DP_3p，DP_3pEach feature map of (1) has a width of

Has a height of

For the 4 th depth map neural network block, the 4 th depth map neural network block consists of an eighth convolution layer, an eighth normalization layer, an eighth activation layer, a ninth convolution layer, a ninth normalization layer, a ninth activation layer, a tenth convolution layer, a tenth normalization layer and a tenth activation layer which are sequentially arranged; the input of the 4 th depth map neural network block receives DP_3p512 feature maps are output from the output end of the 4 th depth map neural network block, and a set of 512 feature maps is recorded as DP₄The convolution kernel size of the eighth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eighth standard layer is 512 feature maps, the activation mode of the eighth active layer is 'Relu', the convolution kernel size of the ninth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the ninth standard layer is 512 feature maps, the activation mode of the ninth active layer is 'Relu', the convolution kernel size of the tenth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the tenth standard layer is 512 feature maps, the activation mode of the tenth active layer is 'Relu', and DP₄Each feature map of (1) has a width of

Has a height of

For the 1 st depth map up-sampling block, it is set up in turnThe eleventh coiling layer, the eleventh standardization layer, the eleventh activation layer, the twelfth coiling layer, the twelfth standardization layer, the twelfth activation layer, the thirteenth coiling layer, the thirteenth standardization layer and the thirteenth activation layer; the input of the 1 st depth map upsampling block receives DP₄256 feature maps are output from the output end of the 1 st depth map neural network block, and a set formed by the 256 feature maps is recorded as DU₁The convolution kernel size of the eleventh convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the eleventh batch of standard layers is 256 characteristic graphs, the activation mode of the eleventh activation layer is 'Relu', the convolution kernel size of the twelfth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the twelfth batch of standard layers is 256 characteristic graphs, the activation mode of the twelfth activation layer is 'Relu', the convolution kernel size of the thirteenth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the thirteenth batch of standard layers is 256 characteristic graphs, the activation mode of the thirteenth activation layer is 'Relu', DU₁Each feature map of (1) has a width of

Has a height of

Will DU₁As input, the data is input into a first boundary sub-output layer, and the first boundary outputs 1 characteristic diagram, which is marked as D, from the output layer₃Wherein the convolution kernel of the convolution of the first boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D₃Each feature map of width

Has a height of

For the first depth map upsampling layer, it is formed by the first bilinear differenceA value up-sampling layer composition; the input of the sampling layer on the first depth map receives the DU₁256 feature maps are output from the output end of the sampling layer on the first depth map, and a set of the 256 feature maps is recorded as DUB₁(ii) a Wherein the amplification factor of the first bilinear difference upsampling layer is 2; DUB₁Each feature map having a width of

Has a height of

For the 1 st depth map fusion layer, DUB₁And DP₃The result of adding the corresponding position elements is recorded as U₃The depth map data is input into a 2 nd depth map upsampling block, wherein the 2 nd depth map upsampling block consists of a fourteenth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fifteenth convolution layer, a fifteenth normalization layer, a fifteenth active layer, a sixteenth convolution layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged; the input of the 2 nd depth map upsampling block receives U₃The output end of the sampling block on the 2 nd depth map outputs 128 feature maps, and the set of 128 feature maps is denoted as DU₂The convolutional kernel size of the fourteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the fourteenth batch of standard layers is 128 feature maps, the activation mode of the fourteenth active layer is 'Relu', the convolutional kernel size of the fifteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the fifteenth batch of standard layers is 128 feature maps, the activation mode of the fifteenth active layer is 'Relu', the convolutional kernel size of the sixteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the sixteenth batch of standard layers is 128 feature maps, the activation mode of the sixteenth active layer is 'Relu', and DU₂Each feature map of (1) has a width of

Has a height of

Will DU₂As input, the data is input into the 2 nd boundary sub-output layer, and the 2 nd boundary outputs 1 characteristic diagram, which is marked as D, from the output layer₂Wherein the convolution kernel of the convolution of the 2 nd boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D₂Each feature map of width

Has a height of

For the 2 nd depth map upsampling layer, the 2 nd depth map upsampling layer consists of a 2 nd bilinear difference upsampling layer; the input of the 2 nd depth map upsampling layer receives the DU₂The output end of the sampling layer on the 2 nd depth map outputs 128 feature maps, and the set of the 128 feature maps is recorded as DUB₂(ii) a Wherein, the amplification factor of the 2 nd bilinear difference up-sampling layer is 2; DUB₂Each feature map having a width of

Has a height of

For the 2 nd depth map fusion layer, DUB₂And DP₂The result of adding the corresponding position elements is recorded as U₂The depth map data is input into a 3 rd depth map up-sampling block, wherein the 3 rd depth map up-sampling block comprises a seventeenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, an eighteenth convolution layer, an eighteenth normalization layer, an eighteenth active layer, a nineteenth convolution layer, a nineteenth normalization layer and a nineteenth active layer which are sequentially arranged; the input of the 3 rd depth map upsampling block receives U₂All the characteristics ofIn the figure, 64 feature maps are output from the output end of the sampling block on the 3 rd depth map, and the set of 64 feature maps is denoted as DU₁The convolutional kernel size of the seventeenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the seventeenth standard layer is 64 feature maps, the activation mode of the seventeenth active layer is 'Relu', the convolutional kernel size of the eighteenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the eighteenth standard layer is 64 feature maps, the activation mode of the eighteenth active layer is 'Relu', the convolutional kernel size of the nineteenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the nineteenth standard layer is 64 feature maps, the activation mode of the nineteenth active layer is 'Relu', and DU₁Each feature map of (1) has a width of

Has a height of

Will DU₁As input, the data is input into a 3 rd boundary sub-output layer, and the 3 rd boundary outputs 1 characteristic diagram, which is marked as D, from the output layer₁Wherein the convolution kernel of the convolution of the 3 rd boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D₁Each feature map of width

Has a height of

For the 3 rd depth map upsampling layer, the upsampling layer consists of a 3 rd bilinear difference upsampling layer; the input of the 3 rd depth map upsampling layer receives the DU₁The output end of the sampling layer on the 3 rd depth map outputs 64 feature maps, and the set of the 64 feature maps is recorded as DUB₁(ii) a Wherein, the amplification factor of the 3 rd bilinear difference up-sampling layer is 2; DUB₁Each characteristic ofThe width of the figure is W and the height is H.

For the 3 rd depth map fusion layer, DUB₁And DP₁The result of adding the corresponding position elements is recorded as U₁The sampling block is used as input and is input into a 4 th depth map upsampling block, and the 4 th depth map upsampling block comprises a twentieth convolution layer, a twentieth normalization layer, a twentieth activation layer, a twenty-first convolution layer, a twentieth normalization layer, a twenty-first activation layer, a twenty-second convolution layer, a twenty-second normalization layer and a twenty-second activation layer which are sequentially arranged; the input of the 4 th depth map upsampling block receives U₁The convolution kernel size of the twentieth convolution layer is 3 ×, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twentieth standard layer is 64 feature maps, the activation mode of the twentieth activation layer is "Relu", the convolution kernel size of the twenty-first convolution layer is 3 × 3, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twentieth standard layer is 64 feature maps, the activation mode of the twenty-first activation layer is "Relu", the convolution kernel size of the twenty-second convolution layer is 3 ×, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twelfth standard layer is 64 feature maps, the activation mode of the twenty-second activation layer is "Relu", the output of each feature map in the 4 th depth map is 64 feature maps, the output of each boundary from W1 to W4 is 0, and the input width of the boundary is 361, the width of the convolution kernel is 364 th convolution kernel, and the output of the boundary is 364 th convolution kernel, and the width of the input sub-first activation layer is 364 th convolution kernel, and the output of the input sub-th convolution kernel is 364H 4 th convolution kernel.

For a color input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original input color image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H.

For the 1 st color map neural network block, the first Convolution layer (Conv), the first Batch of normalization layers (BN), the first Activation layer (Act), the second Convolution layer, the second Batch of normalization layers and the second Activation layer are sequentially arranged; the input end of the 1 st color image neural network block receives R channel component, G channel component and B channel component of original input image output by the output end of the color image input layer, the output end of the 1 st color image neural network block outputs 64 characteristic images, and the set formed by the 64 characteristic images is marked as P₁Wherein, the sizes (kernel _ size) of the convolution kernels of the first convolution layer are all 3 × 3, the number (filters) of the convolution kernels is 64, zero padding (padding) is 1, the output of the first normalization layer is 64 feature maps, the activation mode of the first activation layer is 'Relu', the sizes of the convolution kernels of the second convolution layer is 3 × 3, the number of the convolution kernels is 64, zero padding is 1, the output of the second normalization layer is 64 feature maps, the activation mode of the second activation layer is 'Relu', P is represented by P ×₁As an input, the input is a first maximum pooling layer (Pool), the pooling size (Pool _ size) of the first maximum pooling layer is 2, the step size (stride) is 2, and the output is represented as P_1p，P_1pEach feature map of (1) has a width of

Has a height of

For the 2 nd color map neural network block, the color map neural network block comprises a third convolution layer, a third batch of normalization layer, a third activation layer, a fourth convolution layer, a fourth batch of normalization layer and a fourth activation layer which are arranged in sequence; the input end of the 2 nd color image neural network block receives P_1pThe output end of the 2 nd color map neural network block outputs 128 feature maps, and the set formed by the 128 feature maps is marked as P₂The convolution kernel size of the third convolution layer is 3 × 3, the number of convolution kernels is 128, zero padding parameters are 1, the output of the third normalization layer is 128 characteristic graphs, and the excitation of the third activation layerThe activation mode is Relu, the convolution kernel size of the fourth convolution layer is 3 × 3, the number of the convolution kernels is 128, the zero padding parameter is 1, the output of the fourth batch of normalization layers is 128 characteristic graphs, the activation mode of the fourth activation layer is Relu, and P is calculated₂As an input, the pooling size input to the second largest pooling layer is 2, the step size is 2, and the output is noted as P_2p，P_2pEach feature map of (1) has a width of

Has a height of

For the 3 rd color map neural network block, the fifth convolution layer, the fifth normalization layer, the fifth activation layer, the sixth convolution layer, the sixth normalization layer, the sixth activation layer, the seventh convolution layer, the seventh normalization layer and the seventh activation layer are sequentially arranged; the input end of the 3 rd color image neural network block receives P_2p256 characteristic graphs are output from the output end of the 3 rd color graph neural network block, and the set formed by the 256 characteristic graphs is marked as P₃The convolution kernel size of the fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the fifth batch of standard layers is 256 characteristic graphs, the activation mode of the fifth activation layer is 'Relu', the convolution kernel size of the sixth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixth batch of standard layers is 256 characteristic graphs, the activation mode of the sixth activation layer is 'Relu', the convolution kernel size of the seventh convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the seventh batch of standard layers is 256 characteristic graphs, the activation mode of the seventh activation layer is 'Relu', and P is processed by the following steps₃As input, the data is input into a third maximum pooling layer, the pooling size of the third maximum pooling layer is 2, the step size is 2, and the output is expressed as P_3p，P_3pEach feature map of (1) has a width of

Has a height of

For the 4 th color map neural network block, the 4 th color map neural network block consists of an eighth convolution layer, an eighth normalization layer, an eighth activation layer, a ninth convolution layer, a ninth normalization layer, a ninth activation layer, a tenth convolution layer, a tenth normalization layer and a tenth activation layer which are arranged in sequence; the input end of the 4 th color image neural network block receives P_3pThe output end of the 4 th color map neural network block outputs 512 feature maps, and the set formed by the 512 feature maps is marked as P₄The convolution kernel size of the eighth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eighth standard layer is 512 feature maps, the activation mode of the eighth activation layer is 'Relu', the convolution kernel size of the ninth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the ninth standard layer is 512 feature maps, the activation mode of the ninth activation layer is 'Relu', the convolution kernel size of the tenth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the tenth standard layer is 512 feature maps, the activation mode of the tenth activation layer is 'Relu', and P is processed by the following steps of₄As input, the data is input into a fourth maximum pooling layer, the pooling size of the fourth maximum pooling layer is 1, the step size is 1, and the output is expressed as P_4p，P_4pEach feature map of (1) has a width of

Has a height of

For the 5 th color map neural network block, the color map neural network block consists of an eleventh convolution layer, an eleventh standardization layer, an eleventh activation layer, a twelfth convolution layer, a twelfth standardization layer, a twelfth activation layer, a thirteenth convolution layer, a thirteenth standardization layer and a thirteenth activation layer which are arranged in sequence; the input terminal of the 5 th color image neural network block receives P_4pAll feature maps in (1), output of the 5 th color neural network blockOutputting 512 characteristic diagrams at the output end, and marking a set formed by the 512 characteristic diagrams as P₅The convolution kernel size of the eleventh convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eleventh batch of standard layers is 512 feature maps, the activation mode of the eleventh activation layer is 'Relu', the convolution kernel size of the twelfth convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the twelfth batch of standard layers is 512 feature maps, the activation mode of the twelfth activation layer is 'Relu', the convolution kernel size of the thirteenth convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the thirteenth batch of standard layers is 512 feature maps, the activation mode of the thirteenth activation layer is 'Relu', P₅Each feature map of (1) has a width of

Has a height of

For the first global-based saliency detection module, P₅As an input, inputting to a global-based saliency detection module, a first global-based saliency detection module which is composed of a first pyramid pooling module (pyramid pooling module) arranged in sequence; the input of the first global-based saliency detection module receives P₅The output end of the first global-based saliency detection module outputs 256 feature maps, and a set formed by the 256 feature maps is marked as U⁴；U⁴Each feature map of (1) has a width of

Has a height of

Will U⁴As input, the 1 st significance sub-output layer outputs 1 characteristic diagram, which is marked as S₁Where the convolution kernel for the convolution of the 1 st saliency sub-output layer has a size of 1 × 1 volumeThe number of the kernels is 1, and the zero padding parameter is 0. S₁Each feature map having a width of

Has a height of

As shown in FIG. 3, for the 1 st context information fusion block, P is merged₅And P₄Respectively input to the fourteenth and fifteenth convolutional layers, the input terminal of the fourteenth convolutional layer receiving P₅256 feature maps are output from the output end of the fourteenth convolutional layer, and the constituent set of the 256 feature maps is recorded as

Input terminal of the fifteenth convolutional layer receives P₄256 feature maps are output from the output end of the fifteenth convolutional layer, and a set of 256 feature maps is described as

Wherein the size of convolution kernel of the fourteenth convolution layer is 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0, the size of convolution kernel of the fifteenth convolution layer is 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0,

and

each feature map having a width of

Has a height of

Will be provided with

And

the result of the superposition (conjugation) is denoted C₁And input it as input to the 1 st context information fusion block, which is composed of the sixteenth convolution layer, the sixteenth normalization layer, the sixteenth activation layer, the seventeenth convolution layer, the seventeenth normalization layer and the seventeenth activation layer, which are arranged in sequence; the input end of the 1 st context information fusion block receives C₁The output end of the 1 st context information fusion block outputs 256 feature graphs, and the set formed by the 256 feature graphs is recorded as SX₁The convolution kernel size of the sixteenth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixteenth standard layer is 256 characteristic graphs, the activation mode of the sixteenth activation layer is 'Relu', the convolution kernel size of the seventeenth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the seventeenth standard layer is 256 characteristic graphs, the activation mode of the seventeenth activation layer is 'Relu', SX₁Each feature map of (1) has a width of

Has a height of

For the first multimodal information fusion module, SX is used₁And S₁And U⁴And DU₁Taking all characteristic graphs in (1) as input, firstly taking SX₁And U⁴The result of the superposition (conjugation) is denoted as M₁And M is₁The input is input into the eighteenth convolutional layer, the output end of the eighteenth convolutional layer outputs 256 characteristic diagrams, and the set of the 256 characteristic diagrams is represented as Re₁Will M₁All characteristic diagrams and S of₁The result of the multiplication is recorded as Mul₁Mixing Mul₁The signals are input into a nineteenth convolutional layer as input, 256 characteristic diagrams are output from the output end of the nineteenth convolutional layer, and a set formed by the 256 characteristic diagrams is represented as For₁Will be₁And Re₁And DU₁The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo₁，Mo₁256 characteristic diagrams in total, Mo₁Each feature map of (1) has a width of

Has a height of

For the 1 st RGB map up-sampling block, the 1 st RGB map up-sampling block consists of a twentieth convolution layer, a twentieth normalization layer, a twentieth activation layer, a twenty-first convolution layer, a twentieth normalization layer, a twenty-first activation layer, a twenty-second convolution layer, a twenty-second normalization layer, a twenty-second activation layer and a twenty-third up-sampling layer which are sequentially arranged; the input of the sampling block on the 1 st RGB map receives Mo₁The output end of the sampling block on the 1 st RGB map outputs 256 feature maps, and the set formed by the 256 feature maps is marked as U³The method comprises the steps of obtaining a twenty-first convolution layer, a twenty-second convolution layer, an expansion parameter and a zero filling parameter, wherein the convolution kernel size of the twentieth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twentieth standard layer is 256 feature maps, the activation mode of the twentieth activation layer is 'Relu', the convolution kernel size of the twenty-first convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twentieth standard layer is 256 feature maps, the activation mode of the twenty-first activation layer is 'Relu', the convolution kernel size of the twenty-second convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twenty-second standard layer is 256 feature maps, the activation mode of the twenty-second activation layer is 'Relu', the amplification factor of a second thirteen upper sampling layer is 2, and the adopted method is a bilinear difference value, U is a bilinear difference value³Each feature map of (1) has a width of

Has a height of

Will U³As input, the data is input into the 2 nd significance sub-output layer, and the 2 nd significance sub-output layer outputs 1 characteristic diagram which is marked as S₂Wherein the convolution kernel of the convolution of the 2 nd significance sub-output layer has a size of 1 × 1, a number of 1 convolution kernels and a zero-filling parameter of 0. S₂Each feature map having a width of

Has a height of

For the 2 nd context information fusion block, P is added₅Inputting the data into a twenty-fourth upsampling layer and a twenty-fifth convolutional layer, wherein the input end of the twenty-fourth upsampling layer receives P₅512 feature maps are output from the output end of the twenty-fourth upsampling layer, and the 512 feature maps are grouped into a set

Input reception of the twenty-fifth convolutional layer

256 characteristic diagrams are output from the output end of the twenty-fifth convolutional layer, and a set formed by the 256 characteristic diagrams is recorded as

Will P₄Inputting the data into a twenty-sixth upsampling layer and a twenty-seventh convolutional layer, wherein the input end of the twenty-sixth upsampling layer receives P₄The output end of the twenty-sixth up-sampling layer outputs 512 feature maps, and the 512 feature maps are grouped into a set

Input reception of the twenty-seventh convolutional layer

256 feature maps are output from the output end of the twenty-seventh convolutional layer, and the set of 256 feature maps is recorded as

Will P₃Input into twenty-eighth convolutional layer, and input end of twenty-eighth upsampling layer receives P₃256 feature maps are output from the output end of the twenty-eight upsampling layer, and a set formed by the 256 feature maps is recorded as

Wherein the amplification factor of the twenty-fourth upsampling layer is 2, the adopted method is bilinear interpolation, the size of the convolution kernel of the twenty-fifth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the twenty-sixth upsampling layer is 2, the adopted method is bilinear interpolation, the size of the convolution kernel of the twenty-seventh convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the size of the convolution kernel of the twenty-eighth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0,

and

and

each feature map having a width of

Has a height of

Will be provided with

And

and

the result of the superposition (conjugation) is denoted C₂And the context information is used as input and input into a 2 nd context information fusion block which consists of a twenty-ninth convolution layer, a twenty-ninth standardization layer, a twenty-ninth activation layer, a thirty-eighth convolution layer, a thirty-fifth standardization layer and a thirty-fifth activation layer which are arranged in sequence; the input end of the 2 nd context information fusion block receives C₂256 feature graphs are output from the output end of the 2 nd context information fusion block, and the set formed by the 256 feature graphs is recorded as SX₂The convolution kernel size of the twenty-ninth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the twenty-ninth batch of standard layers is 256 characteristic graphs, the activation mode of the twenty-ninth activation layer is 'Relu', the convolution kernel size of the thirty-ninth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the thirty-th batch of standard layers is 256 characteristic graphs, the activation mode of the thirty-th activation layer is 'Relu', SX₂Each feature map of (1) has a width of

Has a height of

For the second multimodal information fusion module, SX is used₂And S₂And U³And DU₂Taking all characteristic graphs in (1) as input, firstly taking SX₂And U³The result of the superposition (conjugation) is denoted as M₂And M is₂The input is the thirty-first convolutional layer, the output of the thirty-first convolutional layer outputs 256 characteristic maps, and the set of the 256 characteristic maps is denoted as Re₂Will M₂All characteristic diagrams and S of₂The result of the multiplication is recorded as Mul₂Mixing Mul₂As input, the input is input into the thirty-second convolution layer256 characteristic diagrams are output from the output end, and a set formed by the 256 characteristic diagrams is recorded as For₂Will be₂And Re₂And DU₂The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo₂，Mo₂256 characteristic diagrams in total, Mo₂Each feature map of (1) has a width of

Has a height of

For the 2 nd RGB map up-sampling block, the 2 nd RGB map up-sampling block consists of a thirty-third convolution layer, a thirty-third standardization layer, a thirty-third activation layer, a thirty-fourth convolution layer, a thirty-fourth standardization layer, a thirty-fourth activation layer, a thirty-fifth convolution layer, a thirty-fifth standardization layer, a thirty-fifth activation layer and a thirty-sixth up-sampling layer which are sequentially arranged; the input of the sampling block on the 2 nd RGB map receives Mo₂256 feature maps are output from the output end of the sampling block on the 2 nd RGB map, and the set of 256 feature maps is marked as U²The method comprises the steps of determining the convolution kernel size of a thirty-third convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, expansion parameters to be 4, output of a thirty-third batch of standard layers to be 256 feature maps, the activation mode of a thirty-third activation layer to be 'Relu', the convolution kernel size of a thirty-fourth convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, output of a thirty-fourth batch of standard layers to be 256 feature maps, the activation mode of a thirty-fourth activation layer to be 'Relu', the convolution kernel size of a thirty-fifth convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, output of a thirty-fifth batch of standard layers to be 256 feature maps, the activation mode of a thirty-fifth activation layer to be 'Relu', the amplification factor of a thirty-sixth-upper sampling layer to be 2, and the method to be a bilinear difference value, U²Each feature map of (1) has a width of

Has a height of

Will U²As input, the input is input into the 3 rd significance sub-output layer, and the 3 rd significance sub-output layer outputs 1 characteristic diagram which is marked as S₃Wherein the convolution kernel of the convolution of the 3 rd significance sub-output layer has the size of 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0. S₃Each feature map having a width of

Has a height of

For the 3 rd context information fusion block, P is combined₅Input into the thirty-seventh upsampling layer and the thirty-eighth convolutional layer, and the input end of the thirty-seventh upsampling layer receives P₅512 feature maps are output from the output end of the seventeenth upsampling layer, and the 512 feature maps are grouped into a set

Input terminal reception of thirty-eighth convolutional layer

256 feature maps are output from the output end of the thirty-eighth convolutional layer, and the set of 256 feature maps is recorded as

Will P₄Input into the thirty-ninth upsampling layer and the forty convolutional layer, and the input end of the thirty-ninth upsampling layer receives P₄The output end of the thirty-ninth up-sampling layer outputs 512 feature maps, and the 512 feature maps are grouped into a set

Input terminal reception of the fortieth convolutional layer

256 feature maps are output from the output terminal of the fortieth convolutional layer, and a set of 256 feature maps is described as

Will P₃Inputting the data into a forty-first up-sampling layer and a forty-second convolutional layer, wherein the input end of the forty-first up-sampling layer receives P₃256 feature maps are output from the output end of the eleventh upsampling layer, and a set of 256 feature maps is recorded as

Input terminal reception of the forty-second convolution layer

256 feature maps are output from the output end of the forty-second convolutional layer, and a set of 256 feature maps is described as

Will P₂Input into a forty-third convolutional layer, the input of which receives P₂256 feature maps are output from the output end of the forty-third convolutional layer, and a set of 256 feature maps is described as

Wherein, the amplification factor of the seventeenth upsampling layer is 4, the adopted method is bilinear interpolation, the size of the convolution kernel of the thirty-eighth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the thirty-ninth upsampling layer is 4, the adopted method is bilinear interpolation, the size of the convolution kernel of the forty-fourth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the forty-fourth upsampling layer is 2, the adopted method is bilinear interpolation, and the convolution kernel of the forty-second convolutional layer is 2The size of the product kernel is 1 × 1, the number of convolution kernels is 1, the zero padding parameter is 0, the size of the convolution kernel of the forty-third convolution layer is 1 × 1, the number of convolution kernels is 1, the zero padding parameter is 0,

and

and

and

each feature map having a width of

Has a height of

Will be provided with

And

and

and

the result of the superposition (conjugation) is denoted C₃And input it as input to the 3 rd context information fusion block, which is composed of a forty-fourth convolution layer, a forty-fourth batch of normalization layer, a forty-fourth active layer, a forty-fifth convolution layer, a forty-fifth batch of normalization layer, and a forty-fifth active layer, which are arranged in this order; the input of the 3 rd context information fusion block receives C₃The output end of the 3 rd context information fusion block outputs 256 feature graphs, and the set formed by the 256 feature graphs is recorded as SX₃(ii) a Wherein the content of the first and second substances,the convolution kernel size of the forty-fourth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the forty-fourth batch of standard layers is 256 feature maps, the activation mode of the forty-fourth activation layer is 'Relu', the convolution kernel size of the forty-fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the forty-fifteenth batch of standard layers is 256 feature maps, the activation mode of the forty-fifth activation layer is 'Relu', SX₃Each feature map of (1) has a width of

Has a height of

For the third multimodal information fusion module, SX is used₃And S₃And U²And DU₁Taking all characteristic graphs in (1) as input, firstly taking SX₃And U²The result of the superposition (conjugation) is denoted as M₃And M is₃As an input, the input is inputted to a forty-sixth convolutional layer, 256 characteristic maps are outputted from the output terminal of the forty-sixth convolutional layer, and a set of these 256 characteristic maps is denoted as Re₃Will M₃All characteristic diagrams and S of₃The result of the multiplication is recorded as Mul₃Mixing Mul₃The signals are input into a forty-seventh convolutional layer, 256 characteristic diagrams are output from the output end of the forty-seventh convolutional layer, and the set of the 256 characteristic diagrams is expressed as For₃Will be₃And Re₃And DU₁The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo₃，Mo₃256 characteristic diagrams in total, Mo₃Each feature map of (1) has a width of

Has a height of

For the 3 rd RGB map upsampling block, the 3 rd RGB map upsampling blockThe device comprises a forty-eighth convolutional layer, a forty-eighth standardized layer batch, a forty-eighth active layer, a forty-ninth convolutional layer, a forty-ninth standardized layer batch, a forty-ninth active layer, a fifty-eighth convolutional layer, a fifty-fifth standardized layer batch, a fifty-fifth active layer and a fifty-fifth up-sampling layer which are arranged in sequence; the input of the sampling block on the 3 rd RGB map receives Mo₃The output end of the sampling block on the 3 rd RGB map outputs 64 feature maps, and the set of the 64 feature maps is marked as U¹The method comprises the steps of determining the sizes of convolution kernels of a forty-eighth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, expansion parameters to be 6, output of a forty-eighth standard layer to be 64 feature maps, the activation mode of a forty-eighth activation layer to be 'Relu', the size of a convolution kernel of a forty-ninth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, expansion parameters to be 6, output of a forty-ninth standard layer to be 64 feature maps, the activation mode of a forty-ninth activation layer to be 'Relu', the size of a convolution kernel of a fifty-fifth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, output of a fifty standard layer to be 64 feature maps, the activation mode of the fifty-th activation layer to be 'Relu', the amplification factor of a fifty-first upper sampling layer to be 2, and the adopted method to be bilinear difference value, U¹Each feature map in (1) has a width W and a height H. Will U¹As input, the input is input into the 4 th significance sub-output layer, and the 4 th significance sub-output layer outputs 1 characteristic diagram which is marked as S₄Wherein the convolution kernel of the convolution of the 4 th significance sub-output layer has the size of 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0. S₄Each feature map of (a) has a width W and a height H.

Step 1_ 3: inputting each original real color object image and corresponding depth image in the training set as original input images into a convolutional neural network for training to obtain 4 saliency detection prediction images and 4 saliency boundary prediction images corresponding to the original real object images in the training set, and converting the { I } into a plurality of prediction images^q(i, j) } 4 saliency detection prediction maps (S)₁，S₂，S₃，S₄) The set of constructs is denoted as

4 saliency boundary prediction maps (D)₃，D₂，D₁D) is recorded as

Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 4 different sizes to obtain the real significance detection label image with the width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width W and height H will be { I }^q(i, j) } the set formed by 4 images obtained by scaling the corresponding real significance detection image is recorded as

In the same way, the real significant boundary label image corresponding to each original color real object image in the training set is scaled by 4 different sizes to obtain the image with the width of

And has a height of

An image of width of

And has a height of

An image of width of

And has a height of

An image of width W and height H will be { I }^q(i, j) } the set formed by 4 images obtained by scaling the corresponding real significant boundary image is recorded as

Step 1_ 5: respectively calculate

And

and

and

the value of the loss function in between will

And

the value of the loss function in between is recorded as

By dividingClass cross entropy (categoricalcissentcopy) is obtained and is to be used

And

the value of the loss function in between is recorded as

Obtained using Dice loss, will

And

the final loss function values are obtained by addition.

Step 1-6, repeatedly executing step 1-3 and step 1-4 for V times to obtain a convolutional neural network classification training model, obtaining Q × V loss function values, finding out the loss function value with the minimum value from the Q × V loss function values, and correspondingly taking the weight vector and the bias term corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model, and correspondingly marking the weight vector and the bias term as W^bestAnd b^best(ii) a Wherein, V>In this example, V is 300.

The test stage process comprises the following specific steps:

step 2_ 1: order to

A color image representing a real object to be saliency detected,

representing a depth image corresponding to a real object to be subjected to saliency detection; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j),

to represent

And the middle coordinate position is the pixel value of the pixel point of (i, j).

Step 2_ 2: will be provided with

R channel component, G channel component and B channel component of and

inputting the superposed three-channel components into a convolutional neural network classification training model, and utilizing W^bestAnd b^bestMaking a prediction to obtain

And

corresponding predicted saliency detection image and saliency boundary image, S₁The final saliency detection image is recorded as

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

In this embodiment, in step 1_1,

the acquisition process comprises the following steps:

step 1-1 a-to

And defining the current pixel point to be processed as the current pixel point.

And 1_1b, performing convolution operation on the current pixel point by using convolution of 3 × 3 with weights of 1 to obtain a convolution result.

Step 1_1c, if the convolution result is 0 or 9, determining the current pixel point as a non-boundary pixel point; and if the convolution result is any one of the numerical values from 1 to 8, determining the current pixel point as a boundary pixel point.

Step 1-1 d-to

Taking the next pixel point to be processed as the current pixel point, and then returning to the step 1_1b to continue executing until the next pixel point to be processed is reached

And finishing processing all the pixel points in the step (2).

Step 1_1e, order

To represent

Will be shown in

The pixel value of the pixel point with the middle coordinate position (i, j) is recorded as

If it is

If the pixel point with the middle coordinate position (i, j) is a non-boundary pixel point, then order

If it is

If the pixel point with the middle coordinate position (i, j) is the boundary pixel point, then order

Wherein

And

wherein, the symbol is assigned.

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) building an architecture of the convolutional neural network based on context and depth information fusion by using a python-based deep learning library Pytrich1.0.1. The method adopts a real object image database NJU2000 test set to analyze how the significance detection effect of the real scene image (taking 400 real object images) obtained by prediction by the method is. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy recall curve (Precision recalling curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).

The method is utilized to predict each real object image in the real object image database NJU2000 test set to obtain a prediction significance detection image of each real object image, a class accuracy recall rate Curve (PR Curve) reflecting the significance detection effect of the method is shown in figure 8-a, an average absolute error (MAE) reflecting the significance detection effect of the method is shown in figure 8-b, the value is 0.054, and an F metric value (F-Measure) reflecting the significance detection effect of the method is shown in figure 8-c, and the value is 0.872. As can be seen from fig. 8-a to 8-c, the saliency detection result of the real object image obtained by the method of the present invention is the best, which shows that it is feasible and effective to obtain the predicted saliency detection image of the real object image by using the method of the present invention.

FIG. 4a shows the 1 st original real scene image of the same scene; 4a-d show corresponding depth maps for the 1 st original real scene image; FIG. 4b shows a predicted saliency detection image obtained by predicting the original scene image shown in FIG. 4a using the method of the present invention; FIG. 5a shows the 2 nd original object image of the same scene; 5a-d show the corresponding depth maps for the 2 nd original real object image; FIG. 5b shows a predicted saliency detection image obtained by predicting the original object image shown in FIG. 5a using the method of the present invention; FIG. 6a shows the 3 rd original object image of the same scene; 6a-d show the corresponding depth maps of the 3 rd original real object image; FIG. 6b shows a predicted saliency detection image obtained by predicting the original object image shown in FIG. 6a using the method of the present invention; FIG. 7a shows the 4 th original object image of the same scene; 7a-d show corresponding depth maps for the 4 th original real object image; FIG. 7b shows a predicted semantic detection image obtained by predicting the original object image shown in FIG. 7a by the method of the present invention. Comparing fig. 4a and 4b, fig. 5a and 5b, fig. 6a and 6b, and fig. 7a and 7b, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.

Claims

1. An image significance detection method based on an information fusion convolutional neural network is characterized by comprising the following steps:

step 1: selecting Q RGB images containing real objects, and a depth map, a saliency detection label map and a saliency boundary label map which are known and correspond to each RGB image to form a training set;

step 2: constructing an information fusion convolutional neural network, wherein the information fusion convolutional neural network comprises an input layer, a hidden layer and an output layer which are sequentially connected;

and step 3: inputting each RGB image in the training set and the corresponding depth map thereof into an information fusion convolutional neural network from an input layer for training, and outputting from an output layer to obtain four saliency detection prediction maps and four saliency boundary prediction maps; taking the four significance detection prediction images as significance prediction image sets, and taking the four significance boundary prediction images as boundary prediction image sets; carrying out scaling treatment on the saliency detection label graphs corresponding to each RGB image in different sizes to obtain four images with different widths and heights as a saliency label graph set, and carrying out scaling treatment on the saliency boundary label graphs corresponding to the same RGB image in different sizes to obtain four images with different widths and heights as a boundary label graph set; calculating a first loss function value between the significance prediction atlas and the significance label atlas, calculating a second loss function value between the boundary prediction atlas and the boundary label atlas, and adding the first loss function value and the second loss function value to obtain a total loss function value;

step 4, repeatedly executing the step 3 for V times to obtain Q × V total loss function values, and taking the weight vector and the bias item corresponding to the minimum total loss function value as the optimal weight vector and the optimal bias item of the information fusion convolutional neural network, so as to obtain the trained information fusion convolutional neural network;

and 5: and collecting an RGB image to be subjected to significance detection, inputting the RGB image into the trained information fusion convolutional neural network, and outputting to obtain a final significance detection prediction image.

2. The image saliency detection method based on information fusion convolutional neural network of claim 1, characterized in that: the input layer of the information fusion convolutional neural network comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a color image processing part and a depth image processing part, the RGB input layer receives an RGB image and inputs the RGB image to the color image processing part for processing and then outputs the RGB image to obtain four significance sub-output layers, and the depth image input layer receives the depth image and inputs the RGB image to the depth image processing part for processing and then outputs the depth image to obtain four boundary sub-output layers;

the color image processing part comprises a first RGB image neural network block, a first RGB image maximum pooling layer, a second RGB image neural network block, a second RGB image maximum pooling layer, a third RGB image neural network block, a third RGB image maximum pooling layer, a fourth RGB image neural network block, a fourth RGB image maximum pooling layer, a fifth RGB image neural network block, a first significance detection module, a first multi-mode information fusion module, a first RGB up-sampling block, a second multi-mode information fusion module, a second RGB up-sampling block, a third multi-mode information fusion module and a third RGB up-sampling block which are connected in sequence, the RGB image received by the RGB input layer is input into the color image processing part through a first RGB image neural network block and is output by a first significance detection module, a first RGB up-sampling block, a second RGB up-sampling block and a third RGB up-sampling block;

the outputs of the fourth RGB map neural network block and the fifth RGB map neural network block are connected to the input of the first context information fusion block, and the output of the first context information fusion block is connected to the input of the first multi-mode information fusion module; the outputs of the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the second context information fusion block, and the output of the second context information fusion block is connected to the input of the second multi-mode information fusion module; the outputs of the second RGB map neural network block, the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the third context information fusion block, and the output of the third context information fusion block is connected to the input of the third multi-mode information fusion module; the output of the first depth map upsampling block is also connected to the input of the first multi-mode information fusion module, the output of the second depth map upsampling block is also connected to the input of the second multi-mode information fusion module, and the output of the third depth map upsampling block is also connected to the input of the third multi-mode information fusion module;

the depth map processing part comprises a first depth map neural network block, a first depth map maximum pooling layer, a second depth map neural network block, a second depth map maximum pooling layer, a third depth map neural network block, a third depth map maximum pooling layer, a fourth depth map neural network block, a first depth map upsampling layer, a second depth map upsampling block, a second depth map upsampling layer, a third depth map upsampling block, a third depth map upsampling layer and a fourth depth map upsampling block which are connected in sequence; the depth map received by the depth map input layer is input into a depth map processing part through a first depth map neural network block and is output by a first depth map upsampling block, a second depth map upsampling block, a third depth map upsampling block and a fourth depth map upsampling block;

the output of the third depth map neural network block is connected to the input of the second depth map upsampling block, and the output of the first depth map upsampling layer and the output of the third depth map neural network block are fused and then input into the second depth map upsampling block; the output of the second depth map neural network block is connected to the input of the third depth map upsampling block, and the output of the second depth map upsampling layer and the output of the second depth map neural network block are fused and then input into the third depth map upsampling block; the output of the first depth map neural network block is connected to the input of a fourth depth map upsampling block, and the output of the third depth map upsampling layer and the output of the first depth map neural network block are fused and then input into the fourth depth map upsampling block;

3. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the structure of each depth map neural network block is the same, each depth map neural network block is mainly formed by sequentially connecting a plurality of convolution blocks, and each convolution block is mainly composed of a convolution layer, a batch normalization layer and an activation layer which are sequentially connected.

4. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the context information fusion blocks have the same structure, and specifically include: the context information fusion block comprises a plurality of convolution layers, a convolution block I and a convolution block II, wherein the number of the convolution layers is the same as the input number of the context information fusion block, the convolution layers correspond to the input number of the context information fusion block one by one, one end of each convolution layer is connected with one input, the other end of each convolution layer is connected with the convolution block I and the convolution block II in sequence, and the output of the convolution block II is used as the output of the 1 st context information fusion block.

5. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the first multi-mode information fusion module comprises an overlapping layer, a multiplying layer, a first convolution layer, a second convolution layer and an addition layer, wherein the output of the overlapping layer is respectively connected to the input of the multiplying layer and the input of the first convolution layer, the output of the multiplying layer is connected to the input of the addition layer through the second convolution layer, the output of the first convolution layer is connected to the input of the addition layer, and the output of the addition layer is used as the output of the multi-mode information fusion module.