CN111445432A - Image significance detection method based on information fusion convolutional neural network - Google Patents

Image significance detection method based on information fusion convolutional neural network Download PDF

Info

Publication number
CN111445432A
CN111445432A CN201910971962.4A CN201910971962A CN111445432A CN 111445432 A CN111445432 A CN 111445432A CN 201910971962 A CN201910971962 A CN 201910971962A CN 111445432 A CN111445432 A CN 111445432A
Authority
CN
China
Prior art keywords
layer
block
output
neural network
depth map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910971962.4A
Other languages
Chinese (zh)
Inventor
周武杰
吴君委
雷景生
何成
王海江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN201910971962.4A priority Critical patent/CN111445432A/en
Publication of CN111445432A publication Critical patent/CN111445432A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Abstract

The invention discloses an image significance detection method based on an information fusion convolutional neural network. Inputting an original object image and a corresponding depth image into a convolutional neural network for training to obtain a saliency detection prediction image and a saliency object boundary prediction image; then, calculating a loss function value between a set formed by a saliency detection prediction image and a set formed by a real saliency detection image and a loss function value between a saliency edge detection image and a set formed by a real saliency object saliency edge image to obtain an optimal weight vector and a bias term of a convolutional neural network classification training model; inputting a scene image to be subjected to significance detection and a corresponding depth image into a convolutional neural network classification training model to obtain a prediction significance detection image; the method has the advantage of improving the significance detection efficiency and accuracy of the object image.

Description

Image significance detection method based on information fusion convolutional neural network
Technical Field
The invention relates to a significance detection method based on deep learning, in particular to an image significance detection method based on an information fusion convolutional neural network.
Background
The visual saliency can help people to quickly filter out unimportant information, so that the attention of people is focused on a meaningful area, and the scene in front of the eyes can be better understood. With the development of the field of computer vision, it is desirable that a computer also have the same ability as a human being, that is, when a complex scene is understood and analyzed, the computer can process useful information more specifically, so that the complexity of an algorithm can be reduced more greatly, and interference of noise can be eliminated. In the conventional method, researchers model a saliency object detection algorithm according to various observed a priori knowledge to generate a saliency map. These a priori knowledge include contrast, center a priori, edge a priori, semantic a priori, etc. However, in complex scenarios, conventional approaches tend to be inaccurate. This is because these observations are often limited to low-level features (e.g., color and contrast, etc.) and do not accurately reflect the common points of significance inherent to the salient objects.
In recent years, deep convolutional neural networks have been widely used in various fields of computer vision, and many difficult vision problems have been made a great progress. Different from the traditional method, the deep convolutional neural network can be modeled from a large number of training samples and automatically learn more essential characteristics end to end (end-to-end), so that the defects of traditional manual modeling and feature design are effectively avoided. Recently, the effective application of 3D sensors enriches databases, and we can obtain not only color pictures, but also depth information of the pictures. Depth information is an important ring in the human visual system in real 3D scenes, which is an important piece of information that has been completely ignored in the conventional methods, so that the most important task at present is how to build a model to effectively utilize the depth information.
A significance detection method of deep learning is adopted in an RGB-D database, pixel-level end-to-end significance detection is directly carried out, and prediction can be carried out on a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. At present, the depth learning significance detection based on the RGB-D database mainly uses a coding-decoding architecture, and there are three methods how to use depth information: the first method is to directly superimpose the depth information and the color image information into a four-dimensional input information or add or superimpose the color image information and the depth information in the encoding process, and the method is called pre-fusion; the second method is to add or superimpose the color image information and the depth information corresponding to each other in the encoding process into the corresponding decoding process in a layer skipping (skip connection) manner, which is called post-fusion; the third method is to classify and use color image information and depth information to carry out significance prediction and fuse the final results. In the first method, since the color image information and the depth information have a large difference in distribution, noise is added to a certain extent by directly adding the depth information in the encoding process. The third method uses the depth information and the color map information to perform saliency prediction, but if the prediction results of the depth information and the color map information are not accurate, the final fusion result is relatively inaccurate. The second method not only avoids the noise brought by directly utilizing the depth information in the coding stage, but also can fully learn the complementary relation between the color image information and the depth information in the continuous optimization of the network model. Compared with the prior post-Fusion scheme, such as CNNs-Based RGB-D safety Detection View Cross-views-View Transfer and Multiview Fusion (RGB-D salient object Detection Based on Cross-View conversion and multi-View Fusion of a convolutional neural network), which is hereinafter referred to as CBSD, the CBSD respectively performs feature extraction and down-sampling operations on color image information and depth information, performs Fusion on the minimum scale, and outputs a salient prediction map with a small size on the basis of the Fusion. The CBSD only has down-sampling operation, so that the space detail information of the object becomes fuzzy in the continuous down-sampling operation, and the information of different modes is fused by a direct addition method, and the final result is influenced to a certain extent due to different data information distribution.
Disclosure of Invention
The invention aims to provide a significance detection method based on a convolutional neural network, which has high detection efficiency and high detection accuracy.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
the method comprises the steps of 1, selecting Q RGB images containing real objects, and a depth map, a saliency detection label map and a saliency boundary label map which are known and correspond to each RGB image to form a training set, then utilizing a convolution operation of 3 × 3 to extract the boundary of the saliency detection label image to obtain the saliency boundary label map, wherein the saliency detection label map is an image obtained after the real objects are extracted, and the saliency boundary label map is an image obtained after the outline of the real objects is extracted.
Step 2: and constructing an information fusion convolutional neural network, wherein the information fusion convolutional neural network comprises an input layer, a hidden layer and an output layer which are sequentially connected.
And step 3: inputting each RGB image in the training set and the corresponding depth map thereof into an information fusion convolutional neural network from an input layer for training, and outputting from an output layer to obtain four saliency detection prediction maps and four saliency boundary prediction maps; taking the four significance detection prediction images as significance prediction image sets, and taking the four significance boundary prediction images as boundary prediction image sets; carrying out scaling treatment on the saliency detection label graphs corresponding to each RGB image in different sizes to obtain four images with different widths and heights as a saliency label graph set, and carrying out scaling treatment on the saliency boundary label graphs corresponding to the same RGB image in different sizes to obtain four images with different widths and heights as a boundary label graph set; calculating a first loss function value between the significance prediction atlas and the significance label atlas, wherein the first loss function value is obtained by using classification cross entropy (canonical cross entropy); obtaining a second loss function value by adopting Dice loss; and calculating a second loss function value between the boundary prediction atlas and the boundary label atlas, and adding the first loss function value and the second loss function value to obtain a total loss function value.
And 4, repeatedly executing the step 3 for V times to obtain Q × V total loss function values, and taking the weight vector and the bias item corresponding to the minimum total loss function value as the optimal weight vector and the optimal bias item of the information fusion convolutional neural network, so as to obtain the trained information fusion convolutional neural network.
And 5: collecting RGB images to be subjected to significance detection, inputting the RGB images to the trained information fusion convolutional neural network, outputting to obtain a final significance detection prediction image, and taking the fourth significance detection prediction image as a final predicted significance detection prediction image.
The input layer of the information fusion convolutional neural network comprises an RGB image input layer and a depth image input layer, the hidden layer comprises a color image processing part and a depth image processing part, the RGB input layer receives an RGB image and inputs the RGB image to the color image processing part for processing and then outputs the RGB image to obtain four significance sub-output layers, and the depth image input layer receives a depth image and inputs the depth image to the depth image processing part for processing and then outputs the depth image to obtain four boundary sub-output layers.
The color image processing part comprises a first RGB image neural network block, a first RGB image maximum pooling layer, a second RGB image neural network block, a second RGB image maximum pooling layer, a third RGB image neural network block, a third RGB image maximum pooling layer, a fourth RGB image neural network block, a fourth RGB image maximum pooling layer, a fifth RGB image neural network block, a first significance detection module, a first multi-mode information fusion module, a first RGB up-sampling block, a second multi-mode information fusion module, a second RGB up-sampling block, a third multi-mode information fusion module and a third RGB up-sampling block which are connected in sequence, the RGB map received by the RGB input layer is input into the color map processing part through a first RGB map neural network block and is output by a first significance detection module, a first RGB up-sampling block, a second RGB up-sampling block and a third RGB up-sampling block.
The outputs of the fourth RGB map neural network block and the fifth RGB map neural network block are connected to the input of the first context information fusion block, and the output of the first context information fusion block is connected to the input of the first multi-mode information fusion module; the outputs of the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the second context information fusion block, and the output of the second context information fusion block is connected to the input of the second multi-mode information fusion module; the outputs of the second RGB map neural network block, the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the third context information fusion block, and the output of the third context information fusion block is connected to the input of the third multi-mode information fusion module; the output of the first depth map upsampling block is further connected to the input of the first multimodal information fusion module, the output of the second depth map upsampling block is further connected to the input of the second multimodal information fusion module, and the output of the third depth map upsampling block is further connected to the input of the third multimodal information fusion module.
The depth map processing part comprises a first depth map neural network block, a first depth map maximum pooling layer, a second depth map neural network block, a second depth map maximum pooling layer, a third depth map neural network block, a third depth map maximum pooling layer, a fourth depth map neural network block, a first depth map upsampling layer, a second depth map upsampling block, a second depth map upsampling layer, a third depth map upsampling block, a third depth map upsampling layer and a fourth depth map upsampling block which are connected in sequence; the depth map received by the depth map input layer is input into the depth map processing part through a first depth map neural network block and output by a first depth map upsampling block, a second depth map upsampling block, a third depth map upsampling block and a fourth depth map upsampling block.
The output of the third depth map neural network block is connected to the input of the second depth map upsampling block, and the output of the first depth map upsampling layer and the output of the third depth map neural network block are fused and then input into the second depth map upsampling block; the output of the second depth map neural network block is connected to the input of the third depth map upsampling block, and the output of the second depth map upsampling layer and the output of the second depth map neural network block are fused and then input into the third depth map upsampling block; the output of the first depth map neural network block is connected to the input of a fourth depth map upsampling block, and the output of the third depth map upsampling layer and the output of the first depth map neural network block are fused and then input into the fourth depth map upsampling block; the fusion mode is specifically as follows: and adding the pixel values of the pixel points at the corresponding positions in the output characteristic diagram.
The output layers comprise four significance sub-output layers and four boundary sub-output layers, the outputs of the first significance detection module, the first RGB up-sampling block, the second RGB up-sampling block and the third RGB up-sampling block are respectively connected with the first significance sub-output layer, the second significance sub-output layer, the third significance sub-output layer and the fourth significance sub-output layer, and the outputs of the first significance sub-output layer, the second significance sub-output layer and the third significance sub-output layer are also respectively connected with the inputs of the first multi-mode information fusion module, the second multi-mode information fusion module and the third multi-mode information fusion module; the outputs of the first depth map upsampling block, the second depth map upsampling block, the third depth map upsampling block and the fourth depth map upsampling block are respectively connected with the first boundary sub-output layer, the second boundary sub-output layer, the third boundary sub-output layer and the fourth boundary sub-output layer.
The structure of each depth map neural network block is the same, each depth map neural network block is mainly formed by sequentially connecting a plurality of convolution blocks, and each convolution block is mainly composed of a convolution layer, a batch normalization layer and an activation layer which are sequentially connected. The number of convolution blocks of the first depth map neural network block, the second depth map neural network block, the third depth map neural network block and the fourth depth map neural network block is respectively 2, 3 and 3. The number of convolution blocks of the sampling blocks on the first two, three and four depth maps is 3. The convolution block numbers of the first color map neural network block, the second color map neural network block, the third color map neural network block, the fourth color map neural network block and the fifth color map neural network block are respectively 2, 3 and 3. The first, second and third RGB upsampling blocks have the same structure and are composed of three convolution blocks and an upsampling layer, wherein the convolution blocks are sequentially connected, one end of each convolution block is used as the input of the RGB upsampling block, one end of each convolution block is connected with the upsampling layer, and the output of the upsampling layer is used as the output of the RGB upsampling block.
The context information fusion blocks have the same structure, and specifically include: the context information fusion block comprises a plurality of convolution layers, a convolution block I and a convolution block II, wherein the number of the convolution layers is the same as the input number of the context information fusion block, the convolution layers correspond to the input number of the context information fusion block one by one, one end of each convolution layer is connected with one input, the other end of each convolution layer is connected with the convolution block I and the convolution block II in sequence, and the output of the convolution block II is used as the output of the 1 st context information fusion block.
The first multi-mode information fusion module comprises an overlapping layer, a multiplying layer, a first convolution layer, a second convolution layer and an addition layer, wherein the output of the overlapping layer is respectively connected to the input of the multiplying layer and the input of the first convolution layer, the output of the multiplying layer is connected to the input of the addition layer through the second convolution layer, the output of the first convolution layer is connected to the input of the addition layer, and the output of the addition layer is used as the output of the multi-mode information fusion module; the superposition means adding the channel numbers of the output characteristic graphs, the multiplication means multiplying the pixel values of the pixel points at the corresponding positions in the output characteristic graphs, and the addition means adding the pixel values of the pixel points at the corresponding positions in the output characteristic graphs.
For the first multimodal information fusion module: the output of the first context information fusion block and the output of the first significance detection module are jointly input into the superposition layer for superposition (registration) and then serve as the output of the superposition layer, the output of the superposition layer is also input into the first convolution layer, the output of the superposition layer and the output of the first significance sub-output layer are jointly input into the multiplication layer for multiplication and then serve as the output of the multiplication layer, the output of the multiplication layer is further input into the second convolution layer, and the output of the first convolution layer, the second convolution layer and the sampling block on the first depth map which are jointly input into the addition layer for addition is the output of the first multi-mode information fusion module.
For the second and third multi-modal information fusion modules: the output of the context information fusion block and the output of the RGB up-sampling block are jointly input into an overlapping layer for overlapping (registration) and then used as the output of the overlapping layer, the output of the overlapping layer is also input into a first convolution layer, the output of the overlapping layer and the output of the saliency sub-output layer are jointly input into a multiplication layer for multiplication and then used as the output of the multiplication layer, the output of the multiplication layer is input into a second convolution layer, and the input of the first convolution layer, the second convolution layer and the input of the depth map up-sampling block are jointly input into an addition layer for addition, and then the output is the output of the second or third multi-mode information fusion module.
The sampling mode of the sampling layer on each depth map is a bilinear difference method, and the structure of the first significance detection module adopts a network structure of a pyramid pooling module (pyramid pooling module).
Compared with the prior art, the invention has the advantages that:
1) the method comprises the steps of constructing a convolutional neural network, inputting color images and depth images in a training set into the convolutional neural network for training to obtain a convolutional neural network training model; the method combines the convolutional layer with holes and the bilinear difference layer (namely the upper sampling layer) to construct the sampling neural network blocks on the 1 st to the 3 rd RGB images and the sampling neural network blocks on the 1 st to the 4 th depth images when constructing the convolutional neural network, so that the object space information is optimized in the operation process of up-sampling one step by one step, the convolutional layer with holes can obtain larger receptive field, and the final detection effect can be improved.
2) The method of the invention innovatively uses the depth information to obtain the boundary of the salient object when the depth information is utilized, and creates a fusion mode in the fusion process of different modal information (namely color image information and depth image information), and takes the salient prediction image with a smaller scale as input to gradually guide the prediction of the salient prediction image with a larger scale. The final detection effect is greatly improved by extracting the boundary and gradually predicting the salient object.
3) The invention adopts a plurality of supervision modes, namely, the salient output layer is supervised by utilizing the salient detection label graph, and the salient boundary output layer is supervised by utilizing the salient boundary label graph, so that the boundary of an object is clearer, and a better result is obtained.
Drawings
FIG. 1 is a block diagram of an overall implementation of the inventive method;
FIG. 2 is a schematic diagram of a multimodal information fusion module;
FIG. 3 is a diagram of a context information fusion block; the difference point of the three different context information fusion modules of the invention is that the number of the input is respectively 2, 3 and 4, the rest structures are consistent, and the legend shows two inputs.
FIG. 4a is the 1 st original image of a real object;
4a-d are depth maps of the 1 st original image of a real object;
FIG. 4b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 4a using the method of the present invention;
FIG. 5a is the 2 nd original image of a real object;
FIGS. 5a-d are depth maps of the 2 nd original real object image;
FIG. 5b is a predicted saliency detection image obtained by predicting the original object image shown in FIG. 5a using the method of the present invention;
FIG. 6a is the 3 rd original image of a real object;
6a-d are depth maps of the 3 rd original real object image;
FIG. 6b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 6a using the method of the present invention;
FIG. 7a is the 4 th original image of a real object;
FIGS. 7a-d are depth maps of the 4 th original real object image;
FIG. 7b is a predicted saliency detection image obtained by predicting the original real object image shown in FIG. 7a using the method of the present invention;
FIG. 8-a is a graph of the recall rate of accuracy reflecting the significance detection effect of the method of the present invention using the method of the present invention to predict each color real object image in a real object image database NJU2000 test set;
FIG. 8-b is an average absolute error showing the significance detection effect of the method of the present invention for predicting each color real object image in the real object image database NJU2000 test set using the method of the present invention;
fig. 8-c is a F metric value that reflects the significance detection effect of the method of the present invention, using the method of the present invention to predict each color real object image in the real object image database NJU2000 test set.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a significance detection method based on a convolutional neural network, the overall implementation block diagram of which is shown in fig. 1, and the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original color real object images, a depth image and a real significance detection label image corresponding to each original color real object image, forming a training set, and correspondingly marking the Q-th original color real object image in the training set, the depth image corresponding to the original color real object image and the real significance detection label image as { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002232378810000071
Then, using convolution of 3 × 3 to extract boundary of each real saliency detection label image in the training set to obtain saliency boundary map of each real saliency detection label image in the training set, and obtaining the saliency boundary map of each real saliency detection label image in the training set
Figure BDA0002232378810000072
The saliency boundary map of (1) is denoted as
Figure BDA0002232378810000073
Wherein Q is a positive integer, Q is not less than 200, Q is a positive integer, the initial value of Q is 1, Q is not less than 1 and not more than Q, I is not less than 1 and not more than W, j is not less than 1 and not more than H, W represents { I ≦q(i,j)}、{Dq(i,j)}、
Figure BDA0002232378810000074
H represents { I }q(i,j)}、{Dq(i,j)}、
Figure BDA0002232378810000075
W and H can be divided by 2, { Iq(I, j) } RGB color image, Iq(I, j) represents { Iq(i, j) } pixel value of pixel point whose coordinate position is (i, j) { Dq(i, j) } is a single-channel depth image, Dq(i, j) represents { DqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002232378810000076
to represent
Figure BDA0002232378810000077
The middle coordinate position is the pixel value of the pixel point of (i, j),
Figure BDA0002232378810000078
to represent
Figure BDA0002232378810000079
The pixel value of the pixel point with the middle coordinate of (i, j);
step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer, wherein the input layer comprises an RGB (red, green and blue) graph input layer and a depth graph input layer, the hidden layer comprises 5 RGB graph neural network blocks, 4 RGB graph maximum pooling layers, 4 depth graph neural network blocks, 3 RGB graph maximum pooling layers, 3 RGB upsampling blocks, 1 global-based significance detection module, 3 multi-mode-based information fusion modules, 3 context information-based fusion modules, 4 depth graph upsampling blocks, 3 depth graph upsampling layers and the output layer comprises 4 significance output layers and 4 significance boundary output layers.
For the depth input layer, the input end of the input layer receives an original input depth image and superposes the original input depth image into three-channel depth information, and the output end of the input layer outputs the three-channel depth information of the original input image to the hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H.
For the 1 st depth map neural network block, the first convolution layer, the first batch of normalization layers, the first activation layer, the second convolution layer, the second batch of normalization layers and the second activation layer are sequentially arranged; the input end of the 1 st depth map neural network block receives three-channel components of an original input image output by the output end of the depth map input layer, the output end of the 1 st depth map neural network block outputs 64 feature maps, and a set formed by the 64 feature maps is recorded as DP1Wherein, the sizes (kernel _ size) of the convolution kernels of the first convolution layer are all 3 × 3, the number (filters) of the convolution kernels is 64, zero padding (padding) is 1, the output of the first normalization layer is 64 feature maps, the activation mode of the first activation layer is 'Relu', the sizes (kernel _ size) of the convolution kernels of the second convolution layer is 3 × 3, the number (padding) of the convolution kernels is 64, zero padding is 1, the output of the second normalization layer is 64 feature maps, the activation mode of the second activation layer is 'Relu', and DP is processed by the following steps of1As an input, the input is a first maximum pooling layer (Pool), the pooling size (Pool _ size) of the first maximum pooling layer is 2, the step size (stride) is 2, and the output is represented as DP1p,DP1pEach feature map of (1) has a width of
Figure BDA0002232378810000081
Has a height of
Figure BDA0002232378810000082
For the 2 nd depth map neural network block, it is composed of the third convolution layerA third normalization layer, a third active layer, a fourth convolution layer, a fourth normalization layer and a fourth active layer; DP is received at the input of the 2 nd depth map neural network block1pThe output end of the 2 nd deep neural network block outputs 128 feature maps, and a set formed by the 128 feature maps is recorded as DP2The convolution kernel size of the third convolution layer is 3 × 3, the convolution kernel number is 128, the zero padding parameter is 1, the output of the third batch of normalization layers is 128 characteristic graphs, the activation mode of the third activation layer is 'Relu', the convolution kernel size of the fourth convolution layer is 3 × 3, the convolution kernel number is 128, the zero padding parameter is 1, the output of the fourth batch of normalization layers is 128 characteristic graphs, the activation mode of the fourth activation layer is 'Relu', and DP is calculated2As an input, the pooling size input to the second largest pooling layer is 2, the step size is 2, and the output is DP2p,DP2pEach feature map of (1) has a width of
Figure BDA0002232378810000091
Has a height of
Figure BDA0002232378810000092
For the 3 rd depth map neural network block, the fifth convolution layer, the fifth normalization layer, the fifth activation layer, the sixth convolution layer, the sixth normalization layer, the sixth activation layer, the seventh convolution layer, the seventh normalization layer and the seventh activation layer are sequentially arranged; DP is received at the input of the 3 rd depth map neural network block2p256 feature maps are output from the output end of the 3 rd depth map neural network block, and a set formed by the 256 feature maps is recorded as DP3The convolution kernel size of the fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the fifth batch of standard layers is 256 characteristic graphs, the activation mode of the fifth activation layer is 'Relu', the convolution kernel size of the sixth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixth batch of standard layers is 256 characteristic graphs, the activation mode of the sixth activation layer is 'Relu', the convolution kernel size of the seventh convolution layer is 3 × 3, and the convolution kernel number is 1256 and zero padding parameter of 1, the output of the seventh batch of standard layers is 256 characteristic graphs, the activation mode of the seventh activation layer is 'Relu', and DP is calculated3As an input, the data is input into a third maximum pooling layer, the pooling size of the third maximum pooling layer is 2, the step size is 2, and the output is represented as DP3p,DP3pEach feature map of (1) has a width of
Figure BDA0002232378810000093
Has a height of
Figure BDA0002232378810000094
For the 4 th depth map neural network block, the 4 th depth map neural network block consists of an eighth convolution layer, an eighth normalization layer, an eighth activation layer, a ninth convolution layer, a ninth normalization layer, a ninth activation layer, a tenth convolution layer, a tenth normalization layer and a tenth activation layer which are sequentially arranged; the input of the 4 th depth map neural network block receives DP3p512 feature maps are output from the output end of the 4 th depth map neural network block, and a set of 512 feature maps is recorded as DP4The convolution kernel size of the eighth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eighth standard layer is 512 feature maps, the activation mode of the eighth active layer is 'Relu', the convolution kernel size of the ninth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the ninth standard layer is 512 feature maps, the activation mode of the ninth active layer is 'Relu', the convolution kernel size of the tenth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the tenth standard layer is 512 feature maps, the activation mode of the tenth active layer is 'Relu', and DP4Each feature map of (1) has a width of
Figure BDA0002232378810000101
Has a height of
Figure BDA0002232378810000102
For the 1 st depth map up-sampling block, it is set up in turnThe eleventh coiling layer, the eleventh standardization layer, the eleventh activation layer, the twelfth coiling layer, the twelfth standardization layer, the twelfth activation layer, the thirteenth coiling layer, the thirteenth standardization layer and the thirteenth activation layer; the input of the 1 st depth map upsampling block receives DP4256 feature maps are output from the output end of the 1 st depth map neural network block, and a set formed by the 256 feature maps is recorded as DU1The convolution kernel size of the eleventh convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the eleventh batch of standard layers is 256 characteristic graphs, the activation mode of the eleventh activation layer is 'Relu', the convolution kernel size of the twelfth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the twelfth batch of standard layers is 256 characteristic graphs, the activation mode of the twelfth activation layer is 'Relu', the convolution kernel size of the thirteenth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 2, the expansion parameter is 2, the output of the thirteenth batch of standard layers is 256 characteristic graphs, the activation mode of the thirteenth activation layer is 'Relu', DU1Each feature map of (1) has a width of
Figure BDA0002232378810000103
Has a height of
Figure BDA0002232378810000104
Will DU1As input, the data is input into a first boundary sub-output layer, and the first boundary outputs 1 characteristic diagram, which is marked as D, from the output layer3Wherein the convolution kernel of the convolution of the first boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D3Each feature map of width
Figure BDA0002232378810000105
Has a height of
Figure BDA0002232378810000106
For the first depth map upsampling layer, it is formed by the first bilinear differenceA value up-sampling layer composition; the input of the sampling layer on the first depth map receives the DU1256 feature maps are output from the output end of the sampling layer on the first depth map, and a set of the 256 feature maps is recorded as DUB1(ii) a Wherein the amplification factor of the first bilinear difference upsampling layer is 2; DUB1Each feature map having a width of
Figure BDA0002232378810000107
Has a height of
Figure BDA0002232378810000108
For the 1 st depth map fusion layer, DUB1And DP3The result of adding the corresponding position elements is recorded as U3The depth map data is input into a 2 nd depth map upsampling block, wherein the 2 nd depth map upsampling block consists of a fourteenth convolution layer, a fourteenth normalization layer, a fourteenth active layer, a fifteenth convolution layer, a fifteenth normalization layer, a fifteenth active layer, a sixteenth convolution layer, a sixteenth normalization layer and a sixteenth active layer which are sequentially arranged; the input of the 2 nd depth map upsampling block receives U3The output end of the sampling block on the 2 nd depth map outputs 128 feature maps, and the set of 128 feature maps is denoted as DU2The convolutional kernel size of the fourteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the fourteenth batch of standard layers is 128 feature maps, the activation mode of the fourteenth active layer is 'Relu', the convolutional kernel size of the fifteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the fifteenth batch of standard layers is 128 feature maps, the activation mode of the fifteenth active layer is 'Relu', the convolutional kernel size of the sixteenth convolutional layer is 3 × 3, the convolutional kernel number is 128, the zero padding parameter is 4, the expansion parameter is 4, the output of the sixteenth batch of standard layers is 128 feature maps, the activation mode of the sixteenth active layer is 'Relu', and DU2Each feature map of (1) has a width of
Figure BDA0002232378810000111
Has a height of
Figure BDA0002232378810000112
Will DU2As input, the data is input into the 2 nd boundary sub-output layer, and the 2 nd boundary outputs 1 characteristic diagram, which is marked as D, from the output layer2Wherein the convolution kernel of the convolution of the 2 nd boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D2Each feature map of width
Figure BDA0002232378810000113
Has a height of
Figure BDA0002232378810000114
For the 2 nd depth map upsampling layer, the 2 nd depth map upsampling layer consists of a 2 nd bilinear difference upsampling layer; the input of the 2 nd depth map upsampling layer receives the DU2The output end of the sampling layer on the 2 nd depth map outputs 128 feature maps, and the set of the 128 feature maps is recorded as DUB2(ii) a Wherein, the amplification factor of the 2 nd bilinear difference up-sampling layer is 2; DUB2Each feature map having a width of
Figure BDA0002232378810000115
Has a height of
Figure BDA0002232378810000116
For the 2 nd depth map fusion layer, DUB2And DP2The result of adding the corresponding position elements is recorded as U2The depth map data is input into a 3 rd depth map up-sampling block, wherein the 3 rd depth map up-sampling block comprises a seventeenth convolution layer, a seventeenth normalization layer, a seventeenth active layer, an eighteenth convolution layer, an eighteenth normalization layer, an eighteenth active layer, a nineteenth convolution layer, a nineteenth normalization layer and a nineteenth active layer which are sequentially arranged; the input of the 3 rd depth map upsampling block receives U2All the characteristics ofIn the figure, 64 feature maps are output from the output end of the sampling block on the 3 rd depth map, and the set of 64 feature maps is denoted as DU1The convolutional kernel size of the seventeenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the seventeenth standard layer is 64 feature maps, the activation mode of the seventeenth active layer is 'Relu', the convolutional kernel size of the eighteenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the eighteenth standard layer is 64 feature maps, the activation mode of the eighteenth active layer is 'Relu', the convolutional kernel size of the nineteenth convolutional layer is 3 × 3, the number of convolutional kernels is 64, the zero padding parameter is 6, the expansion parameter is 6, the output of the nineteenth standard layer is 64 feature maps, the activation mode of the nineteenth active layer is 'Relu', and DU1Each feature map of (1) has a width of
Figure BDA0002232378810000121
Has a height of
Figure BDA0002232378810000122
Will DU1As input, the data is input into a 3 rd boundary sub-output layer, and the 3 rd boundary outputs 1 characteristic diagram, which is marked as D, from the output layer1Wherein the convolution kernel of the convolution of the 3 rd boundary sub-output layer has a size of 1 × 1, a number of convolution kernels of 1, and a zero-padding parameter of 0. D1Each feature map of width
Figure BDA0002232378810000123
Has a height of
Figure BDA0002232378810000124
For the 3 rd depth map upsampling layer, the upsampling layer consists of a 3 rd bilinear difference upsampling layer; the input of the 3 rd depth map upsampling layer receives the DU1The output end of the sampling layer on the 3 rd depth map outputs 64 feature maps, and the set of the 64 feature maps is recorded as DUB1(ii) a Wherein, the amplification factor of the 3 rd bilinear difference up-sampling layer is 2; DUB1Each characteristic ofThe width of the figure is W and the height is H.
For the 3 rd depth map fusion layer, DUB1And DP1The result of adding the corresponding position elements is recorded as U1The sampling block is used as input and is input into a 4 th depth map upsampling block, and the 4 th depth map upsampling block comprises a twentieth convolution layer, a twentieth normalization layer, a twentieth activation layer, a twenty-first convolution layer, a twentieth normalization layer, a twenty-first activation layer, a twenty-second convolution layer, a twenty-second normalization layer and a twenty-second activation layer which are sequentially arranged; the input of the 4 th depth map upsampling block receives U1The convolution kernel size of the twentieth convolution layer is 3 ×, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twentieth standard layer is 64 feature maps, the activation mode of the twentieth activation layer is "Relu", the convolution kernel size of the twenty-first convolution layer is 3 × 3, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twentieth standard layer is 64 feature maps, the activation mode of the twenty-first activation layer is "Relu", the convolution kernel size of the twenty-second convolution layer is 3 ×, the number of convolution kernels is 64, the zero padding parameter is 8, the expansion parameter is 8, the output of the twelfth standard layer is 64 feature maps, the activation mode of the twenty-second activation layer is "Relu", the output of each feature map in the 4 th depth map is 64 feature maps, the output of each boundary from W1 to W4 is 0, and the input width of the boundary is 361, the width of the convolution kernel is 364 th convolution kernel, and the output of the boundary is 364 th convolution kernel, and the width of the input sub-first activation layer is 364 th convolution kernel, and the output of the input sub-th convolution kernel is 364H 4 th convolution kernel.
For a color input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original input color image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H.
For the 1 st color map neural network block, the first Convolution layer (Conv), the first Batch of normalization layers (BN), the first Activation layer (Act), the second Convolution layer, the second Batch of normalization layers and the second Activation layer are sequentially arranged; the input end of the 1 st color image neural network block receives R channel component, G channel component and B channel component of original input image output by the output end of the color image input layer, the output end of the 1 st color image neural network block outputs 64 characteristic images, and the set formed by the 64 characteristic images is marked as P1Wherein, the sizes (kernel _ size) of the convolution kernels of the first convolution layer are all 3 × 3, the number (filters) of the convolution kernels is 64, zero padding (padding) is 1, the output of the first normalization layer is 64 feature maps, the activation mode of the first activation layer is 'Relu', the sizes of the convolution kernels of the second convolution layer is 3 × 3, the number of the convolution kernels is 64, zero padding is 1, the output of the second normalization layer is 64 feature maps, the activation mode of the second activation layer is 'Relu', P is represented by P ×1As an input, the input is a first maximum pooling layer (Pool), the pooling size (Pool _ size) of the first maximum pooling layer is 2, the step size (stride) is 2, and the output is represented as P1p,P1pEach feature map of (1) has a width of
Figure BDA0002232378810000131
Has a height of
Figure BDA0002232378810000132
For the 2 nd color map neural network block, the color map neural network block comprises a third convolution layer, a third batch of normalization layer, a third activation layer, a fourth convolution layer, a fourth batch of normalization layer and a fourth activation layer which are arranged in sequence; the input end of the 2 nd color image neural network block receives P1pThe output end of the 2 nd color map neural network block outputs 128 feature maps, and the set formed by the 128 feature maps is marked as P2The convolution kernel size of the third convolution layer is 3 × 3, the number of convolution kernels is 128, zero padding parameters are 1, the output of the third normalization layer is 128 characteristic graphs, and the excitation of the third activation layerThe activation mode is Relu, the convolution kernel size of the fourth convolution layer is 3 × 3, the number of the convolution kernels is 128, the zero padding parameter is 1, the output of the fourth batch of normalization layers is 128 characteristic graphs, the activation mode of the fourth activation layer is Relu, and P is calculated2As an input, the pooling size input to the second largest pooling layer is 2, the step size is 2, and the output is noted as P2p,P2pEach feature map of (1) has a width of
Figure BDA0002232378810000141
Has a height of
Figure BDA0002232378810000142
For the 3 rd color map neural network block, the fifth convolution layer, the fifth normalization layer, the fifth activation layer, the sixth convolution layer, the sixth normalization layer, the sixth activation layer, the seventh convolution layer, the seventh normalization layer and the seventh activation layer are sequentially arranged; the input end of the 3 rd color image neural network block receives P2p256 characteristic graphs are output from the output end of the 3 rd color graph neural network block, and the set formed by the 256 characteristic graphs is marked as P3The convolution kernel size of the fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the fifth batch of standard layers is 256 characteristic graphs, the activation mode of the fifth activation layer is 'Relu', the convolution kernel size of the sixth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixth batch of standard layers is 256 characteristic graphs, the activation mode of the sixth activation layer is 'Relu', the convolution kernel size of the seventh convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the seventh batch of standard layers is 256 characteristic graphs, the activation mode of the seventh activation layer is 'Relu', and P is processed by the following steps3As input, the data is input into a third maximum pooling layer, the pooling size of the third maximum pooling layer is 2, the step size is 2, and the output is expressed as P3p,P3pEach feature map of (1) has a width of
Figure BDA0002232378810000143
Has a height of
Figure BDA0002232378810000144
For the 4 th color map neural network block, the 4 th color map neural network block consists of an eighth convolution layer, an eighth normalization layer, an eighth activation layer, a ninth convolution layer, a ninth normalization layer, a ninth activation layer, a tenth convolution layer, a tenth normalization layer and a tenth activation layer which are arranged in sequence; the input end of the 4 th color image neural network block receives P3pThe output end of the 4 th color map neural network block outputs 512 feature maps, and the set formed by the 512 feature maps is marked as P4The convolution kernel size of the eighth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eighth standard layer is 512 feature maps, the activation mode of the eighth activation layer is 'Relu', the convolution kernel size of the ninth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the ninth standard layer is 512 feature maps, the activation mode of the ninth activation layer is 'Relu', the convolution kernel size of the tenth convolutional layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the tenth standard layer is 512 feature maps, the activation mode of the tenth activation layer is 'Relu', and P is processed by the following steps of4As input, the data is input into a fourth maximum pooling layer, the pooling size of the fourth maximum pooling layer is 1, the step size is 1, and the output is expressed as P4p,P4pEach feature map of (1) has a width of
Figure BDA0002232378810000151
Has a height of
Figure BDA0002232378810000152
For the 5 th color map neural network block, the color map neural network block consists of an eleventh convolution layer, an eleventh standardization layer, an eleventh activation layer, a twelfth convolution layer, a twelfth standardization layer, a twelfth activation layer, a thirteenth convolution layer, a thirteenth standardization layer and a thirteenth activation layer which are arranged in sequence; the input terminal of the 5 th color image neural network block receives P4pAll feature maps in (1), output of the 5 th color neural network blockOutputting 512 characteristic diagrams at the output end, and marking a set formed by the 512 characteristic diagrams as P5The convolution kernel size of the eleventh convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the eleventh batch of standard layers is 512 feature maps, the activation mode of the eleventh activation layer is 'Relu', the convolution kernel size of the twelfth convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the twelfth batch of standard layers is 512 feature maps, the activation mode of the twelfth activation layer is 'Relu', the convolution kernel size of the thirteenth convolution layer is 3 × 3, the convolution kernel number is 512, the zero padding parameter is 1, the output of the thirteenth batch of standard layers is 512 feature maps, the activation mode of the thirteenth activation layer is 'Relu', P5Each feature map of (1) has a width of
Figure BDA0002232378810000153
Has a height of
Figure BDA0002232378810000154
For the first global-based saliency detection module, P5As an input, inputting to a global-based saliency detection module, a first global-based saliency detection module which is composed of a first pyramid pooling module (pyramid pooling module) arranged in sequence; the input of the first global-based saliency detection module receives P5The output end of the first global-based saliency detection module outputs 256 feature maps, and a set formed by the 256 feature maps is marked as U4;U4Each feature map of (1) has a width of
Figure BDA0002232378810000161
Has a height of
Figure BDA0002232378810000162
Will U4As input, the 1 st significance sub-output layer outputs 1 characteristic diagram, which is marked as S1Where the convolution kernel for the convolution of the 1 st saliency sub-output layer has a size of 1 × 1 volumeThe number of the kernels is 1, and the zero padding parameter is 0. S1Each feature map having a width of
Figure BDA0002232378810000163
Has a height of
Figure BDA0002232378810000164
As shown in FIG. 3, for the 1 st context information fusion block, P is merged5And P4Respectively input to the fourteenth and fifteenth convolutional layers, the input terminal of the fourteenth convolutional layer receiving P5256 feature maps are output from the output end of the fourteenth convolutional layer, and the constituent set of the 256 feature maps is recorded as
Figure BDA0002232378810000165
Input terminal of the fifteenth convolutional layer receives P4256 feature maps are output from the output end of the fifteenth convolutional layer, and a set of 256 feature maps is described as
Figure BDA0002232378810000166
Wherein the size of convolution kernel of the fourteenth convolution layer is 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0, the size of convolution kernel of the fifteenth convolution layer is 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0,
Figure BDA0002232378810000167
and
Figure BDA0002232378810000168
each feature map having a width of
Figure BDA0002232378810000169
Has a height of
Figure BDA00022323788100001610
Will be provided with
Figure BDA00022323788100001611
And
Figure BDA00022323788100001612
the result of the superposition (conjugation) is denoted C1And input it as input to the 1 st context information fusion block, which is composed of the sixteenth convolution layer, the sixteenth normalization layer, the sixteenth activation layer, the seventeenth convolution layer, the seventeenth normalization layer and the seventeenth activation layer, which are arranged in sequence; the input end of the 1 st context information fusion block receives C1The output end of the 1 st context information fusion block outputs 256 feature graphs, and the set formed by the 256 feature graphs is recorded as SX1The convolution kernel size of the sixteenth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the sixteenth standard layer is 256 characteristic graphs, the activation mode of the sixteenth activation layer is 'Relu', the convolution kernel size of the seventeenth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the seventeenth standard layer is 256 characteristic graphs, the activation mode of the seventeenth activation layer is 'Relu', SX1Each feature map of (1) has a width of
Figure BDA00022323788100001613
Has a height of
Figure BDA00022323788100001614
For the first multimodal information fusion module, SX is used1And S1And U4And DU1Taking all characteristic graphs in (1) as input, firstly taking SX1And U4The result of the superposition (conjugation) is denoted as M1And M is1The input is input into the eighteenth convolutional layer, the output end of the eighteenth convolutional layer outputs 256 characteristic diagrams, and the set of the 256 characteristic diagrams is represented as Re1Will M1All characteristic diagrams and S of1The result of the multiplication is recorded as Mul1Mixing Mul1The signals are input into a nineteenth convolutional layer as input, 256 characteristic diagrams are output from the output end of the nineteenth convolutional layer, and a set formed by the 256 characteristic diagrams is represented as For1Will be1And Re1And DU1The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo1,Mo1256 characteristic diagrams in total, Mo1Each feature map of (1) has a width of
Figure BDA0002232378810000171
Has a height of
Figure BDA0002232378810000172
For the 1 st RGB map up-sampling block, the 1 st RGB map up-sampling block consists of a twentieth convolution layer, a twentieth normalization layer, a twentieth activation layer, a twenty-first convolution layer, a twentieth normalization layer, a twenty-first activation layer, a twenty-second convolution layer, a twenty-second normalization layer, a twenty-second activation layer and a twenty-third up-sampling layer which are sequentially arranged; the input of the sampling block on the 1 st RGB map receives Mo1The output end of the sampling block on the 1 st RGB map outputs 256 feature maps, and the set formed by the 256 feature maps is marked as U3The method comprises the steps of obtaining a twenty-first convolution layer, a twenty-second convolution layer, an expansion parameter and a zero filling parameter, wherein the convolution kernel size of the twentieth convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twentieth standard layer is 256 feature maps, the activation mode of the twentieth activation layer is 'Relu', the convolution kernel size of the twenty-first convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twentieth standard layer is 256 feature maps, the activation mode of the twenty-first activation layer is 'Relu', the convolution kernel size of the twenty-second convolution layer is 3 × 3, the number of convolution kernels is 256, the zero filling parameter is 2, the output of the twenty-second standard layer is 256 feature maps, the activation mode of the twenty-second activation layer is 'Relu', the amplification factor of a second thirteen upper sampling layer is 2, and the adopted method is a bilinear difference value, U is a bilinear difference value3Each feature map of (1) has a width of
Figure BDA0002232378810000173
Has a height of
Figure BDA0002232378810000174
Will U3As input, the data is input into the 2 nd significance sub-output layer, and the 2 nd significance sub-output layer outputs 1 characteristic diagram which is marked as S2Wherein the convolution kernel of the convolution of the 2 nd significance sub-output layer has a size of 1 × 1, a number of 1 convolution kernels and a zero-filling parameter of 0. S2Each feature map having a width of
Figure BDA0002232378810000175
Has a height of
Figure BDA0002232378810000176
For the 2 nd context information fusion block, P is added5Inputting the data into a twenty-fourth upsampling layer and a twenty-fifth convolutional layer, wherein the input end of the twenty-fourth upsampling layer receives P5512 feature maps are output from the output end of the twenty-fourth upsampling layer, and the 512 feature maps are grouped into a set
Figure BDA0002232378810000181
Input reception of the twenty-fifth convolutional layer
Figure BDA0002232378810000182
256 characteristic diagrams are output from the output end of the twenty-fifth convolutional layer, and a set formed by the 256 characteristic diagrams is recorded as
Figure BDA0002232378810000183
Will P4Inputting the data into a twenty-sixth upsampling layer and a twenty-seventh convolutional layer, wherein the input end of the twenty-sixth upsampling layer receives P4The output end of the twenty-sixth up-sampling layer outputs 512 feature maps, and the 512 feature maps are grouped into a set
Figure BDA0002232378810000184
Input reception of the twenty-seventh convolutional layer
Figure BDA0002232378810000185
256 feature maps are output from the output end of the twenty-seventh convolutional layer, and the set of 256 feature maps is recorded as
Figure BDA0002232378810000186
Will P3Input into twenty-eighth convolutional layer, and input end of twenty-eighth upsampling layer receives P3256 feature maps are output from the output end of the twenty-eight upsampling layer, and a set formed by the 256 feature maps is recorded as
Figure BDA0002232378810000187
Wherein the amplification factor of the twenty-fourth upsampling layer is 2, the adopted method is bilinear interpolation, the size of the convolution kernel of the twenty-fifth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the twenty-sixth upsampling layer is 2, the adopted method is bilinear interpolation, the size of the convolution kernel of the twenty-seventh convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the size of the convolution kernel of the twenty-eighth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0,
Figure BDA0002232378810000188
and
Figure BDA0002232378810000189
and
Figure BDA00022323788100001810
each feature map having a width of
Figure BDA00022323788100001811
Has a height of
Figure BDA00022323788100001812
Will be provided with
Figure BDA00022323788100001813
And
Figure BDA00022323788100001814
and
Figure BDA00022323788100001815
the result of the superposition (conjugation) is denoted C2And the context information is used as input and input into a 2 nd context information fusion block which consists of a twenty-ninth convolution layer, a twenty-ninth standardization layer, a twenty-ninth activation layer, a thirty-eighth convolution layer, a thirty-fifth standardization layer and a thirty-fifth activation layer which are arranged in sequence; the input end of the 2 nd context information fusion block receives C2256 feature graphs are output from the output end of the 2 nd context information fusion block, and the set formed by the 256 feature graphs is recorded as SX2The convolution kernel size of the twenty-ninth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the twenty-ninth batch of standard layers is 256 characteristic graphs, the activation mode of the twenty-ninth activation layer is 'Relu', the convolution kernel size of the thirty-ninth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the thirty-th batch of standard layers is 256 characteristic graphs, the activation mode of the thirty-th activation layer is 'Relu', SX2Each feature map of (1) has a width of
Figure BDA0002232378810000191
Has a height of
Figure BDA0002232378810000192
For the second multimodal information fusion module, SX is used2And S2And U3And DU2Taking all characteristic graphs in (1) as input, firstly taking SX2And U3The result of the superposition (conjugation) is denoted as M2And M is2The input is the thirty-first convolutional layer, the output of the thirty-first convolutional layer outputs 256 characteristic maps, and the set of the 256 characteristic maps is denoted as Re2Will M2All characteristic diagrams and S of2The result of the multiplication is recorded as Mul2Mixing Mul2As input, the input is input into the thirty-second convolution layer256 characteristic diagrams are output from the output end, and a set formed by the 256 characteristic diagrams is recorded as For2Will be2And Re2And DU2The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo2,Mo2256 characteristic diagrams in total, Mo2Each feature map of (1) has a width of
Figure BDA0002232378810000193
Has a height of
Figure BDA0002232378810000194
For the 2 nd RGB map up-sampling block, the 2 nd RGB map up-sampling block consists of a thirty-third convolution layer, a thirty-third standardization layer, a thirty-third activation layer, a thirty-fourth convolution layer, a thirty-fourth standardization layer, a thirty-fourth activation layer, a thirty-fifth convolution layer, a thirty-fifth standardization layer, a thirty-fifth activation layer and a thirty-sixth up-sampling layer which are sequentially arranged; the input of the sampling block on the 2 nd RGB map receives Mo2256 feature maps are output from the output end of the sampling block on the 2 nd RGB map, and the set of 256 feature maps is marked as U2The method comprises the steps of determining the convolution kernel size of a thirty-third convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, expansion parameters to be 4, output of a thirty-third batch of standard layers to be 256 feature maps, the activation mode of a thirty-third activation layer to be 'Relu', the convolution kernel size of a thirty-fourth convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, output of a thirty-fourth batch of standard layers to be 256 feature maps, the activation mode of a thirty-fourth activation layer to be 'Relu', the convolution kernel size of a thirty-fifth convolution layer to be 3 × 3, the number of convolution kernels to be 256, zero padding parameters to be 4, output of a thirty-fifth batch of standard layers to be 256 feature maps, the activation mode of a thirty-fifth activation layer to be 'Relu', the amplification factor of a thirty-sixth-upper sampling layer to be 2, and the method to be a bilinear difference value, U2Each feature map of (1) has a width of
Figure BDA0002232378810000201
Has a height of
Figure BDA0002232378810000202
Will U2As input, the input is input into the 3 rd significance sub-output layer, and the 3 rd significance sub-output layer outputs 1 characteristic diagram which is marked as S3Wherein the convolution kernel of the convolution of the 3 rd significance sub-output layer has the size of 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0. S3Each feature map having a width of
Figure BDA0002232378810000203
Has a height of
Figure BDA0002232378810000204
For the 3 rd context information fusion block, P is combined5Input into the thirty-seventh upsampling layer and the thirty-eighth convolutional layer, and the input end of the thirty-seventh upsampling layer receives P5512 feature maps are output from the output end of the seventeenth upsampling layer, and the 512 feature maps are grouped into a set
Figure BDA0002232378810000205
Input terminal reception of thirty-eighth convolutional layer
Figure BDA0002232378810000206
256 feature maps are output from the output end of the thirty-eighth convolutional layer, and the set of 256 feature maps is recorded as
Figure BDA0002232378810000207
Will P4Input into the thirty-ninth upsampling layer and the forty convolutional layer, and the input end of the thirty-ninth upsampling layer receives P4The output end of the thirty-ninth up-sampling layer outputs 512 feature maps, and the 512 feature maps are grouped into a set
Figure BDA0002232378810000208
Input terminal reception of the fortieth convolutional layer
Figure BDA0002232378810000209
256 feature maps are output from the output terminal of the fortieth convolutional layer, and a set of 256 feature maps is described as
Figure BDA00022323788100002010
Will P3Inputting the data into a forty-first up-sampling layer and a forty-second convolutional layer, wherein the input end of the forty-first up-sampling layer receives P3256 feature maps are output from the output end of the eleventh upsampling layer, and a set of 256 feature maps is recorded as
Figure BDA00022323788100002011
Input terminal reception of the forty-second convolution layer
Figure BDA00022323788100002012
256 feature maps are output from the output end of the forty-second convolutional layer, and a set of 256 feature maps is described as
Figure BDA00022323788100002013
Will P2Input into a forty-third convolutional layer, the input of which receives P2256 feature maps are output from the output end of the forty-third convolutional layer, and a set of 256 feature maps is described as
Figure BDA00022323788100002014
Wherein, the amplification factor of the seventeenth upsampling layer is 4, the adopted method is bilinear interpolation, the size of the convolution kernel of the thirty-eighth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the thirty-ninth upsampling layer is 4, the adopted method is bilinear interpolation, the size of the convolution kernel of the forty-fourth convolutional layer is 1 × 1, the number of the convolution kernels is 1, the zero padding parameter is 0, the amplification factor of the forty-fourth upsampling layer is 2, the adopted method is bilinear interpolation, and the convolution kernel of the forty-second convolutional layer is 2The size of the product kernel is 1 × 1, the number of convolution kernels is 1, the zero padding parameter is 0, the size of the convolution kernel of the forty-third convolution layer is 1 × 1, the number of convolution kernels is 1, the zero padding parameter is 0,
Figure BDA0002232378810000211
and
Figure BDA0002232378810000212
and
Figure BDA0002232378810000213
and
Figure BDA0002232378810000214
each feature map having a width of
Figure BDA0002232378810000215
Has a height of
Figure BDA0002232378810000216
Will be provided with
Figure BDA0002232378810000217
And
Figure BDA0002232378810000218
and
Figure BDA0002232378810000219
and
Figure BDA00022323788100002110
the result of the superposition (conjugation) is denoted C3And input it as input to the 3 rd context information fusion block, which is composed of a forty-fourth convolution layer, a forty-fourth batch of normalization layer, a forty-fourth active layer, a forty-fifth convolution layer, a forty-fifth batch of normalization layer, and a forty-fifth active layer, which are arranged in this order; the input of the 3 rd context information fusion block receives C3The output end of the 3 rd context information fusion block outputs 256 feature graphs, and the set formed by the 256 feature graphs is recorded as SX3(ii) a Wherein the content of the first and second substances,the convolution kernel size of the forty-fourth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the forty-fourth batch of standard layers is 256 feature maps, the activation mode of the forty-fourth activation layer is 'Relu', the convolution kernel size of the forty-fifth convolution layer is 3 × 3, the convolution kernel number is 256, the zero padding parameter is 1, the output of the forty-fifteenth batch of standard layers is 256 feature maps, the activation mode of the forty-fifth activation layer is 'Relu', SX3Each feature map of (1) has a width of
Figure BDA00022323788100002111
Has a height of
Figure BDA00022323788100002112
For the third multimodal information fusion module, SX is used3And S3And U2And DU1Taking all characteristic graphs in (1) as input, firstly taking SX3And U2The result of the superposition (conjugation) is denoted as M3And M is3As an input, the input is inputted to a forty-sixth convolutional layer, 256 characteristic maps are outputted from the output terminal of the forty-sixth convolutional layer, and a set of these 256 characteristic maps is denoted as Re3Will M3All characteristic diagrams and S of3The result of the multiplication is recorded as Mul3Mixing Mul3The signals are input into a forty-seventh convolutional layer, 256 characteristic diagrams are output from the output end of the forty-seventh convolutional layer, and the set of the 256 characteristic diagrams is expressed as For3Will be3And Re3And DU1The result of adding corresponding position elements of all the feature maps in (1) is recorded as Mo3,Mo3256 characteristic diagrams in total, Mo3Each feature map of (1) has a width of
Figure BDA00022323788100002113
Has a height of
Figure BDA00022323788100002114
For the 3 rd RGB map upsampling block, the 3 rd RGB map upsampling blockThe device comprises a forty-eighth convolutional layer, a forty-eighth standardized layer batch, a forty-eighth active layer, a forty-ninth convolutional layer, a forty-ninth standardized layer batch, a forty-ninth active layer, a fifty-eighth convolutional layer, a fifty-fifth standardized layer batch, a fifty-fifth active layer and a fifty-fifth up-sampling layer which are arranged in sequence; the input of the sampling block on the 3 rd RGB map receives Mo3The output end of the sampling block on the 3 rd RGB map outputs 64 feature maps, and the set of the 64 feature maps is marked as U1The method comprises the steps of determining the sizes of convolution kernels of a forty-eighth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, expansion parameters to be 6, output of a forty-eighth standard layer to be 64 feature maps, the activation mode of a forty-eighth activation layer to be 'Relu', the size of a convolution kernel of a forty-ninth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, expansion parameters to be 6, output of a forty-ninth standard layer to be 64 feature maps, the activation mode of a forty-ninth activation layer to be 'Relu', the size of a convolution kernel of a fifty-fifth convolution layer to be 3 ×, the number of convolution kernels to be 64, zero padding parameters to be 6, output of a fifty standard layer to be 64 feature maps, the activation mode of the fifty-th activation layer to be 'Relu', the amplification factor of a fifty-first upper sampling layer to be 2, and the adopted method to be bilinear difference value, U1Each feature map in (1) has a width W and a height H. Will U1As input, the input is input into the 4 th significance sub-output layer, and the 4 th significance sub-output layer outputs 1 characteristic diagram which is marked as S4Wherein the convolution kernel of the convolution of the 4 th significance sub-output layer has the size of 1 × 1, the number of convolution kernels is 1, and the zero padding parameter is 0. S4Each feature map of (a) has a width W and a height H.
Step 1_ 3: inputting each original real color object image and corresponding depth image in the training set as original input images into a convolutional neural network for training to obtain 4 saliency detection prediction images and 4 saliency boundary prediction images corresponding to the original real object images in the training set, and converting the { I } into a plurality of prediction imagesq(i, j) } 4 saliency detection prediction maps (S)1,S2,S3,S4) The set of constructs is denoted as
Figure BDA0002232378810000221
4 saliency boundary prediction maps (D)3,D2,D1D) is recorded as
Figure BDA0002232378810000222
Step 1_ 4: scaling the real significance detection label image corresponding to each original color real object image in the training set by 4 different sizes to obtain the real significance detection label image with the width of
Figure BDA0002232378810000223
And has a height of
Figure BDA0002232378810000224
An image of width of
Figure BDA0002232378810000225
And has a height of
Figure BDA0002232378810000226
An image of width of
Figure BDA0002232378810000227
And has a height of
Figure BDA0002232378810000228
An image of width W and height H will be { I }q(i, j) } the set formed by 4 images obtained by scaling the corresponding real significance detection image is recorded as
Figure BDA0002232378810000229
In the same way, the real significant boundary label image corresponding to each original color real object image in the training set is scaled by 4 different sizes to obtain the image with the width of
Figure BDA00022323788100002210
And has a height of
Figure BDA00022323788100002211
An image of width of
Figure BDA00022323788100002212
And has a height of
Figure BDA00022323788100002213
An image of width of
Figure BDA00022323788100002214
And has a height of
Figure BDA00022323788100002215
An image of width W and height H will be { I }q(i, j) } the set formed by 4 images obtained by scaling the corresponding real significant boundary image is recorded as
Figure BDA0002232378810000231
Step 1_ 5: respectively calculate
Figure BDA0002232378810000232
And
Figure BDA0002232378810000233
and
Figure BDA0002232378810000234
and
Figure BDA0002232378810000235
the value of the loss function in between will
Figure BDA0002232378810000236
And
Figure BDA0002232378810000237
the value of the loss function in between is recorded as
Figure BDA0002232378810000238
By dividingClass cross entropy (categoricalcissentcopy) is obtained and is to be used
Figure BDA0002232378810000239
And
Figure BDA00022323788100002310
the value of the loss function in between is recorded as
Figure BDA00022323788100002311
Figure BDA00022323788100002312
Obtained using Dice loss, will
Figure BDA00022323788100002313
And
Figure BDA00022323788100002314
the final loss function values are obtained by addition.
Step 1-6, repeatedly executing step 1-3 and step 1-4 for V times to obtain a convolutional neural network classification training model, obtaining Q × V loss function values, finding out the loss function value with the minimum value from the Q × V loss function values, and correspondingly taking the weight vector and the bias term corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model, and correspondingly marking the weight vector and the bias term as WbestAnd bbest(ii) a Wherein, V>In this example, V is 300.
The test stage process comprises the following specific steps:
step 2_ 1: order to
Figure BDA00022323788100002315
A color image representing a real object to be saliency detected,
Figure BDA00022323788100002316
representing a depth image corresponding to a real object to be subjected to saliency detection; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA00022323788100002317
Width of (A), H' represents
Figure BDA00022323788100002318
The height of (a) of (b),
Figure BDA00022323788100002319
to represent
Figure BDA00022323788100002320
The middle coordinate position is the pixel value of the pixel point of (i, j),
Figure BDA00022323788100002321
to represent
Figure BDA00022323788100002322
And the middle coordinate position is the pixel value of the pixel point of (i, j).
Step 2_ 2: will be provided with
Figure BDA00022323788100002323
R channel component, G channel component and B channel component of and
Figure BDA00022323788100002324
inputting the superposed three-channel components into a convolutional neural network classification training model, and utilizing WbestAnd bbestMaking a prediction to obtain
Figure BDA00022323788100002325
And
Figure BDA00022323788100002326
corresponding predicted saliency detection image and saliency boundary image, S1The final saliency detection image is recorded as
Figure BDA00022323788100002327
Wherein the content of the first and second substances,
Figure BDA00022323788100002328
to represent
Figure BDA00022323788100002329
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
In this embodiment, in step 1_1,
Figure BDA00022323788100002330
the acquisition process comprises the following steps:
step 1-1 a-to
Figure BDA00022323788100002331
And defining the current pixel point to be processed as the current pixel point.
And 1_1b, performing convolution operation on the current pixel point by using convolution of 3 × 3 with weights of 1 to obtain a convolution result.
Step 1_1c, if the convolution result is 0 or 9, determining the current pixel point as a non-boundary pixel point; and if the convolution result is any one of the numerical values from 1 to 8, determining the current pixel point as a boundary pixel point.
Step 1-1 d-to
Figure BDA00022323788100002332
Taking the next pixel point to be processed as the current pixel point, and then returning to the step 1_1b to continue executing until the next pixel point to be processed is reached
Figure BDA00022323788100002333
And finishing processing all the pixel points in the step (2).
Step 1_1e, order
Figure BDA00022323788100002334
To represent
Figure BDA00022323788100002335
Will be shown in
Figure BDA00022323788100002336
The pixel value of the pixel point with the middle coordinate position (i, j) is recorded as
Figure BDA00022323788100002337
If it is
Figure BDA00022323788100002338
If the pixel point with the middle coordinate position (i, j) is a non-boundary pixel point, then order
Figure BDA0002232378810000241
If it is
Figure BDA0002232378810000242
If the pixel point with the middle coordinate position (i, j) is the boundary pixel point, then order
Figure BDA0002232378810000243
Wherein
Figure BDA0002232378810000244
And
Figure BDA0002232378810000245
wherein, the symbol is assigned.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building an architecture of the convolutional neural network based on context and depth information fusion by using a python-based deep learning library Pytrich1.0.1. The method adopts a real object image database NJU2000 test set to analyze how the significance detection effect of the real scene image (taking 400 real object images) obtained by prediction by the method is. Here, the detection performance of the predicted saliency detection image is evaluated by using 3 common objective parameters of the saliency detection method as evaluation indexes, namely, a class accuracy recall curve (Precision recalling curve), a Mean Absolute Error (MAE), and an F metric value (F-Measure).
The method is utilized to predict each real object image in the real object image database NJU2000 test set to obtain a prediction significance detection image of each real object image, a class accuracy recall rate Curve (PR Curve) reflecting the significance detection effect of the method is shown in figure 8-a, an average absolute error (MAE) reflecting the significance detection effect of the method is shown in figure 8-b, the value is 0.054, and an F metric value (F-Measure) reflecting the significance detection effect of the method is shown in figure 8-c, and the value is 0.872. As can be seen from fig. 8-a to 8-c, the saliency detection result of the real object image obtained by the method of the present invention is the best, which shows that it is feasible and effective to obtain the predicted saliency detection image of the real object image by using the method of the present invention.
FIG. 4a shows the 1 st original real scene image of the same scene; 4a-d show corresponding depth maps for the 1 st original real scene image; FIG. 4b shows a predicted saliency detection image obtained by predicting the original scene image shown in FIG. 4a using the method of the present invention; FIG. 5a shows the 2 nd original object image of the same scene; 5a-d show the corresponding depth maps for the 2 nd original real object image; FIG. 5b shows a predicted saliency detection image obtained by predicting the original object image shown in FIG. 5a using the method of the present invention; FIG. 6a shows the 3 rd original object image of the same scene; 6a-d show the corresponding depth maps of the 3 rd original real object image; FIG. 6b shows a predicted saliency detection image obtained by predicting the original object image shown in FIG. 6a using the method of the present invention; FIG. 7a shows the 4 th original object image of the same scene; 7a-d show corresponding depth maps for the 4 th original real object image; FIG. 7b shows a predicted semantic detection image obtained by predicting the original object image shown in FIG. 7a by the method of the present invention. Comparing fig. 4a and 4b, fig. 5a and 5b, fig. 6a and 6b, and fig. 7a and 7b, it can be seen that the detection accuracy of the predicted saliency detection image obtained by the method of the present invention is higher.

Claims (5)

1. An image significance detection method based on an information fusion convolutional neural network is characterized by comprising the following steps:
step 1: selecting Q RGB images containing real objects, and a depth map, a saliency detection label map and a saliency boundary label map which are known and correspond to each RGB image to form a training set;
step 2: constructing an information fusion convolutional neural network, wherein the information fusion convolutional neural network comprises an input layer, a hidden layer and an output layer which are sequentially connected;
and step 3: inputting each RGB image in the training set and the corresponding depth map thereof into an information fusion convolutional neural network from an input layer for training, and outputting from an output layer to obtain four saliency detection prediction maps and four saliency boundary prediction maps; taking the four significance detection prediction images as significance prediction image sets, and taking the four significance boundary prediction images as boundary prediction image sets; carrying out scaling treatment on the saliency detection label graphs corresponding to each RGB image in different sizes to obtain four images with different widths and heights as a saliency label graph set, and carrying out scaling treatment on the saliency boundary label graphs corresponding to the same RGB image in different sizes to obtain four images with different widths and heights as a boundary label graph set; calculating a first loss function value between the significance prediction atlas and the significance label atlas, calculating a second loss function value between the boundary prediction atlas and the boundary label atlas, and adding the first loss function value and the second loss function value to obtain a total loss function value;
step 4, repeatedly executing the step 3 for V times to obtain Q × V total loss function values, and taking the weight vector and the bias item corresponding to the minimum total loss function value as the optimal weight vector and the optimal bias item of the information fusion convolutional neural network, so as to obtain the trained information fusion convolutional neural network;
and 5: and collecting an RGB image to be subjected to significance detection, inputting the RGB image into the trained information fusion convolutional neural network, and outputting to obtain a final significance detection prediction image.
2. The image saliency detection method based on information fusion convolutional neural network of claim 1, characterized in that: the input layer of the information fusion convolutional neural network comprises an RGB (red, green and blue) image input layer and a depth image input layer, the hidden layer comprises a color image processing part and a depth image processing part, the RGB input layer receives an RGB image and inputs the RGB image to the color image processing part for processing and then outputs the RGB image to obtain four significance sub-output layers, and the depth image input layer receives the depth image and inputs the RGB image to the depth image processing part for processing and then outputs the depth image to obtain four boundary sub-output layers;
the color image processing part comprises a first RGB image neural network block, a first RGB image maximum pooling layer, a second RGB image neural network block, a second RGB image maximum pooling layer, a third RGB image neural network block, a third RGB image maximum pooling layer, a fourth RGB image neural network block, a fourth RGB image maximum pooling layer, a fifth RGB image neural network block, a first significance detection module, a first multi-mode information fusion module, a first RGB up-sampling block, a second multi-mode information fusion module, a second RGB up-sampling block, a third multi-mode information fusion module and a third RGB up-sampling block which are connected in sequence, the RGB image received by the RGB input layer is input into the color image processing part through a first RGB image neural network block and is output by a first significance detection module, a first RGB up-sampling block, a second RGB up-sampling block and a third RGB up-sampling block;
the outputs of the fourth RGB map neural network block and the fifth RGB map neural network block are connected to the input of the first context information fusion block, and the output of the first context information fusion block is connected to the input of the first multi-mode information fusion module; the outputs of the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the second context information fusion block, and the output of the second context information fusion block is connected to the input of the second multi-mode information fusion module; the outputs of the second RGB map neural network block, the third RGB map neural network block, the fourth RGB map neural network block and the fifth RGB map neural network block are all connected to the input of the third context information fusion block, and the output of the third context information fusion block is connected to the input of the third multi-mode information fusion module; the output of the first depth map upsampling block is also connected to the input of the first multi-mode information fusion module, the output of the second depth map upsampling block is also connected to the input of the second multi-mode information fusion module, and the output of the third depth map upsampling block is also connected to the input of the third multi-mode information fusion module;
the depth map processing part comprises a first depth map neural network block, a first depth map maximum pooling layer, a second depth map neural network block, a second depth map maximum pooling layer, a third depth map neural network block, a third depth map maximum pooling layer, a fourth depth map neural network block, a first depth map upsampling layer, a second depth map upsampling block, a second depth map upsampling layer, a third depth map upsampling block, a third depth map upsampling layer and a fourth depth map upsampling block which are connected in sequence; the depth map received by the depth map input layer is input into a depth map processing part through a first depth map neural network block and is output by a first depth map upsampling block, a second depth map upsampling block, a third depth map upsampling block and a fourth depth map upsampling block;
the output of the third depth map neural network block is connected to the input of the second depth map upsampling block, and the output of the first depth map upsampling layer and the output of the third depth map neural network block are fused and then input into the second depth map upsampling block; the output of the second depth map neural network block is connected to the input of the third depth map upsampling block, and the output of the second depth map upsampling layer and the output of the second depth map neural network block are fused and then input into the third depth map upsampling block; the output of the first depth map neural network block is connected to the input of a fourth depth map upsampling block, and the output of the third depth map upsampling layer and the output of the first depth map neural network block are fused and then input into the fourth depth map upsampling block;
the output layers comprise four significance sub-output layers and four boundary sub-output layers, the outputs of the first significance detection module, the first RGB up-sampling block, the second RGB up-sampling block and the third RGB up-sampling block are respectively connected with the first significance sub-output layer, the second significance sub-output layer, the third significance sub-output layer and the fourth significance sub-output layer, and the outputs of the first significance sub-output layer, the second significance sub-output layer and the third significance sub-output layer are also respectively connected with the inputs of the first multi-mode information fusion module, the second multi-mode information fusion module and the third multi-mode information fusion module; the outputs of the first depth map upsampling block, the second depth map upsampling block, the third depth map upsampling block and the fourth depth map upsampling block are respectively connected with the first boundary sub-output layer, the second boundary sub-output layer, the third boundary sub-output layer and the fourth boundary sub-output layer.
3. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the structure of each depth map neural network block is the same, each depth map neural network block is mainly formed by sequentially connecting a plurality of convolution blocks, and each convolution block is mainly composed of a convolution layer, a batch normalization layer and an activation layer which are sequentially connected.
4. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the context information fusion blocks have the same structure, and specifically include: the context information fusion block comprises a plurality of convolution layers, a convolution block I and a convolution block II, wherein the number of the convolution layers is the same as the input number of the context information fusion block, the convolution layers correspond to the input number of the context information fusion block one by one, one end of each convolution layer is connected with one input, the other end of each convolution layer is connected with the convolution block I and the convolution block II in sequence, and the output of the convolution block II is used as the output of the 1 st context information fusion block.
5. The image saliency detection method based on information fusion convolutional neural network of claim 2, characterized in that: the first multi-mode information fusion module comprises an overlapping layer, a multiplying layer, a first convolution layer, a second convolution layer and an addition layer, wherein the output of the overlapping layer is respectively connected to the input of the multiplying layer and the input of the first convolution layer, the output of the multiplying layer is connected to the input of the addition layer through the second convolution layer, the output of the first convolution layer is connected to the input of the addition layer, and the output of the addition layer is used as the output of the multi-mode information fusion module.
CN201910971962.4A 2019-10-14 2019-10-14 Image significance detection method based on information fusion convolutional neural network Withdrawn CN111445432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910971962.4A CN111445432A (en) 2019-10-14 2019-10-14 Image significance detection method based on information fusion convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910971962.4A CN111445432A (en) 2019-10-14 2019-10-14 Image significance detection method based on information fusion convolutional neural network

Publications (1)

Publication Number Publication Date
CN111445432A true CN111445432A (en) 2020-07-24

Family

ID=71652559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910971962.4A Withdrawn CN111445432A (en) 2019-10-14 2019-10-14 Image significance detection method based on information fusion convolutional neural network

Country Status (1)

Country Link
CN (1) CN111445432A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528899A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN113112464A (en) * 2021-03-31 2021-07-13 四川大学 RGBD (red, green and blue) saliency object detection method and system based on cross-mode alternating current encoder
CN113362322A (en) * 2021-07-16 2021-09-07 浙江科技学院 Distinguishing auxiliary and multi-mode weighted fusion salient object detection method
CN113505800A (en) * 2021-06-30 2021-10-15 深圳市慧鲤科技有限公司 Image processing method and training method, device, equipment and medium of model thereof
WO2023077809A1 (en) * 2021-11-05 2023-05-11 五邑大学 Neural network training method, electronic device, and computer storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528899A (en) * 2020-12-17 2021-03-19 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN112528899B (en) * 2020-12-17 2022-04-12 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN113112464A (en) * 2021-03-31 2021-07-13 四川大学 RGBD (red, green and blue) saliency object detection method and system based on cross-mode alternating current encoder
CN113112464B (en) * 2021-03-31 2022-06-21 四川大学 RGBD (red, green and blue) saliency object detection method and system based on cross-mode alternating current encoder
CN113505800A (en) * 2021-06-30 2021-10-15 深圳市慧鲤科技有限公司 Image processing method and training method, device, equipment and medium of model thereof
CN113362322A (en) * 2021-07-16 2021-09-07 浙江科技学院 Distinguishing auxiliary and multi-mode weighted fusion salient object detection method
CN113362322B (en) * 2021-07-16 2024-04-30 浙江科技学院 Obvious object detection method based on discrimination assistance and multi-mode weighting fusion
WO2023077809A1 (en) * 2021-11-05 2023-05-11 五邑大学 Neural network training method, electronic device, and computer storage medium

Similar Documents

Publication Publication Date Title
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
CN111445432A (en) Image significance detection method based on information fusion convolutional neural network
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN107679462B (en) Depth multi-feature fusion classification method based on wavelets
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN110263813B (en) Significance detection method based on residual error network and depth information fusion
CN110246148B (en) Multi-modal significance detection method for depth information fusion and attention learning
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN112597985B (en) Crowd counting method based on multi-scale feature fusion
CN109635662B (en) Road scene semantic segmentation method based on convolutional neural network
CN112861729B (en) Real-time depth completion method based on pseudo-depth map guidance
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN111582316A (en) RGB-D significance target detection method
CN112258526A (en) CT (computed tomography) kidney region cascade segmentation method based on dual attention mechanism
CN110909615B (en) Target detection method based on multi-scale input mixed perception neural network
CN113269787A (en) Remote sensing image semantic segmentation method based on gating fusion
CN111429466A (en) Space-based crowd counting and density estimation method based on multi-scale information fusion network
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN111640116B (en) Aerial photography graph building segmentation method and device based on deep convolutional residual error network
CN113538457B (en) Video semantic segmentation method utilizing multi-frequency dynamic hole convolution
CN111161224A (en) Casting internal defect grading evaluation system and method based on deep learning
CN114187520B (en) Building extraction model construction and application method
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN111160356A (en) Image segmentation and classification method and device
CN111310767A (en) Significance detection method based on boundary enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200724

WW01 Invention patent application withdrawn after publication