CN112241743A

CN112241743A - RGBD image saliency detection method for regenerating saliency map into three-stream network

Info

Publication number: CN112241743A
Application number: CN202011113013.1A
Authority: CN
Inventors: 周武杰; 柳昌; 郭沁玲; 强芳芳; 薛林林; 雷景生; 杨胜英
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2020-10-17
Filing date: 2020-10-17
Publication date: 2021-01-19

Abstract

The invention discloses an RGBD image saliency detection method for regenerating a three-stream convolution neural network by using a saliency map. Firstly, a gate structure form is established through a double-current end-to-end network for decoding to generate an initial significance prediction graph, then the initial significance prediction graph is used as an input to establish a single-current lightweight network, in order to save memory and calculation power, a decoding part guides decoding information of the double-current network to decode the single-current network, previous information guides subsequent characteristic information through previous experience, and then the significance prediction graph generated by the single current and the initial significance prediction graph are added through setting weights to obtain a final significance prediction graph. The method of the invention enhances the initial significance prediction graph by establishing two layers of networks to effectively utilize the previous information and the subsequent information, and the experiment proves the effectiveness of the method.

Description

RGBD image saliency detection method for regenerating saliency map into three-stream network

Technical Field

The invention relates to the technical field of saliency target detection, in particular to an RGBD image saliency detection method for regenerating a three-flow network by a saliency map.

Background

With the rapid improvement of computer hardware equipment, the complex calculation of the neural network is gradually shifted to the operation on the GPU by the operation of the CPU, and the Invitta company successively puts forward acceleration packages such as CUDA (compute unified device architecture) and the like for self-adaptive optimization, so that the development of the neural network reaches unprecedented height. The deep development of the neural network also brings the change of the coverage of the sky to some computer vision, such as target detection, pedestrian tracking, semantic segmentation and the like, the significance detection is also the computer vision direction influenced in the trend, the significance detection is the detection of an artificial significance region of a target scene, and the artificial significance region is a region which is interested by human beings.

The early significance detection is based on manual feature extraction, the efficiency of the manual feature extraction is low, the significance object extraction is inaccurate, and the precision is remarkably improved after the convolutional neural network is applied to the significance detection direction. Convolutional neural networks based on RGB three-color images have achieved excellent performance in this direction. However, the two-dimensional scene provides more single information for distinguishing the salient regions, and the distinguishing effect is obviously reduced when a complex scene is encountered. With the development of depth sensors, depth images with depth information for assisting the saliency detection of RGB images become one of the hot trends nowadays, and the invention is also established in the case of dual input of RGB images and depth images. Many previous methods tend to generate the final saliency prediction map by using only an end-to-end network, but actually, in the human brain, people tend to see the scene by themselves to optimize the scene received by eyes, which is one of the principles that human eyes tend to be cheated.

Therefore, how to provide an RGBD image saliency detection method with clear structure and clear boundary for saliency prediction map is an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides an RGBD image saliency detection method for regenerating a saliency map into a three-stream network, which solves the related problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

an RGBD image saliency detection method for regenerating a saliency map into a three-stream network comprises the following specific steps:

selecting RGB images, depth images and label images of N original RGBD images to form a training set;

constructing a neural network, and adopting a double-current end-to-end convolution neural network and a single-current light weight network; the double-current end-to-end convolutional neural network adopts VGG-16 as a basic coding network; and the double-current end-to-end convolutional neural network introduces ImageNet training weight for pre-training;

inputting the RGB image and the depth image of each original RGBD image in the training set as original input images into the neural network for training, and obtaining a significance prediction graph formed by a significance prediction graph and a background significance prediction graph corresponding to the RGB image in each original RGBD image in the training set;

calculating a loss function value between the significance prediction graph and the corresponding label graph, wherein the loss function value is obtained by adopting a two-class cross entropy loss function;

repeatedly executing training and calculation, circulating the whole training set each time to obtain a convolutional neural network classification training model, and determining a minimum loss function value; and correspondingly taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model.

Preferably, in the RGBD image saliency detection method for saliency map regeneration three-stream network described above, in the neural network, the coding portion of the two-stream end-to-end convolutional neural network is a first neural network block, a second neural network block, a third neural network block, a fourth neural network block, and a fifth neural network block, which are input from the first input layer by the depth image;

the RGB image is input from a second input layer and comprises a sixth neural network block, a seventh neural network block, an eighth neural network block, a ninth neural network block and a tenth neural network block, wherein a decoding part consists of a first global information guide multi-scale feature block, a first feature aggregation gate structure block, a second feature aggregation gate structure block, a third feature aggregation gate structure block, a fourth feature aggregation gate structure block, a first information guide block, a second information guide block, a third information guide block, a fourth information guide block and a first output layer, wherein the first global information guide multi-scale feature block, the first feature aggregation gate structure block, the second feature aggregation gate structure block, the fourth feature aggregation gate structure block, the first information guide block, the second information guide block, the third information; the double-flow end-to-end convolutional neural network outputs an initial significance prediction graph as the input of a single-flow lightweight network, and the single-flow lightweight network is combined with the previous information to enhance the initial significance prediction graph; the single-stream lightweight network is composed of a first feature enhancement block, a second feature enhancement block, a third feature enhancement block, a fourth feature enhancement block, a fifth feature enhancement block of an encoder, and a second global information guide multi-scale feature block, a first feature fine-tuning refinement block, a second feature fine-tuning refinement block, a third feature fine-tuning refinement block, a fourth feature fine-tuning refinement block, a first bidirectional attention block, a second bidirectional attention block, a third bidirectional attention block, a fourth bidirectional attention block, a second output layer of an output layer, a third output layer, a fourth output layer, a fifth output layer, a sixth output layer, a seventh output layer, and a final output layer of the decoder.

Preferably, in the RGBD image saliency detection method for saliency map regeneration three-flow network, the first global information-oriented multi-scale feature block and the second global information-oriented multi-scale feature block have the same structure, and include a first convolution layer, a first active layer, a second convolution layer, a second active layer, a first global average pooling layer, a third convolution layer, a third active layer, a first upsampling layer, a first global maximum pooling layer, a fourth convolution layer, a fourth active layer, a second upsampling layer, a fifth convolution layer, a fifth active layer, a first expanded convolution layer, a sixth active layer, a sixth convolution layer, a seventh active layer, a second expanded convolution layer, an eighth active layer, an eighth convolution layer, a seventh convolution layer, a ninth active layer, and a third expanded convolution layer;

the characteristic diagram is input into five branches after passing through the first convolution layer, the first activation layer, the second convolution layer and the second activation layer; the first branch is the first global average pooling layer, the third convolution layer and the first up-sampling layer; the second branch is the first global maximum pooling layer, the fourth convolution layer and the second up-sampling layer; adding the feature maps after passing through the first branch and the second branch to obtain a global feature map; a third branch is the fifth convolutional layer, the third active layer, the first expansion convolutional layer and the fourth active layer; a fourth branch being the sixth convolutional layer, the fifth active layer, the second expansion convolutional layer and the sixth active layer; the fifth branch is the seventh convolutional layer, the seventh active layer, the third expansion convolutional layer and the eighth active layer, the feature map is subjected to channel stacking operation with the global feature map through the third branch, the fourth branch, the fifth branch, and the eighth convolutional layer and the ninth active layer after the channel stacking operation to obtain a final feature map A.

Preferably, in the RGBD image saliency detection method for reconstructing a saliency map into a three-stream network, the first feature aggregation gate structure block, the second feature aggregation gate structure block, the third feature aggregation gate structure block, and the fourth feature aggregation gate structure block have the same structure, and the method includes: a ninth convolutional layer, a tenth active layer, a third upsampling layer, a tenth convolutional layer, an eleventh active layer, an eleventh convolutional layer, a twelfth active layer, a twelfth convolutional layer, a thirteenth active layer, a thirteenth convolutional layer, a fourteenth active layer, a fourteenth convolutional layer, a fifteenth active layer, a fifteenth convolutional layer, a sixteenth active layer, a first S-type active function, a second S-type active function, a sixteenth convolutional layer, a seventeenth active layer;

each feature aggregation gate structure block is divided into a depth stream feature, an RGB stream feature and a fusion information stream feature, the depth stream feature is added to the fusion information stream feature map passing through the ninth convolutional layer, the tenth active layer and the third upsampling layer, the tenth convolutional layer and the eleventh active layer, the added feature map passes through the twelfth convolutional layer, the thirteenth active layer, the thirteenth convolutional layer and the fourteenth active layer to obtain a preliminary fusion feature map, and the fusion information stream feature is dot-product-operated with the RGB stream feature map passing through the eleventh convolutional layer and the twelfth active layer and then added with the original RGB stream to obtain a gate structure feature map; the gate structure feature graph passes through the first S-shaped activation function and the second S-shaped activation function to obtain a gate structure binarization weight; the RGB stream features are subjected to the operations of the fourteenth convolution layer, the fifteenth activation layer, the fifteenth convolution layer and the sixteenth activation layer, and gate structure binarization weight dot product to obtain an RGB information feature map; the preparation fusion feature map and gate structure binarization weight dot product operation is carried out to obtain a depth information feature map; and after the depth information characteristic diagram and the RGB information characteristic diagram are subjected to channel stacking operation, a final characteristic diagram B is obtained through the sixteenth convolution layer and the seventeenth activation layer.

Preferably, in the RGBD image saliency detection method for saliency map regeneration three-stream network described above, the first information leading block, the second information leading block, the third information leading block, and the fourth information leading block have the same structure, and include: a convolutional layer A; and performing dot product operation on the feature map through the convolutional layer A and the final feature map B, and adding the feature map after the dot product operation and the final feature map B to obtain a final feature map C.

Preferably, in the RGBD image saliency detection method for saliency map regeneration three-flow network described above, the first feature fine-tuning refinement block, the second feature fine-tuning refinement block, the third feature fine-tuning refinement block, and the fourth feature fine-tuning refinement block have the same structure, and include an eighty-seventh convolution layer, a fifty-first active layer, an eighty-eight convolution layer, a fifty-second active layer, and an eighty-nine convolution layer;

the first feature map is a feature map output by the feature enhancement block corresponding to each feature fine tuning refinement block; the second feature map is a feature map of the bidirectional attention block output corresponding to each feature fine tuning refinement block; after the first characteristic diagram passes through the eighty-eight convolutional layer and the fifty-second active layer, two sets of characteristic diagrams are respectively a first characteristic diagram w and a first characteristic diagram b according to channels, and the first characteristic diagram w, the eighty-seven convolutional layer and the fifty-first active layer are subjected to dot product operation, then are added with the first characteristic diagram b, and then are subjected to the eighty-nine convolutional layer to obtain a final characteristic diagram D.

Preferably, in the RGBD image saliency detection method of a saliency map regeneration three-stream network described above, the first bidirectional attention block, the second bidirectional attention block, the third bidirectional attention block, and the fourth bidirectional attention block have the same structure, and include: a seventh upsampling layer, a second global average pooling layer, a fifty-ninth convolutional layer, a first maximum normalized activation layer, a sixty-third convolutional layer, a ninth S-type activation function, a sixty-convolutional layer, and a sixty-second convolutional layer;

the third feature map is a feature map output by the second global information guide multi-scale feature block or a feature map output by each fine tuning refinement module;

the final feature map C is transformed into attention weights arranged according to channels through a second global average pooling layer and the fifty-ninth convolutional layer, the attention weights are mapped in a [0,1] interval through the first maximum normalized active layer, the normalized attention weights and a third feature map are subjected to dot product operation to obtain an attention feature map, and the attention feature map and the final feature map C are added and pass through the sixty convolutional layer, the sixty-one convolutional layer and the sixty-two convolutional layer to obtain a residual channel attention map; obtaining a spatial feature map by the seventh upsampling layer and the dimensionality reduction operation of the third feature map, and obtaining a binary spatial feature map by the spatial feature map through the sixty-three convolutional layer and the ninth S-type activation function; and performing dot product operation on the binarization space characteristic diagram and the residual channel attention diagram tensor to obtain a final characteristic diagram E.

According to the technical scheme, compared with the prior art, the invention discloses an RGBD image saliency detection method for regenerating a saliency map into a three-stream network, and the method has the advantages that:

(1) the invention adopts a brand-new network structure, firstly adopts a double-flow end-to-end network to generate an initial significance prediction graph, and uses a single-flow lightweight network to combine with the prior decoding information to perform characteristic enhancement on the basis of the initial significance prediction graph, even if the network of the invention is a two-section end-to-end convolutional neural network, the network parameter of the invention is still very small by adopting a VGG-16 basic network;

(2) the method adopts background and foreground supervision on output, and adopts weight addition to better combine with the characteristics of the initial saliency prediction graph according to the network characteristics;

(3) the invention adopts a bidirectional attention mechanism, combines the spatial information and the channel information of the characteristics to fully combine the previous information characteristics and the single-stream decoding characteristics, and the experimental result proves that the structure has high efficiency and good generated image.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a diagram of a first global information-guided multi-scale feature block of the present invention, in which other global information-guided multi-scale feature blocks have the same structure;

FIG. 3 is a fourth characteristic polymeric door structural block of the present invention, the other characteristic polymeric door structural blocks being of identical structure;

FIG. 4 is a first information leader block of the present invention, with the other information leader blocks being of the same structure;

FIG. 5 is a first bi-directional attention block of the present invention, with the other bi-directional attention blocks having a consistent structure;

FIG. 6 is a fourth refinement block of the present invention, with the other refinement blocks having a consistent structure;

FIG. 7a is a test set RGB image selected randomly in accordance with the present invention; FIG. 7b is a depth image corresponding to a test set selected randomly according to the present invention; FIG. 7c is a graph of a randomly selected test set versus a significance prediction graph generated by the present invention; FIG. 7d is a diagram of a real scene tag image corresponding to a test set selected randomly in accordance with the present invention;

FIG. 8a is a test set RGB image selected randomly in accordance with the present invention; FIG. 8b is a depth image corresponding to a test set selected randomly in accordance with the present invention; FIG. 8c is a graph of a randomly selected test set versus a significance prediction graph generated by the present invention; FIG. 8d is a diagram of a real scene tag image corresponding to a test set selected randomly in accordance with the present invention;

FIG. 9a is a test set RGB image selected randomly in accordance with the present invention; FIG. 9b is a depth image corresponding to a test set selected at random according to the present invention; FIG. 9c is a graph of significance prediction generated by the present invention for a randomly selected test set; FIG. 9d is a diagram of a real scene label image corresponding to the test set selected randomly in accordance with the present invention;

FIG. 10a is a PR (accurate-recall) curve on the NJU2K test set according to the present invention; FIG. 10b is the PR (accurate-recall) curve on the NLPR test set according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical scheme adopted by the invention for solving the technical problems is as follows: an RGBD image saliency detection method for regenerating a saliency map into a three-stream network is characterized by comprising two processes of a training stage and a testing stage:

the specific steps of the training phase process are as follows:

step 1_ 1: firstly, selecting RGB images, depth images and corresponding label images of N original RGBD images, forming a training set, and recording the RGB image of the nth original RGBD image in the training set as the RGB image of the nth original RGBD image

Depth image of original RGBD image is noted

The label graph comprises a real scene label graph and a real scene background label graph; real scene tag map as

The background label picture of the real scene is marked as

The real scene background label graph is used for supervising the end-to-end convolutional neural network in the embodiment of the invention, and the real scene label graph is used for supervising the single-flow lightweight network; the real scene background label image is a black image with all pixels being 1, minus the real scene label image, and the real scene label image is visually inverted in pixel and reversed in black and white. H represents the height of the image, W represents the width of the image, the RGB image is an image with color information in three channels of red, green and blue, and the depth image is a single-channel image with depth information shot by a depth sensor.

Step 1_ 2: constructing a neural network: the network is formed by two end-to-end networks, namely a double-current end-to-end convolutional neural network and a single-current light-weight network. The double-current end-to-end convolutional neural network adopts VGG-16 as a basic coding network, and introduces ImageNet training weight for pre-training, wherein the coding part of the double-current end-to-end convolutional neural network comprises a first neural network block, a second neural network block, a third neural network block, a fourth neural network block and a fifth neural network block which are input by a depth image from a first input layer, a sixth neural network block, a seventh neural network block, an eighth neural network block, a ninth neural network block and a tenth neural network block which are input by an RGB image from a second input layer, and the decoding part comprises a first global information guide multi-scale feature block, a first feature aggregation gate structure block, a second feature aggregation gate structure block, a third feature aggregation gate structure block and a fourth feature aggregation gate structure block which are used as information guides, A second leading information block, a third leading information block, a fourth leading information block, and a first output layer. The double-flow end-to-end convolutional neural network outputs an initial significance prediction graph as an input of a single-flow lightweight network, and the task of the single-flow lightweight network is to combine the previous information to enhance the initial significance prediction graph. The single-stream lightweight network is composed of a first feature enhancement block, a second feature enhancement block, a third feature enhancement block, a fourth feature enhancement block, a fifth feature enhancement block of an encoder, and a second global information guide multi-scale feature block, a first feature fine-tuning refinement block, a second feature fine-tuning refinement block, a third feature fine-tuning refinement block, a fourth feature fine-tuning refinement block, a first bidirectional attention block, a second bidirectional attention block, a third bidirectional attention block, a fourth bidirectional attention block, a second output layer of an output layer, a third output layer, a fourth output layer, a fifth output layer, a sixth output layer, a seventh output layer, and a final output layer of the decoder.

Double-flow end-to-end convolutional neural network: for the first global information guided multi-scale feature block, the first global information guided multi-scale feature block is composed of a first convolutional layer (Conv), a first active layer (Act), and an active mode (Act)ivation, Act) is composed of 'Relu', a second convolutional layer, a second active layer, a first global average pooling layer, a third convolutional layer, a first global maximum pooling layer, a first upsampling layer in the upsampling manner of 'biolinearear', a second upsampling layer, a fourth convolutional layer, a fifth convolutional layer, a third active layer, a first extended convolutional layer (correlation constant), a fourth active layer, a sixth convolutional layer, a fifth active layer, a second extended convolutional layer, a sixth active layer, a seventh convolutional layer, a seventh active layer, a third extended convolutional layer, an eighth active layer, an eighth convolutional layer, and a ninth active layer. The first convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 128, the second convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 128, the first global average pooling layer output width and height are both 1, the third convolutional layer convolutional kernel size is 1 × 01, the number of convolutional kernels is 128, the first global maximum pooling layer output width and height are both 1, the fourth convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 128, the fifth convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 128, the first extended convolutional layer expansion degree is 1, the convolutional kernel size is 3 × 3, the number of convolutional kernels is 128, the sixth convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 128, the second extended convolutional layer expansion degree is 2, the convolutional kernel size is 3 × 3, the number of convolutional kernels is 128, the seventh convolutional layer kernel size is 1 × 1, the number of convolution kernels is 128, the third expanded convolution layer expansion degree is 4, the size of the convolution kernels is 3 x 3, the number of convolution kernels is 128, the size of the eighth convolution layer convolution kernel is 1 x 1, and the number of convolution kernels is 512. And the first global information guide multi-scale feature block receives feature maps of a fifth neural network block and a tenth neural network block, and the fifth neural network block feature map and the feature map of the tenth neural network block are input into the first global information guide multi-scale feature block after being subjected to channel stacking. The characteristic diagram is input into five branches after passing through a first convolution layer, a first activation layer, a second convolution layer and a second activation layer, wherein the first branch is a first global average pooling layer, a third convolution layer and a first up-sampling layer, and the second branch is a first full-scale average pooling layerThe local maximum pooling layer, the fourth convolution layer and the second up-sampling layer. The feature map is added after passing through a first branch and a second branch to obtain a global feature map, the third branch is a fifth convolutional layer, a third active layer, a first expansion convolutional layer and a fourth active layer, the fourth branch is a sixth convolutional layer, a fifth active layer, a second expansion convolutional layer and an eighth active layer, the fifth branch is a seventh convolutional layer, a seventh active layer, a third expansion convolutional layer and an eighth active layer, the feature map is subjected to channel stacking operation with the global feature map after passing through the third branch, the fourth branch, the fifth branch, and the eighth convolutional layer and the ninth active layer after the channel stacking operation to obtain a final feature map. The 512 feature maps of the multi-scale feature block guided by the first global information are marked as M₁The width of the characteristic diagram is W/16, and the height is H/16.

For the fourth feature-integrated gate structure block, the fourth feature-integrated gate structure block is composed of a ninth convolutional layer, a tenth active layer, and a third upsampling layer (upsample layer), where the upsampling manner is bilinear interpolation (bilinear), the tenth convolutional layer, the eleventh active layer, the eleventh convolutional layer, the twelfth active layer, the twelfth convolutional layer, the thirteenth active layer, the thirteenth convolutional layer, the fourteenth active layer, the fourteenth convolutional layer, the fifteenth active layer, the fifteenth convolutional layer, the sixteenth active layer, and the first S-type active function, and the activation manner is 'Sigmoid', the second S-type active function, the sixteenth convolutional layer, and the seventeenth active layer. The ninth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the upsampling magnification of the first upsampling layer is 2, the tenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the eleventh convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the twelfth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the thirteenth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 512, the fourteenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the fifteenth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 512, the sixteenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, andis 256. The fourth feature aggregation gate structure block receives the feature map M from the fourth neural network block, the ninth neural network block and the first global information guided multi-scale feature block₁And dividing the fourth feature aggregation gate structure block into a depth flow feature, an RGB flow feature and a fusion information flow feature, wherein the depth flow feature is an output feature of the fourth neural network block, adding the depth flow feature to a fusion information flow feature map passing through a ninth convolutional layer and a tenth activation layer, and passing through a third upsampling layer, a tenth convolutional layer and an eleventh activation layer, obtaining a prepared fusion feature map after the added feature map passes through a twelfth convolutional layer, a thirteenth activation layer, a thirteenth convolutional layer and a fourteenth activation layer, and obtaining the depth information feature map through a gate structure binarization weight dot product operation. The method comprises the steps of fusing an information flow characteristic diagram (first global information guide multi-scale characteristic block output characteristic) and an RGB (red, green and blue) flow characteristic diagram to perform linear operation, specifically performing dot product operation on the RGB flow characteristic diagram passing through a eleventh convolution layer and a twelfth activation layer, adding the dot product operation and the original RGB flow to obtain a gate structure characteristic diagram, and obtaining a gate structure binarization weight by the gate structure characteristic diagram through a first S-shaped activation function and a second S-shaped activation function. And adding the RGB stream characteristics and the depth stream characteristics, obtaining an RGB information characteristic diagram after the addition passes through a tenth convolution layer, a twelfth activation layer, a fourteenth convolution layer, a fifteenth activation layer, a fifteenth convolution layer and a sixteenth activation layer, and obtaining a final characteristic diagram after the depth information characteristic diagram and the RGB information characteristic diagram are subjected to channel stacking operation and pass through a sixteenth convolution layer and a seventeenth activation layer. The 256 characteristic diagrams through the fourth characteristic polymer door structure block are marked as G₄The width of the characteristic diagram is W/8, and the height is H/8.

For the third characteristic poly gate structure block, the third characteristic poly gate structure block is composed of a seventeenth convolution layer, an eighteenth active layer, a fourth upsampling layer, an eighteenth convolution layer, a nineteenth active layer, a nineteenth convolution layer, a twentieth active layer, a twentieth convolution layer, a twenty-first active layer, a twenty-second convolution layerThree active layers, a twenty-third convolutional layer, a twenty-fourth active layer, a third S-type active function, a fourth S-type active function, a twenty-fourth convolutional layer, and a twenty-fifth active layer. The seventeenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 256, the upsampling magnification of the second upsampling layer is 2, the eighteenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 256, the nineteenth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 256, the twentieth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 256, the twenty-first convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 256, the twenty-second convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 512, the twenty-third convolutional kernel size is 1 × 1, the number of convolution kernels is 256, the twenty-fourth convolutional layer convolution kernel size is 3 × 3, and the number of convolution kernels is 128. The third signature aggregate gate structure block receives signature graphs G from the third neural network block, the eighth neural network block, and the fourth signature aggregate gate structure block₄And dividing a third feature aggregation gate structure block into a depth flow feature, a fusion information flow feature and an RGB flow feature, wherein the depth flow feature is an output feature of a third neural network block, adding feature maps of an eighth neural network block passing through a seventeenth convolution layer and an eighteenth activation layer and passing through a eighteenth convolution layer and a nineteenth activation layer, obtaining a prepared fusion feature map after the added feature map passes through a twentieth convolution layer, a twenty-first activation layer, a twenty-first convolution layer and a twenty-second activation layer, and obtaining a depth information feature map by the prepared fusion feature map and gate structure binarization weight dot product operation. And performing linear operation on the fusion information flow characteristic diagram and the characteristic diagram from the RGB flow, specifically performing dot product operation on the fusion information flow characteristic diagram and the RGB flow characteristic diagram passing through a nineteenth convolution layer and a twentieth activation layer, and adding the dot product operation and the original RGB flow to obtain a gate structure characteristic diagram, wherein the gate structure characteristic diagram obtains a gate structure binary weight through a third S-shaped activation function and a fourth S-shaped activation function. Adding the RGB stream characteristics and the depth stream characteristics, and processing the added characteristic graph by a nineteenth convolution layer, a twentieth activation layer, a twenty-second convolution layer, a twenty-third activation layer and a second convolution layerAnd obtaining RGB information characteristic diagrams by the thirteen convolution layers and the twenty-fourth activation layer, and obtaining final characteristic diagrams by the twenty-fourth convolution layer and the twenty-fifth activation layer after channel stacking operation is carried out on the depth information characteristic diagrams and the RGB information characteristic diagrams. The 128-piece characteristic diagram of the door structure block aggregated by the third characteristic is marked as G₃The width of the characteristic diagram is W/4, and the height is H/4.

For the second feature aggregate door structure block, the second feature aggregate door structure block consists of a twenty-fifth convolution layer, a twenty-sixth activation layer, a fifth upsampling layer, a twenty-sixth convolution layer, a twenty-seventh activation layer, a twenty-seventh convolution layer, a twenty-eighth activation layer, a twenty-eighth convolution layer, a twenty-ninth activation layer, a twenty-ninth convolution layer, a thirtieth activation layer, a thirty-eighth convolution layer, a thirty-eighth activation layer, a thirty-sixth activation layer, a thirty-eleventh convolution layer, a thirty-second activation layer, a fifth S-type activation function, a sixth S-type activation function, a thirty-second convolution layer, and a thirty-third activation layer. The twenty-fifth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the up-sampling magnification of the third up-sampling layer is 2, the twenty-sixth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the twenty-seventh convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the twenty-eighth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the twenty-ninth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the thirty-fifth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the thirty-first convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the thirty-second convolutional layer convolution kernel size is 3 × 3, and the number of convolution kernels is 64. The second signature aggregate gate structure block receives signature graphs G from the second neural network block, the seventh neural network block, and the third signature aggregate gate structure block₃Dividing the second feature aggregation gate structure block into a depth stream, a fusion information stream and an RGB stream, wherein the depth stream features the output features of the second neural network block, and adding feature maps of the sixth neural network block passing through the twenty-fifth convolutional layer and the twenty-sixth active layer and passing through the twenty-sixth convolutional layer and the twenty-seventh active layer, and addingAnd then, obtaining a preliminary fusion feature map through the twenty-eighth convolution layer, the twenty-ninth activation layer, the twenty-ninth convolution layer and the thirty-third activation layer of the feature map, and obtaining a depth information feature map through the preliminary fusion feature map and gate structure binarization weight dot product operation. Third feature aggregate door structure block feature graph G₃The method comprises the following steps that (fusion information flow characteristic diagram) linear operation is carried out on a RGB flow characteristic diagram through a third up-sampling layer, a twenty-sixth convolution layer and a twenty-seventh activation layer, the specific operation is that dot product operation is carried out on the RGB flow characteristic diagram through the twenty-ninth convolution layer and the thirty-fifth activation layer, then the dot product operation is carried out on the RGB flow characteristic diagram and the original RGB flow to obtain a gate structure characteristic diagram, and the gate structure characteristic diagram is subjected to a fifth S-shaped activation function and a sixth S-shaped activation function to obtain a gate structure binarization weight. And adding the RGB stream characteristics and the depth stream characteristics, obtaining an RGB information characteristic diagram by the characteristic diagram after the addition through a twenty-ninth convolution layer, a thirty-third activation layer, a thirty-third convolution layer, a thirty-first activation layer, a thirty-first convolution layer and a thirty-second activation layer, and obtaining a final characteristic diagram by the depth information characteristic diagram and the RGB information characteristic diagram through the thirty-second convolution layer and the thirty-third activation layer after channel stacking operation is carried out on the depth information characteristic diagram and the RGB information characteristic diagram. The 64 characteristic figures of the door structure block aggregated by the second characteristic are marked as G₂The width of the characteristic diagram is W/2, and the height is H/2.

For the first feature aggregation door structure block, the first feature aggregation door structure block consists of a thirty-third convolution layer, a thirty-fourth activation layer, a sixth upsampling layer, a thirty-fourth convolution layer, a thirty-fifth activation layer, a thirty-fifth convolution layer, a thirty-sixth activation layer, a thirty-seventh convolution layer, a thirty-eighth activation layer, a thirty-eighth convolution layer, a thirty-ninth activation layer, a thirty-ninth convolution layer, a forty-fourth activation layer, a seventh S-type activation function, an eighth S-type activation function, a forty-fourth convolution layer, and a forty-first activation layer. The thirty-third convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64, the up-sampling magnification of the fourth up-sampling layer is 2, the thirty-fourth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64,the thirty-fifth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64, the thirty-sixth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64, the thirty-seventh convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 64, the thirty-eighth convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64, the thirty-ninth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 64, the forty-th convolutional layer convolution kernel size is 3 × 3, and the number of convolution kernels is 32. The first signature aggregate gate structure block receives the first, sixth and second signature aggregate gate structure blocks G from the first, sixth and fourth neural network blocks₂Dividing a first feature aggregation gate structure block into a depth flow feature, a fusion information flow feature and an RGB flow feature, wherein the depth flow feature is an output feature of a first neural network block, adding feature maps of a fifth neural network block passing through a thirty-third convolution layer and a thirty-fourth activation layer and passing through a thirty-fourth convolution layer and a thirty-fifth activation layer, obtaining a prepared fusion feature map after the added feature map passes through a thirty-sixth convolution layer, a thirty-seventh activation layer, a thirty-seventh convolution layer and a thirty-eighth activation layer, and obtaining a depth information feature map by performing binarization weight dot product operation on the prepared fusion feature map and the gate structure. Second characteristic polymeric door structural block G₂And performing linear operation with the RGB stream feature map, specifically performing dot product operation with the RGB stream feature map passing through the thirty-fifth convolution layer and the thirty-sixth activation layer, and adding the dot product operation with the original RGB stream to obtain a gate structure feature map, wherein the gate structure feature map passes through a seventh S-shaped activation function and an eighth S-shaped activation function to obtain a gate structure binarization weight. And adding the RGB stream characteristics and the depth stream characteristics, obtaining an RGB information characteristic diagram by the characteristic diagram after the addition through a thirty-fifth convolution layer, a thirty-sixth activation layer, a thirty-eighth convolution layer, a thirty-ninth activation layer, a thirty-ninth convolution layer and a forty-fourth activation layer, and obtaining a final characteristic diagram by the depth information characteristic diagram and the RGB information characteristic diagram through a forty convolution layer and a forty-first activation layer after channel stacking operation is carried out on the depth information characteristic diagram and the RGB information characteristic diagram. The 32 characteristic images through the fourth characteristic polymeric door structure block are marked as G₁The width of the characteristic diagram is W, and the height is H.

For the first output layer, the first output layer is composed of a forty-first convolutional layer, the size of a convolutional kernel of the forty-first convolutional layer is 3 × 3, and the number of convolutional kernels is 1. The first output layer outputs an initial saliency prediction map S₁And performing background supervision on the initial saliency image, namely performing two-class cross entropy supervision on the saliency prediction image after negating the label image of the real scene, wherein the width of the initial saliency prediction image is W, and the height of the initial saliency prediction image is H.

For the first information leading block, the first information leading block is composed of a forty-second convolutional layer, the size of a convolutional kernel of the forty-second convolutional layer is 3 × 3, and the number of convolutional kernels is 32. The first information guide block receives the feature map G from the first feature aggregation door structure block₁And a feature map G of a sixth neural network block, wherein the feature map output by the sixth neural network block passes through the forty-second convolution layer and the first feature aggregation gate structure₁Performing dot product operation, and performing feature graph and feature graph G after the dot product operation₁Adding to obtain final characteristic diagram, and marking the 32 characteristic diagrams passing through the first information guide block as I₁The width of the characteristic diagram is W, and the height is H.

For the second information leader block, the second information leader block consists of the forty-third convolutional layer, the size of the forty-third convolutional layer convolutional kernel is 3 × 3, and the number of convolutional kernels is 64. The second information guide block receives the feature map G from the second feature aggregation door structure block₂And a characteristic diagram G of a seventh neural network block, wherein the characteristic diagram output by the seventh neural network block passes through a forty-third convolution layer and a second characteristic aggregation gate structure₂Performing dot product operation, and performing feature graph and feature graph G after the dot product operation₂Adding to obtain final characteristic diagram, and marking 64 characteristic diagrams passing through the second information guide block as I₂The width of the characteristic diagram is W/2, and the height is H/2.

For the third information leading block, the third information leading block is composed of a forty-fourth convolutional layer, the size of the forty-fourth convolutional layer convolutional kernel is 3 × 3, and the number of convolutional kernels is 128. The third information guide block receives information from the third feature aggregationCharacteristic diagram G of door structure block₃And the feature map G of the feature map output by the eighth neural network block passes through the forty-fourth convolution layer and the third feature aggregation gate structure₃Performing dot product operation, and performing feature graph and feature graph G after the dot product operation₃Adding to obtain final characteristic diagram, and marking the 128 characteristic diagrams passing through the third information guide block as I₃The width of the characteristic diagram is W/4, and the height is H/4.

For the fourth information leader block, the fourth information leader block consists of the forty-fifth convolutional layer, the size of the forty-fifth convolutional layer convolutional kernel is 3 × 3, and the number of convolutional kernels is 256. The fourth information guide block receives the feature map G from the fourth feature aggregation door structure block₄And a feature map G of a ninth neural network block, wherein the feature map output by the ninth neural network block passes through the forty-fifth convolution layer and the fourth feature aggregation gate structure₄Performing dot product operation, and performing feature graph and feature graph G after the dot product operation₄And adding to obtain the final characteristic diagram. 256 signatures passing through the fourth information guide block are denoted as I₄The width of the characteristic diagram is W/8, and the height is H/8.

Single-flow lightweight network: the single-stream lightweight network has the function of carrying out information enhancement on the initial saliency prediction graph in combination with the prior characteristic information, and the basic network is improved based on VGG-16. For the first feature enhancement block, the first feature enhancement block consists of a forty-sixth convolutional layer, a first parametrically-modified Linear Unit (parametrically-modified Linear Unit), a forty-seventh convolutional layer, and a second parametrically-modified layer. The size of a forty-sixth convolutional layer convolution kernel is 3 multiplied by 3, the number of the convolution kernels is 32, the size of a forty-seventh convolutional layer convolution kernel is 3 multiplied by 3, the number of the convolution kernels is 32, and the first feature enhancement block receives an initial significance prediction graph S₁，S₁And sequentially passing through a forty-sixth convolution layer, a first parametrically activated layer, a forty-seventh convolution layer and a second parametrically activated layer to obtain a final characteristic diagram. The 32 characteristic images passing through the first characteristic enhancement block are marked as R₁The width of the characteristic diagram is W, and the height is H.

Enhancement for second featureThe second feature enhancement block consists of a first max pooling layer (maxpool layer), a forty-eight convolution layer, a third parametrically active layer, a forty-ninth convolution layer, and a fourth parametrically active layer. The size of the forty-eight convolutional layer convolution kernel is 3 multiplied by 3, the number of the convolutional kernels is 64, the size of the forty-ninth convolutional layer convolution kernel is 3 multiplied by 3, the number of the convolutional kernels is 64, and the second characteristic enhancement block receives the first characteristic enhancement block R₁，R₁And sequentially passing through the first largest pooling layer, the forty-eighth convolution layer, the third parametrically-activated layer, the forty-ninth convolution layer and the fourth parametrically-activated layer to obtain a final characteristic diagram. The 64 feature maps passing through the second feature enhancement block are marked as R₂The width of the characteristic diagram is W/2, and the height is H/2.

For the third feature enhancement block, the third feature enhancement block consists of the second largest pooling layer, the fifty-th convolution layer, the fifth parametrically active layer, the fifty-th convolution layer, the sixth parametrically active layer, the fifty-second convolution layer, and the seventh parametrically active layer. The fifty-th convolutional layer convolution kernel size is 3 x 3, the number of convolution kernels is 128, the fifty-second convolutional layer convolution kernel size is 3 x 3, the number of convolution kernels is 128, the third feature enhancement block receives the second feature enhancement block R₂，R₂And sequentially passing through a second maximum pooling layer, a fifty-th convolution layer, a fifth parametrically activated layer, a fifty-th convolution layer, a sixth parametrically activated layer, a fifty-second convolution layer and a seventh parametrically activated layer to obtain a final characteristic diagram. The 128 characteristic graphs passing through the third characteristic enhancement block are marked as R₃The width of the characteristic diagram is W/4, and the height is H/4.

For the fourth feature enhancement block, the fourth feature enhancement block consists of the third largest pooling layer, the fifty-third convolution layer, the eighth parametrically active layer, the fifty-fourth convolution layer, the ninth parametrically active layer, the fifty-fifth convolution layer, and the tenth parametrically active layer. The size of fifty-third convolution layer convolution kernel is 3 × 3, the number of convolution kernels is 256, the size of fifty-fourth convolution layer convolution kernel is 3 × 3, the number of convolution kernels is 256, and the fifth convolution kernel isFifteen convolutional layers with convolutional kernel size of 3 x 3 and number of convolutional kernels of 256, and the fourth feature enhancement block receives the third feature enhancement block R₃，R₃And sequentially passing through a third maximum pooling layer, a fifty-third convolution layer, an eighth parametrically-activated layer, a fifty-fourth convolution layer, a ninth parametrically-activated layer, a fifty-fifth convolution layer and a tenth parametrically-activated layer to obtain a final characteristic diagram. 256 feature maps passing through the fourth feature enhancement block are denoted as R₄The width of the characteristic diagram is W/8, and the height is H/8.

For the fifth feature enhancement block, the fifth feature enhancement block consists of the fourth largest pooling layer, the fifty-sixth convolutional layer, the eleventh parametrically active layer, the fifty-seventh convolutional layer, the twelfth parametrically active layer, the fifty-eighth convolutional layer, and the thirteenth parametrically active layer. The fifty-sixth convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 512, the fifty-seventh convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 512, the fifty-eighth convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 512, and the fifth feature enhancement block receives the fourth feature enhancement block R₄，R₄And sequentially passing through a fourth maximum pooling layer, a fifty-sixth convolution layer, an eleventh parametrically-activated layer, a fifty-seventh convolution layer, a twelfth parametrically-activated layer, a fifty-eighth convolution layer and a thirteenth parametrically-activated layer to obtain a final characteristic diagram. The 512 feature maps passing through the fifth feature enhancement block are marked as R₅The width of the characteristic diagram is W/16, and the height is H/16.

For the first bi-directional attention block, the first bi-directional attention block is composed of a seventh upsampling layer, a second global average pooling layer, a fifty-ninth convolutional layer, and a first maximum normalized activation layer, and the activation modes are 'Softmax', a sixty convolutional layer, a sixty-one convolutional layer, a sixty-two convolutional layer, a sixty-three convolutional layer, and a ninth S-type activation function. The second global average pooling layer has an output width and height of 1, a fifty-ninth convolutional layer convolution kernel size of 1 × 1, a number of convolution kernels of 32, a sixty convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 16, and a sixty-one convolutional layer convolution kernelThe kernel size is 1 × 1, the number of convolution kernels is 16, the sixty-second convolution kernel size is 1 × 1, the number of convolution kernels is 32, the sixty-second convolution kernel size is 1 × 1, and the number of convolution kernels is 1. The first bidirectional attention block receives the characteristic diagram I output by the first information guide block₁And a feature map F output by a second fine-tuning refinement module₂. Characteristic diagram I₁Transforming the feature map into attention weights arranged by channels through a second global average pooling layer, and mapping the attention weights to [0,1] through a first maximum normalized activation layer]Interval, attention weight after normalization and feature map F output by the second fine tuning refinement module₂Performing dot product operation to obtain attention feature map, and feature map I output by the first information guiding block₁Adding the sixteenth convolution layer, the sixteenth convolution layer and the sixty-two convolution layers to obtain a residual channel attention diagram, and outputting a feature diagram F by a second fine tuning and refining module₂And obtaining a spatial feature map through a seventh up-sampling layer and dimensionality reduction operation, obtaining a binary spatial feature map through sixty-third convolution layers and a ninth S-shaped activation function of the spatial feature map, and performing dot product operation on the binary spatial feature map and a residual channel attention map tensor to obtain a final feature map. The 32 characteristic maps passing through the first bidirectional attention block are marked as B₁The width of the characteristic diagram is W, and the height is H.

For the second bi-directional attention block, the second bi-directional attention block consists of the eighth upsampled layer, the third global average pooling layer, the sixty-four convolutional layers, the second maximum normalized activation layer, the sixty-five convolutional layers, the sixty-six convolutional layers, the sixty-seven convolutional layers, the sixty-eight convolutional layers, and the tenth S-type activation function. The third global average pooling layer has an output width and height of 1, the sixty-four convolutional layer convolution kernels have a size of 1 × 1, the number of convolution kernels is 64, the sixty-five convolutional layer convolution kernels have a size of 3 × 3, the number of convolution kernels is 32, the sixty-six convolutional layer convolution kernels have a size of 1 × 1, the number of convolution kernels is 32, the sixty-seven convolutional layer convolution kernels have a size of 1 × 1, the number of convolution kernels is 64, and the sixty-six convolutional layer convolution kernels have a size of 1 × 1The size of the convolution kernels of the eight convolution layers is 1 multiplied by 1, and the number of the convolution kernels is 1. The second bidirectional attention block receives the characteristic diagram I output by the second information guide block₂And a feature map F output by the third fine tuning refinement module₃. Characteristic diagram I₂The feature map is transformed into attention weights arranged according to channels through a third global average pooling layer, and the attention weights are mapped to [0,1] through a second maximum normalized activation layer]Interval, attention weight after normalization and feature map F output by the third fine tuning refinement module₃Performing dot product operation to obtain attention feature map, and feature map I output by the second information guiding block₂Adding sixty-five convolution layers, sixty-six convolution layers and sixty-seven convolution layers to obtain a residual channel attention diagram, and outputting a feature diagram F by a third fine tuning and refining module₃And obtaining a spatial feature map through an eighth upsampling layer and dimensionality reduction operation, obtaining a binary spatial feature map through sixty eight convolutional layers and a tenth S-shaped activation function of the spatial feature map, and performing dot product operation on the binary spatial feature map and a residual channel attention map tensor to obtain a final feature map. The 64 characteristic maps passing through the second bidirectional attention block are marked as B₂The width of the characteristic diagram is W/2, and the height is H/2.

For the third bi-directional attention block, the third bi-directional attention block is composed of a ninth upsampling layer, a fourth global average pooling layer, a sixty-ninth convolutional layer, a third maximum normalized activation layer, a seventy-convolutional layer, a seventy-one convolutional layer, a seventy-two convolutional layer, a seventy-three convolutional layer, and an eleventh S-type activation function. The fourth global average pooling layer has an output width and height of 1, the sixty-ninth convolutional layer convolution kernel size of 1 × 1, the number of convolution kernels is 128, the seventy-seventh convolutional layer convolution kernel size of 3 × 3, the number of convolution kernels is 64, the seventy-first convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 64, the seventy-second convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the seventy-third convolutional layer convolution kernel size is 1 × 1, and the number of convolution kernels is 1. The third bidirectional attention block receives the third information guideSignature graph I of block output₃And a feature map F output by the fourth fine tuning refinement module₄. Characteristic diagram I₃The feature map is converted into attention weights arranged according to channels through a fourth global average pooling layer, and the attention weights are mapped to [0,1] through a third maximum normalized activation layer]Interval, attention weight after normalization and feature map F output by the fourth fine tuning refinement module₄Performing dot product operation to obtain attention feature map, attention feature map and feature map I output by the third information guiding block₃Adding the seventh convolution layer, the seventy one convolution layer and the seventy two convolution layers to obtain a residual channel attention diagram, and outputting a feature diagram F by a fourth fine tuning and refining module₄And obtaining a spatial feature map through a ninth upsampling layer and dimensionality reduction operation, obtaining a binary spatial feature map through a seventy-third convolution layer and an eleventh S-shaped activation function of the spatial feature map, and performing dot product operation on the binary spatial feature map and a residual channel attention map tensor to obtain a final feature map. The 128 characteristic maps passing through the third bidirectional attention block are marked as B₃The width of the characteristic diagram is W/4, and the height is H/4.

For the fourth bi-directional attention block, the fourth bi-directional attention block consists of the fifth global average pooling layer, the seventy-fourth convolutional layer, the fourth maximum normalized activation layer, the seventy-fifth convolutional layer, the seventy-sixth convolutional layer, the seventy-seventh convolutional layer, the seventy-eight convolutional layer, and the twelfth S-type activation function. The fifth global average pooling layer has an output width and height of 1, the size of seventy-four convolutional layer convolution kernels is 1 × 1, the number of convolution kernels is 256, the size of seventy-five convolutional layer convolution kernels is 3 × 3, the number of convolution kernels is 128, the size of seventy-six convolutional layer convolution kernels is 1 × 1, the number of convolution kernels is 128, the size of seventy-seven convolutional layer convolution kernels is 1 × 1, the number of convolution kernels is 256, the size of seventy-eight convolutional layer convolution kernels is 1 × 1, and the number of convolution kernels is 1. The fourth bidirectional attention block receives the feature map I output by the fourth information guide block₄And a second global information guide multi-scale feature block output feature map M₂. Characteristic diagram I₄The feature map is transformed into attention weights arranged according to channels through a fifth global average pooling layer, and the attention weights are mapped to [0,1] through a fourth maximum normalized activation layer]Interval, attention weight after normalization and second global information guide multi-scale feature block output feature map M₂Performing dot product operation to obtain attention feature map, and feature map I output by the fourth information guiding block₄Adding the seventy-fifth convolutional layer, the seventy-sixth convolutional layer and the seventy-seventh convolutional layer to obtain a residual channel attention diagram, wherein the second global information guides the feature diagram M output by the multi-scale feature block₂And obtaining a spatial feature map through a ninth upsampling layer and dimensionality reduction operation, obtaining a binary spatial feature map through the seventy eight convolutional layers and the twelfth S-shaped activation function of the spatial feature map, and performing dot product operation on the binary spatial feature map and a residual channel attention map tensor to obtain a final feature map. 256 feature maps through the fourth bi-directional attention block are labeled B₄The width of the characteristic diagram is W/8, and the height is H/8.

For the second global information-guided multi-scale block, the second global information-guided multi-scale feature block consists of seventy-nine convolutional layers, a forty-second active layer, eighty convolutional layers, a forty-third active layer, a sixth global average pooling layer, eighty-one convolutional layer, a second global maximum pooling layer, eighty-two convolutional layers, a tenth upsampling layer, an eleventh upsampling layer, an eighty-three convolutional layer, a forty-fourth active layer, a fourth expanded convolutional layer, a forty-fifth active layer, an eighty-four convolutional layer, a forty-sixth active layer, a fifth expanded convolutional layer, a forty-seventh active layer, an eighty-fifth convolutional layer, a forty-eighth active layer, a sixth expanded convolutional layer, a forty-ninth active layer, an eighty-sixth convolutional layer, and a fourth active layer. The size of the seventy-nine convolutional layer convolution kernels is 3 x 3, the number of convolution kernels is 128, the size of the eighty convolutional layer convolution kernel is 1 x 1, the number of convolution kernels is 128, the output width and height of the sixth global average pooling layer are both 1, and the eighty-one convolutional layer convolution kernelThe size is 1 × 1, the number of convolution kernels is 128, the second global maximum pooling layer output width and height are both 1, the eighty-two convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the eighty-three convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the fourth extended convolutional layer expansion degree is 1, the convolution kernel size is 3 × 3, the number of convolution kernels is 128, the eighty-four convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the fifth extended convolutional layer expansion degree is 2, the convolution kernel size is 3 × 3, the number of convolution kernels is 128, the eighty-five convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 128, the sixth extended convolutional layer expansion degree is 4, the convolution kernel size is 3 × 3, the number of convolution kernels is 128, the eighty-six convolutional kernel size is 1 × 1, and the number of convolution kernels is 256. The second global information guides the multi-scale feature block to receive the feature map R of the fifth feature enhanced block₅. The feature map is input into five branches after passing through a seventy-nine convolutional layer, a forty-second active layer, an eighty-th convolutional layer and a forty-third active layer, the first branch is a sixth global average pooling layer, an eighty-first convolutional layer and a tenth up-sampling layer, and the second branch is a second global maximum pooling layer, an eighty-second convolutional layer and a eleventh up-sampling layer. And the feature maps are subjected to channel stacking operation through the third branch, the fourth branch, the fifth branch and the global feature map, and are subjected to eighty-six convolutional layers and fifty-fifth active layers after the channel stacking operation to obtain a final feature map. The 256 characteristic maps of the multi-scale characteristic block guided by the second global information are marked as M₂The width of the characteristic diagram is W/16, and the height is H/16.

For the fourth microAnd the fourth fine tuning and refining module consists of an eighty-seventh convolutional layer, a fifty-first active layer, an eighty-eighth convolutional layer, a fifty-second active layer and an eighty-ninth convolutional layer. Eighty-seventh convolutional layer convolution kernels with the size of 3 x 3, the number of convolution kernels of 256, eighty-eighth convolutional layer convolution kernels with the size of 3 x 3, the number of convolution kernels of 512, eighty-ninth convolutional layer convolution kernels with the size of 3 x 3, the number of convolution kernels of 128, and a fourth fine-tuning refinement module for receiving the feature map R from the fourth feature enhancement block₄And a feature map B of a fourth bi-directional attention block₄Characteristic diagram B₄After passing through the eighty-eight convolutional layers, two characteristic diagrams are respectively w and b according to channels, and the characteristic diagram w is as well as the characteristic diagram R of the fourth characteristic enhancement block passing through the eighty-seven convolutional layers and the fifty-one active layer₄And adding the dot product with the characteristic diagram b, and obtaining a final characteristic diagram after passing through a fifty-second active layer and an eighty-nine convolutional layer. The 128 characteristic graphs passing through the fourth fine tuning refinement module are marked as F₄The width of the characteristic diagram is W/8, and the height is H/8.

For the third fine tune refinement module, the third fine tune refinement module consists of a ninety-th convolutional layer, a fifty-third active layer, a ninety-one convolutional layer, a fifty-four active layer, a ninety-two convolutional layer. The ninety convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 128, the ninety convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 256, the ninety convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 64, and the third fine refinement module receives the feature map R from the third feature enhancement block₃And a third two-way attention Block feature map B₃Characteristic diagram B₃After passing through the ninety convolutional layer, two characteristic diagrams are respectively w and b according to the channel, and the characteristic diagram w and the characteristic diagram R of the third characteristic enhanced block passing through the nineteenth convolutional layer and the fifty-third activation layer₃After dot product operation, adding the feature map b, and obtaining the final feature map after fifty-fourth activation layer and ninety-second convolution layer. The 64 characteristic graphs passing through the third fine tuning refinement module are marked as F₃The width of the characteristic diagram is W/4, and the height is H/4.

For the second fine tune refinement module, the second fine tune refinement module consists of a ninety-third convolutional layer, a fifty-fifth active layer, a ninety-fourth convolutional layer, a fifty-sixth active layer, a ninety-fifth convolutional layer. A ninety-third convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 64, a ninety-fourth convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 128, a ninety-fifth convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 32, a second fine refinement module to receive the feature map R from the second feature enhancement block₂And a second bidirectional attention Block signature B₂Characteristic diagram B₂After passing through the ninety-fourth convolutional layer, the characteristic diagrams are w and b respectively according to two sets of characteristic diagrams of the channel, wherein the characteristic diagram w is as well as the characteristic diagram R of the second characteristic enhanced block passing through the ninety-fourth convolutional layer and the fifty-fifth activation layer₂After dot product operation, adding the characteristic diagram b, and obtaining the final characteristic diagram after fifty-sixth activation layer and ninety-fifth convolution layer. The 32 characteristic graphs passing through the second fine tuning refinement module are marked as F₂The width of the characteristic diagram is W/2, and the height is H/2.

For the first fine tune refinement module, the first fine tune refinement module consists of ninety-six convolutional layers, fifty-seven active layers, ninety-seven convolutional layers, fifty-eight active layers, and ninety-eight convolutional layers. A ninety-sixth convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 32, a ninety-seventh convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 64, a ninety-eighth convolutional layer convolution kernel size of 3 × 3, a number of convolution kernels of 16, a first fine refinement module to receive the feature map R from the first feature enhancement block₁And a first bidirectional attention Block feature map B₁Characteristic diagram B₁After passing through the ninety-seventh convolutional layer, the characteristic diagrams are w and b respectively according to two sets of characteristic diagrams of channels, wherein the characteristic diagram w is as well as the characteristic diagram R of the first characteristic enhanced block passing through the ninety-sixth convolutional layer and the fifty-seventh active layer₁Adding the dot product to the feature map b, and adding the dot product to a fifty-eighth activation layer and a ninety-eighth activation layerAnd (5) obtaining a final characteristic diagram after the lamination layer is coiled. The 16 characteristic graphs passing through the second fine tuning refinement module are marked as F₁The width of the characteristic diagram is W, and the height is H.

For the output layer, the second output layer consists of a ninety-ninth convolutional layer and a twelfth upsampling layer, the size of a convolutional kernel of the ninety-ninth convolutional layer is 3 multiplied by 3, the number of the convolutional kernels is 1, the upsampling multiplying power of the twelfth upsampling layer is 16, and the second output layer receives the second global information to guide the feature map M of the multi-scale block₂Feature map M₂Sequentially passing through a nineteenth convolutional layer and a twelfth upsampling layer to obtain a second output layer output significance prediction graph S₂. The third output layer consists of a hundred-th convolutional layer and a thirteenth upsampling layer, the size of a convolution kernel of the hundred-th convolutional layer is 3 multiplied by 3, the number of the convolution kernels is 1, the upsampling multiplying power of the thirteenth upsampling layer is 8, and the third output layer receives a feature diagram F of the fourth fine-tuning and refining module₄Feature map F₄Sequentially passing through the first hundred convolutional layers and the thirteenth upsampling layer to obtain a third output layer output significance prediction graph S₃. The fourth output layer consists of a one hundred-th convolutional layer and a fourteenth up-sampling layer, the size of a convolution kernel of the one hundred-th convolutional layer is 3 multiplied by 3, the number of the convolution kernels is 1, the up-sampling multiplying power of the fourteenth up-sampling layer is 8, and the fourth output layer receives a feature map F of the third fine-tuning refinement module₃Feature map F₃Sequentially passing through a one hundred th convolution layer and a fourteenth up-sampling layer to obtain a fourth output layer output significance prediction graph S₄. The fifth output layer consists of a hundred-th convolutional layer and a fifteenth upsampling layer, the sizes of convolution kernels of the hundred-th convolutional layer and the nineteenth convolutional layer are 3 multiplied by 3, the number of the convolution kernels is 1, the upsampling multiplying power of the fifteenth upsampling layer is 4, and the fifth output layer receives the feature map F of the second fine tuning and refining module₂Feature map F₂Sequentially passing through a one hundred and two convolution layers and a fifteenth upsampling layer to obtain a fifth output layer output significance prediction graph S₅. The sixth output layer consists of a hundred-th three convolutional layers and a sixteenth up-sampling layer, wherein the convolution kernel of the hundred-th three convolutional layers is largeThe size is 3 multiplied by 3, the number of convolution kernels is 1, the upsampling multiplying power of a fifteenth upsampling layer is 2, and a sixth output layer receives a feature map F of a first fine tuning refinement module₁Feature map F₁Sequentially passing through a one hundred and a zero th convolution layer and a fourteenth up-sampling layer to obtain a fourth output layer output significance prediction graph S₆. The seventh output layer consists of a hundred-and-zero-th convolutional layer, the sizes of convolution kernels of the hundred-and-zero-th convolutional layers are 3 multiplied by 3, the number of the convolution kernels is 1, and the seventh output layer receives the characteristic diagram S of the second output layer₂A second output layer characteristic diagram S₃A second output layer characteristic diagram S₄A second output layer characteristic diagram S₅Second output layer characteristic diagram S₆Stacking the outputs of the five output layers, and passing through the one hundred and one convolution layer to obtain a significance prediction graph S of the output of the seventh output layer₇. Significance prediction map S₇Significance prediction map S after pixel inversion₁Performing weight addition to obtain a final significance prediction map S through a final output layer, wherein the pixel inversion operation is to subtract the binarized significance prediction map S from an image with all 1 pixels₁The background and the foreground of the significance prediction image are reversed, and the weights are added to form a significance prediction image S₁Significance prediction graph S multiplied by real number 0.6 and multiplied by real number 0.4₇And (4) adding.

Step 1_ 3: RGB image of each original RGBD image in training set

And depth image

Inputting the original input images into a convolutional neural network for training to obtain 1 saliency prediction map corresponding to each original RGBD image in a training set, and performing saliency prediction on the original RGBD images

The corresponding set of 7 saliency prediction maps and 1 background saliency prediction map is labeled as

Wherein the saliency prediction map is labeled with a real scene

Real scene background label for supervision, background significance prediction graph

Supervision;

step 1_ 4: computing training sets with RGB images

Corresponding to the loss function value between the set formed by the significance prediction image and the corresponding set formed by the real scene label image and the real scene background label image

And

the value of the loss function in between is recorded as

The loss function value is obtained by adopting a two-class cross entropy loss function.

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for T times, circulating the whole training set each time to obtain a convolutional neural network classification training model, and obtaining N multiplied by T binary classification cross entropy loss function values in total; assuming that the minimum loss function value is the optimal result, and then finding out the loss function value with the minimum value from the N multiplied by T loss function values; and then, corresponding the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training model^bestAnd an optimum bias term b^bestThe optimal weight vector and the parameters of the bias items are called as weights, and the weights are stored in a designated folder when training is finished and are called when testing is convenient; in the invention, T is 100;

the specific steps of the test stage process are as follows:

the test images are test set images in the divided data set before experiment, and the test images must be untrained images. For a pair of RGBD images in the test set, there are RGB images

Depth image

Inputting the pair of RGBD images into the established convolutional neural network model, and obtaining a significance prediction graph of a final output layer by calling the optimal weight selected in the training stage

The significance prediction graph is the final significance prediction graph of the invention, different test sets are respectively tested, the final significance prediction graph is stored in different folders, and then the real scene label graph corresponding to the test set is passed

And (5) carrying out comparison to obtain a final experimental test result.

In order to verify the feasibility and effectiveness of the method of the invention, experiments were carried out. The experimental environment is an Intel i5-7500 processor, NVIDIA TITAN XP-12GB video card, and runs are written by using the pytorech library python language. Experimental data set the international public acknowledged data sets NJU2K and NLPR were used as experimental data sets to analyze the accuracy and validity of the method of the present invention. In the invention, the selected training set is a training data set which is randomly extracted 1400 pairs of NJU2K and 650 pairs of NLPR images, the other images are used as the experimental test set, and 4 common objective parameters of the visual saliency evaluation detection method are used as evaluation indexes: s-measure (S-measure), E-measure (Enhanced alignment measure), F-measure (F-measure), MAE Mean Absolute Error (Mean Absolute Error). The method comprises the steps that S metric values evaluate structural similarity of a significant region in a significance prediction graph and a real scene label graph, E-mean metric combines local pixel values with image mean values to jointly capture image-level statistics and local pixel matching information characteristics and take mean values to represent, F metric values are similarity metric represented as weighted harmonic mean values based on regions, and MAE is defined as absolute error in the direction of average pixels between the significance prediction graph and the real scene label graph. The four index values of the present invention are shown in table 1. As can be seen from the data listed in Table 1, the result generated by the method of the invention is very close to the result of the label graph of the real scene, and experiments prove that the final significance prediction graph generated by the method of the invention has higher precision and certain robustness on two international public data, which indicates that the invention has effectiveness in significance detection.

TABLE 1 comparison of 4 commonly used objective indices on two International published data sets using the method of the invention

Performance index	S-measure	E-mean	F-mean	MAE
					NJU2K	0.907	0.929	0.889	0.042
NLPR	0.916	0.940	0.884	0.028

The comparison between the final saliency prediction map generated by the method and the real scene label image can be shown by combining the table 1, the method has excellent performance on two international public data sets, the four index values are high, and particularly the average absolute error of NLPR (line-to-line correlation) reaches 0.028. According to the three pairs of contrast images, the saliency prediction graph generated by the method is very close to a real scene label graph, can adapt to various complex environments, and can show that the method is clear and definite in the boundary of a saliency target object and accurate and complete in target structure identification.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An RGBD image saliency detection method for regenerating a saliency map into a three-stream network is characterized by comprising the following specific steps:

constructing a neural network, and adopting a double-current end-to-end convolution neural network and a single-current light weight network; the double-current end-to-end convolutional neural network adopts VGG-16 as a basic coding network; importing the double-current end-to-end convolution neural network into ImageNet training weight for pre-training;

2. The RGBD image saliency detection method of a saliency map reproduction three-flow network of claim 1, characterized in that in the neural network, the encoding part of the two-stream end-to-end convolution neural network is composed of a first neural network block, a second neural network block, a third neural network block, a fourth neural network block, a fifth neural network block, which are input from a first input layer by a depth image;

3. The RGBD image saliency detection method of one saliency map reproduction three-flow network of claim 2, characterized in that the first global information guided multi-scale feature block and the second global information guided multi-scale feature block have the same structure, and include a first convolution layer, a first active layer, a second convolution layer, a second active layer, a first global average pooling layer, a third convolution layer, a third active layer, a first upsampling layer, a first global maximum pooling layer, a fourth convolution layer, a fourth active layer, a second upsampling layer, a fifth convolution layer, a fifth active layer, a first expanded convolution layer, a sixth active layer, a sixth convolution layer, a seventh active layer, a second expanded convolution layer, an eighth convolution layer, a seventh convolution layer, a ninth active layer, and a third expanded convolution layer;

4. The RGBD image saliency detection method of the saliency map reproduction three-flow network of claim 3, wherein the first feature aggregate gate structure block, the second feature aggregate gate structure block, the third feature aggregate gate structure block, and the fourth feature aggregate gate structure block are identical in structure, and include: a ninth convolutional layer, a tenth active layer, a third upsampling layer, a tenth convolutional layer, an eleventh active layer, an eleventh convolutional layer, a twelfth active layer, a twelfth convolutional layer, a thirteenth active layer, a thirteenth convolutional layer, a fourteenth active layer, a fourteenth convolutional layer, a fifteenth active layer, a fifteenth convolutional layer, a sixteenth active layer, a first S-type active function, a second S-type active function, a sixteenth convolutional layer, a seventeenth active layer;

each feature aggregation gate structure block is divided into a depth stream feature, an RGB stream feature and a fusion information stream feature, the depth stream feature is added to the fusion information stream feature map passing through the ninth convolution layer, the tenth active layer and the third upsampling layer, the tenth convolution layer and the eleventh active layer, the added feature map passes through the twelfth convolution layer, the thirteenth active layer, the thirteenth convolution layer and the fourteenth active layer to obtain a preliminary fusion feature map, and the fusion information stream feature is dot-product-operated with the RGB stream feature map passing through the eleventh convolution layer and the twelfth active layer and then added to the original RGB stream to obtain a gate structure feature map; the gate structure feature graph passes through the first S-shaped activation function and the second S-shaped activation function to obtain a gate structure binarization weight; the RGB stream features are subjected to the operations of the fourteenth convolution layer, the fifteenth activation layer, the fifteenth convolution layer and the sixteenth activation layer, and gate structure binarization weight dot product to obtain an RGB information feature map; the preparation fusion feature map and gate structure binarization weight dot product operation is carried out to obtain a depth information feature map; and after the depth information characteristic diagram and the RGB information characteristic diagram are subjected to channel stacking operation, a final characteristic diagram B is obtained through the sixteenth convolution layer and the seventeenth activation layer.

5. The RGBD image saliency detection method of claim 4, characterized in that the first, second, third and fourth information leading blocks are identical in structure, and include: a convolutional layer A; and performing dot product operation on the feature map through the convolutional layer A and the final feature map B, and adding the feature map after the dot product operation and the final feature map B to obtain a final feature map C.

6. The RGBD image saliency detection method of one saliency map reproduction three flow network of claim 5, characterized in that said first, second, third and fourth feature fine tuning refinement blocks are identical in structure and comprise eighty-seventh, fifty-first, eighty-eighth, fifty-second and eighty-ninth convolution layers;

the first feature map is a feature map output by the feature enhancement block corresponding to each feature fine tuning refinement block; the second feature map is a feature map of the bidirectional attention block output corresponding to each feature fine tuning refinement block; after the first characteristic diagram passes through the eighty-eight convolutional layer and the fifty-second active layer, two sets of characteristic diagrams are respectively a first characteristic diagram w and a first characteristic diagram b according to channels, and the first characteristic diagram w, the eighty-seventh convolutional layer and the fifty-first active layer are subjected to dot product operation, then are added with the first characteristic diagram b, and then are subjected to the eighty-nine convolutional layer to obtain a final characteristic diagram D.

7. The RGBD image saliency detection method of saliency map reproduction three-flow network of claim 6, characterized in that said first bidirectional attention block, said second bidirectional attention block, said third bidirectional attention block and said fourth bidirectional attention block are identical in structure, including: a seventh upsampling layer, a second global average pooling layer, a fifty-ninth convolutional layer, a first maximum normalized activation layer, a sixty-third convolutional layer, a ninth S-type activation function, a sixty-convolutional layer, and a sixty-second convolutional layer;