CN112070753A

CN112070753A - Multi-scale information enhanced binocular convolutional neural network saliency image detection method

Info

Publication number: CN112070753A
Application number: CN202010948669.9A
Authority: CN
Inventors: 周武杰; 柳昌; 郭沁玲; 雷景生; 强芳芳; 杨胜英; 郭翔
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Science and Technology ZUST
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-11

Abstract

The invention discloses a multi-scale information enhanced binocular convolutional neural network saliency image detection method, and relates to the technical field of neural networks. The method constructs an end-to-end convolutional neural network two-classification training model, original scene image data in a training set is input into the convolutional neural network two-classification training model for training to obtain a final saliency prediction image, loss function values between the final saliency prediction image and a real saliency detection image set are repeatedly calculated to obtain a trained convolutional neural network two-classification training model, a scene image to be saliency detected is input into the convolutional neural network two-classification training model to obtain a predicted saliency detection image, the saliency detection image generated by the method is clear in boundary and complete in structure, and the problems that the saliency map generated by the existing method is missing in structure and incomplete in detail are solved.

Description

Multi-scale information enhanced binocular convolutional neural network saliency image detection method

Technical Field

The invention relates to the technical field of neural networks, in particular to a multi-scale information enhanced binocular convolution neural network saliency image detection method.

Background

In recent years, due to the rapid increase of computer hardware capability and the rise of deep neural networks, many visual tasks related to the traditional computer are rapidly redefined, and by means of a deep learning tool, many visual task indexes are rapidly refreshed, wherein significance detection is included. The significance detection is a detection method based on the human brain attention mechanism. When a person sees a scene, the human automatically processes a part of interested areas preferentially, but temporarily ignores the uninteresting areas, and the interested areas screened by the human through the unique attention mechanism are called as salient areas, and the salient detection is a task of detecting the salient areas through the existing means, similar to matting, semantic segmentation and the like in computer vision. The traditional significance detection relies on manual feature extraction, the process is complicated and low in efficiency, the generated result is not ideal enough, and the significance detection is not easy to realize in the actual life industrialization. These drawbacks are slowly overcome when deep network models are introduced to significance detection. Even if a simpler deep neural network is used for coding pre-training, the decoded result exceeds most methods for manually extracting features. The saliency detection can be used as a first step of other visual tasks, such as semantic segmentation, behavior recognition and the like, and has many applications in industry, such as intelligent driving, mobile phone camera shooting and the like.

In order to use the deep neural network for the significance detection method, many researchers have proposed that many single-stream networks can obtain good effects only by RGB picture input. However, technology is advancing, and advances in sensor hardware allow people to obtain RGB pictures and obtain depth image slices with depth information, which are referred to as depth images. The depth image has the same scene as an RGB picture but has different semantic information. The existing method is to utilize a depth image to assist an RGB image to carry out significance detection and build a binocular convolutional neural network model. However, based on the characteristics of the depth image, the underutilized depth image may not play a positive role in the network, and since the depth image shot by the sensor has a certain error, the information feature fusion method of the depth image and the RGB image in the network also needs to be studied intensively.

Therefore, how to provide a method for effectively obtaining a salient image by utilizing insufficient feature information is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a method for detecting a saliency image of a multi-scale information enhanced binocular convolutional neural network, so as to solve the problem that the saliency image cannot be effectively obtained due to insufficient utilization of feature information provided in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme: a multi-scale information enhanced binocular convolutional neural network saliency image detection method comprises the following specific steps:

acquiring a training set, wherein the training set comprises N pairs of scene image data and real label images corresponding to the scene image data; wherein the N pairs of scene image data comprise N original RGB images and N depth images;

constructing an end-to-end convolutional neural network two-class training model;

inputting each pair of original scene image data in a training set into the convolutional neural network two-classification training model for training, performing multi-scale feature refinement extraction to obtain a significance prediction image corresponding to each pair of original scene image data in the training set, and performing pre-fusion to obtain a final significance prediction image;

obtaining a loss function value between the final significance prediction image and all real significance detection image sets by adopting a binary-class cross entropy;

and repeatedly calculating a loss function value, and determining the optimal weight vector and the optimal bias term to optimize the weight of the convolutional neural network two-class training model so as to obtain the trained convolutional neural network two-class training model.

Preferably, the convolutional neural network two-class training model comprises 5 output channels; wherein the two inputs are an RGB image and a depth image, respectively; the first output channel and the second output channel respectively comprise a first neural network block, a second neural network block, a fusion block, a multi-scale feature extraction block, an attention block and a reverse attention block; the first neural network block processes the depth image to obtain a first feature map; the second neural network block processes the RGB image to obtain a second feature map; the first feature map is subjected to the reverse attention block to obtain a third feature map; the input of the reverse attention block is a first feature map and a significance prediction image output by a next-stage output channel; the fusion block performs feature fusion on the first feature map and the second feature map to obtain a fifth feature map; the second feature map and the third feature map are used as the input of the multi-scale feature extraction block, and multi-scale feature extraction is carried out to obtain a fourth feature map; and the fourth feature map and the fifth feature map are used as the input of the attention block to obtain a significance prediction image, and the significance prediction image is output through an output layer.

An upper sampling layer is arranged between the multi-scale feature extraction blocks of the third output channel and the fourth output channel and the reverse attention block;

the multi-scale feature extraction block of the fifth output channel only takes the second feature map as input;

and each output channel obtains the significance prediction images, and the final significance prediction images are obtained by overlapping the significance prediction images according to the channels.

By adopting the scheme, the method has the following beneficial effects: and refining and re-extracting the multi-scale features, fusing a plurality of information by using a decoder based on an attention mechanism to perform selective enhancement, and guiding the low-level stage features to form a hierarchical reflux network after the high-level stage features are extracted. And generating a significance prediction graph at each stage, performing hierarchical supervision by using a two-classification cross entropy function, and combining the significance graphs of the five stages to generate a final significance prediction graph. The invention solves the problems of missing structure and incomplete detail of the salient map generated by the existing method.

Preferably, the fusion blocks respectively comprise a convolution layer I, a convolution layer II and an S-shaped activation layer I;

the first characteristic diagram passes through the convolution layer I to obtain a sixth characteristic diagram;

the second characteristic diagram passes through the convolution layer II to obtain a seventh characteristic diagram;

and after the sixth characteristic diagram is subjected to tensor multiplication operation by the S-shaped activation layer I and the seventh characteristic diagram, the sixth characteristic diagram is added with the seventh characteristic diagram, and the obtained result is subjected to channel stacking operation with the sixth characteristic diagram to output a fifth characteristic diagram.

By adopting the scheme, the method has the following beneficial effects: and a fusion block is arranged in the five output channels to construct joint information, the joint information is the fusion of the characteristics of each stage of the RGB image input basic network and the depth image input basic network, and because some depth images can be distorted or unobvious in display, the method adopts two basic networks to obtain a final significance prediction image through a series of operations by information fusion, and the generated significance prediction image has a clear boundary structure.

Preferably, the attention blocks each include: an S-type active layer A, a convolution layer B, an active layer A and an active layer B;

compressing the fourth feature map to pixel values arranged according to channels through tensor operation by the S-shaped active layer A; and after passing through the convolutional layer A and the active layer B, the fifth feature map is multiplied by a pixel value to obtain an intermediate feature map A, and after passing through the convolutional layer A and the active layer B, the intermediate feature map A is added with the eighth feature map to obtain the significance prediction image and output the significance prediction image.

By adopting the scheme, the method has the following beneficial effects: the attention block of the invention is mainly composed of a decoder, and can reduce the size of a network model and improve the decoding precision.

Preferably, the reverse attention blocks each comprise: an S-type active layer a, a convolution layer a, an active layer a, a convolution layer b and an active layer b; after the negation of the significance prediction image, the significance prediction image is compressed to pixel values arranged according to channels through tensor operation by the S-shaped active layer a; and inverting the pixel value of the inverted feature map, receiving the first feature map, multiplying the first feature map by the pixel value after passing through the convolutional layer a and the activation layer a to obtain an intermediate feature map a, and adding the intermediate feature map a and the significance prediction image after passing through the convolutional layer b and the activation layer b to obtain a third feature map and outputting the third feature map.

Preferably, the multi-scale feature extraction block includes: the system comprises a first multi-scale feature extraction block, a second multi-scale feature extraction block, a third multi-scale feature extraction block, a fourth multi-scale feature extraction block and a fifth multi-scale feature extraction block;

wherein the fifth multi-scale feature extraction block comprises five dilation convolution channels and one sampling channel; the second characteristic diagram A respectively passes through the expansion convolution channel and the sampling channel to obtain a fourth characteristic diagram A;

the fourth multi-scale feature extraction block comprises four expansion convolution channels and a sampling channel; performing convolution and activation on the third feature map B to obtain a ninth feature map B, performing channel stacking operation on the ninth feature map B and the second feature map B, and performing convolution layer to obtain a pre-extraction feature map B; after the channel stacking operation is carried out on the preliminary extraction feature map B and the second feature map B, a tenth feature map B is obtained through the sampling channel, and meanwhile, an eleventh feature map B is obtained through the preliminary extraction feature map and the expansion convolution channel; stacking channels of the tenth characteristic diagram B and the eleventh characteristic diagram B, and performing convolution to obtain a fourth characteristic diagram B;

the third multi-scale feature extraction block comprises three expansion convolution channels and a sampling channel; performing convolution and activation on the third feature map C to obtain a ninth feature map C, performing channel stacking operation on the ninth feature map C and the second feature map C, and performing convolution layer to obtain a pre-extraction feature map C; after the channel stacking operation is carried out on the preliminary extraction feature map C and the second feature map C, a tenth feature map C is obtained through the sampling channel, and meanwhile, an eleventh feature map C is obtained through the preliminary extraction feature map C and the expansion convolution channel; stacking channels of the tenth feature map C and the eleventh feature map C, and performing convolution to obtain a fourth feature map C;

the second multi-scale feature extraction block comprises two expansion convolution channels and a sampling channel; performing convolution and activation on the third feature map D to obtain a ninth feature map D, performing channel stacking operation on the ninth feature map D and the second feature map D, and performing convolution layer to obtain a pre-extraction feature map D; after the channel stacking operation is carried out on the preliminary extraction feature map D and the second feature map D, a tenth feature map D is obtained through the sampling channel D, and meanwhile, an eleventh feature map D is obtained through the preliminary extraction feature map D and the expansion convolution channel; stacking channels of the tenth feature map D and the eleventh feature map D, and performing convolution to obtain a fourth feature map D;

the first multi-scale feature extraction block and the second multi-scale feature extraction block have the same structure.

By adopting the scheme, the method has the following beneficial effects: the method uses the multi-scale feature extraction block to carry out multi-scale feature refinement analysis on the features of each stage of the basic network input by each RGB image, and extracts more detailed multi-scale features, so that the finally obtained significance prediction image has clear details and high accuracy.

According to the technical scheme, compared with the prior art, the invention discloses the method for detecting the saliency image of the multi-scale information enhanced binocular convolution neural network, and the method has the following beneficial effects:

1. the method is constructed by a binocular end-to-end coding-decoding network, and multi-scale feature refinement analysis is carried out on each stage of features of a basic network input by each RGB image by using a multi-scale feature extraction block to extract more detailed multi-scale features; and combined information is constructed, wherein the combined information is the pre-fusion of the characteristics of each stage of the basic network input by the RGB image and the basic network input by the depth image, and is represented as the output of a fusion block in the frame image.

2. The decoder of the method mainly forms an attention block, aims to reduce the size of a network model and improve the decoding precision, and adopts multi-level supervision to supervise the generated saliency map of each stage so as to achieve the highest precision of the model.

3. The final significance prediction image generated by the method is compared with the real label, and the conclusion shows that the significance detection effect of the method is excellent, the accuracy of the detected object is high, the details are clear, the boundary is clear, and the method can be well suitable for various environments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of an overall implementation of the present invention;

FIG. 2 is a diagram of a first fusion block, and other fusion blocks have the same structure and the same function;

FIG. 3 is a fifth multi-scale feature extraction block;

FIG. 4 is a diagram of a fourth multi-scale feature extraction block, in which an adaptive multi-scale feature extraction strategy is employed, the third multi-scale feature extraction block has one less branch than the fourth multi-scale feature extraction block, the second multi-scale feature extraction block has one less branch than the third multi-scale feature extraction block, and the first multi-scale feature extraction block and the second multi-scale feature extraction block have the same structure;

FIG. 5 is a drawing of a fifth attention block, with the other attention blocks having the same structure and the same function;

FIG. 6 is a drawing of a fourth reverse attention block, the other attention blocks having the same structure and the same function;

FIG. 7a is a graph of randomly selected RGB values of a test set of NJU 2K; FIG. 7b illustrates a depth image corresponding to a test set of NJU 2K; FIG. 7c is a final saliency prediction plot generated by the present invention; FIG. 7d is a diagram of a real label corresponding to a test set;

FIG. 8a is a graph illustrating randomly selected RGB values for a test set of NJU 2K; FIG. 8b illustrates a depth image corresponding to a test set of NJU 2K; FIG. 8c is a final saliency prediction plot generated by the present invention; FIG. 8d is a diagram of a real label corresponding to a test set;

FIG. 9a is a graph of randomly selected RGB for the NLPR test set; FIG. 9b illustrates a depth image corresponding to a test set of NJU 2K; FIG. 9c is a final saliency prediction plot generated by the present invention; FIG. 9d is a diagram of a real label corresponding to a test set;

FIG. 10a is a graph illustrating the PR curve of the present invention for the NJU2K test set; FIG. 10b is a set of NLPR tests showing PR graphs of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a multi-scale information enhanced binocular convolutional neural network saliency image detection method, and the overall implementation block diagram of the method is shown in fig. 1, and the method comprises a training stage and a testing stage.

The specific steps of the training phase process are as follows:

step 1_ 1: acquiring a training set, wherein the training set comprises N pairs of scene image data and real label images corresponding to the scene image data; wherein the N pairs of scene image data comprise N original RGB images and N depth images.

Specifically, the RGB image is a color image with three channels of red, green and blue, the depth image is a single-channel image with depth information captured using a depth sensor, the real label map is a saliency image based on correct classification given by the human attention mechanism, and constitutes a training set, the RGB image andthe depth image constitutes a pair of scene image data; marking the q-th original RGB image as { I ] in the training set_q(i, j) }, and recording the q-th original depth image corresponding to the (i, j) } as { R }_q(I, j) }, will be associated with the original pair of scene images { I_q(i, j) } the corresponding real label image is recorded as

Before training begins, in order to facilitate pre-training, the depth image is converted into a three-channel image through simple copying; in the embodiment, the NJU2K and NLPR are selected as experimental data sets, and due to the privacy problem, 1985 pairs of pictures are included in the NJU2K, 1000 pairs of pictures are included in the NLPR, one pair of pictures includes an RGB picture and a depth image, 1400 pairs are extracted from the NJU2K, 650 pairs are extracted from the NLPR as training sets, and the rest of pictures are used as test sets.

Step 1_ 2: and constructing an end-to-end convolutional neural network two-class training model.

As can be seen from FIG. 1, the convolutional neural network two-class training model comprises an input layer, a hidden layer and an output layer. The input layer is sequentially a first input layer of the depth image and a second input layer of the RGB image, the hidden layer is sequentially a first neural network block, a second neural network block, a third neural network block, a fourth neural network block and a fifth neural network block of the depth stream, a sixth neural network block, a seventh neural network block, an eighth neural network block, a ninth neural network block and a tenth neural network block of the RGB stream, a first fusion block, a second fusion block, a third fusion block, a fourth fusion block and a fifth fusion block of the joint information stream, a first multi-scale feature extraction block, a second multi-scale feature extraction block, a third multi-scale feature extraction block, a fourth multi-scale feature extraction block and a fifth multi-scale feature extraction block for multi-scale feature enhancement, a first attention block for decoding, A second attention block, a third attention block, a fourth attention block, a fifth attention block, and a first reverse attention block, a second reverse attention block, a third reverse attention block, a fourth reverse attention block for accepting feedback level information, wherein the output layers comprise a first output layer, a second output layer, a third output layer, a fourth output layer, a fifth output layer, and a sixth output layer for final stage output. For the convenience of pre-training, the depth image is simply copied to three channels through the first input layer, and the RGB image enters the second input layer without any change. The input width of the depth image and the input width of the RGB image are both W, the input height of the depth image and the input height of the RGB image are both H, and the channel is three channels.

Further, for the depth stream basic network, the depth stream includes five neural network blocks, namely a first neural network block, a second neural network block, a third neural network block, a fourth neural network block and a fifth neural network block. The number of feature maps output via the first neural network block was 64, and the set of 64 feature maps was denoted as R₁The width of the characteristic diagram is W/4, and the height is H/4. The number of feature maps passing through the second neural network block is 256, and the set of 256 feature maps is denoted as R₂The width of the characteristic diagram is W/4, and the height is H/4. The number of feature maps passing through the third neural network block is 512, and the set of 512 feature maps is denoted as R₃The width of the characteristic diagram is W/8, and the height is H/8. The number of the feature maps passing through the fourth neural network block is 1024, and the set of the 1024 feature maps is recorded as R₄The width of the characteristic diagram is W/16, and the height is H/16. The number of feature maps passing through the fifth neural network block is 2048, and the set of 2048 feature maps is denoted as R₅The width of the characteristic diagram is W/16, and the height is H/16.

Further, for the RGB stream infrastructure network, the RGB stream infrastructure network is composed of five neural network blocks, which are a sixth neural network block, a seventh neural network block, an eighth neural network block, a ninth neural network block, and a tenth neural network block. The RGB stream basic network and the depth stream basic network adopt the same structure. The number of feature maps passing through the sixth neural network block is 64, and the set of 64 feature maps is represented as I₁The width of the characteristic diagram is W/4, and the height is H/4; the number of feature maps passing through the seventh neural network block is 256, and the set of 256 feature maps is represented as I₂The width of the characteristic diagram is W/4, and the height is H/4; through the eighth stepThe number of feature maps of the neural network block is 512, and the set of 512 feature maps is represented as I₃The width of the characteristic diagram is W/8, and the height is H/8; the number of the feature maps passing through the ninth neural network block is 1024, and the set of the 1024 feature maps is marked as I₄The width of the characteristic diagram is W/16, and the height is H/16; the number of feature maps passing through the tenth neural network block is 2048, and the set of 2048 feature maps is marked as I₅The width of the characteristic diagram is W/16, and the height is H/16;

therein, ten neural network blocks of the present invention all refer to the Res2net structure.

For the five fusion blocks shown in fig. 1, the same structure is adopted, the same parameters are set, only the first fusion block is described in detail in this embodiment, and the other four fusion blocks are the same as the first fusion block.

As shown in FIG. 2, the first fusion block receives a feature map R from a first neural network block of the deep stream foundation network₁And I of the sixth neural network block of the RGB stream₁. The first fused block consists of sixty-eight convolutional layers, sixty-nine convolutional layers, and the first S-type active layer, and the active function is 'Sigmoid'. The sixty-eight convolutional layers have a convolutional kernel size of 1 × 1 and the number of convolutional kernels of 32, and the sixty-nine convolutional layers have a convolutional kernel size of 1 × 1 and the number of convolutional kernels of 32. Received characteristic map R₁And I₁Respectively passing through sixty-eight convolutional layers and sixty-nine convolutional layers to obtain characteristic diagram

And characteristic diagrams

Characteristic diagram

Through the first S-shaped active layer and the characteristic diagram

Adding the characteristic diagram after carrying out tensor multiplication operation

The obtained result is further combined with the feature map

A channel stacking operation is performed. The number of feature maps of the first fusion block was 64, and 64 feature maps were denoted as f₁The width of the characteristic diagram is W/4, and the height is H/4.

The number of feature maps passing through the second fusion block was 64, and 64 feature maps were denoted as f₂The width of the characteristic diagram is W/4, and the height is H/4. The number of feature maps of the third fused block was 64, and 64 feature maps were denoted as f₃The width of the characteristic diagram is W/8, and the height is H/8. The number of feature maps passing through the fourth fusion block was 64, and 64 feature maps were denoted as f₄The width of the characteristic diagram is W/16, and the height is H/16. The number of feature maps of the fifth fused block was 64, and 64 feature maps were denoted as f₅The width of the characteristic diagram is W/16, and the height is H/16.

For the fifth multi-scale feature extraction block, as shown in fig. 1, the multi-scale feature extraction block acts on the multi-scale features output by the underlying network, and the fifth multi-scale feature extraction block receives the feature map I from the tenth neural network block of the RGB map underlying network₅。

As shown in fig. 3, the fifth multi-scale feature extraction block is composed of seventy-eight convolutional layers, ninety-fifth active layer, first Adaptive average pooling layer (Adaptive avgel), seventy-nine convolutional layers, first upsampling layer, eighty convolutional layers, first extended convolutional layer (scaling constraint), ninety-sixth active layer, eighty-one convolutional layer, second extended convolutional layer, seventh active layer, eighty-two convolutional layers, third extended convolutional layer, ninety-eight active layer, eighty-three convolutional layers, fourth extended convolutional layer, ninety-nine active layer, eighty-four convolutional layers, fifth extended convolutional layers, one hundred active layer, and eighty-five convolutional layers. The seventy-eight convolutional layer convolution kernel size is 1 multiplied by 1, the number of the convolution kernels is 64, the pooling size of the first adaptive average pooling layer is 1, and the function is thatIs to convert the feature map into pixel values arranged by channel. The size of the seventy-nine convolutional layer convolution kernel is 1 multiplied by 1, the number of the convolution kernels is 16, the sampling mode of the first up-sampling layer is 'bilinear', and the size and the characteristic diagram I after up-sampling₅Consistently, the eighty-th convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 16, the first expanded convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 1, the expansion ratio (variance) is 1, the eighty-th convolutional layer convolution kernel size is 1 × 01, the number of convolution kernels is 16, the second expanded convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 2, the expansion ratio is 2, the eighty-th convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 16, the third expanded convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 4, the expansion ratio is 4, the eighty-th convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 16, the fourth expanded convolutional kernel size is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 6, the dilation ratio is 6, the eighty-four convolutional kernel size is 1 × 1, the number of convolutional kernels is 16, the fifth dilation convolutional kernel size is 3 × 3, the number of convolutional kernels is 16, the zero padding parameter is 8, the dilation ratio is 8, the eighty-five convolutional kernel size is 1 × 1, and the number of convolutional kernels is 64. Characteristic diagram I₅After passing through the seventy-eight convolutional layers, the feature map after up-sampling through the first adaptive pooling layer and the seventy-nine convolutional layers is

The characteristic diagram of the expanded convolutional layer is that the characteristic diagram passes through the eightieth convolutional layer and the first expanded convolutional layer

The characteristic diagram of the second expanded convolution layer and the eighty-th convolution layer is

The characteristic diagram of the film passing through the eighty-two convolution layers and the third expansion convolution layer is

The characteristic diagram of the light beam passing through the eighty-third convolution layer and the fourth expansion convolution layer is

The characteristic diagram of the film passing through the eighty-four convolution layers and the fifth expansion convolution layer is

Where each dilation convolution is followed by an active layer. And then, carrying out channel stacking operation on the six types of feature maps, and carrying out eighty-five convolution on the obtained feature maps to obtain a final feature map. The number of feature maps that passed through the fifth multi-scale feature extraction block was 64, and 64 feature maps were designated as P₅The width of the characteristic diagram is W/16, and the height is H/16.

For the fourth multi-scale feature extraction block, as shown in FIG. 1, the fourth multi-scale feature extraction block receives a feature map I from the ninth neural network block of the RGB map base network₄And a feature map r from a fourth reverse attention block₄。

As shown in fig. 4, the fourth multi-scale feature extraction block consists of eighty-six convolutional layers, one hundred-zero activation layer, eighty-seven convolutional layers, eighty-eight convolutional layers, a second adaptive average pooling layer, a second upsampling layer, eighty-nine convolutional layers, a sixth expanded convolutional layer, one hundred-zero two activation layers, a ninety-th convolutional layer, a seventh expanded convolutional layer, three hundred-zero activation layers, a ninety-one convolutional layer, an eighth expanded convolutional layer, four hundred-zero activation layers, a ninety-second convolutional layer, a ninth expanded convolutional layer, five hundred-zero activation layers, and a ninety-three convolutional layers. The eighty-sixth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 64, the eighty-seventh convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 64, the eighty-eighth convolutional layer convolution kernel size is 1 × 1, and the number of convolution kernels is 16. The pooling size of the second adaptive average pooling layer was 1, the size of the eighty-ninth convolutional layer convolution kernel was 1 × 1, and the number of convolution kernels was 16The sixth extended convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 16, the zero padding parameter is 1, the expansion ratio is 1, the ninety convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 16, the seventh extended convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 16, the zero padding parameter is 2, the expansion ratio is 2, the ninety one convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 16, the eighth extended convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 16, the zero padding parameter is 4, the expansion ratio is 4, the ninety two convolutional layer convolutional kernel size is 1 × 1, the number of convolutional kernels is 16, the ninth extended convolutional layer convolutional kernel size is 3 × 3, the number of convolutional kernels is 16, the zero padding parameter is 6, the expansion ratio is 6, the ninety one convolutional kernel size is 1 × 1, and the number of convolutional kernels is 64. Characteristic diagram I₄After an eighty-sixth convolutional layer and a one hundred and one activation layer, is characterized by the characteristic diagram r₄Performing channel stacking operation, and performing eighty-seven layers of convolution layer to obtain a pre-extraction feature map, a pre-extraction feature map and a feature map r₄After the channel stacking operation is carried out, a characteristic diagram is obtained through a second self-adaptive average pooling layer, an eighty-eight layer and a second up-sampling layer

The characteristic diagram is obtained by extracting the characteristic diagram through an eighty-nine convolution layer, a sixth expansion convolution layer and a hundred-zero activation layer

The characteristic diagram is obtained by extracting the characteristic diagram in advance through a ninth convolution layer, a seventh expansion convolution layer and a hundred-th activation layer

The characteristic diagram is obtained by extracting the characteristic diagram through the ninety convolutional layer, the eighth expansion convolutional layer and the one hundred and four activation layers

The characteristic diagram is prepared and extracted through the ninety second convolution layer,The ninth expansion convolution layer and the one hundred and fifty activation layers obtain a characteristic diagram

Where each dilation convolution is followed by an active layer. And then carrying out channel stacking operation on the five types of feature maps, and carrying out ninth and thirteenth convolution on the obtained feature maps to obtain the final feature map. The number of feature images passing through the fourth multi-scale feature extraction block was 64, and 64 feature images were designated as P₄The width of the characteristic diagram is W/16, and the height is H/16.

For the third multi-scale feature extraction block, as shown in FIG. 1, the third multi-scale feature extraction block receives a feature map I from the eighth neural network block of the RGB map base network₃And a feature map r from a third reverse attention block₃. The third multi-scale feature extraction block has one less branch than the fourth multi-scale feature extraction block.

For the third multi-scale feature extraction block, the third multi-scale feature extraction block receives the feature map I from the eighth neural network block of the RGB map base network₃And a feature map r from a third reverse attention block₃. The third multi-scale feature extraction block consists of a ninety-four convolutional layer, a one-hundred-six active layer, a ninety-five convolutional layer, a ninety-six convolutional layer, a third adaptive average pooling layer, a third upsampling layer, a ninety-seven convolutional layer, a tenth extended convolutional layer, a one-hundred-seven active layer, a ninety-eight convolutional layer, an eleventh extended convolutional layer, a one-hundred-eight active layer, a ninety-nine convolutional layer, a twelfth extended convolutional layer, a one-hundred-nine active layer, and a one-hundred-convolutional layer. The ninety-four convolutional layer convolution kernels have a size of 1 × 1, the number of convolution kernels is 64, the ninety-five convolutional layer convolution kernels have a size of 1 × 1, the number of convolution kernels is 64, the ninety-six convolutional layer convolution kernels have a size of 1 × 1, and the number of convolution kernels is 16. The pooling size of the third adaptive average pooling layer is 1, the size of the ninety-seventh convolutional layer convolution kernel is 1 × 1, the number of convolution kernels is 16, the size of the tenth extended convolutional layer convolution kernel is 3 × 3, and the number of convolution kernels is16, a zero padding parameter is 1, an expansion ratio is 1, a ninety eighth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 16, an eleventh expanded convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 16, a zero padding parameter is 2, an expansion ratio is 2, a ninety ninth convolutional layer convolution kernel size is 1 × 1, the number of convolution kernels is 16, a twelfth expanded convolutional layer convolution kernel size is 3 × 3, the number of convolution kernels is 16, a zero padding parameter is 4, an expansion ratio is 4, a first hundred convolutional layer convolution kernel size is 1 × 1, and the number of convolution kernels is 64. Characteristic diagram I₃After a ninety-fourth convolutional layer and a one-hundred-six active layer₃Performing channel stacking operation, obtaining a pre-extraction feature map through the ninety-fifth convolutional layer, and obtaining the pre-extraction feature map and the feature map r₃After the channel stacking operation is carried out, a feature map is obtained through a third self-adaptive average pooling layer, a ninety-sixth layer and a third up-sampling layer

The characteristic diagram is obtained by extracting the characteristic diagram through the ninety seventh convolution layer, the tenth expansion convolution layer and the one hundred and seven activation layers

The characteristic diagram is obtained by extracting the characteristic diagram through the ninety eighth convolution layer, the eleventh expansion convolution layer and the one hundred and eight activation layers

The characteristic diagram is obtained by extracting the characteristic diagram through a nineteenth convolution layer, a twelfth expansion convolution layer and a nineteenth hundred activation layer

Where each dilation convolution is followed by an active layer. And then, carrying out channel stacking operation on the four types of feature maps, and carrying out first hundred convolutions on the obtained feature maps to obtain a final feature map. The number of feature maps that passed through the third multi-scale feature extraction block was 64, and 64 feature maps were designated as P₃The width of the characteristic diagram is W/8, and the height is H/8.

For the second multi-scale feature extraction block, as shown in FIG. 1, the second multi-scale feature extraction block receives the feature map I from the seventh neural network block of the RGB map base network₂And a feature map r from a second reverse attention block₂. The second multi-scale feature extraction block has one less branch than the third multi-scale feature extraction block.

The second multi-scale feature extraction block consists of a one hundred and one hundred th convolutional layer, a one hundred and ten activation layer, a one hundred and two convolutional layers, a one hundred and three convolutional layers, a fourth adaptive average pooling layer, a fourth upsampling layer, a one hundred and four convolutional layers, a thirteenth expansion convolutional layer, a one hundred and eleven activation layer, a one hundred and five convolutional layers, a fourteenth expansion convolutional layer, a one hundred and twelve activation layer and a one hundred and six convolutional layer. The size of the one hundred and zero convolutional layer convolution kernel is 1 multiplied by 1, the number of the convolution kernels is 64, the size of the one hundred and zero convolutional layer convolution kernel is 1 multiplied by 1, and the number of the convolution kernels is 16. The pooling size of the fourth adaptive average pooling layer is 1, the size of the one hundred and zero fourth convolution layer convolution kernels is 1 × 1, the number of convolution kernels is 16, the size of the thirteenth expanded convolution layer convolution kernel is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 1, the expansion ratio is 1, the size of the one hundred and zero five convolution layer convolution kernels is 1 × 1, the number of convolution kernels is 16, the size of the fourteenth expanded convolution layer convolution kernel is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 4, the expansion ratio is 4, the size of the one hundred and zero six convolution kernels is 1 × 1, and the number of convolution kernels is 16. Characteristic diagram I₂After passing through a one hundred-th convolutional layer and a one hundred-tenth active layer, is characterized by a characteristic diagram r₂Performing channel stacking operation, performing a one hundred and two layers of convolution layer to obtain a pre-extraction feature map, and performing a pre-extraction feature map and a feature map r₂After the channel stacking operation is carried out, a feature map is obtained through a fourth self-adaptive average pooling layer, a hundred-th three-layer convolution layer and a fourth up-sampling layer

The characteristic diagram is obtained by extracting the characteristic diagram through a one hundred-four convolution layer, a thirteenth expansion convolution layer and a one hundred-eleven activation layer

The characteristic diagram is obtained by extracting the characteristic diagram through a one hundred-to-zero five convolution layers, a fourteenth expansion convolution layer and a one hundred-to-twelve activation layer

Where each dilation convolution is followed by an active layer. And then, carrying out channel stacking operation on the three types of feature maps, and carrying out a one hundred and six th convolution on the obtained feature maps to obtain a final feature map. The number of feature images passing through the second multi-scale feature extraction block was 64, and 64 feature images were designated as P₂The width of the characteristic diagram is W/4, and the height is H/4.

For the first multi-scale feature extraction block, as shown in FIG. 1, the first multi-scale feature extraction block receives a feature map I from the sixth neural network block of the RGB map base network₁And a feature map r from a second reverse attention block₁The first multi-scale feature extraction block and the second multi-scale feature extraction block have the same structure.

The first multi-scale feature extraction block consists of a hundred-zero seventh convolution layer, a hundred-thirteen active layer, a hundred-zero eight convolution layer, a hundred-nine convolution layer, a fifth adaptive average pooling layer, a fifth upsampling layer, a hundred-ten convolution layer, a fifteenth expansion convolution layer, a hundred-fourteen active layer, a hundred-eleven convolution layer, a sixteenth expansion convolution layer, a hundred-fifteen active layer and a hundred-twelve convolution layer. The size of the one hundred and seven convolutional layer convolution kernels is 1 multiplied by 1, the number of the convolution kernels is 64, the size of the one hundred and eight convolutional layer convolution kernels is 1 multiplied by 1, the number of the convolution kernels is 64, the size of the one hundred and nine convolutional layer convolution kernels is 1 multiplied by 1, and the number of the convolution kernels is 16. Pooling ruler of fifth adaptive average pooling layerThe size is 1, the size of the one hundred and ten convolutional layer convolution kernels is 1 × 1, the number of convolution kernels is 16, the size of the fifteenth extended convolutional layer convolution kernel is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 1, the extension ratio is 1, the size of the one hundred and eleven convolutional layer convolution kernels is 1 × 1, the number of convolution kernels is 16, the size of the sixteenth extended convolutional layer convolution kernel is 3 × 3, the number of convolution kernels is 16, the zero padding parameter is 2, the extension ratio is 2, the size of the one hundred and twelve convolutional layer convolution kernels is 1 × 1, and the number of convolution kernels is 16. Characteristic diagram I₁After passing through the one hundred and seven convolutional layers and the one hundred and thirteen activation layers, the characteristic diagram r₁Performing channel stacking operation, obtaining a pre-extraction feature map through a hundred-th and eight convolution layers, and obtaining the pre-extraction feature map and the feature map r₁After the channel stacking operation is carried out, a feature map is obtained through a fifth self-adaptive average pooling layer, a one hundred and nine th convolution layers and a fifth up-sampling layer

The characteristic diagram is obtained by extracting the characteristic diagram through the one hundred and ten convolution layers, the fifteenth expansion convolution layer and the one hundred and fourteen activation layers

The characteristic diagram is obtained by extracting the characteristic diagram through the one hundred eleventh convolution layer, the sixteenth expansion convolution layer and the one hundred fifteen activation layers

Where each dilation convolution is followed by an active layer. And then, carrying out channel stacking operation on the three types of feature maps, and carrying out one hundred twelve convolutions on the obtained feature maps to obtain a final feature map. The number of feature images passing through the first multi-scale feature extraction block was 64, and 64 feature images were designated as P₁The width of the characteristic diagram is W/4, and the height is H/4.

For the fifth attention block, as shown in FIG. 1, the fifth attention block receives the feature map f from the fifth fusion block₅And a fifth multi-scale featureFeature map P of the block₅。

As shown in FIG. 5, the fifth attention block is composed of the sixth S-shaped active layer, the one hundred and thirteen convolutional layers, the one hundred and sixteen active layers, the one hundred and fourteen convolutional layers, and the one hundred and seventeenth active layers. The size of the one hundred and thirteen convolutional layer convolution kernels is 1 multiplied by 1, the number of the convolutional kernels is 64, the size of the one hundred and fourteen convolutional layer convolution kernels is 1 multiplied by 1, and the number of the convolutional kernels is 64. Received profile P₅Compressing the image to pixel values arranged according to channels through a sixth S-shaped active layer by tensor operation, and receiving an eigen map f₅After passing through the one hundred thirteen convolution layers and the one hundred sixteen activation layers, multiplying the pixel values to obtain an intermediate characteristic diagram, and after passing through the one hundred fourteen convolution layers and the one hundred seventeenth activation layers, the intermediate characteristic diagram is multiplied by the characteristic diagram P₅And adding to obtain the final characteristic diagram. The number of feature maps passing through the fifth attention block was 64, and 64 feature maps were denoted as a₅The width of the characteristic diagram is W/16, and the height is H/16.

The fourth attention block, the third attention block, the second attention block, the first attention block and the fifth attention block are identical in structure. The number of feature maps passing through the fourth attention block was 64, and 64 feature maps were denoted as a₄The width of the characteristic diagram is W/16, and the height is H/16. The number of feature maps passing through the third attention block was 64, and 64 feature maps were denoted as a₃The width of the characteristic diagram is W/8, and the height is H/8. The number of feature maps passing through the second attention block was 64, and 64 feature maps were denoted as a₂The width of the characteristic diagram is W/4, and the height is H/4. The number of feature maps passing through the first attention block was 64, and 64 feature maps were denoted as a₁The width of the characteristic diagram is W/4, and the height is H/4.

For the fourth reverse attention block, as shown in FIG. 1, the fourth reverse attention block receives the feature map R from the fourth neural network block₄And the feature map a of the fifth attention Block₅。

As shown in FIG. 6, the fourth block of reverse attention consists of the eleventh S-shaped active layer, the one hundred twenty three convolutional layers, and the one hundred two layersSixteen active layers, a one hundred twenty four convolution layers and a one hundred twenty seven active layers. The size of the one hundred twenty three convolution layer convolution kernels is 1 multiplied by 1, the number of the convolution kernels is 64, the size of the one hundred twenty four convolution layer convolution kernels is 1 multiplied by 1, and the number of the convolution kernels is 64. Received profile a₅Compressing the inverted image to pixel values arranged according to channels through a tenth S-shaped active layer through tensor operation, inverting the pixel values of the inverted characteristic image, and receiving the characteristic image R₄Multiplying the pixel values after passing through the one hundred twenty-third convolution layer and the one hundred twenty-six activation layer to obtain an intermediate feature map, and multiplying the intermediate feature map and the feature map a after passing through the one hundred twenty-four convolution layer and the one hundred twenty-seven activation layer₅And adding to obtain the final characteristic diagram. The number of feature maps passing through the fourth block of reverse attention is 64, and 64 feature maps are denoted by r₄The width of the characteristic diagram is W/16, and the height is H/16. And inputting the data to a fourth multi-scale extraction feature block through a sixth upsampling layer, wherein the upsampling multiple is 2.

For the third reverse attention block, as shown in FIG. 1, the third reverse attention block receives the feature map R from the third neural network block₃And a feature map a of a fourth attention Block₄。

The third reverse attention block consists of a twelfth S-shaped active layer, a one hundred twenty-five convolution layer, a one hundred twenty-eight active layer, a one hundred twenty-six convolution layer and a one hundred twenty-nine active layer. The size of the one hundred twenty-five convolutional layer convolution kernels is 1 x 1, the number of the convolution kernels is 64, the size of the one hundred twenty-six convolutional layer convolution kernels is 1 x 1, and the number of the convolution kernels is 64. Received profile a₄Compressing the image to pixel values arranged according to channels through tensor operation by a twelfth S-shaped active layer after inversion, inverting the pixel values of the inverted characteristic image, and receiving the characteristic image R₃Multiplying the pixel values after passing through the one hundred twenty-five convolution layers and the one hundred twenty-eight activation layers to obtain an intermediate feature map, and multiplying the intermediate feature map after passing through the one hundred twenty-six convolution layers and the one hundred twenty-nine activation layers with the feature map a₄And adding to obtain the final characteristic diagram. Passing through a third, opposite, focus blockThe number of feature maps is 64, and 64 feature maps are denoted by r₃The width of the characteristic diagram is W/8, and the height is H/8. And inputting the data to a fourth multi-scale extraction feature block through a seventh up-sampling layer, wherein the up-sampling multiple is 2.

For the second reverse attention block, as shown in FIG. 1, the second reverse attention block receives the feature map R from the second neural network block₂And a third attention Block feature map a₃. The third reverse attention block consists of the thirteenth S-shaped active layer, the one hundred twenty seven convolutional layers, the one hundred thirty active layers, the one hundred twenty eight convolutional layers, and the one hundred thirty one active layers. The size of the one hundred twenty seven convolutional layer convolution kernels is 1 multiplied by 1, the number of the convolutional kernels is 64, the size of the one hundred twenty eight convolutional layer convolution kernels is 1 multiplied by 1, and the number of the convolutional kernels is 64. Received profile a₃And after the inversion, the pixel values are compressed to the pixel values arranged according to the channels through tensor operation by a thirteenth S-shaped activation layer, and the pixel values of the inverted characteristic image are inverted. Received characteristic map R₂After passing through the one hundred twenty seven convolutional layers and the one hundred thirty active layers, the intermediate characteristic diagram is obtained by being multiplied by the pixel value, and after passing through the one hundred twenty eight convolutional layers and the one hundred thirty active layers, the intermediate characteristic diagram is multiplied by the characteristic diagram a₃And adding to obtain the final characteristic diagram. The number of feature maps passing through the third block of reverse attention is 64, and 64 feature maps are denoted by r₂The width of the characteristic diagram is W/4, and the height is H/4.

A first reverse attention block receiving the profile R from the first neural network block₁And a second attention Block feature map a₂. The first reverse attention block consists of a fourteenth S-shaped active layer, a one hundred twenty nine convolution layer, a one hundred thirty two active layers, a one hundred thirty convolution layer, and a one hundred thirty three active layers. The size of the one hundred twenty nine convolutional layer convolution kernel is 1 x 1, the number of convolution kernels is 64, the size of the one hundred thirty convolutional layer convolution kernel is 1 x 1, and the number of convolution kernels is 64. Received profile a₂After negation, compressing the image to pixel values arranged according to channels through a fourteen S-shaped active layers through tensor operation, and negationThe characteristic map pixel values of (1) are inverted. Received characteristic map R₁After passing through the one hundred twenty nine convolutional layers and the one hundred thirty two active layers, the intermediate characteristic diagram is obtained by being multiplied by the pixel value, and after passing through the one hundred thirty convolutional layers and the one hundred thirty three active layers, the intermediate characteristic diagram is multiplied by the characteristic diagram a₂And adding to obtain the final characteristic diagram. The number of feature maps passing through the third block of reverse attention is 64, and 64 feature maps are denoted by r₁The width of the characteristic diagram is W/4, and the height is H/4.

For the output layers, the fifth output layer consists of the one hundred thirty one convolutional layer, the eighth upsampling layer. The convolution kernel size of the one hundred thirty one convolution layer is 1 multiplied by 1, the number of convolution kernels is 1, the up-sampling multiplying power of the eighth up-sampling layer is 16, and a characteristic diagram passing through the fifth output layer is S₅The characteristic diagram is 1 sheet. The fourth output layer consists of the one hundred thirty-two convolutional layers and the ninth up-sampling layer. The convolution kernel size of the one hundred thirty two convolution layers is 1 multiplied by 1, the number of the convolution kernels is 1, the up-sampling multiplying power of the ninth up-sampling layer is 16, and a characteristic diagram passing through the fourth output layer is S₄The characteristic diagram is 1 sheet. The third output layer consists of the one hundred thirty-third convolutional layer and the tenth upsampling layer. The sizes of the one hundred thirty third convolution layer convolution kernels are 1 multiplied by 1, the number of the convolution kernels is 1, the up-sampling multiplying power of the tenth up-sampling layer is 8, and a characteristic diagram passing through the third output layer is S₃The characteristic diagram is 1 sheet. The second output layer consists of the one hundred thirty-four convolutional layers, the eleventh upsampling layer. The convolution kernel size of the one hundred thirty-four convolution layers is 1 multiplied by 1, the number of the convolution kernels is 1, the up-sampling multiplying power of the eleventh up-sampling layer is 4, and a characteristic diagram passing through the second output layer is S₂The characteristic diagram is 1 sheet. The first output layer consists of the one hundred thirty-fifth convolutional layer and the twelfth upsampling layer. The convolution kernel size of the one hundred thirty five convolution layers is 1 multiplied by 1, the number of the convolution kernels is 1, the up-sampling multiplying power of the twelfth up-sampling layer is 4, and a characteristic diagram passing through the first output layer is S₁The characteristic diagram is 1 sheet. Will the characteristic diagram S₁、S₂、S₃、S₄、S₅Performing channel stacking input to the sixth output layerAnd the sixth output layer consists of a one hundred thirty-sixth convolutional layer, the convolutional cores of the one hundred thirty-sixth convolutional layer have the size of 1 multiplied by 1, the number of the convolutional cores is 1, and a final significance prediction graph is obtained after output.

Step 1_ 3: inputting each pair of original scene images in the training set into a convolutional neural network two-classification training model for training, obtaining a significance prediction image corresponding to each pair of original scene images in the training set through multi-scale feature refinement extraction, and performing pre-fusion to obtain a final significance prediction image;

step 1_ 4: calculating loss function values between the final significance prediction image obtained in the step 1_3 and all real significance detection image sets by adopting two-classification cross entropy;

step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model and obtain Q V loss function values; then finding out the loss function value with the minimum value from the Q V loss function values; then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, wherein the vector and the bias item are called as weights, and obtaining the trained convolutional neural network classification training model by storing the optimal weight of the convolutional neural network secondary classification training model;

the specific steps of the test stage process are as follows:

for a pair of scene images to be detected, which include an RGB image to be detected and a depth image, the test image and the training image are preprocessed in the same way, and are predicted by using a trained convolutional neural network classification training model, the predicted result is the final saliency predicted image of the sixth output layer in this embodiment, the saliency predicted image corresponding to the pair of scene images to be predicted is obtained, and the result score data sets are stored in different folders as the final detection results.

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

Codes are written on the basis of a python language of a pytorech library and experiments are carried out, wherein experimental equipment is an Intel i5-7500 processor, and cuda acceleration is used under a NVIDIA TITAN XP-12GB video card. To ensure the stringency of the experiment, 1400 pairs and 650 pairs of NLPR from NJU2K were extracted as training sets and the rest as test sets, respectively. The test set tests two data sets of NJU2K and NLPR respectively. In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: and evaluating the detection performance of the significance detection image by using an S metric value (F-measure), an F-Mean metric value (F-measure), an E-Mean metric value (Enhanced alignment measure) and an MAE Mean Absolute Error (Mean Absolute Error).

The method of the present invention is used to predict each color real object image in the real object image database NJU2K test set, to obtain a predicted saliency detection image of each color real object image, and the class accuracy recall rate curves reflecting the saliency detection effect of the method of the present invention are shown in fig. 10a and fig. 10 b. Through tests, the method has the advantages of excellent effect of significance detection, high object detection accuracy, clear details and clear boundary, and can be well adapted to various scenes.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-scale information enhanced binocular convolutional neural network saliency image detection method is characterized by comprising the following specific steps:

2. The method for detecting the saliency image of the multi-scale information-enhanced binocular convolutional neural network of claim 1, wherein the convolutional neural network second class training model comprises 5 output channels; wherein, the two inputs are an RGB image and a depth image respectively; the first output channel and the second output channel respectively comprise a first neural network block, a second neural network block, a fusion block, a multi-scale feature extraction block, an attention block and a reverse attention block; the first neural network block processes the depth image to obtain a first feature map; the second neural network block processes the RGB image to obtain a second feature map; the first feature map is subjected to the reverse attention block to obtain a third feature map; the input of the reverse attention block is a first feature map and a significance prediction image output by a next-stage output channel; the fusion block performs feature fusion on the first feature map and the second feature map to obtain a fifth feature map; the second feature map and the third feature map are used as the input of the multi-scale feature extraction block, and multi-scale feature extraction is carried out to obtain a fourth feature map; the fourth feature map and the fifth feature map are used as the input of an attention block to obtain a significance prediction image, and the significance prediction image is output through an output layer;

3. The method for detecting the saliency image of the binocular convolutional neural network based on multi-scale information enhancement as claimed in claim 1,

the fusion blocks comprise a convolution layer I, a convolution layer II and an S-shaped activation layer I;

4. The method of claim 2, wherein the attention blocks each comprise: an S-type active layer A, a convolution layer B, an active layer A and an active layer B;

5. The method of claim 4, wherein the inverse attention block comprises:

the reverse attention blocks each include: an S-type active layer a, a convolution layer a, an active layer a, a convolution layer b and an active layer b; after the negation of the significance prediction image, the significance prediction image is compressed to pixel values arranged according to channels through tensor operation by the S-shaped active layer a; and inverting the pixel value of the inverted feature map, receiving the first feature map, multiplying the first feature map by the pixel value after passing through the convolutional layer a and the activation layer a to obtain an intermediate feature map a, and adding the intermediate feature map a and the significance prediction image after passing through the convolutional layer b and the activation layer b to obtain a third feature map and outputting the third feature map.

6. The method for detecting the saliency image of the multi-scale information-enhanced binocular convolutional neural network of claim 2, wherein the multi-scale feature extraction block comprises: the system comprises a first multi-scale feature extraction block, a second multi-scale feature extraction block, a third multi-scale feature extraction block, a fourth multi-scale feature extraction block and a fifth multi-scale feature extraction block;

the first multi-scale feature extraction block and the second multi-scale feature extraction block are identical in structure.