CN110929736A

CN110929736A - Multi-feature cascade RGB-D significance target detection method

Info

Publication number: CN110929736A
Application number: CN201911099871.2A
Authority: CN
Inventors: 周武杰; 潘思佳; 林鑫杨; 黄铿达; 雷景生; 何成; 王海江; 薛林林
Original assignee: Zhejiang University of Science and Technology ZUST
Current assignee: Zhejiang University of Science and Technology ZUST
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-03-27
Anticipated expiration: 2039-11-12
Also published as: CN110929736B

Abstract

The invention discloses a method for detecting a characteristic cascade RGB-D significance target. Selecting RGB images, depth images corresponding to the RGB images and real significance images to form a training set, constructing a convolutional neural network which comprises two input layers, a hidden layer and an output layer, inputting the training set into the convolutional neural network for training to obtain significance prediction images corresponding to each RGB image in the training set, calculating loss function values between the significance prediction images corresponding to each RGB image in the training set and the corresponding real significance images, and continuously training weight vectors and bias items corresponding to the loss function values with minimum values; and inputting the RGB image and the depth image to be predicted into the trained convolutional neural network training model to obtain a predicted segmentation image. The model of the invention has novel structure, and the similarity rate of the saliency map and the target map obtained after model processing is high.

Description

Multi-feature cascade RGB-D significance target detection method

Technical Field

The invention relates to a method for detecting a human eye salient target, in particular to a method for detecting a multi-feature cascade RGB-D salient target.

Background

Salient object detection is a branch of image processing and is also an area of computer vision. Computer vision, in a broad sense, is the discipline that imparts natural visual capabilities to machines. Natural visual ability refers to the visual ability of the biological visual system. In fact, computer vision essentially addresses the problem of visual perception. The core problem is to study how to organize the input image information, identify objects and scenes, and further explain the image content.

Computer vision has been the subject of increasing interest and intense research over the last several decades. Computer vision is also increasingly being used to recognize patterns from images. Even with great play in various fields, with the dramatic achievements of artificial intelligence and computer vision technology becoming more and more prevalent in different industries, the future of computer vision seems to be replete with promising and unthinkable results. The detection of the salient objects referred to herein is one of the categories, but plays a great role.

Saliency detection is a method of predicting where a person is located in an image, and has attracted extensive research interest in recent years. It plays an important preprocessing role in image classification, image relocation, target recognition and other problems. Unlike RGB significance detection, RGB significance detection is less studied. According to the definition of significance, significance detection methods can be classified into top-down methods and bottom-up methods. Top-down saliency detection is a task-dependent detection method that incorporates high-level features to locate salient objects. On the other hand, the bottom-up approach is task-free, and it maps regions from a biological perspective using low-level features.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a multi-feature cascade RGB-D saliency target detection method, a saliency map obtained after model processing has high similarity with a target map, and the model has a novel structure.

The technical scheme adopted by the invention is as follows:

step 1_ 1: selecting Q original RGB images and depth maps corresponding to the original RGB images, and combining real significant images corresponding to the original RGB images to form a training set;

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises two input layers, a hidden layer and an output layer, wherein the two input layers are connected to the input end of the hidden layer, and the output end of the hidden layer is connected to the output layer;

step 1_ 3: inputting each original RGB image in the training set and the depth image corresponding to the original RGB image in the training set as original input images of two input layers into a convolutional neural network for training to obtain a prediction significance image corresponding to each original RGB image in the training set; calculating a loss function value between a prediction significance image corresponding to each original RGB image in the training set and a corresponding real significance image, wherein the loss function value is obtained by adopting a BCE (binary coded decimal) loss function;

step 1_ 4: repeatedly executing the step 1_3 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; then, the weight vector and the bias item corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias item, and the weight vector and the bias item in the trained convolutional neural network training model are replaced;

step 1_ 5: inputting the RGB image to be predicted and the depth image corresponding to the RGB image to be predicted into a trained convolutional neural network training model, and predicting by using the optimal weight vector and the optimal bias term to obtain a predicted saliency image corresponding to the RGB image to be predicted, thereby realizing saliency target detection.

In the two input layers in the step 1_2, the 1 st input layer is an RGB image input layer, and the 2 nd input layer is a depth image input layer; the hidden layer comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, an SKNet network model and a post-processing module;

the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color up-sampling layers, four attention convolution layers and four color convolution layers which are sequentially connected; the four color map neural network blocks which are connected in sequence respectively correspond to four modules which are connected in sequence in ResNet50, the output of the first color map neural network block is respectively connected to the first RGB branch and the fifth RGB branch, the output of the second color map neural network block is respectively connected to the second RGB branch and the sixth RGB branch, the output of the third color map neural network block is respectively connected to the third RGB branch and the seventh RGB branch, and the output of the fourth color map neural network block is respectively connected to the fourth RGB branch and the eighth RGB branch;

the depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth upsampling layers, four attention convolutional layers and four depth convolutional layers which are sequentially connected, wherein the four depth map neural network blocks which are sequentially connected respectively correspond to the four modules which are sequentially connected in ResNet50, the output of the first depth map neural network block is respectively connected to a first depth branch and a fifth depth branch, the output of the second depth map neural network block is respectively connected to a second depth branch and a sixth depth branch, the output of the third depth map neural network block is respectively connected to a third depth branch and a seventh depth branch, and the output of the fourth depth map neural network block is respectively connected to the fourth depth branch and the eighth depth branch;

multiplying the outputs of the first RGB branch and the second RGB branch to serve as one input of the low-level feature convolution layer, and multiplying the outputs of the first depth branch and the second depth branch to serve as the other input of the low-level feature convolution layer; multiplying the outputs of the third RGB branch and the fourth RGB branch to be used as one input of the advanced feature convolution layer, and multiplying the outputs of the third depth branch and the fourth depth branch to be used as the other input of the advanced feature convolution layer;

the outputs of the low-level characteristic convolution layer and the high-level characteristic convolution layer are input into the mixed characteristic convolution layer;

the fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module;

the output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then serve as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then serve as the other input of the SKNet network model;

the post-processing module comprises a first anti-convolution layer and a second anti-convolution layer which are sequentially connected, the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer.

The first RGB branch comprises a first color attention layer, a first color up-sampling layer and a first attention convolution layer which are connected in sequence, the second RGB branch comprises a second color attention layer, a second color up-sampling layer and a second attention convolution layer which are connected in sequence, the third RGB branch comprises a third color attention layer, a third color up-sampling layer and a third attention convolution layer which are connected in sequence, and the fourth RGB branch comprises a fourth color attention layer, a fourth color up-sampling layer and a fourth attention convolution layer which are connected in sequence;

the fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;

the first depth branch comprises a first depth attention layer, a first depth up-sampling layer and a fifth attention convolution layer which are connected in sequence, the second depth branch comprises a second depth attention layer, a second depth up-sampling layer and a sixth attention convolution layer which are connected in sequence, the third depth branch comprises a third depth attention layer, a third depth up-sampling layer and a seventh attention convolution layer which are connected in sequence, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth up-sampling layer and an eighth attention convolution layer which are connected in sequence;

the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.

The detail information processing module comprises a first network module and a second over-convolution layer which are sequentially connected, and the input of the detail information processing module is fused with the output of the second over-convolution layer through the output of the first over-convolution layer and then is used as the output of the detail information processing module;

the global information processing module comprises three processing branches, the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then serve as the output of the global information processing module.

Each color attention layer and each depth attention layer adopt a CBAM (Convolutional module attention mechanism module), and each color up-sampling layer and each depth up-sampling layer are used for performing up-sampling processing of bilinear interpolation on input features; each of the attention convolution layer, the color convolution layer, the depth convolution layer, the low-level feature convolution layer, the high-level feature convolution layer, and the mixed-feature convolution layer includes one convolution layer; each of said deconvolution layers comprises a deconvolution;

the excessive convolutional layer in the detail information processing module and the global convolutional layer in the global information processing module both comprise one convolutional layer, the first network module in the detail information processing module adopts a Dense block in a DenseNet network, and each global network module in the global information processing module adopts an ASPP module (a cavity space pyramid pool module).

The input end of the RGB image input layer receives an RGB input image, and the input end of the depth image input layer receives a depth image corresponding to the RGB image; the input of the RGB feature extraction module and the input of the depth feature extraction module are respectively the output of an RGB image input layer and a depth image input layer.

Compared with the prior art, the invention has the advantages that:

1) the invention uses ResNet50 to pre-train RGB and depth maps respectively (converting the depth map into three-channel input), extracts the different results of the RGB and depth maps passing through 4 modules in ResNet50 respectively, and performs two different operations on the extracted results, wherein the detail optimization of high-low level features is performed through an attention mechanism, and the high-low level features are fused after a main network is formed and then transmitted into a later model.

2) The method extracts the feature information from the pre-training, divides the image features into high-level and low-level features, extracts the detail features of the image from the high-level and low-level features at the left part of the model, and fuses the high-level features and the low-level features after the operation respectively, so that the method has excellent effect.

3) Two novel modules are designed in the right side of the model of the invention: the first module adopts a module combining convolution and a Dense block, and fully combines the advantages of convolution and DenseNet, so that the detection result of the method is more detailed. The second module utilizes ASPP to enlarge the visual field and then is matched with convolution, so that the global feature collection is facilitated, and the detection result of the method is more comprehensive. And finally, performing overall fusion on the detail features on the left side of the fusion model.

Drawings

Fig. 1 is a block diagram of the overall implementation of the method of the present invention.

Fig. 2a is an original RGB image.

Fig. 2b is the depth image of fig. 2 a.

Fig. 3a is the true saliency detection image of fig. 2 a.

Fig. 3b is a significance prediction image obtained by the invention of fig. 2a and 2 b.

FIG. 4a shows the results of the Recall evaluation of the present invention.

FIG. 4b shows the results of the present invention on ROC.

FIG. 4c is the results of the present invention on MAE.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The overall implementation block diagram of the multi-feature cascade RGB-D significance target detection method provided by the invention is shown in FIG. 1, and the method comprises three processes, namely a training stage, a verification stage and a testing stage.

The specific steps of the training phase process are as follows:

step 1_ 1: selecting Q color real target images, corresponding depth images and saliency images corresponding to each color real target image, forming a training set, and recording the Q-th original object image in the training set as { I }^q(i, j) }, depth image is noted as { D^q(I, j) }, the training set is summed with { I }^q(i, j) } the corresponding true saliency image is noted as

Wherein, the color real target image is an RGB color image, the depth map is a binary gray scale map, Q is a positive integer, Q is more than or equal to 200, if Q is 1588, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents { I ≦ H^q(I, j) }, H denotes { I }^q(I, j) } e.g. take W512, H512, I^q(I, j) represents { I^qThe coordinate position in (i, j) is (i,j) the pixel value of the pixel point of (a),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); here, 1588 images in the training set of the database NJU2000 are directly selected as the color real target image.

Step 1_ 2: constructing a convolutional neural network:

the convolutional neural network comprises an input layer, a hidden layer and an output layer;

the input layers include an RGB image input layer and a depth image input layer. For an RGB image input layer, an input end receives an R channel component, a G channel component and a B channel component of an original input image, and an output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H. For a depth image input layer, an input end receives a depth image corresponding to an original input image, an output end of the input end outputs the original depth image, two channels are superposed by the original depth image to form a three-channel depth image, and three-channel components are sent to a hidden layer; wherein the input end of the input layer is required to receive the original input image with width W and height H.

The hidden layer comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, an SKNet network model and a post-processing module;

the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color up-sampling layers, four attention convolution layers and four color convolution layers which are sequentially connected. The four color map neural network blocks which are connected in sequence respectively correspond to the four modules which are connected in sequence in the ResNet 50. The 1 st color image neural network block, the 2 nd color image neural network, the 3 rd color image neural network and the 4 th color image neural network are respectively and sequentially corresponding to ResNet50The 4 modules adopt a pre-training method, and pre-training is carried out on the input image by utilizing the network of ResNet50 carried by the pytorech and the weight of the network. Outputting 256 characteristic graphs after passing through the 1 st color image neural network block, and recording a set consisting of the 256 output characteristic graphs as P₁，P₁Wherein each feature map has a width of

Has a height of

Outputting 512 feature maps after passing through the 2 nd color image neural network block, and recording a set of the 512 output feature maps as P₂，P₂Wherein each feature map has a width of

Has a height of

The image is output as 1024 characteristic graphs after passing through a 3 rd color image neural network block, and a set formed by the output 1024 characteristic graphs is marked as P₃，P₃Wherein each feature map has a width of

Has a height of

2048 feature maps are output after passing through a 4 th color image neural network block, and a set formed by the 2048 output feature maps is marked as P₄，P₄Each feature map has a width of

Has a height of

The depth feature extraction module comprises four depth map neural network blocks and four depths which are connected in sequenceAn attention layer, eight depth upsampling layers, four attention convolution layers, and four depth convolution layers. The four depth map neural network blocks connected in sequence respectively correspond to the four modules connected in sequence in ResNet 50. For the 1 st depth image neural network, the 2 nd depth image neural network, the 3 rd depth image neural network and the 4 th depth image neural network, 4 modules in ResNet50 respectively correspond in sequence, a pre-training method is adopted, and the input image is pre-trained by using a network of ResNet50 carried by a pytorch and the weight of the network. The images are output as 256 characteristic graphs after passing through a 1 st depth image neural network block, and a set formed by the output 256 characteristic graphs is recorded as D₁，D₁Wherein each feature map has a width of

Has a height of

Outputting 512 feature maps after passing through a 2 nd depth image neural network block, and recording a set formed by the 512 output feature maps as D₂，D₂Wherein each feature map has a width of

Has a height of

The images are output as 1024 characteristic graphs after passing through a 3 rd depth image neural network block, and a set formed by the output 1024 characteristic graphs is recorded as D₃，D₃Wherein each feature map has a width of

Has a height of

2048 feature maps are output after passing through a 4 th depth image neural network block, and a set formed by the 2048 output feature maps is recorded as D₄，D₄Wherein each feature map has a width of

Has a height of

The output of the first color map neural network block is respectively connected to the first RGB branch and the fifth RGB branch, the output of the second color map neural network block is respectively connected to the second RGB branch and the sixth RGB branch, the output of the third color map neural network block is respectively connected to the third RGB branch and the seventh RGB branch, and the output of the fourth color map neural network block is respectively connected to the fourth RGB branch and the eighth RGB branch. The first RGB branch comprises a first color attention layer, a first color up-sampling layer and a first attention convolution layer which are connected in sequence, the second RGB branch comprises a second color attention layer, a second color up-sampling layer and a second attention convolution layer which are connected in sequence, the third RGB branch comprises a third color attention layer, a third color up-sampling layer and a third attention convolution layer which are connected in sequence, and the fourth RGB branch comprises a fourth color attention layer, a fourth color up-sampling layer and a fourth attention convolution layer which are connected in sequence; the fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;

the output of the first depth map neural network block is connected to the first depth branch and the fifth depth branch respectively, the output of the second depth map neural network block is connected to the second depth branch and the sixth depth branch respectively, the output of the third depth map neural network block is connected to the third depth branch and the seventh depth branch respectively, and the output of the fourth depth map neural network block is connected to the fourth depth branch and the eighth depth branch respectively. The first depth branch comprises a first depth attention layer, a first depth up-sampling layer and a fifth attention convolution layer which are connected in sequence, the second depth branch comprises a second depth attention layer, a second depth up-sampling layer and a sixth attention convolution layer which are connected in sequence, the third depth branch comprises a third depth attention layer, a third depth up-sampling layer and a seventh attention convolution layer which are connected in sequence, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth up-sampling layer and an eighth attention convolution layer which are connected in sequence; the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.

Each color attention layer or depth attention layer is composed of a cbam (volumetric block attention module) module. For color attention layer 1, this operation did not change the map size and number of channels, and was still 256 feature maps. For the 2 nd color attention layer, this operation did not change the map size and number of channels, and was still 512 feature maps. For the 3 rd color attention layer, this operation did not change the map size and the number of channels, and still was 1024 feature maps. For the 4 th color attention layer, this operation did not change the map size and the number of channels, and was still 2048 feature maps. For the 1 st depth attention layer, this operation does not change the map size and the number of channels, and still is 256 feature maps. For the 2 nd depth attention layer, this operation does not change the map size and the number of channels, and is still 512 feature maps. For the 3 rd depth attention layer, this operation does not change the map size and the number of channels, and still is 1024 feature maps. For the 4 th depth attention layer, this operation does not change the map size and the number of channels, and still 2048 feature maps.

Each attention convolutional layer consists of one convolutional layer. For the 1 st attention convolution layer, the convolution kernel size is 3 × 3, the convolutionThe number of kernels is 256, zero padding parameter is 1, step length is 1, 256 characteristic graphs are output, and a set formed by the 256 characteristic graphs is marked as S₁. For the 2 nd attention convolution layer, the size of convolution kernel is 5 × 5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, the output is 256 feature maps, and the set composed of 256 feature maps is recorded as S₂. For the 3 rd attention convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 256, zero padding parameter is 1, step length is 1, output is 256 feature maps, and the set composed of 256 feature maps is recorded as S₃. For the 4 th attention convolution layer, the size of convolution kernel is 5 × 5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, the output is 256 feature maps, and the set composed of 256 feature maps is recorded as S₄. For the 5 th attention convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 256, zero padding parameter is 1, step size is 1, output is 256 feature maps, and the set composed of 256 feature maps is marked as G₁. For the 6 th attention convolution layer, the size of convolution kernel is 5 × 5, the number of convolution kernels is 256, the zero padding parameter is 2, the step size is 1, the output is 256 feature maps, and the set composed of 256 feature maps is denoted as G. For the 7 th attention convolution layer, the size of convolution kernels is 3 × 3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature maps are output, and the set consisting of the 256 feature maps is marked as G₃. For the 8 th attention convolution layer, the size of convolution kernel is 5 × 5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, the output is 256 feature maps, and the set composed of 256 feature maps is marked as G₄。

For the 1 st multiplication operation, S is₁And S₂Multiplying the results to output 256 feature maps, and designating a set of 256 feature maps as S₁S₂As one of the inputs to the low-level feature convolution layer. For the 2 nd multiplication operation, S₃And S₄Multiplying the results to output 256 feature maps, and designating a set of 256 feature maps as S₃S₄As another input to the low-level feature convolution layer. For the 3 rd multiplication operation, G₁And G₂Multiplying and outputting 256 characteristicsIn the figure, a set of 256 feature maps is denoted as G₁G₂As one of the inputs to the advanced feature convolution layer. For the 4 th multiplication operation, G₃And G₄Multiplying, outputting 256 characteristic diagrams, and recording the set of 256 characteristic diagrams as G₃G₄As another input to the advanced feature convolution layer.

Each color convolution layer or depth convolution layer is comprised of one convolution. For the 1 st color convolution layer, the convolution kernel size is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 2 nd color convolution layer, the convolution kernel size is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 3 rd color convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 4 th color convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 1 st depth convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 2 nd depth convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 3 rd depth convolution layer, the size of convolution kernel is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and the output is 512 feature maps. For the 4 th depth convolution layer, the size of convolution kernels is 3 × 3, the number of convolution kernels is 512, the zero padding parameter is 1, the step size is 1, and 512 feature maps are output.

Each color upsampling layer or depth upsampling layer is used for performing an upsampling process of bilinear interpolation on the input features.

For the 1 st color upsampling layer, set the output feature map width to

Has a height of

This operation does not change the number of feature maps. For the 2 nd color upsampling layer, set the output feature map width to

Has a height of

This operation does not change the number of feature maps. For the 3 rd color upsampling layer, set the output feature map width to

Has a height of

This operation does not change the number of feature maps. For the 4 th color upsampling layer, set the output feature map width to

Has a height of

This operation does not change the number of feature maps. For the 1 st depth up-sampling layer, setting the width of the output characteristic diagram to be

Has a height of

This operation does not change the number of feature maps. For the 2 nd depth up-sampling layer, setting the width of the output characteristic diagram to be

Has a height of

This operation does not change the number of feature maps. Setting the width of an output feature map to be 3 rd depth up-sampling layer

Has a height of

This operation does not change the number of feature maps. Setting the width of an output feature map to be

Has a height of

This operation does not change the number of feature maps.

For the 5 th color upsampling layer, set the output feature map width to

Has a height of

The operation does not change the number of the feature maps, 512 feature maps are output, and a set formed by the 512 feature maps is marked as U₁. For the 6 th color upsampling layer, set the output feature map width to

Has a height of

The operation does not change the number of the feature maps, 512 feature maps are output, and a set formed by the 512 feature maps is marked as U₂. Setting the output feature map width to be

Has a height of

The operation does not change the number of the feature maps, 512 feature maps are output, and a set formed by the 512 feature maps is marked as U₃. For the 8 th color upsampling layer, set the output feature map width to

Has a height of

The operation does not change the number of the feature maps, 512 feature maps are output, and a set formed by the 512 feature maps is marked as U₄. Setting the width of an output feature map to be 5 th depth up-sampling layer

Has a height of

This operation does not change the number of feature maps, 512 feature maps are output, and a set of 512 feature maps is denoted as F₁. Setting the width of an output feature map to be 6 th depth up-sampling layer

Has a height of

This operation does not change the number of feature maps, 512 feature maps are output, and a set of 512 feature maps is denoted as F₂. Setting the width of an output feature map to be 7 th depth up-sampling layer

Has a height of

This operation does not change the number of feature maps, 512 feature maps are output, and a set of 512 feature maps is denoted as F₃. Setting the width of an output feature map to be 8 th depth up-sampling layer

Has a height of

This operation does not change the number of feature maps, 512 feature maps are output, and a set of 512 feature maps is denoted as F₄。

The outputs of both the low level feature convolutional layer and the high level feature convolutional layer are input into the hybrid feature convolutional layer. For the 1 st high-level feature convolution layer, the high-level feature convolution layer is composed of convolution, the size of convolution kernels is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, and 256 feature maps are output; the high-level features of the RGB map and the depth map are fused. For the 1 st low-level feature convolution layer, the convolution layer is composed of a convolution, the size of convolution kernels is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, and 256 feature maps are output; the low-level features of the RGB map and the depth map are fused. For the 1 st mixed feature convolution layer, the convolution layer is composed of a convolution, the convolution kernel size is 3 multiplied by 3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, the output is 256 feature maps, and the set composed of the 256 feature maps is recorded as X₁。

For the 5 th multiplication operation, U is added₁And U₂The result of addition and F₁And F₂The added results are multiplied, and 512 feature maps are output. For the 6 th multiplication operation, U is added₃And U₄The result of addition and F₃And F₄The added results are multiplied, and 512 feature maps are output. The fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; and the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module.

The detail information processing module comprises a first network module and a second over-convolution layer which are sequentially connected, and the input of the detail information processing module is fused with the output of the second over-convolution layer through the output of the first over-convolution layer and then is used as the output of the detail information processing module. For the 1 st network module, the Dense block of the DenseNet network is used. Wherein the parameters are set as: the number of layers is 6, the size is 4, the number of increasing stages is 4, and the output is 536 characteristic graphs. For the 1 st over convolutionThe layer is composed of convolution, the size of convolution kernel is 3 multiplied by 3, the number of convolution kernels is 256, zero filling parameter is 1, step length is 1, output is 256 characteristic graphs, and a set composed of 256 characteristic graphs is recorded as H₁. For the 2 nd over-convolution layer, the over-convolution layer is composed of one convolution, the size of convolution kernel is 3 multiplied by 3, the number of convolution kernels is 256, zero-filling parameter is 1, step length is 1, output is 256 characteristic diagrams, and the set composed of 256 characteristic diagrams is recorded as H₂。

The global information processing module comprises three processing branches, the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then serve as the output of the global information processing module. Each global network module is an aspp (advanced Spatial Pyramid farming) module. For the 1 st global network module, the output is 512 feature maps. For the 2 nd global network module, the output is 512 feature maps. For the 3 rd global network module, the output is 512 feature maps. Each global convolutional layer consists of one convolutional layer. For the 1 st global convolutional layer, the size of the convolutional kernel is 3 × 3, the number of the convolutional kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature maps are output, and a set formed by the 256 feature maps is recorded as E₁. For the 2 nd global convolutional layer, the size of the convolutional kernel is 5 × 5, the number of the convolutional kernels is 256, the zero padding parameter is 2, the step length is 1, the output is 256 feature maps, and the set formed by the 256 feature maps is recorded as E₂. For the 3 rd global convolutional layer, the size of the convolutional kernel is 7 × 7, the number of the convolutional kernels is 256, the zero padding parameter is 3, the step length is 1, the output is 256 feature maps, and the set consisting of the 256 feature maps is recorded as E₃。

The output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then serve as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then serve as the other input of the SKNet network model. For the 1 st SKNet, the SKNet is composed of a Selective Kernel Networks, the SKNet network model has two inputs, the first input is H₁、H₂And X₁Sum, second input E₁、E₂、E₃And X₁And the input parameters are: 256 characteristic diagrams with the width of the characteristic diagram

Has a height of

The output of this operation is still 256 feature maps, and the size of the maps is unchanged.

The post-processing module comprises a first anti-convolution layer and a second anti-convolution layer which are sequentially connected, the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer. For the 1 st deconvolution layer, the deconvolution layer is composed of one deconvolution, the convolution kernel size of the deconvolution layer is 2 x 2, the number of convolution kernels is 128, the zero filling parameter is 0, the step length is 2, and the width of each feature map is

Has a height of

And for the 2 nd deconvolution layer, the deconvolution layer is formed by deconvolution, the convolution kernel size of the deconvolution layer is 2 x 2, the number of the convolution kernels is 1, the zero padding parameter is 0, the step length is 2, and the width and the height of each feature map are W and H.

Step 1_ 3: converting the conversion size of each original color real target image in the training set into 224 multiplied by 224 to be used as an original RGB input image, converting the conversion size of the depth image corresponding to each original color real target image in the training set into 224 multiplied by 224 and converting the depth image into a three-channel image to be used as a depth input image, inputting the three-channel image into ResNet50 for pre-training, and inputting the corresponding feature map into a model for training after pre-training. Obtaining a significance detection prediction image corresponding to each color real target image in the training set, and calculating the significance of the color real target image according to the significance detection prediction image^q(i, j) } the set of significance detection prediction maps corresponding to the (i, j) } is recorded as

Step 1_ 4: calculating loss function values between a set formed by a saliency detection prediction image corresponding to each original color real target image in a training set and a set formed by a coding image of a corresponding size processed by a corresponding real saliency detection image, and processing the loss function values

And

the value of the loss function in between is recorded as

Obtained using the BCE loss function.

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as W^bestAnd b^best(ii) a Where V > 1, in this example, V is 100.

The specific steps of the test phase process of the embodiment are as follows:

step 2_ 1: order to

Representing a color real target image to be saliency detected,

representing a depth image corresponding to a real object to be subjected to saliency detection; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j),

to represent

And the middle coordinate position is the pixel value of the pixel point of (i, j).

Step 2_ 2: will be provided with

And R, G and B channel components and transformed

Inputting the three-channel components into a convolutional neural network classification training model and utilizing W^bestAnd b^bestMaking a prediction to obtain

And

corresponding predicted saliency detection image, denoted

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

To further verify the feasibility and effectiveness of the method of the invention, experiments were conducted on the method of the invention.

And (3) constructing a multi-scale residual convolutional neural network by using a python-based deep learning library Pytrch 4.0.1. The method adopts a real object image database NJU2000 test set to analyze how significant detection effect of real scene images (397 real object images) is obtained by prediction by the method. Here, the detection performance of the predicted significance detection image is evaluated by using 3 common objective parameters of the significance detection method as evaluation indexes, namely, a Precision Recall Curve (Precision Recall Curve), a working characteristic Curve (ROC), and a Mean Absolute Error (MAE).

The method of the invention is utilized to predict each real scene image in the real scene image database NJU2000 test set, and a prediction significance detection image corresponding to each real scene image is obtained.

FIG. 4a reflects the accuracy recall Curve (PR Curve) for significance detection effect of the method of the present invention, with the result Curve being as close to 1 as better.

FIG. 4b reflects the operating characteristic curve (ROC) of significance detection effect of the method of the present invention, with the result curve being as close to 1 as possible.

Fig. 4c reflects the Mean Absolute Error (MAE) of significance detection by the method of the present invention, with lower MAE results indicating better detection.

As can be seen from the figure, the significance detection result of the real scene image obtained by the method is very good, which shows that the method for obtaining the prediction significance detection image corresponding to the real scene image is feasible and effective.

Claims

1. A multi-feature cascade RGB-D significance target detection method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step 1_2 includes two input layers, the 1 st input layer is an RGB image input layer, and the 2 nd input layer is a depth image input layer; the hidden layer comprises an RGB (red, green and blue) feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, an SKNet network model and a post-processing module;

3. The multi-feature concatenated RGB-D saliency target detection method of claim 2,

4. The multi-feature cascade RGB-D significance target detection method according to claim 2, wherein the detail information processing module comprises a first network module and a second over-convolution layer which are connected in sequence, and the input of the detail information processing module is fused with the output of the first over-convolution layer and the output of the second over-convolution layer to be used as the output of the detail information processing module;

5. The multi-feature cascade RGB-D saliency target detection method of claim 3, characterized in that each said color attention layer and depth attention layer uses CBAM module, each said color up-sampling layer and depth up-sampling layer are used for up-sampling processing of bilinear interpolation of input features; each of the attention convolution layer, the color convolution layer, the depth convolution layer, the low-level feature convolution layer, the high-level feature convolution layer, and the mixed-feature convolution layer includes one convolution layer; each of said deconvolution layers comprises a deconvolution.

6. The method as claimed in claim 4, wherein the hyper-convolutional layer in the detail information processing module and the global convolutional layer in the global information processing module each comprise a convolutional layer, the first network module in the detail information processing module uses a Dense block in a DenseNet network, and each global network module in the global information processing module uses an ASPP module.

7. The method as claimed in claim 2, wherein the input end of the RGB image input layer receives RGB input images, and the input end of the depth image input layer receives depth images corresponding to the RGB images; the input of the RGB feature extraction module and the input of the depth feature extraction module are respectively the output of an RGB image input layer and a depth image input layer.