CN110929736B

CN110929736B - Multi-feature cascading RGB-D significance target detection method

Info

Publication number: CN110929736B
Application number: CN201911099871.2A
Authority: CN
Inventors: 周武杰; 潘思佳; 林鑫杨; 黄铿达; 雷景生; 何成; 王海江; 薛林林
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2023-05-26
Anticipated expiration: 2039-11-12
Also published as: CN110929736A

Abstract

The invention discloses a multi-feature cascading RGB-D significance target detection method. Selecting RGB images and corresponding depth images and true significance images thereof to form a training set, constructing a convolutional neural network, and inputting the training set into the convolutional neural network to train to obtain significance prediction images corresponding to each RGB image in the training set, calculating loss function values between the significance prediction images corresponding to each RGB image in the training set and the corresponding true significance images, and continuously training weight vectors and bias items corresponding to the loss function values with the minimum values; and inputting the RGB image and the depth image to be predicted into a trained convolutional neural network training model to obtain a prediction segmentation image. The model disclosed by the invention is novel in structure, and the similarity of the obtained salient map and the target map is high after model processing.

Description

Multi-feature cascading RGB-D significance target detection method

Technical Field

The invention relates to a human eye saliency target detection method, in particular to a multi-feature cascading RGB-D saliency target detection method.

Background

Saliency target detection is a branch of image processing and is also a field of computer vision. Computer vision, in a broad sense, is the discipline that imparts natural vision capabilities to machines. Natural vision ability refers to the visual ability that the biological vision system embodies. In fact, computer vision is essentially the research of visual perception problems. The key problem is to study how to organize the input image information, identify objects and scenes, and further explain the image content.

Computer vision has been the subject of increasing interest and rigorous research in recent decades. Computer vision is also increasingly better at recognizing patterns from images. Even in various fields, as the striking achievements of artificial intelligence and computer vision technology become more common in different industries, the future of computer vision appears to be filled with desirable and inconceivable results. The salient object detection referred to herein is one of the classifications, but plays a great role.

Significance detection is a method for predicting the position of a person in an image, and has attracted extensive research interest in recent years. It plays an important role in preprocessing in the problems of image classification, image repositioning, target recognition, etc. Unlike RGB saliency detection, RGB saliency detection is less studied. Saliency detection methods can be classified into top-down methods and bottom-up methods according to the definition of saliency. Top-down saliency detection is a task-dependent detection method that incorporates high-level features to locate salient objects. On the other hand, the bottom-up approach is dead, which uses low-level features to map out regions from a biological perspective.

Disclosure of Invention

In order to solve the problems in the background technology, the invention provides a multi-feature cascading RGB-D saliency target detection method, the similarity of a saliency map obtained after model processing and a target map is high, and the model structure is novel.

The technical scheme adopted by the invention is as follows, comprising the following steps:

step 1_1: q original RGB images and corresponding depth maps thereof are selected, and a training set is formed by combining the true significance images corresponding to the original RGB images;

step 1_2: constructing a convolutional neural network: the convolutional neural network comprises two input layers, a hidden layer and an output layer, wherein the two input layers are connected to the input end of the hidden layer, and the output end of the hidden layer is connected to the output layer;

step 1_3: each original RGB image and the corresponding depth image in the training set are respectively used as the original input images of the two input layers and are input into a convolutional neural network for training, so that a prediction significance image corresponding to each original RGB image in the training set is obtained; calculating a loss function value between a predicted saliency image corresponding to each original RGB image in the training set and a corresponding real saliency image, wherein the loss function value is obtained by adopting a BCE loss function;

Step 1_4: repeating the step 1_3 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term, and the weight vector and the bias term in the trained convolutional neural network training model are replaced;

step 1_5: inputting the RGB image to be predicted and the depth image corresponding to the RGB image to be predicted into a trained convolutional neural network training model, and predicting by utilizing an optimal weight vector and an optimal bias term to obtain a predicted saliency image corresponding to the RGB image to be predicted, thereby realizing saliency target detection.

Two input layers in the step 1_2, wherein the 1 st input layer is an RGB image input layer, and the 2 nd input layer is a depth image input layer; the hidden layer comprises an RGB feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, a SKNet network model and a post-processing module;

the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color upsampling layers, four attention convolution layers and four color convolution layers which are connected in sequence; the four sequentially connected color map neural network blocks correspond to the four sequentially connected modules in the ResNet50 respectively, the output of the first color map neural network block is connected to the first RGB branch and the fifth RGB branch respectively, the output of the second color map neural network block is connected to the second RGB branch and the sixth RGB branch respectively, the output of the third color map neural network block is connected to the third RGB branch and the seventh RGB branch respectively, and the output of the fourth color map neural network block is connected to the fourth RGB branch and the eighth RGB branch respectively;

The depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth up-sampling layers, four attention convolution layers and four depth convolution layers which are sequentially connected, wherein the four depth map neural network blocks are respectively corresponding to the four modules which are sequentially connected in the ResNet50, the output of the first depth map neural network block is respectively connected to the first depth branch and the fifth depth branch, the output of the second depth map neural network block is respectively connected to the second depth branch and the sixth depth branch, the output of the third depth map neural network block is respectively connected to the third depth branch and the seventh depth branch, and the output of the fourth depth map neural network block is respectively connected to the fourth depth branch and the eighth depth branch;

the outputs of the first RGB branch and the second RGB branch are multiplied to be used as one input of the low-level characteristic convolution layer, and the outputs of the first depth branch and the second depth branch are multiplied to be used as the other input of the low-level characteristic convolution layer; the outputs of the third RGB branch and the fourth RGB branch are multiplied to be used as one input of the advanced characteristic convolution layer, and the outputs of the third depth branch and the fourth depth branch are multiplied to be used as the other input of the advanced characteristic convolution layer;

The outputs of the low-level characteristic convolution layer and the high-level characteristic convolution layer are input into the mixed characteristic convolution layer;

the fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module;

the output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then used as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then used as the other input of the SKNet network model;

the post-processing module comprises a first deconvolution layer and a second deconvolution layer which are sequentially connected, wherein the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer.

The first RGB branch comprises a first color attention layer, a first color upsampling layer and a first attention convolution layer which are sequentially connected, the second RGB branch comprises a second color attention layer, a second color upsampling layer and a second attention convolution layer which are sequentially connected, the third RGB branch comprises a third color attention layer, a third color upsampling layer and a third attention convolution layer which are sequentially connected, and the fourth RGB branch comprises a fourth color attention layer, a fourth color upsampling layer and a fourth attention convolution layer which are sequentially connected;

The fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;

the first depth branch comprises a first depth attention layer, a first depth upsampling layer and a fifth attention convolution layer which are sequentially connected, the second depth branch comprises a second depth attention layer, a second depth upsampling layer and a sixth attention convolution layer which are sequentially connected, the third depth branch comprises a third depth attention layer, a third depth upsampling layer and a seventh attention convolution layer which are sequentially connected, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth upsampling layer and an eighth attention convolution layer which are sequentially connected;

the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.

The detail information processing module comprises a first network module and a second excessive convolution layer which are sequentially connected, wherein the input of the detail information processing module is fused with the output of the second excessive convolution layer through the output of the first excessive convolution layer and then used as the output of the detail information processing module;

the global information processing module comprises three processing branches, wherein the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then used as the output of the global information processing module.

Each color attention layer and each depth attention layer adopt a CBAM module (attention mechanism module of a convolution module), and each color upsampling layer and each depth upsampling layer are used for upsampling processing of bilinear interpolation of input features; each attention convolution layer, color convolution layer, depth convolution layer, low-level feature convolution layer, high-level feature convolution layer and mixed feature convolution layer comprises a convolution layer; each of said deconvolution layers comprising a deconvolution;

the overconvolution layer in the detail information processing module and the global convolution layer in the global information processing module comprise one convolution layer, a first network module in the detail information processing module adopts a Dense block in a DenseNet network, and each global network module in the global information processing module adopts an ASPP module (space pyramid pool module).

The input end of the RGB image input layer receives an RGB input image, and the input end of the depth image input layer receives a depth image corresponding to the RGB image; the input of the RGB feature extraction module and the depth feature extraction module is the output of an RGB image input layer and a depth image input layer respectively.

Compared with the prior art, the invention has the advantages that:

1) The invention uses ResNet50 to respectively pretrain RGB image and depth image (change depth image into three-channel input), respectively extracts different results of RGB image and depth image passing through 4 modules in ResNet50, and carries out two different operations on the extracted results, firstly, carries out detail optimization of high-low level characteristics through an attention mechanism, secondly, fuses the high-low level characteristics after forming a main network, and then transmits the fused high-low level characteristics into a later model.

2) The invention extracts the characteristic information from the pre-training, divides the image characteristic into two types of high-level and low-level characteristics, extracts the detail characteristic of the image from the high-level and low-level characteristics at the left part of the model, and fuses the high-level characteristic and the low-level characteristic after the operation respectively, so that the mode has excellent effect.

3) Two novel modules are designed in the right side of the model: the first module adopts a module combining convolution and a Dense block, and fully combines the advantages of convolution and DenseNet, so that the detection result of the method is finer. The second module utilizes ASPP to expand the field of view and then matches with convolution, which is beneficial to the collection of global features, so that the detection result of the method is more comprehensive. And finally, carrying out overall fusion on the detail characteristics at the left side of the fusion model.

Drawings

Fig. 1 is a block diagram of a general implementation of the method of the present invention.

Fig. 2a is an original RGB image.

Fig. 2b is the depth image of fig. 2 a.

Fig. 3a is a true saliency detection image of fig. 2 a.

Fig. 3b is a predicted saliency image of fig. 2a and 2b according to the present invention.

FIG. 4a shows the results of the present invention on a Recall evaluation.

Fig. 4b shows the results of the present invention on ROC.

FIG. 4c shows the results of the present invention on MAE.

Detailed Description

The invention is described in further detail below with reference to the embodiments of the drawings.

The invention provides a multi-feature cascading RGB-D significance target detection method, the general implementation block diagram of which is shown in figure 1, and the method comprises three processes of a training stage, a verification stage and a testing stage.

The training phase process comprises the following specific steps:

step 1_1: q color real target images and corresponding depth images are selected, a training set is formed by the corresponding saliency images of each color real target image, and the Q-th original object image in the training set is recorded as { I } ^q (i, j) } the depth image is denoted as { D } ^q (I, j) } and { I } in the training set ^q The true saliency image corresponding to (i, j) is recorded as

Wherein the color real target image is an RGB color image, the depth image is a binary gray scale image, Q is a positive integer, Q is more than or equal to 200, if Q=1588 is taken, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, and W represents { I } ^q Width of (I, j) }, H represents { I }, and ^q height of (i, j), e.g. w=512, h=512, i ^q (I, j) represents { I } ^q Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>

Representation->

Pixel values of the pixel points with the middle coordinate positions (i, j); here, the color real target image directly selects the database NJU2000 training set1588 images of (3).

Step 1_2: constructing a convolutional neural network:

the convolutional neural network comprises an input layer, a hidden layer and an output layer;

the input layers include an RGB image input layer and a depth image input layer. For an RGB image input layer, an input end receives an R channel component, a G channel component and a B channel component of an original input image, and an output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein the width of the original input image received by the input end of the required input layer is W, and the height is H. For a depth image input layer, an input end receives a depth image corresponding to an original input image, an output end of the input end outputs the original depth image, the original depth image is changed into a three-channel depth image through superposition of two channels, and three-channel components are given to a hidden layer; wherein the width of the original input image received by the input end of the required input layer is W, and the height is H.

The hidden layer comprises an RGB feature extraction module, a depth feature extraction module, a mixed feature convolution layer, a detail information processing module, a global information processing module, a SKNet network model and a post-processing module;

the RGB feature extraction module comprises four color map neural network blocks, four color attention layers, eight color upsampling layers, four attention convolution layers and four color convolution layers which are connected in sequence. The four color map neural network blocks connected in sequence correspond to the four modules connected in sequence in the ResNet50 respectively. For the 1 st color image neural network block, the 2 nd color image neural network, the 3 rd color image neural network and the 4 th color image neural network, 4 modules in the ResNet50 are respectively corresponding in sequence, a pretraining method is adopted, and the input image is pretrained by utilizing the network of the ResNet50 of the pytorch and the weight thereof. The output is 256 feature images after passing through the 1 st color image neural network block, and the set formed by the 256 feature images is marked as P ₁ ，P ₁ The width of each characteristic diagram is

Height is +.>

The obtained image is output as 512 feature images after passing through the 2 nd color image neural network block, and the set formed by the output 512 feature images is marked as P ₂ ，P ₂ The width of each feature map in (a) is +.>

Height is +.>

The output is 1024 feature images after passing through the 3 rd color image neural network block, and the set formed by the 1024 feature images is marked as P ₃ ，P ₃ The width of each feature map in (a) is +.>

Height is +.>

The result is output as 2048 feature images after passing through the 4 th color image neural network block, and the set formed by the 2048 feature images is marked as P ₄ ，P ₄ The width of each feature map is +.>

Height is +.>

The depth feature extraction module comprises four depth map neural network blocks, four depth attention layers, eight depth upsampling layers, four attention convolution layers and four depth convolution layers which are connected in sequence. The four depth map neural network blocks connected in sequence correspond to the four modules connected in sequence in the ResNet50 respectively. For the 1 st depth image neural network, the 2 nd depth image neural network, the 3 rd depth image neural network and the 4 th depth image neural network, 4 ResNet50 are respectively corresponding in sequenceThe module adopts a pretraining method, and pretrains the input image by utilizing the network of ResNet50 of the pyrach and the weight thereof. The output is 256 feature images after passing through the 1 st depth image neural network block, and the set formed by the 256 feature images is marked as D ₁ ，D ₁ The width of each characteristic diagram is

Height is +.>

The obtained images are output into 512 feature images after passing through a 2 nd depth image neural network block, and a set formed by the output 512 feature images is marked as D ₂ ，D ₂ The width of each feature map in (a) is +.>

Height is +.>

The 1024 feature images are output after passing through the 3 rd depth image neural network block, and the set formed by the 1024 feature images is marked as D ₃ ，D ₃ The width of each feature map in (a) is +.>

Height is +.>

The image is output as 2048 feature images after passing through the 4 th depth image neural network block, and a set formed by the 2048 feature images is marked as D ₄ ，D ₄ The width of each feature map in (a) is +.>

Height is +.>

The output of the first color map neural network block is connected to the first RGB branch and the fifth RGB branch, respectively, the output of the second color map neural network block is connected to the second RGB branch and the sixth RGB branch, respectively, the output of the third color map neural network block is connected to the third RGB branch and the seventh RGB branch, respectively, and the output of the fourth color map neural network block is connected to the fourth RGB branch and the eighth RGB branch, respectively. The first RGB branch comprises a first color attention layer, a first color upsampling layer and a first attention convolution layer which are sequentially connected, the second RGB branch comprises a second color attention layer, a second color upsampling layer and a second attention convolution layer which are sequentially connected, the third RGB branch comprises a third color attention layer, a third color upsampling layer and a third attention convolution layer which are sequentially connected, and the fourth RGB branch comprises a fourth color attention layer, a fourth color upsampling layer and a fourth attention convolution layer which are sequentially connected; the fifth RGB branch comprises a first color convolution layer and a fifth color up-sampling layer which are sequentially connected, the sixth RGB branch comprises a second color convolution layer and a sixth color up-sampling layer which are sequentially connected, the seventh RGB branch comprises a third color convolution layer and a seventh color up-sampling layer which are sequentially connected, and the eighth RGB branch comprises a fourth color convolution layer and an eighth color up-sampling layer which are sequentially connected;

The output of the first depth map neural network block is connected to the first depth branch and the fifth depth branch respectively, the output of the second depth map neural network block is connected to the second depth branch and the sixth depth branch respectively, the output of the third depth map neural network block is connected to the third depth branch and the seventh depth branch respectively, and the output of the fourth depth map neural network block is connected to the fourth depth branch and the eighth depth branch respectively. The first depth branch comprises a first depth attention layer, a first depth upsampling layer and a fifth attention convolution layer which are sequentially connected, the second depth branch comprises a second depth attention layer, a second depth upsampling layer and a sixth attention convolution layer which are sequentially connected, the third depth branch comprises a third depth attention layer, a third depth upsampling layer and a seventh attention convolution layer which are sequentially connected, and the fourth depth branch comprises a fourth depth attention layer, a fourth depth upsampling layer and an eighth attention convolution layer which are sequentially connected; the fifth depth branch comprises a first depth convolution layer and a fifth depth upsampling layer which are sequentially connected, the sixth depth branch comprises a second depth convolution layer and a sixth depth upsampling layer which are sequentially connected, the seventh depth branch comprises a third depth convolution layer and a seventh depth upsampling layer which are sequentially connected, and the eighth depth branch comprises a fourth depth convolution layer and an eighth depth upsampling layer which are sequentially connected.

Each color attention layer or depth attention layer is composed of one CBAM (Convolutional Block Attention Module) module. For the 1 st color attention layer, this operation does not change the graph size and channel number, and is still 256 feature graphs. For the 2 nd color attention layer, this operation did not change the graph size and channel number, still 512 feature graphs. For the 3 rd color attention layer, this operation does not change the graph size and channel number, still 1024 feature graphs. For the 4 th color attention layer, this operation does not change the graph size and channel number, still 2048 feature graphs. For the 1 st depth attention layer, this operation does not change the graph size and channel number, and is still 256 feature graphs. For the 2 nd depth attention layer, this operation did not change the graph size and channel number, still 512 feature graphs. For the 3 rd depth attention layer, this operation does not change the graph size and channel number, still 1024 feature graphs. For the 4 th depth attention layer, this operation does not change the graph size and channel number, still 2048 feature graphs.

Each attention convolution layer is composed of one convolution layer. For the 1 st attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as S ₁ . For the 2 nd attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as S ₂ . For a pair ofIn the 3 rd attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as S ₃ . For the 4 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as S ₄ . For the 5 th attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G ₁ . For the 6 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step size is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G. For the 7 th attention convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G ₃ . For the 8 th attention convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and the set formed by the 256 feature images is denoted as G ₄ 。

For the 1 st multiplication operation, S is ₁ And S is ₂ The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as S ₁ S ₂ As one of the inputs to the low-level feature convolution layer. For the 2 nd multiplication operation, S is ₃ And S is ₄ The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as S ₃ S ₄ As another input to the low-level feature convolution layer. For the 3 rd multiplication operation, G will be ₁ And G ₂ The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as G ₁ G ₂ As one of the inputs to the advanced feature convolution layer. For the 4 th multiplication operation, G will be ₃ And G ₄ The multiplication is carried out, 256 feature images are output, and a set formed by the 256 feature images is denoted as G ₃ G ₄ As another input to the advanced feature convolution layer.

Each color convolution layer or depth convolution layer is composed of one convolution. For the 1 st color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 2 nd color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 3 rd color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 4 th color convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 1 st depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 2 nd depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 3 rd depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output. For the 4 th depth convolution layer, the convolution kernel size is 3×3, the number of convolution kernels is 512, the zero padding parameter is 1, the step length is 1, and 512 feature images are output.

Each color upsampling layer or depth upsampling layer is used for upsampling processing of bilinear interpolation of the input features.

For the 1 st color up-sampling layer, the width of the output characteristic diagram is set to be

Height is +.>

This operation does not change the feature map number. For the 2 nd color up-sampling layer, the width of the output feature map is set to be +.>

Height is +.>

This operation does not change the feature map number. For the 3 rd color upsampling layer, the output feature map width is set to +.>

Height is +.>

This operation does not change the feature map number. For the 4 th color up-sampling layer, the width of the output feature map is set to be +.>

Height is +.>

This operation does not change the feature map number. For the 1 st depth up-sampling layer, the width of the output feature map is set to be +.>

Height is +.>

This operation does not change the feature map number. For the 2 nd depth up-sampling layer, the width of the output feature map is set to be +.>

Height is +.>

This operation does not change the feature map number. For the 3 rd depth upsampling layer, the output feature map width is set to +.>

Height is +.>

This operation does not change the feature map number. For the 4 th depth up-sampling layer, setting the width of the output characteristic diagramDegree is->

Height is +.>

This operation does not change the feature map number.

For the 5 th color up-sampling layer, the width of the output characteristic diagram is set to be

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U ₁ . For the 6 th color up-sampling layer, the width of the output feature map is set to be +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U ₂ . For the 7 th color upsampling layer, the output feature map width is set to +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U ₃ . For the 8 th color up-sampling layer, the width of the output feature map is set to +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and the set formed by the 512 feature images is marked as U ₄ . For the 5 th depth upsampling layer, the output feature map width is set to +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F ₁ . For the 6 th depth upsampling layer, the output feature map width is set to +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F ₂ . For the 7 th depth upsampling layer, the output feature map width is set to +. >

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F ₃ . For the 8 th depth upsampling layer, the output feature map width is set to +.>

Height is +.>

The operation does not change the number of the feature images, outputs 512 feature images, and marks the set formed by 512 feature images as F ₄ 。

Low-level feature volumeThe outputs of both the laminate and advanced feature convolution layers are input into the hybrid feature convolution layer. For the 1 st advanced characteristic convolution layer, the convolution layer consists of one convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, the zero padding parameter is 1, the step length is 1, and 256 characteristic images are output; the high-level features of the RGB map and the depth map are fused. For the 1 st low-level characteristic convolution layer, the convolution layer consists of a convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, and 256 characteristic diagrams are output; the low-level features of the RGB map and the depth map are fused. For the 1 st mixed characteristic convolution layer, the convolution layer consists of one convolution, the convolution kernel size is 3 multiplied by 3, the number of the convolution kernels is 256, zero padding parameters are 1, the step length is 1, 256 characteristic images are output, and a set formed by the 256 characteristic images is marked as X ₁ 。

For the 5 th multiplication operation, U is ₁ And U ₂ The result after addition is compared with F ₁ And F ₂ The added results are multiplied and output as 512 feature maps. For the 6 th multiplication operation, U is ₃ And U ₄ The result after addition is compared with F ₃ And F ₄ The added results are multiplied and output as 512 feature maps. The fusion result of the fifth RGB branch and the sixth RGB branch is multiplied by the fusion result of the fifth depth branch and the sixth depth branch and then input into a detail information processing module; the fusion result of the seventh RGB branch and the eighth RGB branch is multiplied by the fusion result of the seventh depth branch and the eighth depth branch and then input into the global information processing module.

The detail information processing module comprises a first network module and a second excessive convolution layer which are connected in sequence, and the input of the detail information processing module is fused with the output of the second excessive convolution layer through the output of the first excessive convolution layer and then is used as the output of the detail information processing module. For the 1 st network module, a Dense block of the DenseNet network is used. Wherein the parameters are set as follows: the number of layers is 6, the size is 4, the number of steps is increased by 4, and 536 characteristic diagrams are output. For the 1 st excessive convolution layer, the method consists of one convolution, wherein the convolution kernel size is 3 multiplied by 3, the number of convolution kernels is 256, zero padding parameters are 1, step length is 1, 256 feature images are output, and the 256 feature images are formed The set is denoted as H ₁ . For the 2 nd excessive convolution layer, the convolution kernel is formed by one convolution, the size of the convolution kernel is 3 multiplied by 3, the number of the convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as H ₂ 。

The global information processing module comprises three processing branches, wherein the three processing branches comprise a global network module and a global convolution layer which are sequentially connected, and the outputs of the three processing branches are fused and then used as the output of the global information processing module. Each global network module adopts a ASPP (Atrous Spatial Pyramid Pooling) module. For the 1 st global network module, the output is 512 feature graphs. For the 2 nd global network module, the output is 512 feature graphs. For the 3 rd global network module, the output is 512 feature maps. Each global convolution layer is made up of one convolution layer. For the 1 st global convolution layer, the convolution kernel size is 3 multiplied by 3, the number of convolution kernels is 256, the zero padding parameter is 1, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as E ₁ . For the 2 nd global convolution layer, the convolution kernel size is 5×5, the number of convolution kernels is 256, the zero padding parameter is 2, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is denoted as E ₂ . For the 3 rd global convolution layer, the convolution kernel size is 7 multiplied by 7, the number of convolution kernels is 256, the zero padding parameter is 3, the step length is 1, 256 feature images are output, and a set formed by the 256 feature images is marked as E ₃ 。

The output of the mixed characteristic convolution layer and the output of the detail information processing module are fused and then used as one input of the SKNet network model, and the output of the mixed characteristic convolution layer and the output of the global information processing module are fused and then used as the other input of the SKNet network model. For the 1 st SKNet, consisting of one Selective Kernel Networks, the SKNet network model has two inputs, the first input being H ₁ 、H ₂ And X is ₁ The sum of the second inputs is E ₁ 、E ₂ 、E ₃ And X is ₁ The sum of the input parameters is: 256 feature maps with width of

Height is +.>

The operation output is still 256 feature maps, and the map size is unchanged.

The post-processing module comprises a first deconvolution layer and a second deconvolution layer which are sequentially connected, wherein the input of the post-processing module is the output of the SKNet network model, and the output of the post-processing module is finally output through the output layer. For the 1 st deconvolution layer, the deconvolution layer consists of a deconvolution layer, the convolution kernel size of the deconvolution layer is 2 multiplied by 2, the number of convolution kernels is 128, the zero padding parameter is 0, the step length is 2, and the width of each feature graph is

Height is +.>

The 2 nd deconvolution layer consists of a deconvolution layer, the convolution kernel size of the deconvolution layer is 2 multiplied by 2, the number of the convolution kernels is 1, the zero padding parameter is 0, the step length is 2, and the width of each feature map is W and the height is H.

Step 1_3: the conversion size of each original color real target image in the training set is changed into 224 multiplied by 224 to be used as an original RGB input image, the conversion size of the depth image corresponding to each original color real target image in the training set is changed into 224 multiplied by 224 and is changed into a three-channel image to be used as a depth input image, the depth input image is input into the ResNet50 for pre-training, and the corresponding feature images are input into the model for training after the pre-training. Obtaining a significance detection prediction graph corresponding to each color real target image in the training set, and carrying out { I } ^q The set of significance detection prediction graphs corresponding to (i, j) is denoted as

Step 1_4: computing a set and a correspondence of saliency detection predictive graphs corresponding to each original color real target image in a training setIs processed into a set of encoded images of corresponding size, the loss function value between the set of encoded images will

And->

The loss function value between them is recorded as

Obtained using BCE loss function.

Step 1_5: repeating the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model, and correspondingly marked as W ^best And b ^best The method comprises the steps of carrying out a first treatment on the surface of the Where V > 1, v=100 in this example.

The specific steps of the test phase process of the embodiment are as follows:

step 2_1: order the

Representing a colored real target image to be detected for saliency, a color real target image>

Representing a depth image corresponding to a real object to be detected for significance; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>

Is H' represents ∈>

Height of->

Representation->

The pixel value of the pixel point with the middle coordinate position of (i, j),

representation->

The pixel value of the pixel point whose middle coordinate position is (i, j).

Step 2_2: will be

R channel component, G channel component, and B channel component, and converted

Is input into a convolutional neural network classification training model and utilizes W ^best And b ^best Predicting to obtain->

And->

Corresponding predictive saliency detection image, noted +.>

Wherein (1)>

Representation->

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

Architecture of multi-scale residual convolutional neural network was built using python-based deep learning library pytorch4.0.1. The method adopts an NJU2000 test set of a real object image database to analyze how the significance detection effect of the real scene image (397 real object images) predicted by the method is achieved. Here, the detection performance of the predicted saliency detection image is evaluated using 3 commonly used objective parameters of the estimated saliency detection method as evaluation indexes, namely, a class accuracy recall curve (Precision Recall Curve), a work feature curve (ROC), and an average absolute error (Mean Absolute Error, MAE).

The method is used for predicting each real scene image in the real scene image database NJU2000 test set to obtain a prediction saliency detection image corresponding to each real scene image.

FIG. 4a reflects the accuracy recall Curve (PR Curve) of the significance detection effect of the method of the present invention, with the result Curve being better the closer to 1.

Fig. 4b reflects the operating characteristic curve (ROC) of the significance test effect of the method of the present invention, with the result curve being better the closer to 1.

FIG. 4c reflects the Mean Absolute Error (MAE) of the significance test effect of the method of the present invention, with lower MAE results representing better test effect.

The figure shows that the significance detection result of the real scene image obtained by the method is very good, which indicates that the method for obtaining the predicted significance detection image corresponding to the real scene image is feasible and effective.

Claims

1. The multi-feature cascading RGB-D significance target detection method is characterized by comprising the following steps of:

step 1_1: q original RGB images and corresponding depth images thereof are selected, and a training set is formed by combining the true significance images corresponding to the original RGB images;

Step 1_4: repeating the step 1_3 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then find out the smallest value of loss function value from Q X V pieces of loss function values; then, the weight vector and the bias term corresponding to the loss function value with the minimum value are correspondingly used as the optimal weight vector and the optimal bias term, and the weight vector and the bias term in the trained convolutional neural network classification training model are replaced;

step 1_5: inputting the RGB image to be predicted and the depth image corresponding to the RGB image to be predicted into a trained convolutional neural network classification training model, and predicting by utilizing an optimal weight vector and an optimal bias term to obtain a predicted saliency image corresponding to the RGB image to be predicted, thereby realizing saliency target detection;

2. A multi-feature cascading RGB-D significance target detection method according to claim 1, characterized in that,

3. The multi-feature cascading RGB-D saliency target detection method of claim 1, wherein the detail information processing module includes a first network module and a second excessive convolution layer connected in sequence, and an input of the detail information processing module is fused with an output of the first excessive convolution layer and an output of the second excessive convolution layer to serve as an output of the detail information processing module;

4. The multi-feature cascading RGB-D saliency target detection method of claim 2, wherein each of the color attention layer and the depth attention layer adopts a CBAM module, and each of the color upsampling layer and the depth upsampling layer is used for upsampling processing of bilinear interpolation of input features; each attention convolution layer, color convolution layer, depth convolution layer, low-level feature convolution layer, high-level feature convolution layer and mixed feature convolution layer comprises a convolution layer; each of said deconvolution layers comprises a deconvolution.

5. A multi-feature cascading RGB-D saliency target detection method according to claim 3, wherein the excessive convolution layer in the detail information processing module and the global convolution layer in the global information processing module each comprise a convolution layer, a first network module in the detail information processing module adopts a Dense block in a DenseNet network, and each global network module in the global information processing module adopts an ASPP module.

6. The multi-feature cascading RGB-D saliency target detection method of claim 1, wherein the RGB image input layer input receives an RGB input image and the depth image input layer input receives a depth image corresponding to the RGB image; the input of the RGB feature extraction module and the depth feature extraction module is the output of an RGB image input layer and a depth image input layer respectively.