CN110889416A

CN110889416A - Salient object detection method based on cascade improved network

Info

Publication number: CN110889416A
Application number: CN201911278227.1A
Authority: CN
Inventors: 杨巨峰; 翟英杰; 范登平
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-03-17
Anticipated expiration: 2039-12-13
Also published as: CN110889416B

Abstract

The invention discloses an RGB-D significance object detection method based on a cascade improved network, and belongs to the technical field of image processing. Most existing RGB-D models aggregate features from CNN networks at different levels directly, easily introducing noise and interference information contained in the lower-level features. The invention creatively provides a cascade improved structure, which takes a significance map generated by the characteristics of a high-level part as a mask to improve the characteristics of a low-level part, and then generates a final significance map by polymerizing the improved low-level characteristics; in addition, in order to eliminate the interference information of the depth map, the invention provides a depth enhancement module for preprocessing the depth features and the RGB features before mixing. The invention uses 4 evaluation indexes to carry out experiments on 7 data sets, and the result shows that the invention surpasses all the most advanced RGB-D significant object detection methods at present.

Description

Salient object detection method based on cascade improved network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an RGB-D salient object detection method based on a cascade improved network.

Technical Field

RGB-D saliency detection aims at RGB images in combination with depth information to find the most salient objects in a certain scene. In recent years, various smart devices (such as smart phones, somatosensory peripherals, etc.) capable of capturing depth information have become popular and widely used, and thus a large number of RGB-D significant algorithms have been proposed.

Early RGB-D saliency detection algorithms mainly utilized manual features, and these methods greatly relied on specific knowledge such as local area contrast, global area contrast, background prior knowledge, spatial prior knowledge, channel prior knowledge, and so on. In order to effectively utilize the manual features, researchers have utilized various classical tools such as support vector machines, markov chains, random forest algorithms, cellular automata, etc., which all achieve reliable results. In addition, various fusion strategies have been explored by researchers, such as early fusion, i.e., directly inputting the depth map as a fourth channel input network except for RGB, medium fusion, i.e., fusing features from the RGB network and the depth network, and late fusion, i.e., mixing the significance maps predicted by the depth information and the RGB information by multiplication or addition, and the like, which also achieve good effects.

With the popularity of Convolutional Neural Networks (CNNs), various deep network-based algorithms have been proposed in particular. Early depth algorithms were based on manual features, classified using deep networks, which relied on artificially defined features and could not be trained end-to-end. In order to fully utilize the depth information, researchers have proposed different deep network architectures (such as single network streaming, dual network streaming, triple network streaming) and various multi-scale multi-modal mixing strategies. However, since the depth map acquired by the device may contain much noisy and misleading information, researchers have proposed using a priori knowledge, a depth filter unit, to improve the depth information.

Although the above work considers that the features of different layers of the network all contain effective information and are utilized, the noise and redundancy contained in the features of the lower layer are ignored, the features are not utilized effectively, and the interference information often causes the generated significance map to contain the interference of the background; furthermore, depth features are often combined with RGB information by stitching, element-level addition or multiplication over channels, which are not effective in reducing the differences between the depth features and the RGB feature modality and eliminating the interference of low quality depth maps.

Disclosure of Invention

The invention aims to solve the problems of background interference caused by noise contained in low-layer part features due to the fact that distinguishing and direct aggregation are not carried out on all the layer features in the existing RGB-D significance detection method and the problem of improving the matching capability of depth features and RGB feature modes, and designs an RGB-D significance object detection method based on a cascade improved network.

The technical scheme adopted by the invention is as follows:

a salient object detection method based on a cascade improved network improves the characteristics of a low-level part by utilizing a primary salient map generated by the characteristics of the high-level part, and generates a final salient map by aggregating the improved low-level characteristics, and specifically comprises the following steps:

step 1, two CNN networks with the same structure are utilized, one network input RGB image extracts RGB features of 5 different levels, and the other network input depth map image extracts depth features of 5 different levels;

and step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features, wherein the depth enhancement module is composed of two sequentially executed channel attention operations and space attention operations.

And step 3, aggregating the multi-mode features of the upper layer parts from the 3 rd to the 5 th layers by a Cascade feature Decoder (Cascade Decoder 1) to generate an initial saliency map, and improving the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by respectively carrying out element-level multiplication operation on the initial saliency map and each channel of the multi-mode features of the lower layer parts from the 1 st to the 3 rd layers by using the initial saliency map as a mask.

And step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM). The cascade feature decoder is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.

The invention has the advantages and beneficial effects that:

according to the method, the semantic information contained in the high-level features in the depth network is effectively utilized through the depth map filter module to generate a relatively accurate initial saliency map which is then used for improving the low-level features, so that the influence of noise in the low-level features can be fully inhibited, the detailed information of the low-level features can be well kept, and the saliency map with better edge and detailed information can be generated; on the other hand, the depth enhancement unit provided by the invention can enable the network to intensively extract information which is beneficial to significance detection in the depth map, and can improve the modal matching capability of the RGB features and the depth features.

Drawings

Fig. 1 is a block diagram of an embodiment of a significance detection method based on a cascaded modified network according to the present invention;

fig. 2 is a specific structure of a depth enhancement unit (DEM) of the significance detection method based on the cascaded modified network proposed by the present invention;

FIG. 3 is a specific structure of a global information unit (GCM) in the significance detection method based on the cascaded modified network proposed by the present invention;

FIG. 4 is a comparison experiment of 4 evaluation indexes using the present invention and 10 most advanced RGB-D significance detection methods, including 8 deep learning-based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 conventional manual characterization methods (SE and LBE).

The specific implementation mode is as follows:

referring to fig. 1, the salient object detection method based on the cascaded modified network provided by the present invention mainly comprises a depth enhancement unit (DEM) and a cascaded feature Decoder (Cascade Decoder), and the implementation steps of the salient object detection method based on the cascaded modified network are as follows:

1. using two ResNet50 CNN networks of the same architecture, a network input RGB image extracts RGB features of 5 different levels

Extracting depth features of 5 different levels from another network input depth map image

The number of input channels of the RGB network is 3, and the number of input channels of the deep network is 1.

2. The depth image features of 5 different levels extracted in the step 1 are respectively subjected to Depth Enhancement Module (DEM) to obtain enhanced depth features, and are fused with RGB features of corresponding levels through element level addition to obtain mixed mode (multi-mode) features

Namely:

wherein, with reference to FIG. 2, the depth enhancement module is operated by two channel attentions performed in sequence C_attAnd spatial attention operation S_attThe composition is as follows:

⊙ in the above equation denotes a multiplication operation on an element level, and the channel attention operation and the spatial attention operation are defined as:

wherein f represents the input feature map, M represents a 2-layer multi-layer perceptron, P_maxRepresenting each feature map global max pooling operation, P'_maxRepresenting the global max-pooling operation along the channel dimension in the feature map, Conv represents the standard 3 x 3 convolution operation,

express features ofAnd performing dimension expansion on the figure graph, and then performing multiplication operation on an element level.

3. Aggregating multi-modal features of high-level parts of layers 3 to 5 by means of a Cascade feature Decoder (Cascade Decoder 1)

Generating an initial saliency map, namely:

wherein D₁Representing a cascaded feature decoder at a first level, and then using the initial saliency map to refine the multi-modal features of the lower layer portion of layers 1 through 3

Improved characteristics

Namely:

referring to fig. 1, each concatenated feature decoder consists of 3 global information elements (GCMs) and pyramidal product and concatenation operations. As shown in fig. 3, each global information unit is composed of four branches, for the four branches, the channel dimension of the feature map is reduced to 32 by using 1 × 1 convolution, no other additional operation is performed on the 1 st branch, for the k (k ∈ 2,3,4}) branch, the convolution operation with the convolution kernel size of 2k-1 and the expansion rate of 1 is performed first, then the convolution operation with the convolution kernel size of 3 × 3 and the expansion rate of 2k-1 is performed to capture global information, and the receptive field is improved. Then the outputs of the 4 branches are spliced together by the channel and then residual connected with the input.

Features of the output global information element with reference to FIG. 1(b)

Each output characteristic f^gcmAnd (3) performing multiplication operation updating on all feature maps higher than the feature maps in the hierarchy in sequence, and generating final output for the updated features through layer-by-layer splicing operation on one channel. In the pyramid multiplication and concatenation operation, for small-size feature maps in the features with unmatched sizes, processing is performed through an upsampling-convolution (upsample-conv) to obtain the feature maps with matched sizes, and then pyramid operation is performed to update the feature maps.

4. Aggregating low-level partially modified multi-modal features of layers 1 through 3 with another cascaded feature Decoder (Cascade Decoder 2)

Then generating a final saliency map S by a progressive upsampling module (PTM)₂：

Wherein D is₂Represents a second concatenated feature decoder, the structure of which is identical to the specific structure of the concatenated feature decoder described in step 3, and T represents a progressive upsampling module.

Progressive upsampling (PTM) is used to output a fixed size saliency map, which consists of 2 residual-based deconvolution modules with a channel number of 32, each module consisting of 1 residual-based convolution operation and 1 residual-based deconvolution operation.

5. Loss function of the whole network in the training phase

Is defined as:

where α is the weight controlling the loss of the two-stage output, set to 0.5 in the present invention, G represents the true label graph,

represents the cross entropy loss function, defined as:

where S represents the predicted significance map.

6. The effect of the invention is further illustrated by the following simulation experiment:

table 1 shows comparative experiments of the invention with other 18 RGB-D significance detection methods on seven data sets of NJU2K, NLPR, STERE, DES, LFSD, SSD and SIP. The evaluation index adopted in the experiment is S-measure (S)_α)、Max F-measure(F_β)、Max E-Measure(E_ξ) And mae (m) to fully evaluate the method. The results show that the performance of the present invention exceeds all the latest results that have been published.

TABLE 1

In addition, FIG. 2 shows the effect of the present invention in comparison to the other 10 most advanced methods in a specific application scenario, including 8 deep learning based methods (DMRA, CPFP, TANet, PCF, MMCI, CTMF, AFNet, and DF) and 2 traditional manual characterization methods (SE and LBE). It can be seen that the effect of the present invention is closer to the real tag GT than other methods.

The parts of this example not described in detail belong to common general knowledge in the field, and are not described in detail herein.

The significance detection method based on the cascade improved network is described in detail above, and the principle and the implementation mode of the invention are explained by applying specific embodiments in the text; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and the content of the present specification should not be construed as a limitation to the present invention, and all designs similar to or the same as the present invention belong to the protection scope of the present invention.

Claims

1. A salient object detection method based on a cascade improved network is characterized in that a primary salient map generated by characteristics of a high-level part is used for improving the characteristics of the low-level part, and a final salient map is generated by aggregating the improved low-level characteristics, and comprises the following steps:

step 2, respectively passing the 5 depth image features of different levels extracted in the step 1 through a Depth Enhancement Module (DEM) to obtain enhanced depth features, and then respectively fusing the enhanced depth features of the corresponding levels and the RGB features to obtain multi-modal features;

step 3, aggregating multi-mode features of the upper layer parts of the 3 rd to 5 th layers by a cascade feature decoder to generate an initial saliency map, and improving the multi-mode features of the lower layer parts of the 1 st to 3 rd layers by using the initial saliency map as a mask;

and step 4, aggregating the multi-modal characteristics after the improvement of the low-layer part of the layers 1 to 3 by using another Cascade characteristic Decoder (Cascade Decoder 2), and then generating a final saliency map by a step-by-step up-sampling module (PTM).

2. The cascade improved network-based salient object detection method according to claim 1, wherein: the depth enhancement module described in step 2 consists of two channel attention operations and a spatial attention operation performed in sequence.

3. The cascade improved network-based salient object detection method according to claim 1, wherein: the improvement method described in step 3 is to perform element-wise multiplication (element-wise multiplication) on the initial saliency map and each channel of the low-level part multi-modal features of layers 1 to 3 by using the initial saliency map as a mask.

4. The cascade improved network-based salient object detection method according to claim 1, wherein: the cascade feature decoder in the steps 3 and 4 is composed of 3 global information units and pyramid multi-layer feature product and splicing operation.