CN113963170A

CN113963170A - RGBD image saliency detection method based on interactive feature fusion

Info

Publication number: CN113963170A
Application number: CN202111039181.5A
Authority: CN
Inventors: 赵晓丽; 张倬尧; 陈正; 方志军; 叶翰辰
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2022-01-21

Abstract

The invention discloses an RGBD image saliency detection method based on interactive feature fusion, for each image in the training image sample set, firstly, a multilevel convolution neural network module is utilized to respectively extract the characteristics of the color image and the depth image in a multilevel manner, and a cross characteristic fusion module is utilized, performing multi-level dot product fusion on the color and depth image features extracted by deep level convolution to obtain an initial significant image, and then, carrying out multi-scale fusion on the initial salient image by using an increment structure, outputting a network prediction salient image, finally solving a focus entropy loss function by using the network prediction salient image and the target salient image, learning the optimal parameters of an image saliency detection model, obtaining a trained image saliency detection model, and carrying out saliency detection on the RGB-D image to be processed. The method is simple and reliable, convenient to operate, easy to realize and convenient to popularize and apply.

Description

RGBD image saliency detection method based on interactive feature fusion

Technical Field

The invention relates to the technical field of image processing, in particular to an RGBD image saliency detection method based on interactive feature fusion.

Background

In the application fields of automatic driving, robots, virtual reality and the like, a significant target in a scene is searched, information with weak task correlation is filtered, the method has important significance for reducing the system computation complexity and improving the scene understanding capability, and is one of the core problems and research hotspots in the field of computer vision.

In recent years, with the wide application of deep convolutional neural networks in the field of image processing, significance detection has been rapidly developed, and a large number of significance models based on visual features such as color and brightness have been proposed in succession. Li et al first constructed a multi-scale feature-based significance model using a deep neural network in a Visual salience based on multiscale deep feature. Hou et al propose a DSS model in "deep Supervised patient Object Detection with Short Connections", which utilizes a Full Convolutional Network (FCN) to extract multi-level multi-scale features, and then fuses the extracted multi-level multi-scale features together by introducing a layer jump structure; feng et al in "attention feedback network for boundary-aware object detection" utilizes a global perceptron module to refine the most salient features as a whole, and an attention feedback module to pass information between respective codecs.

However, RGB image saliency detection faces two major challenges: firstly, when the target and the background have similar appearances, the target and the background are difficult to distinguish by only depending on RGB information; secondly, when the same object contains different colors, the same object is easily judged to be different objects by mistake; the depth map contains rich space structure and three-dimensional layout information, and can provide a large number of additional clues to distinguish the target from the background on the basis of ensuring the integrity of the detection area, so that the significance detection effect can be effectively improved by utilizing the depth information. Ciptadi et al first introduces depth information on the basis of RGB in depth view of saliency and proposes a RGB-D-based significance segmentation model; peng et al propose a multi-stage RGB-D model in "Rgbd sample object detection: a benchmark and algorithms", which takes into account depth and appearance cues generated by low-level feature contrast, mid-level region grouping and high-level prior enhancement at the same time; chen et al designs a complementary perception fusion module in a progressive complementary fusion network for RGB-D spatial object detection to learn color and depth complementary information, and gradually fuses multi-level information in a manner of densely increasing layer-by-layer supervision from deep to shallow through a cascade module; piao et al propose a Depth-induced multi-scale recurrent attention network in a Depth-induced multi-scale recurrent attention network for salience detection, which uses a Depth refinement block containing a residual error structure to fuse color and Depth complementary information, combines multi-scale context features with Depth information to accurately locate a saliency target, and utilizes a recurrent attention module to further improve the model performance.

In summary, the existing RGB-D saliency detection method mainly proposes some sub-network learning color and depth complementary information based on the backbone network, and performs feature fusion, but most of the existing RGB-D saliency detection methods are very large in network structure, large in parameter quantity, and difficult to train.

Disclosure of Invention

The invention provides an RGBD image saliency detection method based on interactive Feature Fusion, and provides a novel interactive double-flow saliency detection frame, which designs a Global and Local Feature extraction convolution Block (GL Block) for acquiring Global features and guiding Local Feature extraction, provides a method for acquiring common features of a color image and a depth image in a dot-product manner, and builds a Cross-mode Feature Fusion Module (CFFM) to Cross-fuse Feature information of the color image and the depth image.

The invention can be realized by the following technical scheme:

an RGBD image saliency detection method based on interactive feature fusion comprises the following steps:

firstly, establishing an image sample set for training;

step two, establishing an image significance detection model;

for each image in the image sample set, firstly, a multilevel convolutional neural network module is utilized to extract the characteristics of color images and depth images in a multilevel mode respectively, a cross characteristic fusion module is utilized to perform multilevel point-product fusion on the characteristics of the color images and the depth images extracted by the deep convolutional mode to obtain initial significant images, then, an Incepration structure is utilized to perform multi-scale fusion on the initial significant images to output network prediction significant images, finally, the network prediction significant images and target significant images are utilized to solve a focus entropy loss function, the optimal parameters of an image significance detection model are learned, and a trained image significance detection model is obtained;

and step three, inputting the RGB-D image to be processed into the trained image saliency detection model, and outputting a corresponding saliency detection result, namely a saliency map, through model calculation.

Further, the cross feature fusion module comprises a first convolution and a second convolution, the first convolution is used for carrying out feature extraction on the color image features, the second convolution is used for carrying out feature extraction on the depth image features, common features of the color image features and the depth image features are extracted in a point multiplication mode and are subjected to fusion transformation, then the third convolution is used for merging the fused features with the original color image features and the original depth image features through convolution and activation operations.

Further, the first convolution, the second convolution and the third convolution have the same structure.

Further, the multi-level convolution neural network module comprises two same branches which respectively act on the color image and the depth image, the FCN structure is adopted, the multi-level convolution neural network module comprises five layers of convolution, the first layer of convolution adopts a standard convolution block, and the rest layers of convolution all adopt a global-local feature extraction convolution block;

the global-local feature extraction convolution block comprises a global branch and a local branch, the local branch firstly reduces an input feature graph to 1/4 of an original feature graph by convolution with the step length of 2, then local feature extraction is carried out by two same convolutions with the step length of 1, the global branch adopts a bottleneck structure to carry out global feature extraction, and finally, the extracted global feature and the extracted local feature are fused by a dot-product mode.

Further, the convolution kernel size of the convolution with step size 1 is 3 × 3, and the activation function is ReLU.

Further, the focus entropy loss function

Is arranged as

Wherein y and

respectively representing a target significant image and a network prediction significant image, gamma represents a constant, and alpha represents a balance factor.

The beneficial technical effects of the invention are as follows:

a novel interactive double-flow saliency detection framework is adopted, a saliency region can be well detected, an accurate saliency map can be generated, and saliency target detection efficiency and accuracy are improved. Experimental results show that comprehensive experiments on three public data sets of NJU2000, NLPR and STEREO show that the method has a good detection effect on mainstream evaluation indexes. In addition, the method of the invention is simple and reliable, convenient to operate, easy to realize and convenient to popularize and apply.

Drawings

FIG. 1 is a schematic diagram of the architecture of a dual stream network of the present invention;

FIG. 2 is a schematic diagram of the structure of the global-local feature extraction volume Block GL Block of the present invention;

FIG. 3 is a schematic structural diagram of a cross feature fusion module CFFM according to the present invention;

FIG. 4 is a graphical representation of the comparison of the results of significance testing using the method of the present invention with other methods;

FIG. 5 is a graph comparing P-R curves for significance testing using the method of the present invention with other methods;

FIG. 6 is a graph comparing model sizes for significance detection using the method of the present invention with other methods.

Detailed Description

The following detailed description of the preferred embodiments will be made with reference to the accompanying drawings.

The invention provides an RGBD image saliency detection method based on interactive feature fusion, as shown in FIG. 1, a network framework of the method adopts a double-flow network, utilizes a proposed global-local feature extraction volume Block GL Block to acquire and fuse global and local features, and replaces an original standard volume Block in an FCN to generate an initial saliency map; in order to obtain common significant features of color and depth information, a cross feature fusion module CFFM based on a dot multiplication mode is provided; considering that the shallow feature has more noise, the invention utilizes CFFM to cross and fuse color and depth features at a deeper level in the FCN network, thereby reducing redundant features; and finally, fusing the initial saliency map through an inclusion structure to improve the scale adaptability of the network. The method comprises the following specific steps:

firstly, establishing an image sample set for training;

scaling a color image, a depth map and an artificial labeling map of a corresponding saliency map in each RGB-D image in an image sample set together so as to enable computing equipment to bear the calculated amount of a neural network, and performing operations such as random cutting, horizontal turning and the like together so as to increase the diversity of data; and then, normalizing the color image and the depth image in the image sample set to highlight the foreground characteristics of the image.

Step two, establishing an image significance detection model;

1. and for each image in the image sample set, firstly, a multilevel convolutional neural network module is utilized to respectively extract the characteristics of the color image and the depth image in a multilevel mode.

For a segmented network, the larger the receptive field, the larger the range captured by the network, the more information that can be used for analysis, and the better the segmentation effect. The sense field of the convolution layer in the shallow layer is narrow, a large amount of detail information is reserved, and the fine segmentation of images is facilitated; the deep convolutional layer reception field is relatively wide and can be used for learning some abstract features and improving classification performance, the FCN adopts a skip-level structure, shallow information is fully utilized to assist gradual upsampling, and therefore a refined segmentation image is obtained, but in the FCN, the fc7 layer actual reception field is only 1/4 of a full image, but not a whole image and is not enough to well complete a task, in order to obtain a larger reception field, a method of increasing the network depth and using a large convolution kernel is generally adopted, however, capturing global context information through the former method not only greatly increases the network burden, but also easily causes gradient explosion and gradient disappearance; the latter results in the sudden increase of the calculation amount, which is not beneficial to the increase of the network depth, and the calculation performance is also reduced.

Based on the above problems, the present invention designs a global-local feature extraction convolution Block GL Block, which adopts a dual-branch structure for extracting local and global features, the structure of which is shown in fig. 2, and is used in a dual-current network as shown in fig. 1, wherein a multi-level convolution neural network module includes two branches, which act on a color image and a depth image respectively, and each of which includes 5 convolution blocks, the first is a standard convolution Block, and the rest is the GL Block provided by the present invention, and then performs up-sampling by using deconvolution, and merges shallow layer information through skip level connection, so that each convolution Block can perform global feature extraction, which does not increase network load, but can ensure calculation speed, and is helpful for optimization of the whole network structure.

The GL Block provided by the invention is of a double-branch structure, namely a local branch and a global branch, so that local features and overall features can be respectively extracted. The local branch firstly uses the convolution layer with the step length of 2, the convolution kernel size of 3 multiplied by 3 and the activation function of ReLU to reduce the input characteristic diagram to 1/4 of the original characteristic diagram, and then uses two identical convolution layers with the step length of 1 to extract the local characteristics; in order to reduce the calculation amount of a branch network, a bottleneck structure is adopted for global branches, namely, a global average pooling layer is firstly utilized to explicitly extract global features, a series of convolution operations are carried out for integrating global spatial information of a whole image, then Softmax is utilized to learn global feature distribution, and finally a dot-product mode is adopted to fuse the global features and local features.

2. And performing multi-stage dot product fusion on the color and depth image features extracted by deep secondary convolution in the multi-level convolution neural network module by using a cross feature fusion module to obtain an initial significant image.

Because the existing design of the cross-modal feature fusion mode is mostly based on an addition or cascade mode, the structure is complex, the calculated amount is large, and redundant noise is easily introduced, the cross feature fusion module CFFM is set up by adopting a dot multiplication mode under the inspiration of attention mechanism, and is used for fusing color image features f with vivid appearance and texture information as shown in FIG. 3_r∈R^H*W*CAnd depth image feature f providing clear object shape, contour and spatial structure_d∈R^H*W*C. Considering that the shallow depth feature contains a large amount of noise, the invention applies the cross feature fusion module to the deeper level in the multilayer convolutional neural network module.

The cross feature fusion module fuses the color image features f by using a dot multiplication mode_rAnd depth image feature f_dThe method comprises a first convolution and a second convolution, and a color image feature f extracted from one branch in the multi-level convolution neural network module by using the first convolution_rFeature extraction and channel compression are carried out, the calculated amount of a module is reduced for subsequent processing, and meanwhile, the depth image feature f extracted from the other branch is subjected to second convolution_dExtracting features, compressing channels, and extracting color image features f by dot product_rAnd depth image feature f_dThe fused features have clear boundary and semantic consistency, and then the fused features are subjected to convolution and activation operations by utilizing a third convolution to be consistent with the original color image features f_rAnd depth image feature f_dMerging is performed, for example, the channel is restored, and the merged channel is merged with the original features by an addition method. Thus, the color image feature f is fused by multiple cross features_rAnd depth image feature f_dWill gradually absorb each other's useful information to make it complementary and reduce the color image characteristic f_rAnd sharpens the depth image feature f_dThe boundary of (2). Finally, a convolution of 3 x 3 is adopted to restore the original channel and add the original color image characteristic f_rAnd depth image feature f_dExpressed in refined features. The process can be expressed by the following formula:

f_r＝f_r+W₂(W_r(f_r)*W_d(f_d))

f_d＝f_d+W₂(W_r(f_r)*W_d(f_d))

wherein, W_r、W_d、W₂Are network parameters of a 3 x 3 convolution for compression and recovery of the channel.

The whole cross feature fusion module adopts a symmetrical structure, and extracts the original color image features f from two branches in the multilayer convolutional neural network module_rAnd depth image feature f_dAfter point multiplication, the branches corresponding to the multilayer convolutional neural network module are respectively led back, and the multiplication is carried out on the branches, the larger the common information is, the color image characteristic f is_rPassing detail information to depth image features f_dTo refine the edge, depth image features f_dPassing saliency semantics to color image features f_rRedundant information is discarded, so edges can be refined, and the redundant information cannot appear in color and depth components at the same time.

3. And performing multi-scale fusion on the salient images by using an increment structure, outputting network prediction salient images, finally solving a focus entropy loss function by using the network prediction salient images and the target salient images, learning the optimal parameters of the image saliency detection model, and obtaining the trained image saliency detection model.

The Incep structure is used for fusing the color and depth initial significant images output by the depth branch and the color branch and outputting a network prediction significant image, and achieves the parameter quantity of a compression model while achieving the expected purpose by connecting a small convolution kernel and a large convolution kernel in parallel.

The problems of unbalance of positive and negative samples and unbalance of background foreground in a real scene cannot be solved by the focus entropy loss function. Therefore, the present invention introduces Focalloss to solve this problem, whose formula is as follows:

wherein y and

the target significant image and the final significant image of the network are respectively represented, gamma represents a constant, loss of samples which are easy to classify is reduced, the attention of the network to difficult samples is higher, alpha represents a balance factor, and the contribution of a foreground to a loss function is increased so as to balance positive and negative samples.

The model of the invention is realized based on PyTorch, the machine graphics card is configured into two GTX1080Ti (11GB), an Adam optimizer is used for training, and the training impulse, the learning rate, the weight attenuation rate and the batch size are respectively set to be (0.9,0.999), 0.0005,1E-5 and 16. Since the model is an end-to-end model, no training or other operations are required.

In order to verify the feasibility of the method, 1585 pictures are selected from an NJU2000 data set to serve as a training set, and 400 pictures are selected to serve as a test set; selecting 800 pictures on the NLPR data set as a training set, and 200 pictures as a test set; 637 pictures are selected from the STEREO data set as a training set, and 160 pictures are selected as a testing set. The experimental results of figures 5-6 show that: the model provided by the invention has certain advantages all the time, can accurately detect the salient region of the image, and occupies less computing resources than other methods.

In the invention, Precison and Recall values are used as evaluation indexes, and a P-R curve is drawn to evaluate the performance of the algorithm, as shown in FIG. 5, the calculation formula is as follows:

Precision＝TP/TP+FP

Recall＝TN/TN+FN

wherein TP, FP, TN and FN represent the number of true positive, false positive, true negative and false negative respectively.

Although specific embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely examples and that many variations or modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is therefore defined by the appended claims.

Claims

1. An RGBD image saliency detection method based on interactive feature fusion is characterized by comprising the following steps:

firstly, establishing an image sample set for training;

step two, establishing an image significance detection model;

for each image in the image sample set, firstly, a multilevel convolutional neural network module is utilized to respectively extract the characteristics of a color image and a depth image in a multilevel manner, and a cross characteristic fusion module is utilized to carry out multilevel point-product fusion on the characteristics of the color image and the depth image extracted by the deep convolutional manner to obtain an initial significant image, then, an Incepration structure is utilized to carry out multi-scale fusion on the initial significant image to output a network prediction significant image, finally, the network prediction significant image and a target significant image are utilized to solve a focus entropy loss function, the optimal parameters of an image significance detection model are learned, and a trained image significance detection model is obtained;

2. The RGBD image saliency detection method based on interactive feature fusion of claim 1, characterized in that: the cross feature fusion module comprises a first convolution and a second convolution, the first convolution is used for carrying out feature extraction on color image features, the second convolution is used for carrying out feature extraction on depth image features, common features of the color image features and the depth image features are extracted in a point multiplication mode and are subjected to fusion transformation, and then the third convolution is used for merging the fused features with the original color image features and the original depth image features respectively through convolution and activation operations.

3. The RGBD image saliency detection method based on interactive feature fusion of claim 2, characterized in that: the first convolution, the second convolution and the third convolution have the same structure.

4. The RGBD image saliency detection method based on interactive feature fusion of claim 1, characterized in that: the multi-level convolution neural network module comprises two same branches which respectively act on the color image and the depth image, and adopts an FCN structure, and comprises five layers of convolution, wherein the first layer of convolution adopts a standard convolution block, and the rest layers of convolution all adopt a global-local feature extraction convolution block;

5. The RGBD image saliency detection method based on interactive feature fusion according to claim 4, characterized in that: the convolution kernel size for the convolution with step size 1 is 3 × 3 and the activation function is ReLU.

6. The RGBD image saliency detection method based on interactive feature fusion of claim 1, characterized in that: the focus entropy loss function

Is arranged as

Wherein y and