NL2030745B1

NL2030745B1 - Computer system for saliency detection of rgbd images based on interactive feature fusion

Info

Publication number: NL2030745B1
Application number: NL2030745A
Authority: NL
Inventors: Fang Zhijun; Zhao Xiaoli; Zhang Zhuorao; Chen Zheng
Original assignee: Univ Shanghai Eng Science
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2023-08-07

Abstract

A computer system for saliency detection of RGBD images based on interactive feature fusion, for each image in an image sample set, ﬁrst a multi-layer convolutional neural 5 network module is used to extract a multi-level color and depth image feature from a color image and a depth image respectively, a cross-feature fusion module is used to perform multi-level dot product fusion on the color and depth image feature extracted by deep convolution to obtain an initial salient image, then an Inception structure is used to perform multi-scale fusion on the initial salient image to output a network predicted salient 10 image, ﬁnally, the network predicted salient image and a target salient image are used to solve a focus entropy loss function to learn an optimal parameter of the image saliency detection model and obtain a trained image saliency detection model, so as to perform saliency detection of a to-be-processed RGBD image.

Description

COMPUTER SYSTEM FOR SALIENCY DETECTION OF RGBD IMAGES

BASED ON INTERACTIVE FEATURE FUSION

TECHNICAL FIELD

[01] The present disclosure relates to a technical field of image processing, in particular to a computer system for saliency detection of RGBD images based on interactive feature fusion.

BACKGROUND ART

[02] In application fields such as autonomous driving, robotics and virtual reality, finding salient objects a scene and filtering information that is weakly related to tasks is of great significance to reduce computational complexity of a system and improve an ability to understand the scene and is one of the core issues and research hotspots in a field of computer vision.

[03] In recent years, with a wide application of deep convolutional neural networks in the field of image processing, saliency detection has developed rapidly, and a large number of saliency models based on visual features such as color and brightness have been proposed. In “Visual saliency based on multiscale deep feature”, Li et al. use a deep neural network for the first time to build a saliency model based on multi-scale features; in “Deeply Supervised Salient Object Detection with Short Connections”, Hou et al. proposes a DSS model, which uses a fully convolutional network (FCN) to extract multi-layer and multi-scale features, and then fuses the extracted multi-layer and multi-scale features together by introducing a skip layer structure; in “Attentive feedback network for boundary-aware salient object detection”, Feng et al. use a global perceptron module to refine the most salient features as a whole, and use an attention feedback module to transfer information between the corresponding codecs.

[04] However, saliency detection of RGB images faces two major challenges: one is that when a target and a background have similar appearance, it is difficult to distinguish the target and the background by relying only on RGB information; the other 1s that when the same object includes different colors, it is easy to be misjudged as different object. A depth map includes rich spatial structure and three-dimensional layout information, which can provide a large number of additional clues to distinguish the target from the background on a basis of ensuring the integrity of a detection region. Therefore, the use of depth information can effectively improve an effect of the saliency detection. In “An in depth view of saliency”, Ciptadi et al. introduce depth information based on RGB for the first time, and proposes a saliency segmentation model based on RGB-D; in “Rgbd salient object detection: a benchmark and algorithms”, Peng et al. propose a multi-stage RGB-D model that simultaneously considers depth and appearance cues from low-level feature contrast, mid-level region grouping, and high-level prior enhancement; in “Progressively complementarity-aware fusion network for RGB-D salient object detection”, Chen et al. design a complementary perceptual fusion module to learn color and depth complementary information, and densely increase layer-by-layer supervision from deep to shallow to gradually fuse multi-level information through a cascaded module; and in “Depth-induced multi-scale recurrent attention network for saliency detection”, Piao et al. propose a depth-induced multi-scale recurrent attention network, which uses deep refinement blocks including residual structures to fuse color and depth complementary information, and combines multi-scale contextual features with depth information to accurately locate salient objects , while using a recurrent attention module to obtain more improvements in model performance.

[05] In summary, the existing RGB-D saliency detection methods mainly propose some sub-networks based on a backbone network to learn color and depth complementary information, and perform feature fusion, but most of the network structures are very large, the number of parameters is large, and the training is difficult.

SUMMARY

[06] The present disclosure provides a computer system for saliency detection of

RGBD images based on interactive feature fusion, proposes a novel interactive dual-stream saliency detection framework, which designs a global and local feature extraction convolution block (GL Block) used to obtain a global feature and guide local feature extraction, proposes a dot product method to obtain common features of color images and depth images, and build a cross-model feature fusion module (CFFM) to cross-fuse the feature information of color images and depth images. The detection method performs saliency detection with high accuracy and few model parameters.

[07] The present disclosure can be realized through the following technical solutions:

[08] A computer system for saliency detection of RGBD images based on interactive feature fusion, includes: a processor, a memory, and a computer program stored on the memory and running on the processor, and when the processor executes the computer program, the following modules are executed:

[09] an image sample set establishing module, establishing an image sample set for training;

[10] a saliency detection model establishing module, establishing an image saliency detection model;

[11] for each image in the image sample set, first using a multi-layer convolutional neural network module to extract a multi-level color and depth image feature from a color image and a depth image respectively, and using a cross-feature fusion module to perform multi-level dot product fusion on the color and depth image feature extracted by deep convolution to obtain an initial salient image; then using an Inception structure to perform multi-scale fusion on the initial salient image to output a network predicted salient image; finally, using the network predicted salient image and a target salient image to solve a focus entropy loss function to learn an optimal parameter of the image saliency detection model and obtain a trained image saliency detection model;

[12] an output module, inputting a to-be-processed RGBD image into the trained image saliency detection model and outputting a corresponding saliency detection result which 1s a saliency map through a model calculation.

[13] Further, the cross-feature fusion module including a first convolution and a second convolution uses the first convolution to perform feature extraction on a color image feature, uses the second convolution to perform feature extraction on a depth image feature, and a common feature of the color image feature and the depth image feature is extracted by a dot product method, fused and transformed, and then a third convolution is used to merge the fused feature with an original color image feature and an original depth image feature through convolution and activation operations, respectively.

[14] Further, structures of the first convolution, the second convolution and the third convolution are same.

[15] Further, the multi-layer convolutional neural network module includes two same branches, which act on the color image and the depth image respectively, and both adopt a

FCN structure comprising five layers of convolution, wherein the first convolution adopts a standard convolution block, and all other layers of convolution use a global-local feature extraction convolution block;

[16] the global-local feature extraction convolution block includes a global branch and a local branch, the local branch first reduces an input feature map to 1/4 of an original feature map with a convolution with a step size of 2, and then uses two identical convolutions with a step size of 1 to extract a local feature, the global branch adopts a bottleneck structure to extract a global feature, finally, the extracted global feature and local feature are fused by using a dot product method.

[17] Further, a size of a convolution kernel of the convolutions with the step size of 1 is 3x3, and an activation function is ReLU.

[18] Further, the focus entropy loss function L(y.5) is set as: —(1-a) 3 log(1-§).y=0

[20] wherein, y and J represent the target salient image and the network predicted salient image respectively, y represent a constant, and « represent a balance factor. [BI] The beneficial technical effects of the present disclosure are:

[22] A novel interactive dual-stream saliency detection framework 1s adopted, which can well detect a salient region and generate an accurate saliency map, thereby improving the detection efficiency and accuracy of a saliency target. The experimental results show that the comprehensive experiments on three public data sets of NJU2000, NLPR and

STEREO show that the present disclosure has a good detection effect on mainstream evaluation indicators. In addition, the method of the present disclosure is simple and 5 reliable, easy to operate, easy to implement, and easy to popularize and apply.

BRIEF DESCRIPTION OF THE DRAWINGS

[23] FIG. 1 is a schematic structural diagram of a dual-stream network of the present disclosure;

[24] FIG. 2 is a schematic structural diagram of a global-local feature extraction convolution block (GL Block) of the present disclosure;

[25] FIG. 3 is a schematic structural diagram of a cross feature fusion module (CFFM) of the present disclosure;

[26] FIG. 4 is a schematic diagram of a comparison result of saliency detection by using a method of the present disclosure and other methods;

[27] FIG. 5is a P-R curve comparison diagram of saliency detection by using a method of the present disclosure and other methods;

[28] FIG. 6 is a model size comparison diagram of saliency detection by using a method of the present disclosure and other methods.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[29] The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings and preferred embodiments.

[30] The present disclosure proposes a computer system for saliency detection of

RGBD images based on interactive feature fusion whose network framework adopts a dual-stream network as shown in FIG. 1, a proposed global-local feature extraction convolution block (GL Block) is used to obtain and fuse global and local features, and replace an original standard convolution block in FCN to generate an initial saliency map; in order to obtain a common salient feature of color and depth information, a cross-feature fusion module (CFFM) based on a dot product method is proposed; considering that a shallow feature has more noise, a deep layer of the present disclosure in a FCN network uses the CFFM to cross-fuse color and depth features to reduce redundant features; finally, the initial saliency map 1s fused through an Inception structure to improve the scale adaptability of a network. Details are as follows:

[31] An image sample set establishing module:

[32] scales a color image, a depth map and a manual annotation map of a corresponding saliency map in each RGB-D image in an image sample set together to enable a computing device to take on a computational load of a neural network, can also perform random cropping, horizontal flipping and other operations together to increase the diversity of data; then normalizes the color image and depth map in the image sample set to highlight a foreground feature of an image.

[33] Asaliency detection model establishing module:

[34] 1, for each image in the image sample set, first uses a multi-layer convolutional neural network module to extract a multi-level color and depth image feature from a color image and a depth image respectively.

[35] For a segmentation network, the larger the receptive field, the larger the range captured by the network, the more information that can be used for analysis, the better the segmentation effect. A receptive field of a convolutional layer located in a shallow layer is relatively narrow, which retains a large amount of detailed information, helping to refine a segmentation image; the receptive field of a deep convolutional layer is relatively wide, which can be used to learn some abstract features and improve the classification performance. A FCN network adopts a skip-level structure and makes full use of the shallow information to assist the gradual upsampling, so as to obtain the refined segmentation image. However, in FCN, an actual receptive field of fc7 layer is only 1/4 of a full image, not an entire image, which is not enough to complete the task well. In order to obtain a larger receptive field, methods of increasing a network depth and using large convolution kernels are usually used. However, capturing the global context information through the former will not only greatly increase the network burden, but also easily cause gradient explosion and gradient disappearance; the latter will lead to a sudden increase in the amount of calculation, which is not conducive to the increase of the network depth, and the calculation performance will also be reduced.

[36] Based on the above problems, the present disclosure designs the global-local feature extraction convolution block (GL Block), which adopts a dual-branch structure to extract local and global features respectively, whose structure is shown in FIG. 2, and is used in the dual-stream network as shown in FIG. 1 proposed by the present disclosure, wherein the multi-layer convolutional neural network module includes two same branches, which act on the color image and the depth image respectively and both include five convolution blocks, and the first is the standard convolution block, the rest are the GL

Block proposed by the present disclosure. Then, deconvolution is used for upsampling, and the shallow information is fused through skip-level connections. In this way, each convolution block can perform global feature extraction, which will not increase the network burden, but also ensure the calculation speed and is conducive to the optimization of the entire network structure.

[37] The GL Block proposed by the present disclosure is the dual-branch structure, namely a local branch and a global branch, so as to extract a local feature and a global feature respectively. The local branch first reduces an input feature map to 1/4 of an original feature map with a convolutional layer with a step size of 2, a size of a convolution kernel of 3x3, and an activation function of ReLU, and then uses two identical convolutions with a step size of 1 to extract the local feature; reduce the amount of branch network calculation, the global branch adopts a bottleneck structure, that is, a global average pooling layer is used to explicitly extract the global features, whose purpose is to integrate the global spatial information of the entire image. After a series of convolution operations, Softmax is used to learn a global feature distribution, and finally, the dot product method is used to fuse the global feature and the local feature.

[38] 2, uses a cross-feature fusion module to perform multi-level dot product fusion on the color and depth image feature extracted by deep convolution to obtain an initial salient image.

[39] Since the design of existing cross-modal feature fusion methods is mostly based on addition or cascade, not only the structure is complex, the amount of calculation is large, but also redundant noise is easily introduced. Inspired by an attention mechanism, the present disclosure adopts the dot product method to build the cross-feature fusion module (CFFM), as shown in FIG. 3, which is used to fuse the color image feature f. eR” Ee with distinct appearance and texture information and the depth image feature f; € R” re that provides clear object shape, contour, and spatial structure. Considering that the shallow depth feature contains a lot of noise, the present disclosure applies the cross feature fusion module to a deeper layer in the multi-layer convolutional neural network module.

[40] The cross feature fusion module including a first convolution and a second convolution uses the dot product method to fuse the color image feature f and the depth image feature f;, uses the first convolution to perform feature extraction and channel compression on the color image feature f extracted by a branch in the multi-layer convolutional neural network module, so as to reduce the calculation amount of the module and facilitate subsequent processing; at the same time, uses the second convolution to perform feature extraction and channel compression on the depth image feature fy extracted by another branch. Then a common feature of the color image feature f and the depth image feature fs is extracted by the dot product method, fused and transformed to make the fusion feature have clear boundaries and semantic consistency, and then a third convolution is used to merge the fused feature with the original color image feature f- and the original depth image feature fs through convolution and activation operations, and then merge with the original feature by addition if the channel is restored. In this way, through multiple cross feature fusions, the color image feature /- and the depth image feature fs will gradually absorb the useful information of each other, make them complementary, the redundant information of the color image feature fis reduced, and the boundaries of the depth image feature /: are sharpened. Finally, a 3 x 3 convolution is still used to restore the original channel and added to the original color image feature f. and depth image feature fy,

which are represented by the refined feature. The process can be expressed by following formulas: ay AANA)

Ja = Ja TW WF) Wa (fa)

[42] wherein, W,, Wa and W: are all network parameters of the 343 convolution for compressing and restoring channels.

[43] The entire cross feature fusion module adopts a symmetrical structure. After dot product, the original color image feature f- and depth image feature {: extracted by the two branches in the multi-layer convolutional neural network module are respectively introduced back to corresponding branches of the multi-layer convolutional neural network module, multiplying the two, the shared information will be larger, the color image feature frtransfers the detail information to the depth image feature fs to refine an edge, and the depth image feature f; transfers the saliency semantics to the color image feature f: to discard redundant information, so the edge can be refined, and redundant information does not appear in both color and depth components.

[44] 3, uses an Inception structure to perform multi-scale fusion on the initial salient image to output a network predicted salient image; finally, uses the network predicted salient image and a target salient image to solve a focus entropy loss function to learn an optimal parameter of the image saliency detection model and obtain a trained image saliency detection model.

[45] The Inception structure is used to fuse initial salient images of the color and depth output by the depth branch and the color branch, and output the network predicted salient image. The structure achieves an expected purpose by connecting the small convolution kernel and the large convolution kernel in parallel, while compressing parameter amount of the model.

[46] Because the focus entropy loss function cannot solve the problem of unbalanced positive and negative samples and unbalanced background and foreground in real scenes.

Therefore, the present disclosure introduces Focalloss to solve this problem, and a formula is as follows:

-(1-a)$"log{1-9).y=0

[48] wherein, y and J represent the target salient image and the network predicted salient image respectively, 7 represent a constant, which reduces a loss of easy-to-classify samples and makes the network pay more attention to difficult samples, and « represent a balance factor, which increases the contribution of the foreground to the loss function to balance positive and negative samples.

[49] An output module inputs a to-be-processed RGBD image into the trained image saliency detection model and outputs a corresponding saliency detection result which is a saliency map through a model calculation.

[50] The model of the present disclosure is implemented based on PyTorch, a machine graphics card is configured as two GTX1080Ti (11 GB), an Adam optimizer is used for training, and the training impulse, learning rate, weight decay rate and batch size are respectively set to (0.9, 0.999), 0.0005, 1E-5 and 16. Since the model in this paper is an end-to-end model, no training or other operations are required.

[51] In order to verify the feasibility of the present disclosure, the present disclosure selects 1585 images as a training set and 400 images as a test set on the NJU2000 data set; 800 images are selected as the training set and 200 images as the test set on the NLPR data set; 637 images are selected as the training set and 160 images as the test set on the dataset

STEREO. The experimental results in Figs. 5 to 6 show that the model proposed by the present disclosure always has certain advantages, can accurately detect the salient region of images, and occupies less computing resources than other methods.

[52] The present disclosure adopts Precison and Recall values as evaluation indicators and draws a P-R curve to evaluate the performance of the algorithm, as shown in FIG. 5, the calculation formula is as follows: u [53] Precision =1- Vp + FP

Recall INL EN

[54] wherein, TP, FP, TN, FN represent the number of true positives, false positives,

true negatives, and false negatives, respectively.

[55] Although the specific embodiments of the present disclosure have been described above, those skilled in the art should understand that these are only examples, and various changes and modifications may be made to these embodiments without departing from the principle and essence of the present disclosure, therefore, the scope of protection of the present disclosure is defined by the appended claims.

Claims

Claims L Computer system for RGBD image salience detection based on interactive feature fusion, the system comprising: a processor, a memory and a computer program stored in memory and running on the processor, and where if the processor executes the computer program , the following modules are executed: an image sample set creation module, which creates an image sample set for training; a salience detection model creation module, which establishes an image salience detection model; for each image in the image sample set, first using a multilayer convolutional neural network module to extract a multilevel color and depth image feature from a color image and a depth image, respectively, and using a cross feature fusion module to perform multilevel point product fusion on the color and depth image feature extracted by 1s by deep convolution to obtain a first initial salience image; then using an Inception structure to perform multiscale fusion on the initial salience map to output a network predicted salience map; finally, using the network predicted salience map and a target salience map to solve a focus entropy loss function to learn an optimal parameter of the image salience detection model and to obtain a trained image salience detection model; an output module, which inputs an RGB-D image to be processed into the trained image salience detection model and outputs a corresponding salience detection result which is a salience map by model calculation.

The computer system for RGBD image salience detection based on interactive feature fusion according to claim 1, wherein the cross-feature fusion module comprises a first convolution and a second convolution, the first convolution used to perform feature extraction on a color image feature, the second convolution used to perform feature extraction from on a depth image feature, and wherein a common feature of the color image feature and the depth image feature is extracted by a dot product method, fused and transformed, then a third convolution is used to reconcile the fused feature with an original color image feature and an original depth image feature by convolution and activation operations.

The computer system for RGBD image salience detection based on interactive feature fusion according to claim 2, wherein the structures of the first convolution, the second convolution and the third convolution are the same.

The computer system for RGBD image salience detection based on interactive feature fusion according to claim 1, wherein the multi-layer convolutional neural network module comprises two same branches, respectively acting on the color image and the depth image and both adopting an FCN structure comprising five layers of convolution, wherein the first convolution adopts a standard convolution block and all other convolution layers use a global-local feature extraction convolution block; where the global-local feature extraction convolution block comprises a global branch and a local branch, the local branch first reducing an input feature map to 1/4 of an original feature map with a 2 step size convolution and then using two identical 1 step size convolutions to extracting a local feature, the global branch adopting a bottleneck structure to extract a global feature, and finally fusing the extracted global feature and local feature using a dot product method.

The computer system for RGBD image salience detection based on interactive feature fusion according to claim 4, wherein a size of a convolution kernel of the step size convolutions is 1.3x3 and wherein an activation function is ReLU.

The computer system for RGBD image salience detection based on interactive feature fusion according to claim 1, wherein the focus entropy loss function L(y, f) is set as: TE eR! Hee) log{1- 3), vy =0 where, y and y represent the target salience map and the network predicted salience map respectively, y represents a constant and a represents a balancing factor.