CN116310668A

CN116310668A - Significance target detection algorithm based on double-current dense super-resolution

Info

Publication number: CN116310668A
Application number: CN202211088752.9A
Authority: CN
Inventors: 尹明臣; 邵晶丽
Original assignee: Suzhou Low Light Level Electronic Fusion Technology Research Institute Co ltd
Current assignee: Suzhou Low Light Level Electronic Fusion Technology Research Institute Co ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-06-23

Abstract

The invention discloses a salient target detection algorithm based on double-current dense super-resolution, which comprises the following steps: the space expansion convolution pyramid model is fused, so that the network is ensured to have enough receptive fields to capture the characteristics of targets with different sizes; the high-resolution representation is kept through a dense super-resolution module, all sub-networks are connected in a parallel mode, and information interaction and multi-scale feature fusion among the sub-networks are realized; the invention solves the problem of insufficient extraction of the space information of the obvious target based on the traditional RGB image, obtains a more complete obvious target detection result by fusing RGB and depth information, and solves the problem of image blurring caused by traditional sampling operation by designing a dense super-resolution module.

Description

Significance target detection algorithm based on double-current dense super-resolution

Technical Field

The invention relates to the field of detection algorithms, in particular to a salient target detection algorithm based on double-current dense super-resolution.

Background

The purpose of the salient object detection (Salient Object Detection, SOD) is to find the most eye-attractive object area in the image. Currently, saliency target detection has been commonly used in various computer vision tasks and exhibits excellent performance, such as: image segmentation, object recognition, video retrieval, etc. Image saliency detection is achieved by a traditional method, and mainly uses hand-made features. However, such features are difficult to describe for complex scenes and objects, while they are difficult to adapt to new scenes and are less versatile. With the popularity of deep learning techniques and their successful application in different visual tasks, researchers have begun to explore new ideas to achieve significant target detection and achieve a surprising increase in effectiveness.

Wang et al (Wang L, lu H, ruan X, et al deep networks for saliency detection via local estimation and global search [ C ]. CVPR, 2015:3183-3192) propose the use of two deep neural networks for local estimation and global search, respectively, to compensate for the problem of insufficient extraction of global and detailed information by local and global models. Liu et al (Liu N, han J.Dhsnet: deep hierarchical saliency network for salient object detection [ C ]. CVPR, 2016:678-686) also employ the concept of global and local fusion, providing a deep level saliency network. The network firstly explores the features on the global structure through automatic learning to obtain a rough global prediction result, and then fuses local context information through a hierarchical recursive convolutional neural network to refine the details of the saliency map. To adequately capture the context features, liu et al (n.liu, j.han, and m. -h.yang.picanet: learning pixel wise contextual attention for saliency detection.cvpr, pages 3089-3098, 2018) embed PiCANet based on UNet networks for learning the relevant information carried by each pixel, which greatly improves the salient object detection performance by integrating global context and multi-scale local context. Li et al (Li Z, lang C, chen Y, et al deep Reasoning with Multi-scale Context for Salient Object Detection [ J ]. CVPR, 2019) propose a lightweight network that first utilizes a full convolution network to extract multi-scale features and then uses a modified Shuffle-Net network to achieve fast and accurate predictions for better feature fusion and simplified computation costs. Although the above model integrates global and local features, which improves the significant detection effect, it also faces many challenges, such as: it is difficult to accurately detect significant objects in complex or small objects and other scenes.

With the development of depth sensors, researchers can easily capture depth images with rich spatial information, which has significant advantages for improving SOD performance. For this reason, researchers consider combining Depth information such as three-dimensional layout and spatial structure with RGB information, i.e., jointly using RGB and Depth images to achieve a saliency target detection task. Currently, a number of RGB-D based saliency models are proposed. Pang et al (Y.Pang, L.Zhang, X.Zhao, and H.Lu, "Hierarchical dynamic filtering network for rgb-d salient object detection," in Proceedings of European Conference on Computer Vision,2020, pp.235-252) propose a simple and efficient hierarchical dynamic filtering network in which dynamic hole pyramid modules and hybrid enhancement loss functions are used to fuse different scale cross-modal information and to produce more distinct regions of edge and high consistency, respectively. Li et al (G.Li, Z.Liu, M.Chen, Z.Bai, W.Lin, and H.ling, "Hierarchical alternate interaction network for RGB-D salient object detection," IEEE Transactions on Image Processing, vol.30, pp.3528-3542,2021) propose a hierarchical alternating interaction network (HAINet) for RGB-D SOD to mitigate the interference of depth images with SOD, accurately highlighting salient objects in RGB images. The network mainly comprises three key parts of feature coding, cross-modal alternating interaction and significance reasoning. Lou et al (A.Luo, X.Li, F.Yang, Z.Jiao, H.Cheng, and S.Lyu, "Cascade graphneural networks for rgb-d salient object detection," in Proceedings of European Conference on Computer Vision,2020, pp.346-364) designed cascading graph neural networks (Cas-GNN) based on graph technology to model relationships between multimodal information. Piao et al (Y.Piao, Z.Rong, M.Zhang, W.Ren, and H.Lu, "A2 del: adaptive and attentive depth distiller for efficient rgb-d salient object detection," in Proceedings of the Conference on Computer Vision and Pattern Recognition,2020, pp.9060-9069) propose a DF-GMM model embedding a Gaussian mixture model into a deep neural network. The model uses a Gaussian mixture model to obtain a group of low-rank feature representations of the deep feature map, and then restores the low-rank feature representations to the original coordinate space to obtain the low-rank feature map, so that the problem of discriminative region diffusion on the deep feature map is relieved. Liu et al (N.Liu, N.zhang, and J.Han, "Learning selective self-mutual attention for rgb-d saliency detection," in Proceedings of the Conference on Computer Vision and Pattern Recognition,2020, pp.13753-13762) propose a novel intermediate fusion strategy based on Non-Local that can accurately locate subject targets by fusion depth attention. Meanwhile, a new residual fusion module is introduced, and an attention mechanism is applied to the double-flow CNN module to improve the performance of significance detection.

Although existing RGB-D saliency detection methods are excellent in detection effect on salient targets, such models are merely employing various fusion methods, such as: early fusion and late fusion, which cannot utilize significant cues of both in-mold and cross modes to perform final significance estimation, and model training complexity is high.

Disclosure of Invention

Based on the defects of the existing saliency target detection model, the invention provides a novel double-current dense super-resolution RGB-D saliency detection model.

In order to achieve the above purpose, the invention adopts the following technical scheme: the salient target detection algorithm based on double-current dense super-resolution comprises the following steps: a dual stream network based on RGB and Depth images, characterized by: the space expansion convolution pyramid model is fused, so that the network is ensured to have enough receptive fields to capture the characteristics of targets with different sizes; the high-resolution representation is kept through a dense super-resolution module, all sub-networks are connected in a parallel mode, and information interaction and multi-scale feature fusion among the sub-networks are realized; the feature extraction module adopting the double-branch structure comprises a global feature extraction branch and a local feature extraction branch, and utilizes the global feature extraction branch to extract global information and guide local feature extraction.

In a preferred embodiment of the present invention, the spatial expansion convolution pyramid module includes: a global average pooling layer obtains image-level characteristics, then 1*1 convolution is carried out, and bilinear interpolation is utilized to restore to the original size; one 1*1 convolution layer and three 3*3 dilation convolutions; the five features with different scales are combined in the channel dimension and then fed into the convolution of 1*1 for fusion output.

In a preferred embodiment of the present invention, the local feature extraction branch uses a convolution layer with a step length of 2 and a convolution kernel of 3*3 to perform a dimension reduction operation on the input feature map, i.e. to 1/4 of the original feature map, so as to extract the local feature by using two convolution layers with step lengths of 1.

In a preferred embodiment of the present invention, the global feature extraction branch mainly adopts a bottleneck structure for reducing network computation complexity, and specifically includes the steps of firstly extracting global features by global average pooling, then using Softmax function global feature distribution learning, and finally multiplying global and local feature points to obtain fused features.

In a preferred embodiment of the present invention, the present invention further includes a Focal Loss function, and parameter γ is added on the basis of BCE to reduce the Loss of the easily classified samples, so that the model will pay more attention to the difficult samples.

In a preferred embodiment of the invention, a balancing factor α is introduced to enhance the effect of the foreground on the loss function, balancing the positive and negative samples.

The invention solves the defects existing in the background technology, and has the following beneficial effects:

(1) According to the invention, through a double-flow network structure with a space expansion convolution pyramid model, a more complete obvious target detection result is obtained by fusing RGB and depth information, and the problem of insufficient extraction of the obvious target space information based on the RGB image in the prior art is solved.

(2) The invention solves the problem of image blurring caused by traditional sampling operation through the dense super-resolution module, and mainly adopts a parallel network and dense connection mode to fuse multi-scale characteristics.

(3) The invention utilizes the global feature branches to extract global information and guide local feature extraction through the global and local feature extraction modules with the double-branch structure, and solves the problem that the actual receptive field of the existing saliency detection network is far lower than the theoretical receptive field, and the global information of the image can not be fully utilized.

(4) The invention introduces a Focal Loss function, reduces the Loss of the sample easy to classify, and makes the model put more attention to the difficult sample.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art;

FIG. 1 is a diagram of a super resolution model structure of a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a spatially-expanded convolution pyramid module in accordance with a preferred embodiment of the present invention;

FIG. 3 is a block diagram of the global and local feature extraction modules of the preferred embodiment of the present invention;

fig. 4 is a diagram of a saliency target detection model structure based on double-stream dense super resolution according to a preferred embodiment of the present invention.

Detailed Description

Reference to "an embodiment," "one embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least some embodiments, and all other embodiments obtained by persons having ordinary skill in the art based on the embodiments of the present invention without making inventive efforts are within the scope of protection of the present invention.

The salient target detection algorithm based on double-current dense super-resolution comprises the following steps:

1. super-resolution model: when the computer vision task is realized by using deep learning, in order to ensure that the resolution of the output saliency map is consistent with that of the original image, up-sampling operation and down-sampling operation are usually adopted, but such operation is extremely easy to cause image blurring and influence the detection precision. To further improve the detection accuracy, sun et al (Ju R, ge L, geng W, et al Depth saliency based on anisotropic center-surround difference [ C ]2014 IEEE international conference on image processing (ICIP). IEEE, 2014:1115-1119) propose a new architecture, namely a high resolution network (HRNet), to ensure that the model maintains a high resolution representation throughout the process, the structure is shown in FIG. 1. The network takes a high resolution sub-network as an initial stage, and then gradually increases the high resolution to low resolution sub-network to form more stages. To reduce the complexity and depth of the network, the sub-networks are not connected in series, but are connected in parallel from high resolution to low resolution. This structure not only can maintain high resolution, but also can repeatedly exchange information in parallel multi-resolution subnetworks. Meanwhile, when feature fusion is carried out, different from a common fusion mode of low-level and high-level representations, the high-resolution representation is enhanced by using the low-resolution representation with the same depth and similar level, and repeated multi-scale fusion is realized.

2. A spatial expansion convolution pyramid module: when a convolutional neural network is used for extracting features of an image, the size of a receptive field often plays a critical role in capturing a feature map. Specifically, if the receptive field is too small, the network can only acquire local features, and at the moment, the whole features of the object cannot be expressed completely; if the receptive field is too large, a large amount of invalid information is introduced, and the detection effect of the remarkable target is affected. Therefore, it is urgent to design a model capable of acquiring multi-scale information to adapt to various sizes of objects in the image, so as to realize feature extraction of the objects.

Traditional CNN networks basically rely on stacking of convolutional layers and up and down sampling operations to obtain receptive fields, but they are prone to affecting detection accuracy. In order to improve the receptive field of a network and introduce multi-scale information extraction, the invention provides a space expansion convolution pyramid module. The module builds pyramid models to obtain multi-scale features using mainly dilation convolution. The convolution layer with the cavity can increase the receptive field, avoid information loss caused by downsampling, and effectively save the structural information of the image on the premise of not reducing the feature size. In addition, the expansion convolution ensures the running speed of the network to a certain extent, and does not add extra parameters and calculation difficulty.

The space expansion convolution pyramid module mainly comprises the following parts: (1) A global average pooling layer obtains image-level characteristics, then 1*1 convolution is carried out, and bilinear interpolation is utilized to restore to the original size; (2) one 1*1 convolution layer and three 3*3 dilation convolutions; (3) Five features of different scales are concated in the channel dimension, and then fed into the convolution of 1*1 for fusion output. The structure of which is shown in figure 2.

3. Global and local feature extraction module: increasing network depth is the most common method to obtain a larger receptive field, however simply employing deepened network depth to capture global context information is not feasible. This not only places a burden on the network but also is highly susceptible to gradient explosions or vanishing. Thus, the present invention contemplates a global and local feature extraction Block (Global and Local Feature Extraction Block, GLFE Block) as shown in fig. 3. The module adopts a double-branch structure to extract local and global characteristics respectively, and main parameters of the two branches are shown in table 1.

The local feature extraction branch mainly uses a convolution layer with the step length of 2 and the convolution kernel of 3*3 to perform the dimension reduction operation on the input feature map, namely to 1/4 of the original feature map, and then uses the convolution layers with the step lengths of 1 to extract the local features. The global feature extraction branch mainly adopts a bottleneck structure and is used for reducing network computation complexity, and the specific steps are that global features are extracted by global average pooling firstly, then global feature distribution learning is carried out by using a Softmax function, and finally the fused features are obtained by multiplying global and local feature points.

TABLE 1 GLFE Block principal parameters

4. Loss function: the loss function is an effective indicator of the performance of the model. In the salient object detection task, binary cross entropy (binary cross entropy, BCE) loss is most commonly used, and the calculation method is as follows:

for a data set with balanced positive and negative sample distribution, the BCE loss function can effectively measure the performance of the model. However, the problem of uneven distribution of positive and negative samples in a real scene is common, that is, when the background occupies a subject in an image, foreground pixels are easily ignored, so BCE loss cannot solve the problem of background-foreground imbalance. To achieve the mining of difficult samples, the invention introduces a Focal Loss function, namely

Focal Loss is mainly to add a parameter gamma based on BCE to reduce the Loss of a sample easy to classify, so that the model puts more attention on a difficult sample. Meanwhile, to balance the positive and negative samples, a balance factor α is introduced to enhance the effect of the foreground on the loss function.

In summary, based on a significance detection model of double-current dense super resolution, the invention mainly designs a double-current network based on RGB and Depth images, and the model structure is shown in figure 4. In view of the different duty ratios of the salient objects in the images, the invention fuses the spatial dilation convolution pyramids to ensure that the network has enough receptive fields to capture object features of different sizes. In order to make up for the precision loss caused by up-and-down sampling of the traditional convolution network, a dense super-resolution module is utilized to enable the model to keep high-resolution representation in the whole process, all sub-networks are connected in a parallel mode, and information interaction and multi-scale feature fusion among the sub-networks are achieved. Aiming at the problem that the actual receptive field of the existing saliency detection network is far lower than the theoretical receptive field and global information of images cannot be fully utilized, the invention provides a global and local feature extraction module with a double-branch structure, which utilizes global feature branches to extract global information and guide local feature extraction.

The above-described preferred embodiments according to the present invention are intended to suggest that, from the above description, various changes and modifications can be made by the person skilled in the art without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The salient target detection algorithm based on double-current dense super-resolution comprises the following steps: a dual stream network based on RGB and Depth images, characterized by:

the space expansion convolution pyramid model is fused, so that the network is ensured to have enough receptive fields to capture the characteristics of targets with different sizes;

the high-resolution representation is kept through a dense super-resolution module, all sub-networks are connected in a parallel mode, and information interaction and multi-scale feature fusion among the sub-networks are realized;

the feature extraction module adopting the double-branch structure comprises a global feature extraction branch and a local feature extraction branch, and utilizes the global feature extraction branch to extract global information and guide local feature extraction.

2. The dual-stream dense super resolution based saliency target detection algorithm of claim 1, wherein: the spatial expansion convolution pyramid module comprises: a global average pooling layer obtains image-level characteristics, then 1*1 convolution is carried out, and bilinear interpolation is utilized to restore to the original size; one 1*1 convolution layer and three 3*3 dilation convolutions; the five features with different scales are combined in the channel dimension and then fed into the convolution of 1*1 for fusion output.

3. The dual-stream dense super resolution based saliency target detection algorithm of claim 1, wherein: the local feature extraction branch adopts a convolution layer with the step length of 2 and the convolution kernel of 3*3 to perform the dimension reduction operation on the input feature map, namely to 1/4 of the original feature map, and then the local feature is extracted by using the convolution layer with the two step lengths of 1.

4. The dual-stream dense super resolution based saliency target detection algorithm of claim 1, wherein: the global feature extraction branch mainly adopts a bottleneck structure and is used for reducing network computation complexity, and the specific steps are that global features are extracted by global average pooling, global feature distribution learning is carried out by using a Softmax function, and finally the fused features are obtained by multiplying global and local feature points.

5. The dual-stream dense super resolution based saliency target detection algorithm of claim 1, wherein: the model also comprises a Focal Loss function, and the parameter gamma is added on the basis of the BCE to reduce the Loss of the easily-classified samples, so that the model can pay more attention to the difficult samples.

6. The dual-stream dense super resolution based saliency target detection algorithm of claim 5, wherein: the balance factor alpha is introduced to enhance the effect of the foreground on the loss function, balancing the positive and negative samples.