CN114119993A

CN114119993A - Salient object detection method based on self-attention mechanism

Info

Publication number: CN114119993A
Application number: CN202111278451.8A
Authority: CN
Inventors: 陈福康; 孙凤铭; 袁夏
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-10-30
Filing date: 2021-10-30
Publication date: 2022-03-01

Abstract

The invention discloses a salient object detection method based on a self-attention mechanism, which specifically comprises the following steps: performing feature extraction on an input picture by utilizing a convolutional neural network to generate a group of feature maps, wherein the feature maps comprise a shallow feature map and a deep feature map, and each feature map has semantic information with different scales; fusing the shallow feature maps to generate low-level integrated features, and merging the deep feature maps to form high-level integrated features; constructing a self-attention module based on a self-attention mechanism, inputting low-level integrated features and high-level integrated features into the self-attention module, respectively capturing the features in the high-level features and the low-level features, and exchanging semantic information to form a dependency relationship; and (4) strengthening the obtained features through a multi-scale feature strengthening module, and sending the fused and strengthened features into a cascade decoder to generate a final remarkable target detection map. The invention reduces the dependence on external information, is better at capturing the internal correlation of data or characteristics, can accurately position the obvious target and improves the detection efficiency of the target.

Description

Salient object detection method based on self-attention mechanism

Technical Field

The invention relates to the technical field of computer machine vision, in particular to a salient object detection method based on a self-attention mechanism.

Background

Computer vision is the computer used to realize the human visual function-perception, identification and understanding of the three-dimensional scene of the objective world. The human eye has a mechanism capable of rapidly detecting the surrounding environment, filtering secondary information and positioning a main target in a scene, which is called a human eye visual attention mechanism, in the field of computer vision, the attention mechanism for understanding and simulating the human visual system obtains great attention of academia, and displays a wide application prospect. Research has shown that the human visual system draws more attention to certain objects in an image, which are referred to as salient objects.

The saliency detection of the image means that a computer is utilized to simulate a human eye visual attention mechanism, a set of complete image saliency detection models are established, so that a human eye gazing area in the image is accurately and quickly detected, and the saliency detection is represented in a saliency map topographic form. The purpose of salient object detection is to highlight the most visually distinctive parts of an image, but salient object detection requires a computer to have a deep understanding of the semantics of the entire image and the detailed structure of objects. Traditionally, the saliency detection is carried out by adopting a manual feature mode, but the manually made features cannot capture high-level semantics, and the saliency prediction cannot achieve satisfactory results. With the application of convolutional neural networks and the generation of high-quality data sets, significant object detection based on deep learning has been substantially advanced.

The full convolution neural network is a main method for solving the problem of obvious target detection at present, and the full convolution neural network superposes a plurality of convolution and pooling layers, so that the receptive field is gradually increased and high-level semantics are generated, and the full convolution neural network plays an important role in the obvious target detection. For the complicated Semantic information in the Image, a method based on a full Convolution neural network is proposed to enhance the detection capability, for example, in document 1(Chen L C, Papandrou G, Kokkinos I, et al. Deep Lab: magnetic Image Segmentation with Deep computational networks, atom fusion, and fusion Connected CRFs [ J ]. IEEE Transactions on Pattern Analysis and Machine integration, 2018,40(4):834-848.) is enhanced by multi-scale context fusion; document 2(Ding h. context-controlled Feature and Gated Multi-scale Aggregation for Scene Segmentation [ C ]// IEEE/CVF Conference on Computer Vision & Pattern registration. IEEE,2018.) proposes an encoder-decoder architecture to fuse Multi-scale semantic features.

The nature of attention mechanism comes from human visual attention mechanism, while self-attention mechanism is one of the attention mechanisms and is also an important component in transform, document 3(a. vaswani, n.shazer, n.parmar, j.uszkoreoit, l.jones, a.n.gomez, and

this mechanism is described in detail in Kaiser, "Attention is all you needed," in Advances in Neural Information Processing Systems (NIPS),2017, pp.6000-6010.), which is characterized by the direct calculation of dependencies regardless of the distance between features. This aspect has achieved certain results, such as in document 4(Tan Z, Wang M, Xie J, et al deep semiconductor roll laboratory with Self-orientation [ J)]2017.), document 5(Verga P, Strubell E, Mccallium A. Simultaneousself-addressing to All matters for Full-Abstract Biological Extraction [ J]2018), etc.

However, under the challenging conditions of small size of the salient object, complex semantic information, low background contrast, and the like, the existing method still cannot accurately predict the salient object due to the lack of semantic information and weak feature dependence, so that the final salient object detection effect is not good.

Disclosure of Invention

The invention aims to provide a salient object detection method based on a self-attention mechanism, which improves the salient object detection effect by fully utilizing image semantic information.

The technical solution for realizing the purpose of the invention is as follows: a salient object detection method based on a self-attention mechanism comprises the following steps:

step 1, performing feature extraction on an input picture by utilizing a convolutional neural network to generate a group of feature maps, wherein the feature maps comprise a shallow feature map and a deep feature map, and each feature map has semantic information with different scales;

step 2, fusing the shallow feature maps to generate low-level integrated features, and merging the deep feature maps to form high-level integrated features;

step 3, constructing a self-attention module based on a self-attention mechanism, inputting low-level integrated features and high-level integrated features into the self-attention module, respectively capturing the features in the high-level features and the low-level features, and exchanging semantic information to form a dependency relationship;

and 4, reinforcing the obtained features through a multi-scale feature reinforcing module, and sending the fused and reinforced features into a cascade decoder to generate a final remarkable target detection map.

Further, the convolutional neural network in step 1 is specifically as follows:

selecting a ResNet-101 convolutional neural network as a picture feature extractor, wherein the ResNet-101 convolutional neural network comprises 5 convolution groups, each convolution group comprises a convolution calculation process, each convolution group comprises a downsampling operation, the first convolution group only comprises 1 convolution calculation operation, and the 2 nd to 5 th convolution groups comprise a plurality of identical residual error units and discard the last global pooling layer and the full connection layer.

Further, the feature integration in step 2 is specifically as follows:

fusing the multiple feature maps into a low-level integrated feature through a Concat operation by the shallow feature map, and merging the multiple feature maps into a high-level integrated feature through the Concat operation by the deep feature map; the generated high-level integrated features provide a large amount of semantic information, and the low-level integrated features contain spatial details which help to refine the boundary of the object, and the two contain complementary information.

Further, in step 3, a self-attention module is constructed based on a self-attention mechanism, specifically as follows:

the self-attention module comprises two convolution layers of 1 × 1 and six convolution layers of 3 × 3, two features are subjected to dimensionality reduction through embedding the convolution layers and are converted into queries, keys and values, remodeling matrixes and pooling operations are simultaneously carried out, then long-distance space-time interaction is carried out through dot product attention calculation among the features with different time step sizes to obtain a convolution diagram of 1 × 1, and finally the convolution diagram and the original features are subjected to element-level summation operation to obtain a final representation of long-distance context information.

Further, the multi-scale feature enhancing module in step 4 specifically includes:

the multi-scale feature enhancement module comprises 1 group of 1 × 1 convolution layers and 3 × 3 convolution layers, the number of passages of the 3 × 3 convolution layers is 2, 4 and 6 respectively, a Concat operation is carried out on a feature diagram after dimension reduction of the 3 × 3 convolution layers, and then point multiplication is carried out on the feature diagram after dimension reduction of the 1 × 1 convolution layers to form an enhancement diagram;

the multi-scale feature enhancement module is used for extracting spatial information of different scales, expanding a receptive field by utilizing cavity convolution, combining the multi-scale information to obtain a feature information output with optimized fusion, and then predicting two groups of features by utilizing a cascade decoding mode to form a prediction graph.

Compared with the prior art, the invention has the following remarkable advantages: (1) the semantic information and the spatial edge information are exchanged and fused by effectively utilizing a self-attention mechanism, and the final representation of the long-distance context information is obtained by calculating in a dot product attention mode; (2) meanwhile, a multi-scale feature enhancement module is utilized to carry out cavity convolution on the feature map under different expansion rates, so that the receptive field is effectively expanded on the premise of ensuring semantic information, and the efficiency of detecting the obvious target is improved; (3) finally, a final prediction graph is obtained by utilizing a cascade decoder, and experiments prove that the method has better effects on 3 evaluation indexes of 5 public data sets, and the method is superior to the current front-edge significant target detection method.

Drawings

FIG. 1 is an overall framework diagram of a neural network model based on the self-attention mechanism in the present invention.

FIG. 2 is a block diagram of the self-attention module of the present invention.

FIG. 3 is a diagram of a multi-scale feature enhancement method according to the present invention.

FIG. 4 is a block diagram of a multi-scale feature enhancement module according to the present invention

Detailed Description

The invention relates to a salient object detection method based on a self-attention mechanism, which comprises the following steps of:

As a specific example, step 1 is specifically as follows: the input pictures are subjected to feature extraction by utilizing a convolutional neural network, and a group of feature maps R1, R2, R3, R4 and R5 are generated, wherein the feature maps comprise low-level feature maps and high-level feature maps which have semantic information with different scales. Then we fuse the shallow feature maps to generate low level integration features L and merge the deep feature maps to form high level integration features R. And inputting the integrated features into a self-attention module based on a self-attention mechanism, and then strengthening the obtained features through a multi-scale feature strengthening module to finally obtain a prediction graph.

The convolutional neural network is specifically as follows:

As a specific example, step 2 is specifically as follows: the high-level integrated features H formed by step 1 provide a large amount of semantic information, the low-level integrated features L contain spatial details that help refine the object boundaries, and H and L contain complementary information. The self-attention module utilizes the characteristics that the self-attention machine reduces the dependence on external information and is better at capturing data or the internal correlation of features, two features are subjected to dimensionality reduction through embedding convolution layers, long-distance space-time interaction is carried out through point product attention calculation between the features with different time step lengths, and finally the result and the original features are subjected to element-level summation operation to obtain final representations F 'and L' of long-distance context information.

The method comprises the following steps of feature integration:

As a specific example, step 3 is specifically as follows: high-level integration features provide a large amount of semantic information, and low-level integration features contain spatial details that help refine object boundaries, both of which contain complementary information. The self-attention module utilizes the characteristics that the self-attention machine reduces the dependence on external information and is better at capturing data or the internal correlation of features, two features are subjected to dimensionality reduction through embedding convolution layers, long-distance space-time interaction is carried out through point product attention calculation among the features with different time step lengths, and finally the result and the original features are subjected to element-level summation operation to obtain the final representation of the long-distance context information.

As a specific example, step 4 is specifically as follows: after the feature extraction in steps 1 and 2 and the feature dot product attention calculation in step 3, F 'and L' are required to be sent to a multi-scale feature enhancement module, the multi-scale feature enhancement module is used for extracting spatial information of different scales, the receptive field is enlarged by utilizing cavity convolution, multi-scale information is combined to obtain a feature information output which is fused and optimized, and then two groups of features are predicted by utilizing a cascade decoding mode to form a prediction graph.

The multi-scale feature enhancing module specifically comprises the following components:

the multi-scale feature enhancement module comprises 1 group of 1 × 1 convolution layers and 3 × 3 convolution layers, the number of passages of the 3 × 3 convolution layers is 2, 4 and 6 respectively, the feature map subjected to dimension reduction by the 3 × 3 convolution layers is subjected to Concat operation, and then the feature map subjected to dot multiplication by the feature map subjected to the 1 × 1 convolution layer is formed into the enhancement map.

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Examples

As shown in fig. 1 to 4, the salient object detection method based on the self-attention mechanism of the present invention mainly includes the following steps:

step a: and extracting multi-scale convolution characteristics by using a multi-scale characteristic extraction module.

Selecting ResNet-101 as a picture feature extractor, selecting five convolution modules after ResNet-101, and discarding the last global pooling layer and full connection layer to adapt to the actual requirements of us to obtain multi-scale feature maps R1, R2, R3, R4 and R5, wherein the feature maps contain low-level details and high-level feature maps and have different-scale level semantic information. We fuse shallow feature maps to generate low-level integration features (denoted as L) and merge deep feature maps together to form high-level integration features (denoted as H).

Step b: a self-attention module is utilized to capture features in the high-level features and the low-level features respectively and exchange semantic information to form the dependency relationships.

A great deal of semantic information is provided in the high-level integrated features H in the self-attention module, and the low-level integrated features L contain spatial details which are helpful for refining the boundary of an object, and the low-level integrated features L and the high-level integrated features H contain complementary information. Two characteristics are subjected to dimensionality reduction by embedding into a convolution layer and are respectively expressed as H epsilon R^C*H*W、L∈R^C*H*WUsing convolutional layers to convert features into queries q, keys k, values v, denoted H, respectively_q、H_k、H_v、L_q、L_k、L_vTo make embedding once cost effective and able to capture spatial information, we use 1 x 1 convolution to reduce the dimensionality of the semantic channel for query, key, and value embedding layers, and then use 3 x 3 convolution layers to extract spatial information. We then achieve dot product attention between the two features to capture the long range relationship of the two features. The calculation process is as follows:

DA(H_q,L_k,L_v)＝softmax((H_q)^TL_k)(L_v)^T

DA(L_q,H_k,H_v)＝softmax((L_q)^TH_k)(H_v)^T

the interaction between the query embedding layer and the key embedding layer forms a spatial attention matrix, the matrix models the spatial relationship between any two pixels of the characteristics, then, the attention matrix and the value embedding layer carry out matrix multiplication, and element-level summation operation is carried out on the multiplication result and the original characteristics of the matrix to obtain the final representation of the long-distance context information to obtain H 'and L', and the calculation process is as follows:

step c: and c, performing multi-scale feature reinforcement on the feature map obtained in the step b, and then sending the reinforced features into a cascade decoder to generate a final salient target detection map.

Inputting the H 'and L' feature maps into a multi-scale feature enhancement module, outputting the H 'and L' feature maps into four convolution modules in parallel, performing average pooling on the first module, performing channel transformation on 1 × 1 convolution layers, performing cavity convolution on the second module to the fifth module, selecting the diffusivity of 2, 4 and 6, and then concat the output of the four modules to obtain the enhanced feature maps. And then the output characteristic diagram is subjected to cascade decoding to finally obtain a prediction diagram.

Description of experimental procedures and results:

the present invention first trains the proposed model using the DUTS-TR dataset. It contains 10533 images with high quality pixel-level annotations. The training set is increased by horizontal-vertical flipping and image cropping to alleviate the overfitting problem. And obtaining a final network model through pre-training and fine-tuning.

After training IS completed, the network model evaluates its performance in five reference data sets widely used in the significance detection field, ECCSD, DUT-OMRON, HKU-IS, PASCAL-S, and DUTS-TE, respectively. All of these datasets were manually labeled at the pixel level for quantitative evaluation. Evaluation indexes include precision-recall curves (precision-calls), F-measures (F-measures) and Mean Absolute Errors (MAE). precision-call (pr) curves are standard indicators for evaluating significance performance. F-measure is expressed as F_βThe index is an overall performance index and is obtained by weighted harmonic calculation of precision and recall rate. MAE is the mean absolute error, which is a measure of the predicted significance map and the mean difference of truth. Compared with the existing method, the method has good effect on 3 evaluation indexes of 5 public data sets. The invention can accurately position the obvious target in each data set and accurately position the obvious targetAnd segmenting out complete saliency targets quasi-locally.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A salient object detection method based on a self-attention mechanism is characterized by comprising the following steps:

2. The salient object detection method based on the self-attention mechanism as claimed in claim 1, wherein the convolutional neural network in step 1 is as follows:

3. The salient object detection method based on the self-attention mechanism is characterized in that the features in the step 2 are integrated as follows:

4. The salient object detection method based on the self-attention mechanism as claimed in claim 1, wherein the self-attention module is constructed based on the self-attention mechanism in step 3, and specifically comprises the following steps:

5. The salient object detection method based on the self-attention mechanism as claimed in claim 1, wherein the multi-scale feature enhancing module in step 4 is specifically as follows: