CN111598108A

CN111598108A - Rapid salient object detection method of multi-scale neural network based on three-dimensional attention control

Info

Publication number: CN111598108A
Application number: CN202010319916.9A
Authority: CN
Inventors: 刘云; 张鑫禹; 程明明
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-28

Abstract

A quick salient object detection method of a multi-scale neural network based on three-dimensional attention control. The method aims to design a lightweight convolutional neural network for salient object detection. The method extracts multi-scale convolution characteristics through a multi-branch structure, wherein each branch is a depth separable convolution with different expansion rates; adding the convolution characteristics of all branches, and calculating an attention diagram for each branch by using a stereo attention unit; then multiplying the calculated attention diagram with the characteristics of each branch respectively, adding the multiplied results of the branches, and adding residual connection to form a multi-scale convolution module for controlling the three-dimensional attention; and finally, stacking the multi-scale modules to form a deep convolutional neural network, and performing salient object detection on the natural image. Experiments show that compared with the existing method, the method has the advantages of higher speed, fewer parameters, less calculation amount and similar precision.

Description

Rapid salient object detection method of multi-scale neural network based on three-dimensional attention control

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a quick salient object detection method based on a multi-scale neural network controlled by three-dimensional attention.

Background

Salient object detection, also referred to as saliency detection, is aimed at detecting the most visually distinctive objects or regions in natural images. Saliency detection techniques have many applications in computer vision, such as image retrieval, image segmentation, object detection, object tracking, scene classification, content-based image editing, and so forth. Traditional salient object detection methods mainly rely on manually designed features and a priori knowledge, such as image contrast, texture features, and the characteristic that salient objects often appear in the center of an image, but these methods usually lack high-level semantic information. Recently, the precision of significance detection based on convolutional neural networks is continuously improved due to the great progress of deep learning.

However, the improvement in accuracy comes with a huge price: current convolutional neural network-based methods typically rely on large networks with large computer volumes and large parameters. For example, although the EGNet model proposed by Jiaxing Zhao et al in the ICCV conference in 2019 is one of the most accurate significance detection methods at present, the EGNet model has 108 million parameters, and a memory space of 432MB is required for optically storing the parameters. For images of 336x336 size, EGNet has a speed of only 0.09 frames/sec on i7-8700K CPU, and 12.7 frames/sec on even powerful great britain TITAN XP GPU. Note that the power rating of a TITAN XP GPU is approximately 250W, and the EGNet model is not likely to be deployed on mobile devices. This makes EGNet difficult to deploy in practical applications, especially on mobile devices. However, due to recent rise of mobile trends, such as smart phones, robots, virtual reality, various intelligent terminals, and the like, deployment of the saliency detection system on mobile devices becomes a problem to be solved urgently.

Designing a lightweight convolutional neural network is a good method for solving the problems, and the lightweight neural network is a neural network with small calculated amount, few parameters and high speed obtained by some design skills. The lightweight neural network has related researches in other fields, such as image classification and semantic segmentation, and the famous image classification models such as MobileNet and ShuffleNet are compared, but the invention is dedicated to design the lightweight neural network for salient object detection. Significance detection typically faces two challenges: 1) the method needs high-level semantic information and bottom-level detail information to locate the salient object and correct the object detail; 2) it requires extraction of multi-scale information to handle salient objects of different sizes and dimensions in natural images. Since the depth of lightweight neural networks is usually shallow, the operations are often simplified, and therefore their learning and representation capabilities are often inferior to large scale convolutional neural networks. Because of this, it is not good to design a lightweight convolutional neural network for saliency detection directly using MobileNet and ShuffleNet as the backbone network.

Disclosure of Invention

The invention aims to solve the problems of too high computer complexity, too low speed and too large parameter quantity of the conventional significant object detection method based on the convolutional neural network, and provides a quick significant object detection method based on a multi-scale neural network controlled by three-dimensional attention. The method can achieve similar performance to previous methods, but only 1.33 million parameters, which can reach 343 frames/second speed on the Yingwei TITAN XP GPU and still have 5 frames/second speed on the i7-8700K CPU.

In order to achieve the purpose of the invention, a multi-scale convolution module controlled by three-dimensional attention is designed firstly, the designed module can well extract multi-scale convolution characteristics on the premise of ensuring light operation weight, the modules are stacked to form a deep convolution neural network, and high-level semantic information and bottom-level detail information can be well learned from an image, so that salient object detection in the image can be rapidly and accurately carried out.

The invention provides a method for detecting a rapid salient object of a multi-scale neural network based on stereo attention control, which comprises the following steps:

a. a multi-scale convolution module for controlling the stereo attention is designed.

The module extracts multi-scale convolution characteristics from input images and features by using a plurality of parallel depth separable convolutions with different expansion rates, and then fuses the convolution characteristics of all branches in an element-by-element addition mode. And respectively using a channel-based attention mechanism and a space-based attention mechanism to pay attention to the two different convolution characteristics of the merged convolution characteristics, wherein the channel-based attention mechanism obtains a vector with the dimensionality equal to the number of the convolution characteristic channels multiplied by the number of parallel branches, and the space-based attention mechanism obtains a single-channel matrix with the same size as the space of the convolution characteristics. Multiplying the two attentions in a matrix dimension expansion mode to obtain a stereo attention diagram, and cutting the stereo attention diagram along the channel dimension, so that the convolution features extracted by each parallel branch correspond to a stereo attention diagram with the same size. And multiplying the convolution characteristics extracted by each branch with the corresponding three-dimensional attention diagram respectively, and finally adding the multiplication results of all the branches, and adding the input of the module to obtain the output of the module.

b. And designing a deep convolutional neural network with an encoding-decoding structure.

The coding sub-network of the designed convolutional neural network can be divided into five stages, each stage is firstly connected with a convolutional layer with the step length of 2 to sample the input space size twice, and then is connected with a plurality of designed multi-scale convolutional modules controlled by the stereo attention. The decoding sub-network of the designed convolutional neural network starts from the last layer of the coding sub-network, merges the convolutional characteristics extracted by the coding sub-network at different stages in a gradual up-sampling mode, predicts a significance map after each characteristic merging, and adds deep supervision for training.

c. And c, inputting the color natural image to be detected into the deep convolutional neural network designed in the step b, decoding the significance map predicted after the sub-network is fused for the last time, namely the output of the designed convolutional neural network, wherein the output significance map is equal to the original input image in size.

Advantages and advantageous effects of the invention

The invention obtains a convolution neural network by stacking a designed multi-scale convolution module based on the three-dimensional attention control, and can quickly and accurately detect the salient objects. Because the designed multi-scale convolution module based on the stereo attention control replaces the traditional convolution with the depth separable convolution, the method has few parameters, small calculation amount and high speed. Meanwhile, the designed multi-scale convolution module based on the stereo attention control can efficiently learn multi-scale and rich image expression by using depth separable convolution, so that the method can achieve the precision similar to that of the traditional method.

Drawings

Fig. 1 is a multi-scale convolution module based on stereo attention control designed by the present invention.

Fig. 2 is an overall architecture of the convolutional neural network designed by the present invention.

FIG. 3 is a comparison of experimental results of the present invention and related methods.

Fig. 4 shows several sets of exemplary results of the present invention.

Detailed Description

The following describes in further detail embodiments of the present invention with reference to the accompanying drawings. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The method for detecting the fast significant object of the multi-scale neural network based on the stereo attention control specifically comprises the following operations:

a. and designing a multi-scale convolution module for controlling stereo attention, wherein the multi-scale convolution module is used for extracting multi-scale convolution characteristics.

Assume that DSConv3 × 3 represents a deep separable convolution with a convolution kernel size of 3 × 3, Conv3 × 3 represents an ordinary convolution with a convolution kernel size of 3 × 3, Conv1 × 1 represents an ordinary convolution with a convolution kernel size of 1 × 1, and r represents the expansion ratio of the convolution.

Assume that the convolution signature of a multiscale convolution module input based on stereo attention control is I ∈^C×H×WIts channel number, height and width are represented by C, H and W, respectively, we first process the input I with a depth separable convolution DSConv3 × 3, formulated as

Wherein the content of the first and second substances,

showing a DSConv3 × 3 operation A multi-branch structure was designed, each branch being designated with F₀For input, record as

Here, the first and second liquid crystal display panels are,

DSConv3 × 3 representing the ith branch, N representing the total number of branches, DSConv3 × 3 for each branch having a different dilation rate the resulting convolution characteristics of all branches are added element by element, i.e. adding all branches together

The fused features M will be used to compute the attention of the convolution features.

First, a channel-based attention mechanism that attempts to compute a channel-based attention vector d ∈^(N+1)×C. To explore the relationship between different channels of convolution features M, global average pooling is performed on M, i.e. M is pooled

Wherein, z ∈^C. Then, we process the feature vector z with a multi-layer perceptron consisting of two layers, i.e.

Wherein the vector d ∈^(N+1)CIs derived channel-based attentionThe force vector.

The following is a space-based attention mechanism that is directed to computing a spatial attention matrix s ∈^H×WBecause a large field of view can better extract context information, which is important for learning position-dependent attention, DSConv3 × 3 with dilation rate is used to increase the field of view, specifically, convolution signature M is first mapped by a Conv1 × 1 to a spatial space with a smaller number of channels

Where λ is the channel reduction ratio, the experimental results of the present invention used λ 4, then the resulting convolution signature was processed with two DSConv3 × 3 dilatations 2 and 4 respectively, finally the channel number was reduced to 1 with one Conv1 × 1, resulting in a channel size of 1^1×H×WAnd transforming it into^H×W. These operations may be formulated as

Here, the first and second liquid crystal display panels are,

representing a convolution with an ith convolution kernel size of k × k, a normal convolution when k is 1, and a deep separable convolution DSConv3 ×. s ∈ when k is 3^H×WIs the derived space-based attention matrix.

The stereoscopic attention matrix can be calculated according to the following formula

Here, d ∈^(N+1)CAnd s ∈^H×WThe above-described channel-based attention vector and space-based attention matrix, v ∈, respectively^{(N+1)×C×H×W}Is the obtained three-dimensional attention map. Symbol

Representing element-by-element multiplication, before which the channel-based attention d is extended by replication to^{(N+1)×C×H×W}And space-based attention s is also expanded to the same size by replication^{(N+1)×C×H×W}. Applying Softmax function to normalize on the dimension corresponding to the branch to obtain the final three-dimensional attention matrix, namely

Where C ∈ {0,1, …, C-1}, H ∈ {0,1, …, H-1}, W ∈ {0,1, …, W-1} are indices in the channel, height, and width dimensions, respectively, W_i∈^C×H×WThen all the branches can be weighted and summed as the weight of the ith branch, and the fusion feature based on attention is obtained and expressed as formula

The symbols therein

Finally, the fused feature F is processed by a Conv1 × 1 and added to the input I of the block as a residual connection, resulting in an output O of a multi-scale convolution block based on stereo attention control, which can be expressed as

Wherein the content of the first and second substances,

representing a Conv1 × 1 operation fig. 1 is a diagram of a multi-scale convolution module based on stereo attention control.

b. And designing a deep convolutional neural network with an encoding-decoding structure. Stacking the multi-scale convolution module based on the stereo attention control designed in the step aThe coding sub-network constituting the designed convolutional neural network, as shown in fig. 2, can be divided into five stages, each stage is connected with Conv3 × 3 or DSConv3 × 3 with step size of 2, and then connected with n_i(i ═ 1,2,3,4,5) multiscale convolution modules based on stereo attention control, the number of convolution channels for all operations being C respectively_i(i-1, 2,3,4,5) where only the first stage uses Conv3 × 3 and all other stages use DSConv3 × 3, since the input color image typically has only three channels and need not be separable convolved_i(i is 1,2,3,4,5) is 1, 3, 6, 3, C, respectively_i(i is 1,2,3,4,5) is 16, 32, 64, 96, 128, respectively. The first four stages of the multi-scale convolution module based on the stereo attention control all have three branches, the expansion rates are 1,2 and 3 respectively, the fifth stage has two branches, the expansion rates are 1 and 2 respectively, and because the convolution characteristic diagram of the fifth stage is already small, the large expansion rate is not necessary. After the fifth stage, a Pyramid pooling module was followed, which was proposed by Hengshuang Zhao in the "Pyramid sharing network" published on the 2017 CVPR conference. Suppose the convolution characteristic graph output by each stage is marked as S_i(i ═ 1,2,3,4,5,6), where S is₅Representing the convolution characteristics before the fifth stage pyramid pooling Module, S₆Represents the convolution characteristics of the fifth stage pyramid pooling module output, and S_i+1(i-1, 2,3,4) is S_i(i-1, 2,3,4) half the size.

Using the generated convolution features S_i(i 1,2,3,4,5,6) a decoding subnetwork is constructed for significance detection. To fuse S of the highest layer₅And S₆First, adjust S with a Conv1 × 1₅Number of channels of, and then S₅Processed features and S₆The element-by-element addition is followed by a further depth separable convolution process with a convolution kernel size of k × k.

This step can be expressed as

Wherein the content of the first and second substances,

the convolution kernel representing the fifth decoding stage is a depth separable convolution of size k × k,

is a common convolution with a convolution kernel size of 1 × 1, R₅Showing the characteristics after the fifth stage of fusion. Similarly, the fusion of the underlying features is similar to this except that the features passed down from the higher layers are first up-sampled to the same spatial size as the features of the current stage. Can be formulated as

Where Up denotes upsampling the convolution feature by a factor of 2. Thus, a decoded convolution signature R can be obtained_i(i ═ 1,2,3,4, 5). In neural network training, the present invention uses deep supervision. Specifically, in R_iAnd (i-1, 2,3,4 and 5) sequentially connecting a Conv1 × 1 function and a Sigmoid function to perform significance prediction, upsampling the prediction result to the same size as the original image, and supervising training by using an accurate significance map in a training set.

c. Inputting the color natural image to be detected into the deep convolutional neural network with the coding-decoding structure designed in the step b, and obtaining the convolution characteristic R in the decoding sub-network₁The predicted saliency map is the output of the designed neural network, and the output saliency map is equal to the input original image in size.

FIG. 3 shows a comparison of our method with other methods on six data sets, including data sets ECSSD, DUT-O, DUTS-TE, HKU-IS, SOD and THUR 15K. # Param denotes the parameters of the convolutional neural network in mega (M). FLOPs indicate the number of floating-point operations in giga (G). Speed represents Speed in Frames Per Second (FPS). F_βRepresents F-measure, the larger the better; MAE represents the mean absolute error, the moreThe smaller the better. This is a common indicator for significance detection. SAMNet is the method of the present invention. RFCN is a method in "friendly Detection with reliable function connected networks" published by Linzhao Wang et al in 2016 ECCV conference, DSS is a method in "deep super friendly object Detection with Short Connections" published by Qiabin Hou et al in 2019 IEEE TPAMI, SRM is a method in "A static sensitive Detection module for detecting objects in images" published by Tianodan Wang et al in 2017 ICCV conference, Amule is a method in "aging Detection with stable Detection with objects in images" published by Pingging Zhang et al in 2017 ICCV conference, "ingredient Detection with aggregate collection-parameter for detecting objects in images", CVestimate is a method in "inserting Detection with sample Zhang et al in 2017 ICCV conference," CVrisk Detection with sample Detection function for meeting "published by Pigment Zhang et al in 2017 ICenvironmental protection meeting," and Pigmentdetection method in "published by Pigment Detection with sample collection in needle Detection with function meeting, and Pigment Detection result meeting in" published by Pigment Detection with sample collection and collection in "meeting, and" meeting in "meeting information in" meeting, and the method in "meeting method in Pigment discovery, and meeting The method in saliency Detection, C2S is the method in the paper "content knowledge for Salient Object Detection" published by Xin Li et al at the ECCV conference 2018, RAS is the method in the paper "recovery entry for Salient Object Detection" published by Shuhan Chen et al at the ECCV conference 2018, CPD is the method in the paper "Cascade Partial Decoder for Fast accept and Object Detection" published by Zo Wu et al at the CVPR conference 2019, and EGNet is the method in the paper "BASNet: Bounday-aware Object Detection" published by Xue Bin et al at the CVPR conference 2019. It can be seen that the parameters and the calculated amount of the deep convolutional neural network designed by the invention are small, the speed is high, and the parameter reaches 343FPS, but the precision is almost the same as that of the previous method. The method proposed by the present invention still has a speed of 5 frames/second on i7-8700K CPU, which was not possible before.

FIG. 4 is a comparison of saliency maps obtained using the method of the present invention with other methods. The first column of each row is the original, the second last column is the method of the invention, and the last column is the true value of the label in the data set. The bottom of the figure marks the names of the methods corresponding to the columns.

Claims

1. A method for detecting a fast significant object of a multi-scale neural network based on stereo attention control is characterized by comprising the following steps:

a. designing a stereo attention control multi-scale convolution module, extracting multi-scale convolution characteristics from an input image or the characteristics by using a plurality of parallel branches formed by depth separable convolutions with different expansion rates, calculating an attention diagram for each branch, and finally performing characteristic fusion based on the attention diagram on the multi-scale convolution characteristics extracted by the plurality of branches;

b. designing a deep convolutional neural network with a coding-decoding structure, wherein a coding sub-network of the designed convolutional neural network is formed by stacking the stereo attention-controlled multi-scale convolutional modules designed in the step a, and a decoding sub-network of the designed convolutional neural network fuses convolution characteristics extracted by the coding sub-network at different stages in a gradual up-sampling mode;

c. and (c) inputting the color natural image to be detected into the deep convolutional neural network designed in the step (b), namely outputting a saliency map with the same size as the original image.

2. The method for fast detecting salient objects based on the multi-scale neural network with the stereo attention control as claimed in claim 1, wherein: the designed multi-scale convolution module for controlling the stereo attention force fuses the multi-scale convolution characteristics extracted by all branches in an adding mode, and then calculates the attention force diagram of each branch by using the fused characteristics, so that the attention force diagram corresponding to each branch is calculated according to the characteristics of all branches.

3. The method for fast detecting salient objects based on the multi-scale neural network with the stereo attention control as claimed in claim 2, wherein: the calculation of the stereo attention map fuses the attention mechanism based on the convolution channel and the attention mechanism based on the convolution space, so that each branch corresponds to a different stereo attention map.

4. The method for fast detecting salient objects based on the multi-scale neural network with the stereo attention control as claimed in claim 2, wherein: the feature fusion based on the stereogram is that the convolution features obtained by calculation on each branch and the corresponding stereogram are multiplied element by element, then the convolution features obtained by multiplying all the branches are added element by element, and the input feature graph is added to be used as residual connection.