CN110097115A

CN110097115A - A kind of saliency object detecting method based on attention metastasis

Info

Publication number: CN110097115A
Application number: CN201910347420.XA
Authority: CN
Inventors: 程明明; 范登平; 林铮; 吴文海
Original assignee: Huawei Device Co Ltd; Nankai University
Current assignee: Huawei Device Co Ltd; Nankai University
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2019-08-06
Anticipated expiration: 2039-04-28
Also published as: CN110097115B

Abstract

A kind of saliency object detecting method based on attention metastasis.Attention metastasis is distinctive function in human visual system, and still, current method ignores this important mechanism.The method of the present invention devises a kind of new convolutional neural networks framework, it efficiently utilizes the characteristics of static convolutional network, pyramid extension convolutional network, shot and long term memory network and attention transfer sensing module, to fully demonstrate the attention metastasis in human visual system, practical significance is had more for true application scenarios, and better conspicuousness object detection effect can be obtained.Relative to current all saliency object detecting methods, the method for the present invention has reached the leading level in the world, on the performance evaluating of the public data collection of mainstream, has surmounted current best saliency object detecting method.

Description

A kind of saliency object detecting method based on attention metastasis

Technical field

The invention belongs to technical field of image processing, and it is significant to be related specifically to a kind of video based on attention metastasis Property object detecting method.

Background technique

Saliency object detection (VSOD) is intended to extract noticeable object from dynamic video.This task From the vision attention behavior of the research mankind, i.e. human visual system quickly positions the weight in (visual attention mechanism) scene Want this great ability of information.Early stage physical research is quantitatively confirmed the existence of this specific, between object strong correlation Conspicuousness judgement and implicit visual attention distribute behavior.It is lived in due to us in the world of a dynamic change, depending on Frequency conspicuousness object detection is of great significance.Also, it has an extensive practical application, such as Video segmentation, video extraction, Video compress, automatic Pilot, machine interaction etc..Due to there is a large amount of different types of video data (for example, different movements Mode is blocked, and is obscured, object deformation etc.) and complicated human visual attention behavior (i.e. selective attention dynamic allocation, note Power of anticipating transfer etc.), therefore, saliency detection faces great challenge, and causes highest attention, has important Art value.

The VSOD model of early stage is based on some simple features (for example, color, movement etc.), and largely relies on Classical conspicuousness object detection algorithms (for example, Core-Periphery compares, background priority scheduling) and visual attention in image Cognitive theory (for example, feature integration is theoretical, guidance search etc.).They inquire into and have studied to spatial domain and time-domain conspicuousness The mode of the integration of feature, such as gradient flow field, geodesic distance, random walk and map structure etc..Traditional VSOD model is limited In limited feature representation ability.However, the VSOD model based on deep learning receives more concerns recently, by scheming Deep neural network is applied as upper, is successfully realized the conspicuousness detection of still image.For more specifically, Wang et al. is in IEEE TIP periodical (27 (1): delivers entitled " Video salient object detection via fully on 38-49,2018) The paper of convolutional networks ".The nerve net that a complete convolution is built for VSOD is realized in the paper Network.Another same time is published in entitled " the Deeply supervised 3d recurrent fcn for of the paper on BMVC salient object detection in videos".Room and time information is incorporated in by the work using 3D filter It comes together to build condition random field frame.Then, space-time depth characteristic, Recognition with Recurrent Neural Network etc. are proposed for preferably capturing The significant characteristics of room and time.Generally speaking, it based on the VSOD model of depth network, is mentioned since neural network is utilized Feature is taken, to possess powerful learning ability.Since document is too many, just no longer repeat one by one herein.But these models are neglected Slightly very important attention metastasis in human visual attention mechanism.Such as: there is a static black cat in video scene With the white cat of movement, the attention of people is concentrated on the white cat of movement at the beginning.Several seconds are spent, when that static black cat When the white cat moved suddenly and originally is quarrelled and fought noisily, attention will be transferred to black cat and white cat by people.Due to current state Existing model, which focuses mostly on greatly, on border considers the object of movement, or the conspicuousness detection technique of purely static object.Cause This, in such a needs more scene of the attention metastasis of comprehensive understanding people, the performance of these models will be significant Decline, detection effect are unsatisfactory.

Summary of the invention

Object of the present invention is to solve to fail to turn in view of conspicuousness object in existing saliency object detecting method The problem of shifting, to propose a kind of saliency object detecting method based on attention metastasis.

The method of the present invention is known as Saliency-Shift Aware Video salient object detection (SSAV), be made of two basic modules: pyramid expands convolution module (PDC) and conspicuousness object transfer sensing module (SSLSTM).The former is trained using strong still image conspicuousness object learning method, and the latter extends traditional The long product of memory coil in short-term network (convLSTM) makes it have conspicuousness object transfer perception mechanism.The present invention is obtained from PDC module Static nature sequence is taken as input and generates the corresponding VSOD result shifted with dynamic representation and attention.

Technical solution of the present invention

A kind of saliency object detecting method based on attention metastasis, this method comprises the following steps:

A. static convolution network module: multilayer convolutional neural networks are utilized, to multiframe still imageFeature is carried out to mention It takes, obtains one group of featureWherein, T indicates that the frame sum of input video, t indicate a wherein frame；Multilayer convolution therein Neural network is made of different basic convolutional neural networks, the basic convolutional neural networks include VGG-16 network, ResNet-50 network, ResNet-101 network and SE network.

B. pyramid expands convolution PDC module: using the feature extracted in step a as the input of the module, utilizing golden word Tower expands convolution module and obtains Analysis On Multi-scale Features.Specifically, PDC module is made of K layers of empty convolutional layerEvery layer Empty convolution respectively corresponds different expansion ratiosTo extract Analysis On Multi-scale Features vector

C. attention transfer perception A_tModule: based on shot and long term memory network convLSTM, on the network foundation Add weight F^AModule；Weight F^AModule is specifically made of one group of simple convolutional layer stacking, benefit special design of the invention With weight F^AModule carries out weight distribution to the Analysis On Multi-scale Features that step b is extracted, to realize attention transfer perception:

Above-mentioned attention transfer perception A_tThe input of module is the Analysis On Multi-scale Features vector after PDC moduleIt is defeated It is out two-dimensional map figure S_t∈ [0,1]^W×H；Wherein, attention transfer perception A_tThe processing of module is as follows:

Hidden state H_t=convLSTM (X_t, H_t-1)

Attention transfer perception A_t=F^A({X₁..., X_t})

Perception conversion G_{M, t}=A_t⊙H_{M, t}

The prediction of conspicuousness object

Assuming that the total length of input video is T frame, subscript t indicates that present frame, t-1 indicate previous frame.H_tWork as 3D tensor The hidden state at preceding moment, it passes through current input feature X by shot and long term memory network convLSTM ()_tWith last moment Amount state obtains.Weight F^ABy special design of the invention, it is stacked by one group of simple convolutional layer heap, is utilized to () module One group of feature { X that step b is extracted₁..., X_tCarry out weight distribution；G_{M, t}Indicating perception conversion, m ∈ M is channel index, M is total number of channels, and ⊙ symbol is matrix element multiplication, H_{M, t}Indicate the hidden state in the channel m at 3D tensor current time.w^S For the convolution kernel of 1 × 1 × M,For convolution operation, σ is an activation primitive.

D. it generates image result: convolution being carried out to the feature that step c is exported using the convolutional layer of a 1*1, recycles and swashs Function live to judge which neuron is activated, to generate the conspicuousness subject image of the every frame of video；

E. it updates network: calculating the conspicuousness subject image and artificial mark that step d is generated using cross entropy loss function Reference picture penalty values, carry out gradient passback, update network.

The described function for calculating penalty values with reference picture that is manually marking is as follows:

Wherein L^AttAnd L^VSODTo intersect entropy loss；L () is with the presence or absence of blinkpunkt notable figure F_t；M_tManually to mark ginseng Examine image.

The advantages and benefits of the present invention are:

Saliency detection method of the invention considers attention metastasis, this mechanism is not the prior art, it It is that naturally occurring in human visual system but long-term studied personnel are ignored.It is creative that the mechanism is introduced in a network And have and acquire a certain degree of difficulty, conspicuousness is only detected from single frame video image compared to other current models, the method for the present invention is examined The space relationship of interframe in video is considered, and has considered the metastasis of conspicuousness according to attention viewpoint translation, more for application Add with practical significance, better actual effect can be obtained, reach top standard in the world.

Detailed description of the invention

Fig. 1 is the flow chart of SSAV method of the present invention.

Fig. 2 is the specific implementation frame diagram of SSAV method of the present invention.Wherein, the number 473 × 473 × 3 on image indicates The length of input picture × wide × port number.

Fig. 3 is that SSAV method of the present invention and other 17 existing best deep learning methods and conventional method exist Obtained in the complete data set of ViSal significant illustrated example (17 control methods be followed successively by PDBM, MBNM, FGRN, DLVS, SCNN, SCOM, SFLR, SGSP, STBP, MSTM, GFVM, SAGM, MB+M, RWRV, SPVM, TIMP, SIVM):

Fig. 4 is SSAV method of the present invention and other 17 existing best deep learning methods and conventional method in FBMS Test data set on obtained significant illustrated example (17 same Fig. 3 of control methods)；

Fig. 5 is that SSAV method of the present invention and other 17 existing best deep learning methods and conventional method exist The significant illustrated example (17 same Fig. 3 of control methods) obtained in the test data set of DAVIS；

Fig. 6 is that SSAV method of the present invention and other 17 existing best deep learning methods and conventional method exist The significant illustrated example (17 same Fig. 3 of control methods) obtained in the test data set of DAVSOD；

Fig. 7 is that SSAAV method of the present invention and other 5 now best deep learning methods and conventional method exist The significant illustrated example that there is conspicuousness object to shift is obtained in the test data set of DAVSOD.Wherein column (a) are video input frames Frame (b) is attention viewpoint Fixation that the corresponding mankind observe that input frame leaves, is marked by hand with reference to figure It is (d) notable figure that the method for the present invention SSAV is obtained as frame, (e)-(i) is followed successively by the notable figure that 5 methods of comparison obtain: MBNM、FGRN、PDBM、SFLR、SAGM。

Specific embodiment

With reference to Fig. 1 and Fig. 2, specific implementation step of the invention is as follows:

A. static convolution network module: ResNet-50 neural network is utilized, to multiframe still imageCarry out feature Extraction obtains one group of featureWherein, T indicates that the frame sum of input video, t indicate a wherein frame.Example exhibition in Fig. 2 Having shown 3 frame input pictures is respectively: I_t-1、I_t、I_t+1.One group of feature is obtained after ResNet-50 network are as follows: Q_t-1、Q_t、Q_t+1。

B. pyramid expands convolution PDC module: using the feature extracted in step a as the input of the module, utilizing golden word Tower expands convolution module and obtains Analysis On Multi-scale Features.Specifically, PDC module is made of K layers of parallel empty convolutional layer Every layer of empty convolution respectively corresponds different expansion ratiosTo extract Analysis On Multi-scale Features vectorThis implementation Using 4 layers of empty convolution in example, every layer of corresponding expansion ratio is respectively as follows: 2,4,8,16.For example, the spy obtained by a step Sign Q obtains one group of feature { P for pyramid convolution algorithm is participated in₁..., P_k..., P_K, then merge to obtain one group with Q again Analysis On Multi-scale Features:

X=[Q, P₁..., P_k..., P_K],

Wherein, X is the reinforcing feature extracted, and Q is the 3D characteristic tensor of I frame in a video, Represent parallel operation.Multi-scale information can be obtained using pyramid expansion convolution module, extracts more robust feature.

C. attention transfer perception A_tModule: based on shot and long term memory network convLSTM, on the network foundation Add weight F^AModule；The F^AModule is made of, using it to step special design of the invention one group of simple convolutional layer stacking The Analysis On Multi-scale Features that rapid b is extracted carry out weight distribution, to realize attention metastasis.

Hidden state H_t=convLSTM (X_t, H_t-1)

Attention transfer perception A_t=F^A({X₁..., X_t})

Perception conversion G_{M, t}=A_t⊙H_{M, t}

The prediction of conspicuousness object

Assuming that the total length of input video is T frame, subscript t indicates that present frame, t-1 indicate previous frame.H_tWork as 3D tensor The hidden state at preceding moment, it passes through current input feature X by shot and long term memory network convLSTM ()_tWith last moment Amount state obtains.Weight F^ABy special design of the invention, it is stacked by one group of simple convolutional layer heap, is utilized to () module One group of feature { X that step b is extracted₁..., X_tCarry out weight distribution；G_{M, t}Indicating perception conversion, m ∈ M is channel index, M is total number of channels, and ⊙ symbol is matrix element multiplication, H_{M, t}Indicate the hidden state in the channel m at 3D tensor current time.w^S For the convolution kernel of 1 × 1 × M,For convolution operation, σ is a Sigmoid activation primitive.As shown in Figure 2, ConvLSTM network is using one 3 × 3 × 32 convolution kernel.

E. it updates network: calculating the conspicuousness object figure that step d is generated using cross entropy cross entropy loss function As the penalty values with the reference picture manually marked, carries out gradient passback, updates network.

The model finally obtained can be used to extract the obvious object with attention metastasis in any video.Institute It is stating as follows with reference picture that is manually marking calculating loss function:

Effect of the invention is further illustrated by following emulation experiment:

(1) experimental data set and simulated conditions

Test image used by this experiment includes Wang Wenguan et al. in the ViSal data set constructed, 2014 in 2015 Year, Jitendar professor Malik of Berkeley University of California, the U.S. organized the interior FBMS constructed, the science of Adobe company in 2016 Family Perazzi is published in the DAVIS of famous international computer vision and pattern-recognition meeting (CVPR), Beijing Space aviation section VOS data set and model in skill university Li Jia group in building in 2018 step on the DAVSOD data that equality people announced in 2019 Collection.Wherein, ViSal data set is first data set exclusively for saliency object detection task design, it is contained 17 video sequences amount to 193 mark picture frames.FBMS data set is a classical data set earlier, is appointed for object segmentation Business design, there are 59 videos to add up to 720 mark frames, is now widely used in saliency object detection task.DAVIS is The data set of current first high quality mark, amounts to the picture frame of 3455 dense marks, there is 50 videos.Short 2 years Time, the data set have been widely used.As for VOS data set, then be current data concentrate quantity maximum one, it by 200 video compositions, are labelled with 7467 picture frames.Nankai University's media computation laboratory in 2018 constructs one and works as previous existence The largest DAVSOD data set of video in boundary, video sum are more than 200, and the picture frame number of mark has been more than current institute There is the summation of data set mark frame to reach 23938.This experiment porch is InterE5-2676v3@2.4GHz× 24, video card is GTX TITAN XP.It is emulated using Python Caffe.

(2) saliency object detection Performance evaluation criterion

We are measured using maximum F value (max F), structure metric (S) and these three gold indexs of mean error (M) The result of saliency object detection.

Mathematically F value is the harmonic-mean of precision and recall rate, comprehensive evaluation may be implemented, calculation formula is such as Shown in lower:

β is to impart the higher status of accuracy to weight added by Precision.Contemporary literature widespread practice is β is set²=0.3.Wherein, the calculation formula of Precision and Recall is as follows:

Precision Precision and recall rate Recall is made of a confusion matrix again, wherein in two-value decision problem In, it is practical with reference to also for positive sample that TP expression is predicted as positive sample, and FP indicates to be predicted as that positive sample is practical to be referenced as negative sample, FN It is also negative sample that expression, which is predicted as the practical reference of negative sample,.We carry out the result figure that detected with 256 different threshold values Binaryzation, each thresholding can calculate a F value, and maximum F value is a maximum F in F value after 256 thresholdings Value.

What structure metric (S) You Fandeng equality people proposed in 2017, for predictive metrics result and reference result Architectural difference degree.It is by facing area S_rWith object-oriented S_oTwo levels combine measurement:

S=α * S_o+(1-α)*S_r

Wherein, identical weight is arranged in area metrics and object measurement by setting α=0.5.Specific formula for calculation can join Examine original text: " Structure-measure:A New Way to Evaluate Foreground Maps.ICCV2017 ".

Average measurement error (Mae) is used to the mean absolute error between predictive metrics result and reference result, it is assumed that two The reference result of system is a two-dimensional matrix G, and prediction result is also a two-dimensional matrix S:

Wherein, N is total number of pixels in image.Average measurement error is used to estimate the accuracy of Pixel-level, is using most An extensive evaluation index.

Following table 1 give it is of the invention and it is current it is classical, state-of-the-art 17 control methods are most chosen at 5 in the world The maximum F value (max obtained on the open test data set (ViSal, FBMS-T, DAVIS-T, VOS-T, DAVSOD-T) of war F), structure metric (S), mean error (M).

Table 1

(3) experiment content

Experiment one

From the above table 1 as can be seen that SSAV method of the invention has apparent advantage with current 17 kinds of methods comparison, Highest precision is all reached in 5 data sets, such as 3 indexs in ViSal, FBMS, DAVIS, VOS and DAVSOD.This is filled It defends oneself and the validity and robustness of SSAV method of the present invention is illustrated.The above objective appraisal result quantitatively illustrates that the present invention exists The advantage that saliency object is detected under various scenes, is also required to comment by the subjectivity of visual results other than numerical result Valence.

Experiment two

In this experiment, we further illustrate test result representative on 4 data sets to illustrate this The performance of inventive method.Wherein (a) in Fig. 3-Fig. 6 is 3 frame different images of the video of input, is (b) ginseng marked by hand Picture frame is examined, is (c) notable figure that the method for the present invention SSAV is obtained, is (d) notable figure that PDBM method obtains, is (e) MBNM The notable figure that method obtains is (f) notable figure that FGRN method obtains, is (g) notable figure that DLVS method obtains, is (h) The notable figure that SCNN method obtains is (i) notable figure that SCOM method obtains, is (j) notable figure that SFLR method obtains, (k) It is the notable figure that SGSP method obtains, is (l) notable figure that STBP method obtains, is (m) notable figure that MSTM method obtains, (n) it is notable figure that GFVM method obtains, is (o) notable figure that SAGM method obtains, it is significant to be (p) that MB+M method obtains Figure is (q) notable figure that RWRV method obtains, is (r) notable figure that SPVM method obtains, and it is aobvious to be (s) that TIMP method obtains Figure is write, is (t) notable figure that SIVM method obtains.

From the point of view of the result of complex chart 3- Fig. 6, all very close reference image frame marked by hand of our method.And it compares 17 methods all have larger gap with reference picture.

Conspicuousness object transfer phenomena can be effectively coped in order to further verify the present invention, illustrates this in Fig. 7 One result.Wherein, (a) indicates 5 video frames in DAVSOD data set in a certain video, (b) indicates the attention viewpoint of people, (c) it indicates the reference picture GT marked by hand, is (d) notable figure that this paper SSAV method obtains, is (e) that MBNM method obtains Notable figure is (f) notable figure that FGRN method obtains, is (g) notable figure that PDBM method obtains, is (h) that SFLR method obtains Notable figure, (i) be notable figure that SAGM method obtains.It can be seen from the figure that SSAV method of the present invention is several with respect to other Classic method has obtained more satisfying result.It is existing that method of the invention can effectively capture conspicuousness transfer As: [cat] → [cat, box] → [cat] → [box] → [cat, box].However it other methods or can not completely detect It significant object (for example, SFLR and SAGM method) or only captures the cat of movement out and has ignored box (for example, MBNM Method).

What the present embodiment was not described in detail partly belongs to the public known common sense in this field, does not repeat one by one here.With The upper implementation network (ResNet-50 etc.) specifically used is used only for invention for example, being not to of the invention The restriction of protection scope, design all and that the present invention is similar or identical all belong to the scope of protection of the present invention.

Claims

1. a kind of saliency object detecting method based on attention metastasis, it is characterised in that this method includes as follows Step:

A. static convolution network module: multilayer convolutional neural networks are utilized, feature extraction is carried out to multiframe still image；

B. pyramid expands convolution PDC module: using the feature extracted in step a as the input of the module, being expanded using pyramid It opens convolution module and obtains Analysis On Multi-scale Features；

C. attention transfer perception A_tModule: based on shot and long term memory network convLSTM, power is added on the network foundation Weight F^AModule, F^AModule is made of one group of simple convolutional layer stacking, utilizes weight F^AMore rulers that module extracts step b It spends feature and carries out weight distribution, to realize attention transfer perception；

D. it generates image result: convolution being carried out to the feature that step c is exported using the convolutional layer of a 1*1, recycles activation letter Number is to judge which neuron is activated, to generate the conspicuousness subject image of the every frame of video；

E. it updates network: calculating the conspicuousness subject image that step d is generated and the ginseng manually marked using cross entropy loss function The penalty values of image are examined, gradient passback is carried out, updates network.

2. the saliency object detecting method according to claim 1 based on attention metastasis, feature exist In: multilayer convolutional neural networks described in step a are made of different basic convolutional neural networks.

3. the saliency object detecting method according to claim 2 based on attention metastasis, feature exist In: the basic convolutional neural networks include VGG-16 network, ResNet-50 network, ResNet-101 network and SE network.

4. the saliency object detecting method according to any one of claims 1 to 3 based on attention metastasis, It is characterized by: the transfer perception of attention described in step c A_tThe input of module be Analysis On Multi-scale Features after PDC module to AmountOutput is two-dimensional map figure S_t∈[0,1]^W×H, W is picture traverse, and H is picture altitude；Attention transfer perception A_tThe processing of module is as follows:

Hidden state H_t=convLSTM (X_t,H_t-1)

Attention transfer perception A_t=F^A({X₁,...,X_t})

Perception conversion G_m,t=A_t⊙H_m,t

The prediction of conspicuousness object

Where it is assumed that the total length of input video is T frame, subscript t indicates that present frame, t-1 indicate previous frame, H_tWork as 3D tensor The hidden state at preceding moment, it passes through current input feature X by shot and long term memory network convLSTM ()_tWith last moment Amount state obtains；Weight F^A() module is stacked by one group of simple convolutional layer heap, is extracted using the weight module to step b One group of feature { X₁,...,X_tCarry out weight distribution；G_m,tIndicate perception conversion, m ∈ M is channel index, and ⊙ symbol is matrix Element multiplication, H_m,tIndicate the hidden state in the channel m at 3D tensor current time；w^SFor the convolution kernel of 1 × 1 × M, For convolution operation, σ is an activation primitive.

5. the saliency object detecting method according to any one of claims 1 to 3 based on attention metastasis, It is characterized by: the function for calculating penalty values with artificial mark reference picture described in step e is as follows:

Wherein L^AttAnd L^VSODTo intersect entropy loss；L () is with the presence or absence of blinkpunkt notable figure Ft；Mt is artificial mark reference Figure.